Re: [question] VFIO Device Migration: The vCPU may be paused during vfio device DMA in iommu nested stage mode && vSVA

2021-09-24 Thread Kirti Wankhede




On 9/24/2021 12:17 PM, Tian, Kevin wrote:

From: Kunkun Jiang 
Sent: Friday, September 24, 2021 2:19 PM

Hi all,

I encountered a problem in vfio device migration test. The
vCPU may be paused during vfio-pci DMA in iommu nested
stage mode && vSVA. This may lead to migration fail and
other problems related to device hardware and driver
implementation.

It may be a bit early to discuss this issue, after all, the iommu
nested stage mode and vSVA are not yet mature. But judging
from the current implementation, we will definitely encounter
this problem in the future.


Yes, this is a known limitation to support migration with vSVA.



This is the current process of vSVA processing translation fault
in iommu nested stage mode (take SMMU as an example):

guest os            4.handle translation fault 5.send CMD_RESUME to vSMMU


qemu                3.inject fault into guest os 6.deliver response to
host os
(vfio/vsmmu)


host os          2.notify the qemu 7.send CMD_RESUME to SMMU
(vfio/smmu)


SMMU          1.address translation fault          8.retry or
terminate

The order is 1--->8.

Currently, qemu may pause vCPU at any step. It is possible to
pause vCPU at step 1-5, that is, in a DMA. This may lead to
migration fail and other problems related to device hardware
and driver implementation. For example, the device status
cannot be changed from RUNNING && SAVING to SAVING,
because the device DMA is not over.

As far as i can see, vCPU should not be paused during a device
IO process, such as DMA. However, currently live migration
does not pay attention to the state of vfio device when pausing
the vCPU. And if the vCPU is not paused, the vfio device is
always running. This looks like a *deadlock*.


Basically this requires:

1) stopping vCPU after stopping device (could selectively enable
this sequence for vSVA);



I don't think this is change is required. When vCPUs are at halt vCPU 
states are already saved, step 4 or 5 will be taken care by that. Then 
when device is transitioned in SAVING state, save qemu and host os state 
in the migration stream, i.e. state at step 2 and 3, depending on that 
take action while resuming, about step 6 or 7 to run.


Thanks,
Kirti


2) when stopping device, the driver should block new requests
from vCPU (queued to a pending list) and then drain all in-fly
requests including faults;
 * to block this further requires switching from fast-path to
slow trap-emulation path for the cmd portal before stopping
the device;

3) save the pending requests in the vm image and replay them
after the vm is resumed;
 * finally disable blocking by switching back to the fast-path for
the cmd portal;



Do you have any ideas to solve this problem?
Looking forward to your replay.



We verified above flow can work in our internal POC.

Thanks
Kevin





Re: [PATCH 07/16] vfio: Avoid error_propagate() after migrate_add_blocker()

2021-07-21 Thread Kirti Wankhede




On 7/20/2021 6:23 PM, Markus Armbruster wrote:

When migrate_add_blocker(blocker, ) is followed by
error_propagate(errp, err), we can often just as well do
migrate_add_blocker(..., errp).  This is the case in
vfio_migration_probe().

Prior art: commit 386f6c07d2 "error: Avoid error_propagate() after
migrate_add_blocker()".

Cc: Kirti Wankhede 
Cc: Alex Williamson 
Signed-off-by: Markus Armbruster 
---
  hw/vfio/migration.c | 6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 82f654afb6..ff6b45de6b 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -858,7 +858,6 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
  {
  VFIOContainer *container = vbasedev->group->container;
  struct vfio_region_info *info = NULL;
-Error *local_err = NULL;
  int ret = -ENOTSUP;
  
  if (!vbasedev->enable_migration || !container->dirty_pages_supported) {

@@ -885,9 +884,8 @@ add_blocker:
 "VFIO device doesn't support migration");
  g_free(info);
  
-ret = migrate_add_blocker(vbasedev->migration_blocker, _err);

-if (local_err) {
-error_propagate(errp, local_err);
+ret = migrate_add_blocker(vbasedev->migration_blocker, errp);
+if (ret < 0) {
  error_free(vbasedev->migration_blocker);
  vbasedev->migration_blocker = NULL;
  }



Reviewed by: Kirti Wankhede 



Re: [PATCH v2 17/21] contrib/gitdm: add domain-map for NVIDIA

2021-07-15 Thread Kirti Wankhede




On 7/14/2021 11:50 PM, Alex Bennée wrote:

Signed-off-by: Alex Bennée 
Cc: Kirti Wankhede 
Cc: Yishai Hadas 
Message-Id: <20210714093638.21077-18-alex.ben...@linaro.org>
---
  contrib/gitdm/domain-map | 1 +
  1 file changed, 1 insertion(+)

diff --git a/contrib/gitdm/domain-map b/contrib/gitdm/domain-map
index 0b0cd9feee..329ff09029 100644
--- a/contrib/gitdm/domain-map
+++ b/contrib/gitdm/domain-map
@@ -24,6 +24,7 @@ microsoft.com   Microsoft
  mvista.com  MontaVista
  nokia.com   Nokia
  nuviainc.comNUVIA
+nvidia.com  NVIDIA
  oracle.com  Oracle
  proxmox.com Proxmox
  quicinc.com Qualcomm Innovation Center



Reviewed-by: Kirti Wankhede 



Re: [PATCH v1 1/1] vfio: Make migration support non experimental by default.

2021-07-14 Thread Kirti Wankhede




On 7/10/2021 1:14 PM, Claudio Fontana wrote:

On 3/8/21 5:09 PM, Tarun Gupta wrote:

VFIO migration support in QEMU is experimental as of now, which was done to
provide soak time and resolve concerns regarding bit-stream.
But, with the patches discussed in
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mail-archive.com%2Fqemu-devel%40nongnu.org%2Fmsg784931.htmldata=04%7C01%7Ckwankhede%40nvidia.com%7C98194e8a856f4e6b611c08d943769ab5%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637614998961553398%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=A2EY9LEqGE0BSrT25h2WtWonb5oi0O%2B6%2BQmvhVf8Wd4%3Dreserved=0
 , we have
corrected ordering of saving PCI config space and bit-stream.

So, this patch proposes to make vfio migration support in QEMU to be enabled
by default. Tested by successfully migrating mdev device.

Signed-off-by: Tarun Gupta 
Signed-off-by: Kirti Wankhede 
---
  hw/vfio/pci.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index f74be78209..15e26f460b 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3199,7 +3199,7 @@ static Property vfio_pci_dev_properties[] = {
  DEFINE_PROP_BIT("x-igd-opregion", VFIOPCIDevice, features,
  VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
  DEFINE_PROP_BOOL("x-enable-migration", VFIOPCIDevice,
- vbasedev.enable_migration, false),
+ vbasedev.enable_migration, true),
  DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
  DEFINE_PROP_BOOL("x-balloon-allowed", VFIOPCIDevice,
   vbasedev.ram_block_discard_allowed, false),



Hello,

has plain snapshot been tested?


Yes.


If I issue the HMP command "savevm", and then "loadvm", will things work fine?


Yes

Thanks,
Kirti



Re: [PATCH v1 1/1] vfio/migration: Correct device state from vmstate change for savevm case.

2021-06-18 Thread Kirti Wankhede

CCing more Nvidia folks who are testing this patch.

Gentle Ping for review.

Thanks,
Kirti


On 6/9/2021 12:07 AM, Kirti Wankhede wrote:

Set _SAVING flag for device state from vmstate change handler when it gets
called from savevm.

Currently State transition savevm/suspend is seen as:
 _RUNNING -> _STOP -> Stop-and-copy -> _STOP

State transition savevm/suspend should be:
 _RUNNING -> Stop-and-copy -> _STOP

State transition from _RUNNING to _STOP occurs from vfio_vmstate_change()
where when vmstate changes from running to !running, _RUNNING flag is reset
but at the same time when vfio_vmstate_change() is called for
RUN_STATE_SAVE_VM, _SAVING bit should be set.

Reported by: Yishai Hadas 
Signed-off-by: Kirti Wankhede 
---
  hw/vfio/migration.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 384576cfc051..33242b2313b9 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -725,7 +725,16 @@ static void vfio_vmstate_change(void *opaque, bool 
running, RunState state)
   * _RUNNING bit
   */
  mask = ~VFIO_DEVICE_STATE_RUNNING;
-value = 0;
+
+/*
+ * When VM state transition to stop for savevm command, device should
+ * start saving data.
+ */
+if (state == RUN_STATE_SAVE_VM) {
+value = VFIO_DEVICE_STATE_SAVING;
+} else {
+value = 0;
+}
  }
  
  ret = vfio_migration_set_state(vbasedev, mask, value);






[PATCH v1 1/1] vfio/migration: Correct device state from vmstate change for savevm case.

2021-06-08 Thread Kirti Wankhede
Set _SAVING flag for device state from vmstate change handler when it gets
called from savevm.

Currently State transition savevm/suspend is seen as:
_RUNNING -> _STOP -> Stop-and-copy -> _STOP

State transition savevm/suspend should be:
_RUNNING -> Stop-and-copy -> _STOP

State transition from _RUNNING to _STOP occurs from vfio_vmstate_change()
where when vmstate changes from running to !running, _RUNNING flag is reset
but at the same time when vfio_vmstate_change() is called for
RUN_STATE_SAVE_VM, _SAVING bit should be set.

Reported by: Yishai Hadas 
Signed-off-by: Kirti Wankhede 
---
 hw/vfio/migration.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 384576cfc051..33242b2313b9 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -725,7 +725,16 @@ static void vfio_vmstate_change(void *opaque, bool 
running, RunState state)
  * _RUNNING bit
  */
 mask = ~VFIO_DEVICE_STATE_RUNNING;
-value = 0;
+
+/*
+ * When VM state transition to stop for savevm command, device should
+ * start saving data.
+ */
+if (state == RUN_STATE_SAVE_VM) {
+value = VFIO_DEVICE_STATE_SAVING;
+} else {
+value = 0;
+}
 }
 
 ret = vfio_migration_set_state(vbasedev, mask, value);
-- 
2.7.0




Re: [PATCH] vfio: Fix unregister SaveVMHandler in vfio_migration_finalize

2021-05-28 Thread Kirti Wankhede




On 5/28/2021 7:34 AM, Kunkun Jiang wrote:

Hi Philippe,

On 2021/5/27 21:44, Philippe Mathieu-Daudé wrote:

On 5/27/21 2:31 PM, Kunkun Jiang wrote:

In the vfio_migration_init(), the SaveVMHandler is registered for
VFIO device. But it lacks the operation of 'unregister'. It will
lead to 'Segmentation fault (core dumped)' in
qemu_savevm_state_setup(), if performing live migration after a
VFIO device is hot deleted.

Fixes: 7c2f5f75f94 (vfio: Register SaveVMHandlers for VFIO device)
Reported-by: Qixin Gan 
Signed-off-by: Kunkun Jiang 

Cc: qemu-sta...@nongnu.org


---
  hw/vfio/migration.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 201642d75e..ef397ebe6c 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -892,6 +892,7 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
  
remove_migration_state_change_notifier(>migration_state);

  qemu_del_vm_change_state_handler(migration->vm_state);
+    unregister_savevm(VMSTATE_IF(vbasedev->dev), "vfio", vbasedev);

Hmm what about devices using "%s/vfio" id?

The unregister_savevm() needs 'VMSTATEIf *obj'. If we pass a non-null 'obj'
to unregister_svevm(), it will handle the devices using "%s/vfio" id with
the following code:

    if (obj) {
    char *oid = vmstate_if_get_id(obj);
    if (oid) {
    pstrcpy(id, sizeof(id), oid);
    pstrcat(id, sizeof(id), "/");
    g_free(oid);
    }
    }
    pstrcat(id, sizeof(id), idstr);


This fix seems fine to me.



By the way, I'm puzzled that register_savevm_live() and unregister_savevm()
handle devices using "%s/vfio" id differently. So I learned the commit
history of register_savevm_live() and unregister_savevm().

In the beginning, both them need 'DeviceState *dev', which are replaced
with VMStateIf in 3cad405babb. Later in ce62df5378b, the 'dev' was removed,
because no caller of register_savevm_live() need to pass a non-null 'dev'
at that time.

So now the vfio devices need to handle the 'id' first and then call
register_savevm_live(). I am wondering whether we need to add
'VMSTATEIf *obj' in register_savevm_live(). What do you think of this?



I think proposed change above is independent of this fix. I'll defer to 
other experts.


Reviewed by: Kirti Wankhede 



Re: [PATCH v3 1/3] vfio: Move the saving of the config space to the right place in VFIO migration

2021-02-28 Thread Kirti Wankhede



Reviewed-by: Kirti Wankhede 

On 2/23/2021 7:52 AM, Shenming Lu wrote:

On ARM64 the VFIO SET_IRQS ioctl is dependent on the VM interrupt
setup, if the restoring of the VFIO PCI device config space is
before the VGIC, an error might occur in the kernel.

So we move the saving of the config space to the non-iterable
process, thus it will be called after the VGIC according to
their priorities.

As for the possible dependence of the device specific migration
data on it's config space, we can let the vendor driver to
include any config info it needs in its own data stream.

Signed-off-by: Shenming Lu 
---
  hw/vfio/migration.c | 25 +++--
  1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 00daa50ed8..f5bf67f642 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -575,11 +575,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void 
*opaque)
  return ret;
  }
  
-ret = vfio_save_device_config_state(f, opaque);

-if (ret) {
-return ret;
-}
-
  ret = vfio_update_pending(vbasedev);
  if (ret) {
  return ret;
@@ -620,6 +615,19 @@ static int vfio_save_complete_precopy(QEMUFile *f, void 
*opaque)
  return ret;
  }
  
+static void vfio_save_state(QEMUFile *f, void *opaque)

+{
+VFIODevice *vbasedev = opaque;
+int ret;
+
+ret = vfio_save_device_config_state(f, opaque);
+if (ret) {
+error_report("%s: Failed to save device config space",
+ vbasedev->name);
+qemu_file_set_error(f, ret);
+}
+}
+
  static int vfio_load_setup(QEMUFile *f, void *opaque)
  {
  VFIODevice *vbasedev = opaque;
@@ -670,11 +678,7 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int 
version_id)
  switch (data) {
  case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
  {
-ret = vfio_load_device_config_state(f, opaque);
-if (ret) {
-return ret;
-}
-break;
+return vfio_load_device_config_state(f, opaque);
  }
  case VFIO_MIG_FLAG_DEV_SETUP_STATE:
  {
@@ -720,6 +724,7 @@ static SaveVMHandlers savevm_vfio_handlers = {
  .save_live_pending = vfio_save_pending,
  .save_live_iterate = vfio_save_iterate,
  .save_live_complete_precopy = vfio_save_complete_precopy,
+.save_state = vfio_save_state,
  .load_setup = vfio_load_setup,
  .load_cleanup = vfio_load_cleanup,
  .load_state = vfio_load_state,





Re: [RFC PATCH v2 1/3] vfio: Move the saving of the config space to the right place in VFIO migration

2021-02-18 Thread Kirti Wankhede




On 1/27/2021 3:06 AM, Alex Williamson wrote:

On Thu, 10 Dec 2020 10:21:21 +0800
Shenming Lu  wrote:


On 2020/12/10 2:34, Alex Williamson wrote:

On Wed, 9 Dec 2020 13:29:47 +0100
Cornelia Huck  wrote:
   

On Wed, 9 Dec 2020 16:09:17 +0800
Shenming Lu  wrote:
  

On ARM64 the VFIO SET_IRQS ioctl is dependent on the VM interrupt
setup, if the restoring of the VFIO PCI device config space is
before the VGIC, an error might occur in the kernel.

So we move the saving of the config space to the non-iterable
process, so that it will be called after the VGIC according to
their priorities.

As for the possible dependence of the device specific migration
data on it's config space, we can let the vendor driver to
include any config info it needs in its own data stream.
(Should we note this in the header file linux-headers/linux/vfio.h?)


Given that the header is our primary source about how this interface
should act, we need to properly document expectations about what will
be saved/restored when there (well, in the source file in the kernel.)
That goes in both directions: what a userspace must implement, and what
a vendor driver can rely on.


Yeah, in order to make the vendor driver and QEMU cooperate better, we might
need to document some expectations about the data section in the migration
region...


[Related, but not a todo for you: I think we're still missing proper
documentation of the whole migration feature.]


Yes, we never saw anything past v1 of the documentation patch.  Thanks,



I'll get back on this and send next version soon.



By the way, is there anything unproper with this patch? Wish your suggestion. 
:-)


I'm really hoping for some feedback from Kirti, I understand the NVIDIA
vGPU driver to have some dependency on this.  Thanks,


NVIDIA driver doesn't use device config space value/information during 
device data restoration, so we are good with this change.


Thanks,
Kirti




Alex


Signed-off-by: Shenming Lu 
---
  hw/vfio/migration.c | 25 +++--
  1 file changed, 15 insertions(+), 10 deletions(-)


.
   








Re: [RFC PATCH v2 1/3] vfio: Move the saving of the config space to the right place in VFIO migration

2021-02-18 Thread Kirti Wankhede




On 12/9/2020 1:39 PM, Shenming Lu wrote:

On ARM64 the VFIO SET_IRQS ioctl is dependent on the VM interrupt
setup, if the restoring of the VFIO PCI device config space is
before the VGIC, an error might occur in the kernel.

So we move the saving of the config space to the non-iterable
process, so that it will be called after the VGIC according to
their priorities.

As for the possible dependence of the device specific migration
data on it's config space, we can let the vendor driver to
include any config info it needs in its own data stream.
(Should we note this in the header file linux-headers/linux/vfio.h?)

Signed-off-by: Shenming Lu 
---
  hw/vfio/migration.c | 25 +++--
  1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 00daa50ed8..3b9de1353a 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -575,11 +575,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void 
*opaque)
  return ret;
  }
  
-ret = vfio_save_device_config_state(f, opaque);

-if (ret) {
-return ret;
-}
-
  ret = vfio_update_pending(vbasedev);
  if (ret) {
  return ret;
@@ -620,6 +615,19 @@ static int vfio_save_complete_precopy(QEMUFile *f, void 
*opaque)
  return ret;
  }
  
+static void vfio_save_state(QEMUFile *f, void *opaque)

+{
+VFIODevice *vbasedev = opaque;
+int ret;
+
+/* The device specific data is migrated in the iterable process. */
+ret = vfio_save_device_config_state(f, opaque);
+if (ret) {
+error_report("%s: Failed to save device config space",
+ vbasedev->name);
+}
+}
+


Since error is not propagated, set error in migration stream for 
migration to fail, use qemu_file_set_error() on error.


Thanks,
Kirti


  static int vfio_load_setup(QEMUFile *f, void *opaque)
  {
  VFIODevice *vbasedev = opaque;
@@ -670,11 +678,7 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int 
version_id)
  switch (data) {
  case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
  {
-ret = vfio_load_device_config_state(f, opaque);
-if (ret) {
-return ret;
-}
-break;
+return vfio_load_device_config_state(f, opaque);
  }
  case VFIO_MIG_FLAG_DEV_SETUP_STATE:
  {
@@ -720,6 +724,7 @@ static SaveVMHandlers savevm_vfio_handlers = {
  .save_live_pending = vfio_save_pending,
  .save_live_iterate = vfio_save_iterate,
  .save_live_complete_precopy = vfio_save_complete_precopy,
+.save_state = vfio_save_state,
  .load_setup = vfio_load_setup,
  .load_cleanup = vfio_load_cleanup,
  .load_state = vfio_load_state,





Re: [RFC PATCH v2 1/3] vfio: Move the saving of the config space to the right place in VFIO migration

2021-01-27 Thread Kirti Wankhede




On 1/27/2021 3:06 AM, Alex Williamson wrote:

On Thu, 10 Dec 2020 10:21:21 +0800
Shenming Lu  wrote:


On 2020/12/10 2:34, Alex Williamson wrote:

On Wed, 9 Dec 2020 13:29:47 +0100
Cornelia Huck  wrote:
   

On Wed, 9 Dec 2020 16:09:17 +0800
Shenming Lu  wrote:
  

On ARM64 the VFIO SET_IRQS ioctl is dependent on the VM interrupt
setup, if the restoring of the VFIO PCI device config space is
before the VGIC, an error might occur in the kernel.

So we move the saving of the config space to the non-iterable
process, so that it will be called after the VGIC according to
their priorities.

As for the possible dependence of the device specific migration
data on it's config space, we can let the vendor driver to
include any config info it needs in its own data stream.
(Should we note this in the header file linux-headers/linux/vfio.h?)


Given that the header is our primary source about how this interface
should act, we need to properly document expectations about what will
be saved/restored when there (well, in the source file in the kernel.)
That goes in both directions: what a userspace must implement, and what
a vendor driver can rely on.


Yeah, in order to make the vendor driver and QEMU cooperate better, we might
need to document some expectations about the data section in the migration
region...


[Related, but not a todo for you: I think we're still missing proper
documentation of the whole migration feature.]


Yes, we never saw anything past v1 of the documentation patch.  Thanks,
   


By the way, is there anything unproper with this patch? Wish your suggestion. 
:-)


I'm really hoping for some feedback from Kirti, I understand the NVIDIA
vGPU driver to have some dependency on this.  Thanks,



I need to verify this patch. Spare me a day to verify this.

Thanks,
Kirti



Alex


Signed-off-by: Shenming Lu 
---
  hw/vfio/migration.c | 25 +++--
  1 file changed, 15 insertions(+), 10 deletions(-)


.
   








Re: [PATCH] vfio/migrate: Move switch of dirty tracking into vfio_memory_listener

2021-01-27 Thread Kirti Wankhede




On 1/11/2021 1:04 PM, Keqian Zhu wrote:

For now the switch of vfio dirty page tracking is integrated into
the vfio_save_handler, it causes some problems [1].



Sorry, I missed [1] mail, somehow it didn't landed in my inbox.


The object of dirty tracking is guest memory, but the object of
the vfio_save_handler is device state. This mixed logic produces
unnecessary coupling and conflicts:

1. Coupling: Their saving granule is different (perVM vs perDevice).
vfio will enable dirty_page_tracking for each devices, actually
once is enough.


That's correct, enabling dirty page tracking once is enough. But 
log_start and log_stop gets called on address space update transaction, 
region_add() or region_del(), at this point migration may not be active. 
We don't want to allocate bitmap memory in kernel for lifetime of VM, 
without knowing migration will be happen or not. vfio_iommu_type1 module 
should allocate bitmap memory only while migration is active.


Paolo's suggestion here to use log_global_start and log_global_stop 
callbacks seems correct here. But at this point vfio device state is not 
yet changed to |_SAVING as you had identified it in [1]. May be we can 
start tracking bitmap in iommu_type1 module while device is not yet 
_SAVING, but getting dirty bitmap while device is yet not in 
_SAVING|_RUNNING state doesn't seem optimal solution.


Pasting here your question from [1]

> Before start dirty tracking, we will check and ensure that the device
>  is at _SAVING state and return error otherwise.  But the question is
>  that what is the rationale?  Why does the VFIO_IOMMU_DIRTY_PAGES
> ioctl have something to do with the device state?

Lets walk through the types of devices we are supporting:
1. mdev devices without IOMMU backed device
	Vendor driver pins pages as and when required during runtime. We can 
say that vendor driver is smart which identifies the pages to pin. We 
are good here.


2. mdev device with IOMMU backed device
	This is similar to vfio-pci, direct assigned device, where all pages 
are pinned at VM bootup. Vendor driver is not smart, so bitmap query 
will report all pages dirty always. If --auto-converge is not set, VM 
stucks infinitely in pre-copy phase. This is known to us.


3. mdev device with IOMMU backed device with smart vendor driver
	In this case as well all pages are pinned at VM bootup, but vendor 
driver is smart to identify the pages and pin them explicitly.
Pages can be pinned anytime, i.e. during normal VM runtime or on setting 
_SAVING flag (entering pre-copy phase) or while in iterative pre-copy 
phase. There is no restriction based on these phases for calling 
vfio_pin_pages(). Vendor driver can start pinning pages based on its 
device state when _SAVING flag is set. In that case, if dirty bitmap is 
queried before that then it will report all sysmem as dirty with an 
unnecessary copy of sysmem.
As an optimal solution, I think its better to query bitmap only after 
all vfio devices are in pre-copy phase, i.e. after _SAVING flag is set.



2. Conflicts: The ram_save_setup() traverses all memory_listeners
to execute their log_start() and log_sync() hooks to get the
first round dirty bitmap, which is used by the bulk stage of
ram saving. However, it can't get dirty bitmap from vfio, as
@savevm_ram_handlers is registered before @vfio_save_handler.

Right, but it can get dirty bitmap from vfio device in it's iterative 
callback

ram_save_pending ->
migration_bitmap_sync_precopy() .. ->
 vfio_listerner_log_sync

Thanks,
Kirti


Move the switch of vfio dirty_page_tracking into vfio_memory_listener
can solve above problems. Besides, Do not require devices in SAVING
state for vfio_sync_dirty_bitmap().

[1] https://www.spinics.net/lists/kvm/msg229967.html

Reported-by: Zenghui Yu 
Signed-off-by: Keqian Zhu 
---
  hw/vfio/common.c| 53 +
  hw/vfio/migration.c | 35 --
  2 files changed, 44 insertions(+), 44 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6ff1daa763..9128cd7ee1 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -311,7 +311,7 @@ bool vfio_mig_active(void)
  return true;
  }
  
-static bool vfio_devices_all_saving(VFIOContainer *container)

+static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
  {
  VFIOGroup *group;
  VFIODevice *vbasedev;
@@ -329,13 +329,8 @@ static bool vfio_devices_all_saving(VFIOContainer 
*container)
  return false;
  }
  
-if (migration->device_state & VFIO_DEVICE_STATE_SAVING) {

-if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF)
-&& (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
-return false;
-}
-continue;
-} else {
+if ((vbasedev->pre_copy_dirty_page_tracking == 

[PATCH v2 1/1] Fix to show vfio migration stat in migration status

2020-12-01 Thread Kirti Wankhede
Header file where CONFIG_VFIO is defined is not included in migration.c
file.

Moved populate_vfio_info() to hw/vfio/common.c file. Added its stub in
stubs/vfio.c file. Updated header files and meson file accordingly.

Fixes: 3710586caa5d ("qapi: Add VFIO devices migration stats in Migration
stats")

Signed-off-by: Kirti Wankhede 
---
 hw/vfio/common.c  | 12 +++-
 include/hw/vfio/vfio-common.h |  1 -
 include/hw/vfio/vfio.h|  2 ++
 migration/migration.c | 16 +---
 stubs/meson.build |  1 +
 stubs/vfio.c  |  7 +++
 6 files changed, 22 insertions(+), 17 deletions(-)
 create mode 100644 stubs/vfio.c

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6ff1daa763f8..4868c0fef504 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -25,6 +25,7 @@
 #endif
 #include 
 
+#include "qapi/qapi-types-migration.h"
 #include "hw/vfio/vfio-common.h"
 #include "hw/vfio/vfio.h"
 #include "exec/address-spaces.h"
@@ -292,7 +293,7 @@ const MemoryRegionOps vfio_region_ops = {
  * Device state interfaces
  */
 
-bool vfio_mig_active(void)
+static bool vfio_mig_active(void)
 {
 VFIOGroup *group;
 VFIODevice *vbasedev;
@@ -311,6 +312,15 @@ bool vfio_mig_active(void)
 return true;
 }
 
+void populate_vfio_info(MigrationInfo *info)
+{
+if (vfio_mig_active()) {
+info->has_vfio = true;
+info->vfio = g_malloc0(sizeof(*info->vfio));
+info->vfio->transferred = vfio_mig_bytes_transferred();
+}
+}
+
 static bool vfio_devices_all_saving(VFIOContainer *container)
 {
 VFIOGroup *group;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 6141162d7aea..cc47bd7d4456 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -205,7 +205,6 @@ extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
 extern VFIOGroupList vfio_group_list;
 
-bool vfio_mig_active(void);
 int64_t vfio_mig_bytes_transferred(void);
 
 #ifdef CONFIG_LINUX
diff --git a/include/hw/vfio/vfio.h b/include/hw/vfio/vfio.h
index 86248f54360a..d1e6f4b26f35 100644
--- a/include/hw/vfio/vfio.h
+++ b/include/hw/vfio/vfio.h
@@ -4,4 +4,6 @@
 bool vfio_eeh_as_ok(AddressSpace *as);
 int vfio_eeh_as_op(AddressSpace *as, uint32_t op);
 
+void populate_vfio_info(MigrationInfo *info);
+
 #endif
diff --git a/migration/migration.c b/migration/migration.c
index 87a9b59f83f4..c164594c1d8d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -56,10 +56,7 @@
 #include "net/announce.h"
 #include "qemu/queue.h"
 #include "multifd.h"
-
-#ifdef CONFIG_VFIO
-#include "hw/vfio/vfio-common.h"
-#endif
+#include "hw/vfio/vfio.h"
 
 #define MAX_THROTTLE  (128 << 20)  /* Migration transfer speed throttling 
*/
 
@@ -1041,17 +1038,6 @@ static void populate_disk_info(MigrationInfo *info)
 }
 }
 
-static void populate_vfio_info(MigrationInfo *info)
-{
-#ifdef CONFIG_VFIO
-if (vfio_mig_active()) {
-info->has_vfio = true;
-info->vfio = g_malloc0(sizeof(*info->vfio));
-info->vfio->transferred = vfio_mig_bytes_transferred();
-}
-#endif
-}
-
 static void fill_source_migration_info(MigrationInfo *info)
 {
 MigrationState *s = migrate_get_current();
diff --git a/stubs/meson.build b/stubs/meson.build
index 82b7ba60abe5..909956674847 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -53,3 +53,4 @@ if have_system
   stub_ss.add(files('semihost.c'))
   stub_ss.add(files('xen-hw-stub.c'))
 endif
+stub_ss.add(files('vfio.c'))
diff --git a/stubs/vfio.c b/stubs/vfio.c
new file mode 100644
index ..9cc8753cd102
--- /dev/null
+++ b/stubs/vfio.c
@@ -0,0 +1,7 @@
+#include "qemu/osdep.h"
+#include "qapi/qapi-types-migration.h"
+#include "hw/vfio/vfio.h"
+
+void populate_vfio_info(MigrationInfo *info)
+{
+}
-- 
2.7.0




Re: [PATCH 1/1] Fix to show vfio migration stat in migration status

2020-11-25 Thread Kirti Wankhede




On 11/26/2020 12:33 AM, Dr. David Alan Gilbert wrote:

* Kirti Wankhede (kwankh...@nvidia.com) wrote:



On 11/25/2020 3:00 PM, Dr. David Alan Gilbert wrote:

* Kirti Wankhede (kwankh...@nvidia.com) wrote:

Header file where CONFIG_VFIO is defined is not included in migration.c
file. Include config devices header file in migration.c.

Fixes: 3710586caa5d ("qapi: Add VFIO devices migration stats in Migration
stats")

Signed-off-by: Kirti Wankhede 


Given it's got build problems; I suggest actually something cleaner
would be to swing populate_vfio_info into one of the vfio specific
files, add a stubs/ entry somewhere and then migration.c doesn't need
to include the device or header stuff.



Still function prototype for populate_vfio_info() and its stub has to be
placed in some header file.


Which header file isn't that important; 


Any recommendation which header file to use?

Thanks,
Kirti


and the stub goes in a file in
stubs/


Earlier I used CONFIG_LINUX instead of CONFIG_VFIO which works here. Should
I change it back to CONFIG_LINUX?


No.


I'm not very much aware of meson build system, I tested by configuring
specific target, but I think by default if target build is not specified
during configuration, it builds for multiple target that's where this build
is failing. Any help on how to fix it would be helpful.


With my suggestion you don't have to do anything clever to meson
(which I don't know much about either).

Dave


Thanks,
Kirti


Dave


---
   meson.build   | 1 +
   migration/migration.c | 1 +
   2 files changed, 2 insertions(+)

diff --git a/meson.build b/meson.build
index 7ddf983ff7f5..24526499cfb5 100644
--- a/meson.build
+++ b/meson.build
@@ -1713,6 +1713,7 @@ common_ss.add_all(when: 'CONFIG_USER_ONLY', if_true: 
user_ss)
   common_all = common_ss.apply(config_all, strict: false)
   common_all = static_library('common',
+
c_args:'-DCONFIG_DEVICES="@0@-config-devices.h"'.format(target) ,
   build_by_default: false,
   sources: common_all.sources() + genh,
   dependencies: common_all.dependencies(),
diff --git a/migration/migration.c b/migration/migration.c
index 87a9b59f83f4..650efb81daad 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -57,6 +57,7 @@
   #include "qemu/queue.h"
   #include "multifd.h"
+#include CONFIG_DEVICES
   #ifdef CONFIG_VFIO
   #include "hw/vfio/vfio-common.h"
   #endif
--
2.7.0







Re: [PATCH 1/1] Fix to show vfio migration stat in migration status

2020-11-25 Thread Kirti Wankhede




On 11/25/2020 3:00 PM, Dr. David Alan Gilbert wrote:

* Kirti Wankhede (kwankh...@nvidia.com) wrote:

Header file where CONFIG_VFIO is defined is not included in migration.c
file. Include config devices header file in migration.c.

Fixes: 3710586caa5d ("qapi: Add VFIO devices migration stats in Migration
stats")

Signed-off-by: Kirti Wankhede 


Given it's got build problems; I suggest actually something cleaner
would be to swing populate_vfio_info into one of the vfio specific
files, add a stubs/ entry somewhere and then migration.c doesn't need
to include the device or header stuff.



Still function prototype for populate_vfio_info() and its stub has to be 
placed in some header file.


Earlier I used CONFIG_LINUX instead of CONFIG_VFIO which works here. 
Should I change it back to CONFIG_LINUX?


I'm not very much aware of meson build system, I tested by configuring 
specific target, but I think by default if target build is not specified 
during configuration, it builds for multiple target that's where this 
build is failing. Any help on how to fix it would be helpful.


Thanks,
Kirti


Dave


---
  meson.build   | 1 +
  migration/migration.c | 1 +
  2 files changed, 2 insertions(+)

diff --git a/meson.build b/meson.build
index 7ddf983ff7f5..24526499cfb5 100644
--- a/meson.build
+++ b/meson.build
@@ -1713,6 +1713,7 @@ common_ss.add_all(when: 'CONFIG_USER_ONLY', if_true: 
user_ss)
  
  common_all = common_ss.apply(config_all, strict: false)

  common_all = static_library('common',
+
c_args:'-DCONFIG_DEVICES="@0@-config-devices.h"'.format(target) ,
  build_by_default: false,
  sources: common_all.sources() + genh,
  dependencies: common_all.dependencies(),
diff --git a/migration/migration.c b/migration/migration.c
index 87a9b59f83f4..650efb81daad 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -57,6 +57,7 @@
  #include "qemu/queue.h"
  #include "multifd.h"
  
+#include CONFIG_DEVICES

  #ifdef CONFIG_VFIO
  #include "hw/vfio/vfio-common.h"
  #endif
--
2.7.0





Re: [PATCH 1/1] Fix to show vfio migration stat in migration status

2020-11-23 Thread Kirti Wankhede



On 11/23/2020 10:03 PM, Alex Williamson wrote:

On Thu, 19 Nov 2020 01:58:47 +0530
Kirti Wankhede  wrote:


Header file where CONFIG_VFIO is defined is not included in migration.c
file. Include config devices header file in migration.c.

Fixes: 3710586caa5d ("qapi: Add VFIO devices migration stats in Migration
stats")

Signed-off-by: Kirti Wankhede 
---
  meson.build   | 1 +
  migration/migration.c | 1 +
  2 files changed, 2 insertions(+)

diff --git a/meson.build b/meson.build
index 7ddf983ff7f5..24526499cfb5 100644
--- a/meson.build
+++ b/meson.build
@@ -1713,6 +1713,7 @@ common_ss.add_all(when: 'CONFIG_USER_ONLY', if_true: 
user_ss)
  
  common_all = common_ss.apply(config_all, strict: false)

  common_all = static_library('common',
+
c_args:'-DCONFIG_DEVICES="@0@-config-devices.h"'.format(target) ,
  build_by_default: false,
  sources: common_all.sources() + genh,
  dependencies: common_all.dependencies(),
diff --git a/migration/migration.c b/migration/migration.c
index 87a9b59f83f4..650efb81daad 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -57,6 +57,7 @@
  #include "qemu/queue.h"
  #include "multifd.h"
  
+#include CONFIG_DEVICES

  #ifdef CONFIG_VFIO
  #include "hw/vfio/vfio-common.h"
  #endif


Fails to build...



I didn't see this in my testing. Any specific configuration/build which 
fails?


Thanks,
Kirti



[1705/8465] Compiling C object libcommon.fa.p/migration_postcopy-ram.c.o
[1706/8465] Compiling C object libcommon.fa.p/migration_migration.c.o
FAILED: libcommon.fa.p/migration_migration.c.o
cc -Ilibcommon.fa.p -I. -I.. -I../slirp -I../slirp/src -Iqapi -Itrace -Iui 
-Iui/shader -I/usr/include/libpng16 -I/usr/include/capstone -I/usr/include/SDL2 
-I/usr/include/gtk-3.0 -I/usr/include/pango-1.0 -I/usr/include/glib-2.0 
-I/usr/lib64/glib-2.0/include -I/usr/include/harfbuzz -I/usr/include/fribidi 
-I/usr/include/freetype2 -I/usr/include/cairo -I/usr/include/pixman-1 
-I/usr/include/gdk-pixbuf-2.0 -I/usr/include/libmount -I/usr/include/blkid 
-I/usr/include/gio-unix-2.0 -I/usr/include/atk-1.0 
-I/usr/include/at-spi2-atk/2.0 -I/usr/include/dbus-1.0 
-I/usr/lib64/dbus-1.0/include -I/usr/include/at-spi-2.0 -I/usr/include/spice-1 
-I/usr/include/spice-server -I/usr/include/cacard -I/usr/include/nss3 
-I/usr/include/nspr4 -I/usr/include/vte-2.91 -I/usr/include/virgl 
-I/usr/include/libusb-1.0 -fdiagnostics-color=auto -pipe -Wall -Winvalid-pch 
-std=gnu99 -O2 -g -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -m64 -mcx16 
-D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes 
-Wredu
  ndant-decls -Wundef -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing 
-fno-common -fwrapv -Wold-style-declaration -Wold-style-definition -Wtype-limits 
-Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body 
-Wnested-externs -Wendif-labels -Wexpansion-to-defined -Wno-missing-include-dirs 
-Wno-shift-negative-value -Wno-psabi -fstack-protector-strong -isystem 
/tmp/tmp.HlKsni7iGC/linux-headers -isystem linux-headers -iquote 
/tmp/tmp.HlKsni7iGC/tcg/i386 -iquote . -iquote /tmp/tmp.HlKsni7iGC -iquote 
/tmp/tmp.HlKsni7iGC/accel/tcg -iquote /tmp/tmp.HlKsni7iGC/include -iquote 
/tmp/tmp.HlKsni7iGC/disas/libvixl -pthread -fPIC -DSTRUCT_IOVEC_DEFINED -D_DEFAULT_SOURCE 
-D_XOPEN_SOURCE=600 -DNCURSES_WIDECHAR -Wno-undef -D_REENTRANT 
'-DCONFIG_DEVICES="xtensa-linux-user-config-devices.h"' -MD -MQ 
libcommon.fa.p/migration_migration.c.o -MF libcommon.fa.p/migration_migration.c.o.d -o 
libcommon.fa.p/migration_migration.c.o -c ../migration/migration.c
: fatal error: xtensa-linux-user-config-devices.h: No such file 
or directory
compilation terminated.
[1707/8465] Compiling C object libcommon.fa.p/hw_pci-bridge_dec.c.o
[1708/8465] Compiling C object libcommon.fa.p/backends_hostmem-memfd.c.o
[1709/8465] Compiling C object libcommon.fa.p/hw_display_edid-region.c.o
[1710/8465] Compiling C object libcommon.fa.p/ui_gtk-gl-area.c.o
[1711/8465] Compiling C object libcommon.fa.p/disas_s390.c.o
[1712/8465] Compiling C object libcommon.fa.p/hw_pci-host_gpex-acpi.c.o
[1713/8465] Compiling C object libcommon.fa.p/hw_misc_macio_macio.c.o
[1714/8465] Compiling C object libcommon.fa.p/hw_misc_bcm2835_mbox.c.o
[1715/8465] Compiling C object libcommon.fa.p/hw_pci-bridge_xio3130_upstream.c.o
[1716/8465] Compiling C object libcommon.fa.p/hw_display_qxl-logger.c.o
[1717/8465] Compiling C object libcommon.fa.p/hw_net_net_tx_pkt.c.o
[1718/8465] Compiling C object libcommon.fa.p/hw_char_xen_console.c.o
[1719/8465] Compiling C object 
libqemu-mips64el-softmmu.fa.p/target_mips_msa_helper.c.o
[1720/8465] Compiling C object 
libqemu-mips64el-softmmu.fa.p/target_mips_translate.c.o
[1721/8465] Compiling C++ object libcommon.fa.p/disas_nanomips.cpp.o
ninja: build stopped: subcommand failed.
make[1]: *** [Makef

[PATCH v2 1/1] vfio: Change default dirty pages tracking behavior during migration

2020-11-23 Thread Kirti Wankhede
By default dirty pages tracking is enabled during iterative phase
(pre-copy phase).
Added per device opt-out option 'pre-copy-dirty-page-tracking' to
disable dirty pages tracking during iterative phase. If the option
'pre-copy-dirty-page-tracking=off' is set for any VFIO device, dirty
pages tracking during iterative phase will be disabled.

Signed-off-by: Kirti Wankhede 
---
 hw/vfio/common.c  | 11 +++
 hw/vfio/pci.c |  3 +++
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c1fdbf17f2e6..6ff1daa763f8 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -311,7 +311,7 @@ bool vfio_mig_active(void)
 return true;
 }
 
-static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container)
+static bool vfio_devices_all_saving(VFIOContainer *container)
 {
 VFIOGroup *group;
 VFIODevice *vbasedev;
@@ -329,8 +329,11 @@ static bool 
vfio_devices_all_stopped_and_saving(VFIOContainer *container)
 return false;
 }
 
-if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) &&
-!(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+if (migration->device_state & VFIO_DEVICE_STATE_SAVING) {
+if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF)
+&& (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+return false;
+}
 continue;
 } else {
 return false;
@@ -1125,7 +1128,7 @@ static void vfio_listerner_log_sync(MemoryListener 
*listener,
 return;
 }
 
-if (vfio_devices_all_stopped_and_saving(container)) {
+if (vfio_devices_all_saving(container)) {
 vfio_sync_dirty_bitmap(container, section);
 }
 }
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 58c0ce8971e3..5601df6d6241 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3182,6 +3182,9 @@ static void vfio_instance_init(Object *obj)
 static Property vfio_pci_dev_properties[] = {
 DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
 DEFINE_PROP_STRING("sysfsdev", VFIOPCIDevice, vbasedev.sysfsdev),
+DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,
+vbasedev.pre_copy_dirty_page_tracking,
+ON_OFF_AUTO_ON),
 DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
 display, ON_OFF_AUTO_OFF),
 DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index baeb4dcff102..267cf854bbba 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -129,6 +129,7 @@ typedef struct VFIODevice {
 unsigned int flags;
 VFIOMigration *migration;
 Error *migration_blocker;
+OnOffAuto pre_copy_dirty_page_tracking;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.7.0




Re: [v2 1/1] vfio: Change default dirty pages tracking behavior during migration

2020-11-23 Thread Kirti Wankhede

Sorry for spam, resending it again with 'PATCH'in subject.

Kirti.

On 11/23/2020 7:38 PM, Kirti Wankhede wrote:

By default dirty pages tracking is enabled during iterative phase
(pre-copy phase).
Added per device opt-out option 'pre-copy-dirty-page-tracking' to
disable dirty pages tracking during iterative phase. If the option
'pre-copy-dirty-page-tracking=off' is set for any VFIO device, dirty
pages tracking during iterative phase will be disabled.

Signed-off-by: Kirti Wankhede 
---
  hw/vfio/common.c  | 11 +++
  hw/vfio/pci.c |  3 +++
  include/hw/vfio/vfio-common.h |  1 +
  3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c1fdbf17f2e6..6ff1daa763f8 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -311,7 +311,7 @@ bool vfio_mig_active(void)
  return true;
  }
  
-static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container)

+static bool vfio_devices_all_saving(VFIOContainer *container)
  {
  VFIOGroup *group;
  VFIODevice *vbasedev;
@@ -329,8 +329,11 @@ static bool 
vfio_devices_all_stopped_and_saving(VFIOContainer *container)
  return false;
  }
  
-if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) &&

-!(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+if (migration->device_state & VFIO_DEVICE_STATE_SAVING) {
+if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF)
+&& (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+return false;
+}
  continue;
  } else {
  return false;
@@ -1125,7 +1128,7 @@ static void vfio_listerner_log_sync(MemoryListener 
*listener,
  return;
  }
  
-if (vfio_devices_all_stopped_and_saving(container)) {

+if (vfio_devices_all_saving(container)) {
  vfio_sync_dirty_bitmap(container, section);
  }
  }
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 58c0ce8971e3..5601df6d6241 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3182,6 +3182,9 @@ static void vfio_instance_init(Object *obj)
  static Property vfio_pci_dev_properties[] = {
  DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
  DEFINE_PROP_STRING("sysfsdev", VFIOPCIDevice, vbasedev.sysfsdev),
+DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,
+vbasedev.pre_copy_dirty_page_tracking,
+ON_OFF_AUTO_ON),
  DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
  display, ON_OFF_AUTO_OFF),
  DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index baeb4dcff102..267cf854bbba 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -129,6 +129,7 @@ typedef struct VFIODevice {
  unsigned int flags;
  VFIOMigration *migration;
  Error *migration_blocker;
+OnOffAuto pre_copy_dirty_page_tracking;
  } VFIODevice;
  
  struct VFIODeviceOps {






[v2 1/1] vfio: Change default dirty pages tracking behavior during migration

2020-11-23 Thread Kirti Wankhede
By default dirty pages tracking is enabled during iterative phase
(pre-copy phase).
Added per device opt-out option 'pre-copy-dirty-page-tracking' to
disable dirty pages tracking during iterative phase. If the option
'pre-copy-dirty-page-tracking=off' is set for any VFIO device, dirty
pages tracking during iterative phase will be disabled.

Signed-off-by: Kirti Wankhede 
---
 hw/vfio/common.c  | 11 +++
 hw/vfio/pci.c |  3 +++
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c1fdbf17f2e6..6ff1daa763f8 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -311,7 +311,7 @@ bool vfio_mig_active(void)
 return true;
 }
 
-static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container)
+static bool vfio_devices_all_saving(VFIOContainer *container)
 {
 VFIOGroup *group;
 VFIODevice *vbasedev;
@@ -329,8 +329,11 @@ static bool 
vfio_devices_all_stopped_and_saving(VFIOContainer *container)
 return false;
 }
 
-if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) &&
-!(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+if (migration->device_state & VFIO_DEVICE_STATE_SAVING) {
+if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF)
+&& (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+return false;
+}
 continue;
 } else {
 return false;
@@ -1125,7 +1128,7 @@ static void vfio_listerner_log_sync(MemoryListener 
*listener,
 return;
 }
 
-if (vfio_devices_all_stopped_and_saving(container)) {
+if (vfio_devices_all_saving(container)) {
 vfio_sync_dirty_bitmap(container, section);
 }
 }
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 58c0ce8971e3..5601df6d6241 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3182,6 +3182,9 @@ static void vfio_instance_init(Object *obj)
 static Property vfio_pci_dev_properties[] = {
 DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
 DEFINE_PROP_STRING("sysfsdev", VFIOPCIDevice, vbasedev.sysfsdev),
+DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,
+vbasedev.pre_copy_dirty_page_tracking,
+ON_OFF_AUTO_ON),
 DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
 display, ON_OFF_AUTO_OFF),
 DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index baeb4dcff102..267cf854bbba 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -129,6 +129,7 @@ typedef struct VFIODevice {
 unsigned int flags;
 VFIOMigration *migration;
 Error *migration_blocker;
+OnOffAuto pre_copy_dirty_page_tracking;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.7.0




Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

2020-11-19 Thread Kirti Wankhede




On 11/14/2020 2:47 PM, Shenming Lu wrote:

When running VFIO migration, I found that the restoring of VFIO PCI device’s
config space is before VGIC on ARM64 target. But generally, interrupt 
controllers
need to be restored before PCI devices. 


Is there any other way by which VGIC can be restored before PCI device?


Besides, if a VFIO PCI device is
configured to have directly-injected MSIs (VLPIs), the restoring of its config
space will trigger the configuring of these VLPIs (in kernel), where it would
return an error as I saw due to the dependency on kvm’s vgic.



Can this be fixed in kernel to re-initialize the kernel state?


To avoid this, we can move the saving of the config space from the iterable
process to the non-iterable process, so that it will be called after VGIC
according to their priorities.



With this change, at resume side, pre-copy phase data would reach 
destination without restored config space. VFIO device on destination 
might need it's config space setup and validated before it can accept 
further VFIO device specific migration state.


This also changes bit-stream, so it would break migration with original 
migration patch-set.


Thanks,
Kirti


Signed-off-by: Shenming Lu 
---
  hw/vfio/migration.c | 22 ++
  1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 3ce285ea39..028da35a25 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -351,7 +351,7 @@ static int vfio_update_pending(VFIODevice *vbasedev)
  return 0;
  }
  
-static int vfio_save_device_config_state(QEMUFile *f, void *opaque)

+static void vfio_save_device_config_state(QEMUFile *f, void *opaque)
  {
  VFIODevice *vbasedev = opaque;
  
@@ -365,13 +365,14 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
  
  trace_vfio_save_device_config_state(vbasedev->name);
  
-return qemu_file_get_error(f);

+if (qemu_file_get_error(f))
+error_report("%s: Failed to save device config space",
+ vbasedev->name);
  }
  
  static int vfio_load_device_config_state(QEMUFile *f, void *opaque)

  {
  VFIODevice *vbasedev = opaque;
-uint64_t data;
  
  if (vbasedev->ops && vbasedev->ops->vfio_load_config) {

  int ret;
@@ -384,15 +385,8 @@ static int vfio_load_device_config_state(QEMUFile *f, void 
*opaque)
  }
  }
  
-data = qemu_get_be64(f);

-if (data != VFIO_MIG_FLAG_END_OF_STATE) {
-error_report("%s: Failed loading device config space, "
- "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
-return -EINVAL;
-}
-
  trace_vfio_load_device_config_state(vbasedev->name);
-return qemu_file_get_error(f);
+return 0;
  }
  
  static int vfio_set_dirty_page_tracking(VFIODevice *vbasedev, bool start)

@@ -575,11 +569,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void 
*opaque)
  return ret;
  }
  
-ret = vfio_save_device_config_state(f, opaque);

-if (ret) {
-return ret;
-}
-
  ret = vfio_update_pending(vbasedev);
  if (ret) {
  return ret;
@@ -720,6 +709,7 @@ static SaveVMHandlers savevm_vfio_handlers = {
  .save_live_pending = vfio_save_pending,
  .save_live_iterate = vfio_save_iterate,
  .save_live_complete_precopy = vfio_save_complete_precopy,
+.save_state = vfio_save_device_config_state,
  .load_setup = vfio_load_setup,
  .load_cleanup = vfio_load_cleanup,
  .load_state = vfio_load_state,





[PATCH 1/1] vfio: Change default dirty pages tracking behavior during migration

2020-11-18 Thread Kirti Wankhede
By default dirty pages tracking is enabled during iterative phase
(pre-copy phase).
Added per device opt-out option 'pre-copy-dirty-page-tracking' to
disable dirty pages tracking during iterative phase. If the option
'pre-copy-dirty-page-tracking=off' is set for any VFIO device, dirty
pages tracking during iterative phase will be disabled.

Signed-off-by: Kirti Wankhede 
---
 hw/vfio/common.c  | 11 +++
 hw/vfio/pci.c |  3 +++
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c1fdbf17f2e6..6ff1daa763f8 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -311,7 +311,7 @@ bool vfio_mig_active(void)
 return true;
 }
 
-static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container)
+static bool vfio_devices_all_saving(VFIOContainer *container)
 {
 VFIOGroup *group;
 VFIODevice *vbasedev;
@@ -329,8 +329,11 @@ static bool 
vfio_devices_all_stopped_and_saving(VFIOContainer *container)
 return false;
 }
 
-if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) &&
-!(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+if (migration->device_state & VFIO_DEVICE_STATE_SAVING) {
+if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF)
+&& (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+return false;
+}
 continue;
 } else {
 return false;
@@ -1125,7 +1128,7 @@ static void vfio_listerner_log_sync(MemoryListener 
*listener,
 return;
 }
 
-if (vfio_devices_all_stopped_and_saving(container)) {
+if (vfio_devices_all_saving(container)) {
 vfio_sync_dirty_bitmap(container, section);
 }
 }
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 58c0ce8971e3..5bea4b3e71f5 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3182,6 +3182,9 @@ static void vfio_instance_init(Object *obj)
 static Property vfio_pci_dev_properties[] = {
 DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
 DEFINE_PROP_STRING("sysfsdev", VFIOPCIDevice, vbasedev.sysfsdev),
+DEFINE_PROP_ON_OFF_AUTO("pre-copy-dirty-page-tracking", VFIOPCIDevice,
+vbasedev.pre_copy_dirty_page_tracking,
+ON_OFF_AUTO_ON),
 DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
 display, ON_OFF_AUTO_OFF),
 DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index baeb4dcff102..267cf854bbba 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -129,6 +129,7 @@ typedef struct VFIODevice {
 unsigned int flags;
 VFIOMigration *migration;
 Error *migration_blocker;
+OnOffAuto pre_copy_dirty_page_tracking;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.7.0




Re: [PATCH RFC] vfio: Set the priority of VFIO VM state change handler explicitly

2020-11-18 Thread Kirti Wankhede




On 11/17/2020 7:10 AM, Shenming Lu wrote:

In VFIO VM state change handler, VFIO devices are transitioned in
_SAVING state, which should keep them from sending interrupts. Then
we can save the pending states of all interrupts in GIC VM state
change handler (on ARM).

So we have to set the priority of VFIO VM state change handler
explicitly (like virtio devices) to ensure it is called before GIC's
in saving.

Signed-off-by: Shenming Lu 
---
  hw/vfio/migration.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 55261562d4..d0d30864ba 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -857,7 +857,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
  register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, 
_vfio_handlers,
   vbasedev);
  
-migration->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,

+migration->vm_state = qdev_add_vm_change_state_handler(vbasedev->dev,
+   vfio_vmstate_change,
 vbasedev);
  migration->migration_state.notify = vfio_migration_state_notifier;
  add_migration_state_change_notifier(>migration_state);



Looks good to me.
Reviewed-by: Kirti Wankhede 




[PATCH 1/1] Fix to show vfio migration stat in migration status

2020-11-18 Thread Kirti Wankhede
Header file where CONFIG_VFIO is defined is not included in migration.c
file. Include config devices header file in migration.c.

Fixes: 3710586caa5d ("qapi: Add VFIO devices migration stats in Migration
stats")

Signed-off-by: Kirti Wankhede 
---
 meson.build   | 1 +
 migration/migration.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/meson.build b/meson.build
index 7ddf983ff7f5..24526499cfb5 100644
--- a/meson.build
+++ b/meson.build
@@ -1713,6 +1713,7 @@ common_ss.add_all(when: 'CONFIG_USER_ONLY', if_true: 
user_ss)
 
 common_all = common_ss.apply(config_all, strict: false)
 common_all = static_library('common',
+
c_args:'-DCONFIG_DEVICES="@0@-config-devices.h"'.format(target) ,
 build_by_default: false,
 sources: common_all.sources() + genh,
 dependencies: common_all.dependencies(),
diff --git a/migration/migration.c b/migration/migration.c
index 87a9b59f83f4..650efb81daad 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -57,6 +57,7 @@
 #include "qemu/queue.h"
 #include "multifd.h"
 
+#include CONFIG_DEVICES
 #ifdef CONFIG_VFIO
 #include "hw/vfio/vfio-common.h"
 #endif
-- 
2.7.0




Re: [RFC PATCH for-QEMU-5.2] vfio: Make migration support experimental

2020-11-10 Thread Kirti Wankhede




On 11/10/2020 2:40 PM, Dr. David Alan Gilbert wrote:

* Alex Williamson (alex.william...@redhat.com) wrote:

On Mon, 9 Nov 2020 19:44:17 +
"Dr. David Alan Gilbert"  wrote:


* Alex Williamson (alex.william...@redhat.com) wrote:

Per the proposed documentation for vfio device migration:

   Dirty pages are tracked when device is in stop-and-copy phase
   because if pages are marked dirty during pre-copy phase and
   content is transfered from source to destination, there is no
   way to know newly dirtied pages from the point they were copied
   earlier until device stops. To avoid repeated copy of same
   content, pinned pages are marked dirty only during
   stop-and-copy phase.

Essentially, since we don't have hardware dirty page tracking for
assigned devices at this point, we consider any page that is pinned
by an mdev vendor driver or pinned and mapped through the IOMMU to
be perpetually dirty.  In the worst case, this may result in all of
guest memory being considered dirty during every iteration of live
migration.  The current vfio implementation of migration has chosen
to mask device dirtied pages until the final stages of migration in
order to avoid this worst case scenario.

Allowing the device to implement a policy decision to prioritize
reduced migration data like this jeopardizes QEMU's overall ability
to implement any degree of service level guarantees during migration.
For example, any estimates towards achieving acceptable downtime
margins cannot be trusted when such a device is present.  The vfio
device should participate in dirty page tracking to the best of its
ability throughout migration, even if that means the dirty footprint
of the device impedes migration progress, allowing both QEMU and
higher level management tools to decide whether to continue the
migration or abort due to failure to achieve the desired behavior.


I don't feel particularly badly about the decision to squash it in
during the stop-and-copy phase; for devices where the pinned memory
is large, I don't think doing it during the main phase makes much sense;
especially if you then have to deal with tracking changes in pinning.



AFAIK the kernel support for tracking changes in page pinning already
exists, this is largely the vfio device in QEMU that decides when to
start exposing the device dirty footprint to QEMU.  I'm a bit surprised
by this answer though, we don't really know what the device memory
footprint is.  It might be large, it might be nothing, but by not
participating in dirty page tracking until the VM is stopped, we can't
know what the footprint is and how it will affect downtime.  Is it
really the place of a QEMU device driver to impose this sort of policy?


If it could actually track changes then I'd agree we shouldn't impose
any policy; but if it's just marking the whole area as dirty we're going
to need a bodge somewhere; this bodge doesn't look any worse than the
others to me.




Having said that, I agree with marking it as experimental, because
I'm dubious how useful it will be for the same reason, I worry
about whether the downtime will be so large to make it pointless.




Not all device state is large, for example NIC might only report 
currently mapped RX buffers which usually not more than a 1GB and could 
be as low as 10's of MB. GPU might or might not have large data, that 
depends on its use cases.




TBH I think that's the wrong reason to mark it experimental.  There's
clearly demand for vfio device migration and even if the practical use
cases are initially small, they will expand over time and hardware will
get better.  My objection is that the current behavior masks the
hardware and device limitations, leading to unrealistic expectations.
If the user expects minimal downtime, configures convergence to account
for that, QEMU thinks it can achieve it, and then the device marks
everything dirty, that's not supportable.


Yes, agreed.


Yes, there is demand for vfio device migration and many devices owners 
started scoping and development for migration support.
Instead of making whole migration support as experimental, we can have 
opt-in option to decide to mark sys mem pages dirty during iterative 
phase (pre-copy phase) of migration.


Thanks,
Kirti




OTOH if the vfio device
participates in dirty tracking through pre-copy, then the practical use
cases will find themselves as migrations will either be aborted because
downtime tolerances cannot be achieved or downtimes will be configured
to match reality.  Thanks,


Without a way to prioritise the unpinned memory during that period,
we're going to be repeatedly sending the pinned memory which is going to
lead to a much larger bandwidth usage that required; so that's going in
completely the wrong direction and also wrong from the point of view of
the user.

Dave



Alex


Reviewed-by: Dr. David Alan Gilbert 


Link: https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg00807.html
Cc: Kirti Wankhede 
Cc: Neo J

Re: [PATCH v1] docs/devel: Add VFIO device migration documentation

2020-11-06 Thread Kirti Wankhede




On 11/6/2020 2:56 AM, Alex Williamson wrote:

On Fri, 6 Nov 2020 02:22:11 +0530
Kirti Wankhede  wrote:


On 11/6/2020 12:41 AM, Alex Williamson wrote:

On Fri, 6 Nov 2020 00:29:36 +0530
Kirti Wankhede  wrote:
   

On 11/4/2020 6:15 PM, Alex Williamson wrote:

On Wed, 4 Nov 2020 13:25:40 +0530
Kirti Wankhede  wrote:
  

On 11/4/2020 1:57 AM, Alex Williamson wrote:

On Wed, 4 Nov 2020 01:18:12 +0530
Kirti Wankhede  wrote:
 

On 10/30/2020 12:35 AM, Alex Williamson wrote:

On Thu, 29 Oct 2020 23:11:16 +0530
Kirti Wankhede  wrote:






+System memory dirty pages tracking
+--
+
+A ``log_sync`` memory listener callback is added to mark system memory pages


s/is added to mark/marks those/
   

+as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried


s/by/by the/
s/Dirty/The dirty/
   

+per container. All pages pinned by vendor driver through vfio_pin_pages()


s/by/by the/
   

+external API have to be marked as dirty during migration. When there are CPU
+writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
+by vendor driver can also be written by device. There is currently no device


s/by/by the/ (x2)
   

+which has hardware support for dirty page tracking. So all pages which are
+pinned by vendor driver are considered as dirty.
+Dirty pages are tracked when device is in stop-and-copy phase because if pages
+are marked dirty during pre-copy phase and content is transfered from source to
+destination, there is no way to know newly dirtied pages from the point they
+were copied earlier until device stops. To avoid repeated copy of same content,
+pinned pages are marked dirty only during stop-and-copy phase.


   

Let me take a quick stab at rewriting this paragraph (not sure if I
understood it correctly):

"Dirty pages are tracked when the device is in the stop-and-copy phase.
During the pre-copy phase, it is not possible to distinguish a dirty
page that has been transferred from the source to the destination from
newly dirtied pages, which would lead to repeated copying of the same
content. Therefore, pinned pages are only marked dirty during the
stop-and-copy phase." ?
   


I think above rephrase only talks about repeated copying in pre-copy
phase. Used "copied earlier until device stops" to indicate both
pre-copy and stop-and-copy till device stops.



Now I'm confused, I thought we had abandoned the idea that we can only
report pinned pages during stop-and-copy.  Doesn't the device needs to
expose its dirty memory footprint during the iterative phase regardless
of whether that causes repeat copies?  If QEMU iterates and sees that
all memory is still dirty, it may have transferred more data, but it
can actually predict if it can achieve its downtime tolerances.  Which
is more important, less data transfer or predictability?  Thanks,



Even if QEMU copies and transfers content of all sys mem pages during
pre-copy (worst case with IOMMU backed mdev device when its vendor
driver is not smart to pin pages explicitly and all sys mem pages are
marked dirty), then also its prediction about downtime tolerance will
not be correct, because during stop-and-copy again all pages need to be
copied as device can write to any of those pinned pages.


I think you're only reiterating my point.  If QEMU copies all of guest
memory during the iterative phase and each time it sees that all memory
is dirty, such as if CPUs or devices (including assigned devices) are
dirtying pages as fast as it copies them (or continuously marks them
dirty), then QEMU can predict that downtime will require copying all
pages.


But as of now there is no way to know if device has dirtied pages during
iterative phase.



This claim doesn't make any sense, pinned pages are considered
persistently dirtied, during the iterative phase and while stopped.



If instead devices don't mark dirty pages until the VM is
stopped, then QEMU might iterate through memory copy and predict a short
downtime because not much memory is dirty, only to be surprised that
all of memory is suddenly dirty.  At that point it's too late, the VM
is already stopped, the predicted short downtime takes far longer than
expected.  This is exactly why we made the kernel interface mark pinned
pages persistently dirty when it was proposed that we only report
pinned pages once.  Thanks,
 


Since there is no way to know if device dirtied pages during iterative
phase, QEMU should query pinned pages in stop-and-copy phase.



As above, I don't believe this is true.

  

Whenever there will be hardware support or some software mechanism to
report pages dirtied by device then we will add a capability bit in
migration capability and based on that capability bit qemu/user space
app should decide to query dirty pages in iterative phase.



Yes, we cou

[PATCH v2 1/1] Fix use after free in vfio_migration_probe

2020-11-06 Thread Kirti Wankhede
Fixes Coverity issue:
CID 1436126:  Memory - illegal accesses  (USE_AFTER_FREE)

Fixes: a9e271ec9b36 ("vfio: Add migration region initialization and finalize
function")

Signed-off-by: Kirti Wankhede 
Reviewed-by: David Edmondson 
Reviewed-by: Alex Bennée 
Reviewed-by: Philippe Mathieu-Daudé 
---
 hw/vfio/migration.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 3ce285ea395d..55261562d4f3 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -897,8 +897,8 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 goto add_blocker;
 }
 
-g_free(info);
 trace_vfio_migration_probe(vbasedev->name, info->index);
+g_free(info);
 return 0;
 
 add_blocker:
-- 
2.7.0




[PATCH 1/1] Change the order of g_free(info) and tracepoint

2020-11-06 Thread Kirti Wankhede
Fixes Coverity issue:
CID 1436126:  Memory - illegal accesses  (USE_AFTER_FREE)

Fixes: a9e271ec9b36 ("vfio: Add migration region initialization and finalize
function")

Signed-off-by: Kirti Wankhede 
---
 hw/vfio/migration.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 3ce285ea395d..55261562d4f3 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -897,8 +897,8 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 goto add_blocker;
 }
 
-g_free(info);
 trace_vfio_migration_probe(vbasedev->name, info->index);
+g_free(info);
 return 0;
 
 add_blocker:
-- 
2.7.0




Re: [PATCH v1] docs/devel: Add VFIO device migration documentation

2020-11-05 Thread Kirti Wankhede




On 11/6/2020 12:41 AM, Alex Williamson wrote:

On Fri, 6 Nov 2020 00:29:36 +0530
Kirti Wankhede  wrote:


On 11/4/2020 6:15 PM, Alex Williamson wrote:

On Wed, 4 Nov 2020 13:25:40 +0530
Kirti Wankhede  wrote:
   

On 11/4/2020 1:57 AM, Alex Williamson wrote:

On Wed, 4 Nov 2020 01:18:12 +0530
Kirti Wankhede  wrote:
  

On 10/30/2020 12:35 AM, Alex Williamson wrote:

On Thu, 29 Oct 2020 23:11:16 +0530
Kirti Wankhede  wrote:
 



 

+System memory dirty pages tracking
+--
+
+A ``log_sync`` memory listener callback is added to mark system memory pages


s/is added to mark/marks those/


+as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried


s/by/by the/
s/Dirty/The dirty/


+per container. All pages pinned by vendor driver through vfio_pin_pages()


s/by/by the/


+external API have to be marked as dirty during migration. When there are CPU
+writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
+by vendor driver can also be written by device. There is currently no device


s/by/by the/ (x2)


+which has hardware support for dirty page tracking. So all pages which are
+pinned by vendor driver are considered as dirty.
+Dirty pages are tracked when device is in stop-and-copy phase because if pages
+are marked dirty during pre-copy phase and content is transfered from source to
+destination, there is no way to know newly dirtied pages from the point they
+were copied earlier until device stops. To avoid repeated copy of same content,
+pinned pages are marked dirty only during stop-and-copy phase.




Let me take a quick stab at rewriting this paragraph (not sure if I
understood it correctly):

"Dirty pages are tracked when the device is in the stop-and-copy phase.
During the pre-copy phase, it is not possible to distinguish a dirty
page that has been transferred from the source to the destination from
newly dirtied pages, which would lead to repeated copying of the same
content. Therefore, pinned pages are only marked dirty during the
stop-and-copy phase." ?



I think above rephrase only talks about repeated copying in pre-copy
phase. Used "copied earlier until device stops" to indicate both
pre-copy and stop-and-copy till device stops.



Now I'm confused, I thought we had abandoned the idea that we can only
report pinned pages during stop-and-copy.  Doesn't the device needs to
expose its dirty memory footprint during the iterative phase regardless
of whether that causes repeat copies?  If QEMU iterates and sees that
all memory is still dirty, it may have transferred more data, but it
can actually predict if it can achieve its downtime tolerances.  Which
is more important, less data transfer or predictability?  Thanks,
 


Even if QEMU copies and transfers content of all sys mem pages during
pre-copy (worst case with IOMMU backed mdev device when its vendor
driver is not smart to pin pages explicitly and all sys mem pages are
marked dirty), then also its prediction about downtime tolerance will
not be correct, because during stop-and-copy again all pages need to be
copied as device can write to any of those pinned pages.


I think you're only reiterating my point.  If QEMU copies all of guest
memory during the iterative phase and each time it sees that all memory
is dirty, such as if CPUs or devices (including assigned devices) are
dirtying pages as fast as it copies them (or continuously marks them
dirty), then QEMU can predict that downtime will require copying all
pages.


But as of now there is no way to know if device has dirtied pages during
iterative phase.



This claim doesn't make any sense, pinned pages are considered
persistently dirtied, during the iterative phase and while stopped.

 

If instead devices don't mark dirty pages until the VM is
stopped, then QEMU might iterate through memory copy and predict a short
downtime because not much memory is dirty, only to be surprised that
all of memory is suddenly dirty.  At that point it's too late, the VM
is already stopped, the predicted short downtime takes far longer than
expected.  This is exactly why we made the kernel interface mark pinned
pages persistently dirty when it was proposed that we only report
pinned pages once.  Thanks,
  


Since there is no way to know if device dirtied pages during iterative
phase, QEMU should query pinned pages in stop-and-copy phase.



As above, I don't believe this is true.

   

Whenever there will be hardware support or some software mechanism to
report pages dirtied by device then we will add a capability bit in
migration capability and based on that capability bit qemu/user space
app should decide to query dirty pages in iterative phase.



Yes, we could advertise support for fine granularity dirty page
tracking, but I completely disagree that we should consider pinned
pages clean until suddenly ex

Re: [PATCH v1] docs/devel: Add VFIO device migration documentation

2020-11-05 Thread Kirti Wankhede




On 11/4/2020 6:15 PM, Alex Williamson wrote:

On Wed, 4 Nov 2020 13:25:40 +0530
Kirti Wankhede  wrote:


On 11/4/2020 1:57 AM, Alex Williamson wrote:

On Wed, 4 Nov 2020 01:18:12 +0530
Kirti Wankhede  wrote:
   

On 10/30/2020 12:35 AM, Alex Williamson wrote:

On Thu, 29 Oct 2020 23:11:16 +0530
Kirti Wankhede  wrote:
  



  

+System memory dirty pages tracking
+--
+
+A ``log_sync`` memory listener callback is added to mark system memory pages


s/is added to mark/marks those/
 

+as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried


s/by/by the/
s/Dirty/The dirty/
 

+per container. All pages pinned by vendor driver through vfio_pin_pages()


s/by/by the/
 

+external API have to be marked as dirty during migration. When there are CPU
+writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
+by vendor driver can also be written by device. There is currently no device


s/by/by the/ (x2)
 

+which has hardware support for dirty page tracking. So all pages which are
+pinned by vendor driver are considered as dirty.
+Dirty pages are tracked when device is in stop-and-copy phase because if pages
+are marked dirty during pre-copy phase and content is transfered from source to
+destination, there is no way to know newly dirtied pages from the point they
+were copied earlier until device stops. To avoid repeated copy of same content,
+pinned pages are marked dirty only during stop-and-copy phase.


 

Let me take a quick stab at rewriting this paragraph (not sure if I
understood it correctly):

"Dirty pages are tracked when the device is in the stop-and-copy phase.
During the pre-copy phase, it is not possible to distinguish a dirty
page that has been transferred from the source to the destination from
newly dirtied pages, which would lead to repeated copying of the same
content. Therefore, pinned pages are only marked dirty during the
stop-and-copy phase." ?
 


I think above rephrase only talks about repeated copying in pre-copy
phase. Used "copied earlier until device stops" to indicate both
pre-copy and stop-and-copy till device stops.



Now I'm confused, I thought we had abandoned the idea that we can only
report pinned pages during stop-and-copy.  Doesn't the device needs to
expose its dirty memory footprint during the iterative phase regardless
of whether that causes repeat copies?  If QEMU iterates and sees that
all memory is still dirty, it may have transferred more data, but it
can actually predict if it can achieve its downtime tolerances.  Which
is more important, less data transfer or predictability?  Thanks,
  


Even if QEMU copies and transfers content of all sys mem pages during
pre-copy (worst case with IOMMU backed mdev device when its vendor
driver is not smart to pin pages explicitly and all sys mem pages are
marked dirty), then also its prediction about downtime tolerance will
not be correct, because during stop-and-copy again all pages need to be
copied as device can write to any of those pinned pages.


I think you're only reiterating my point.  If QEMU copies all of guest
memory during the iterative phase and each time it sees that all memory
is dirty, such as if CPUs or devices (including assigned devices) are
dirtying pages as fast as it copies them (or continuously marks them
dirty), then QEMU can predict that downtime will require copying all
pages.


But as of now there is no way to know if device has dirtied pages during
iterative phase.



This claim doesn't make any sense, pinned pages are considered
persistently dirtied, during the iterative phase and while stopped.

  

If instead devices don't mark dirty pages until the VM is
stopped, then QEMU might iterate through memory copy and predict a short
downtime because not much memory is dirty, only to be surprised that
all of memory is suddenly dirty.  At that point it's too late, the VM
is already stopped, the predicted short downtime takes far longer than
expected.  This is exactly why we made the kernel interface mark pinned
pages persistently dirty when it was proposed that we only report
pinned pages once.  Thanks,
   


Since there is no way to know if device dirtied pages during iterative
phase, QEMU should query pinned pages in stop-and-copy phase.



As above, I don't believe this is true.



Whenever there will be hardware support or some software mechanism to
report pages dirtied by device then we will add a capability bit in
migration capability and based on that capability bit qemu/user space
app should decide to query dirty pages in iterative phase.



Yes, we could advertise support for fine granularity dirty page
tracking, but I completely disagree that we should consider pinned
pages clean until suddenly exposing them as dirty once the VM is
stopped.  Thanks,



Should QEMU copy dirtied pages twice, during iterative phase and then 
when VM is stopped?


Thanks,
Kirti



Re: [PATCH v1] docs/devel: Add VFIO device migration documentation

2020-11-03 Thread Kirti Wankhede




On 11/4/2020 1:57 AM, Alex Williamson wrote:

On Wed, 4 Nov 2020 01:18:12 +0530
Kirti Wankhede  wrote:


On 10/30/2020 12:35 AM, Alex Williamson wrote:

On Thu, 29 Oct 2020 23:11:16 +0530
Kirti Wankhede  wrote:
   





+System memory dirty pages tracking
+--
+
+A ``log_sync`` memory listener callback is added to mark system memory pages


s/is added to mark/marks those/
  

+as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried


s/by/by the/
s/Dirty/The dirty/
  

+per container. All pages pinned by vendor driver through vfio_pin_pages()


s/by/by the/
  

+external API have to be marked as dirty during migration. When there are CPU
+writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
+by vendor driver can also be written by device. There is currently no device


s/by/by the/ (x2)
  

+which has hardware support for dirty page tracking. So all pages which are
+pinned by vendor driver are considered as dirty.
+Dirty pages are tracked when device is in stop-and-copy phase because if pages
+are marked dirty during pre-copy phase and content is transfered from source to
+destination, there is no way to know newly dirtied pages from the point they
+were copied earlier until device stops. To avoid repeated copy of same content,
+pinned pages are marked dirty only during stop-and-copy phase.


  

Let me take a quick stab at rewriting this paragraph (not sure if I
understood it correctly):

"Dirty pages are tracked when the device is in the stop-and-copy phase.
During the pre-copy phase, it is not possible to distinguish a dirty
page that has been transferred from the source to the destination from
newly dirtied pages, which would lead to repeated copying of the same
content. Therefore, pinned pages are only marked dirty during the
stop-and-copy phase." ?
  


I think above rephrase only talks about repeated copying in pre-copy
phase. Used "copied earlier until device stops" to indicate both
pre-copy and stop-and-copy till device stops.



Now I'm confused, I thought we had abandoned the idea that we can only
report pinned pages during stop-and-copy.  Doesn't the device needs to
expose its dirty memory footprint during the iterative phase regardless
of whether that causes repeat copies?  If QEMU iterates and sees that
all memory is still dirty, it may have transferred more data, but it
can actually predict if it can achieve its downtime tolerances.  Which
is more important, less data transfer or predictability?  Thanks,
   


Even if QEMU copies and transfers content of all sys mem pages during
pre-copy (worst case with IOMMU backed mdev device when its vendor
driver is not smart to pin pages explicitly and all sys mem pages are
marked dirty), then also its prediction about downtime tolerance will
not be correct, because during stop-and-copy again all pages need to be
copied as device can write to any of those pinned pages.


I think you're only reiterating my point.  If QEMU copies all of guest
memory during the iterative phase and each time it sees that all memory
is dirty, such as if CPUs or devices (including assigned devices) are
dirtying pages as fast as it copies them (or continuously marks them
dirty), then QEMU can predict that downtime will require copying all
pages. 


But as of now there is no way to know if device has dirtied pages during 
iterative phase.



If instead devices don't mark dirty pages until the VM is
stopped, then QEMU might iterate through memory copy and predict a short
downtime because not much memory is dirty, only to be surprised that
all of memory is suddenly dirty.  At that point it's too late, the VM
is already stopped, the predicted short downtime takes far longer than
expected.  This is exactly why we made the kernel interface mark pinned
pages persistently dirty when it was proposed that we only report
pinned pages once.  Thanks,



Since there is no way to know if device dirtied pages during iterative 
phase, QEMU should query pinned pages in stop-and-copy phase.


Whenever there will be hardware support or some software mechanism to 
report pages dirtied by device then we will add a capability bit in 
migration capability and based on that capability bit qemu/user space 
app should decide to query dirty pages in iterative phase.


Thanks,
Kirti



Re: [PATCH v1] docs/devel: Add VFIO device migration documentation

2020-11-03 Thread Kirti Wankhede




On 10/30/2020 12:35 AM, Alex Williamson wrote:

On Thu, 29 Oct 2020 23:11:16 +0530
Kirti Wankhede  wrote:






+System memory dirty pages tracking
+--
+
+A ``log_sync`` memory listener callback is added to mark system memory pages


s/is added to mark/marks those/
   

+as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried


s/by/by the/
s/Dirty/The dirty/
   

+per container. All pages pinned by vendor driver through vfio_pin_pages()


s/by/by the/
   

+external API have to be marked as dirty during migration. When there are CPU
+writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
+by vendor driver can also be written by device. There is currently no device


s/by/by the/ (x2)
   

+which has hardware support for dirty page tracking. So all pages which are
+pinned by vendor driver are considered as dirty.
+Dirty pages are tracked when device is in stop-and-copy phase because if pages
+are marked dirty during pre-copy phase and content is transfered from source to
+destination, there is no way to know newly dirtied pages from the point they
+were copied earlier until device stops. To avoid repeated copy of same content,
+pinned pages are marked dirty only during stop-and-copy phase.




Let me take a quick stab at rewriting this paragraph (not sure if I
understood it correctly):

"Dirty pages are tracked when the device is in the stop-and-copy phase.
During the pre-copy phase, it is not possible to distinguish a dirty
page that has been transferred from the source to the destination from
newly dirtied pages, which would lead to repeated copying of the same
content. Therefore, pinned pages are only marked dirty during the
stop-and-copy phase." ?
   


I think above rephrase only talks about repeated copying in pre-copy
phase. Used "copied earlier until device stops" to indicate both
pre-copy and stop-and-copy till device stops.



Now I'm confused, I thought we had abandoned the idea that we can only
report pinned pages during stop-and-copy.  Doesn't the device needs to
expose its dirty memory footprint during the iterative phase regardless
of whether that causes repeat copies?  If QEMU iterates and sees that
all memory is still dirty, it may have transferred more data, but it
can actually predict if it can achieve its downtime tolerances.  Which
is more important, less data transfer or predictability?  Thanks,



Even if QEMU copies and transfers content of all sys mem pages during 
pre-copy (worst case with IOMMU backed mdev device when its vendor 
driver is not smart to pin pages explicitly and all sys mem pages are 
marked dirty), then also its prediction about downtime tolerance will 
not be correct, because during stop-and-copy again all pages need to be 
copied as device can write to any of those pinned pages.


Thanks,
Kirti




Re: Out-of-Process Device Emulation session at KVM Forum 2020

2020-10-29 Thread Kirti Wankhede




On 10/29/2020 10:12 PM, Daniel P. Berrangé wrote:

On Thu, Oct 29, 2020 at 04:15:30PM +, David Edmondson wrote:

On Thursday, 2020-10-29 at 21:02:05 +08, Jason Wang wrote:


2) Did qemu even try to migrate opaque blobs before? It's probably a bad
design of migration protocol as well.


The TPM emulator backend migrates blobs that are only understood by
swtpm.


The separate slirp-helper net backend does the same too IIUC



When sys mem pages are marked dirty and content is copied to 
destination, content of sys mem is also opaque to QEMU.


Thanks,
Kirti



Re: [PATCH v1] docs/devel: Add VFIO device migration documentation

2020-10-29 Thread Kirti Wankhede
Thanks for corrections Cornelia. I had done the corrections you 
suggested I had not replied, see my comments on couple of places where I 
disagree.



On 10/29/2020 5:22 PM, Cornelia Huck wrote:

On Thu, 29 Oct 2020 11:23:11 +0530
Kirti Wankhede  wrote:


Document interfaces used for VFIO device migration. Added flow of state
changes during live migration with VFIO device.

Signed-off-by: Kirti Wankhede 
---
  MAINTAINERS   |   1 +
  docs/devel/vfio-migration.rst | 119 ++


You probably want to include this into the Developer's Guide via
index.rst.



Ok.


  2 files changed, 120 insertions(+)
  create mode 100644 docs/devel/vfio-migration.rst

diff --git a/MAINTAINERS b/MAINTAINERS
index 6a197bd358d6..6f3fcffc6b3d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1728,6 +1728,7 @@ M: Alex Williamson 
  S: Supported
  F: hw/vfio/*
  F: include/hw/vfio/
+F: docs/devel/vfio-migration.rst
  
  vfio-ccw

  M: Cornelia Huck 
diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
new file mode 100644
index ..dab9127825e4
--- /dev/null
+++ b/docs/devel/vfio-migration.rst
@@ -0,0 +1,119 @@
+=
+VFIO device Migration
+=
+
+VFIO devices use iterative approach for migration because certain VFIO devices


s/use/use an/ ?


+(e.g. GPU) have large amount of data to be transfered. The iterative pre-copy
+phase of migration allows for the guest to continue whilst the VFIO device 
state
+is transferred to destination, this helps to reduce the total downtime of the


s/to destination,/to the destination;/


+VM. VFIO devices can choose to skip the pre-copy phase of migration by 
returning
+pending_bytes as zero during pre-copy phase.


s/during/during the/


+
+Detailed description of UAPI for VFIO device for migration is in the comment
+above ``vfio_device_migration_info`` structure definition in header file
+linux-headers/linux/vfio.h.


I think I'd copy that to this file. If I'm looking at the
documentation, I'd rather not go hunting for source code to find out
what structure you are talking about. Plus, as it's UAPI, I don't
expect it to change much, so it should be easy to keep the definitions
in sync (famous last words).



I feel its duplication of documentation. I would like to know others 
views as well.



+
+VFIO device hooks for iterative approach:
+-  A ``save_setup`` function that setup migration region, sets _SAVING flag in


s/setup/sets up the/
s/in/in the/


+VFIO device state and inform VFIO IOMMU module to start dirty page tracking.


s/inform/informs the/


+
+- A ``load_setup`` function that setup migration region on the destination and


s/setup/sets up the/


+sets _RESUMING flag in VFIO device state.


s/in/in the/


+
+- A ``save_live_pending`` function that reads pending_bytes from vendor driver
+that indicate how much more data the vendor driver yet to save for the VFIO
+device.


"A ``save_live_pending`` function that reads pending_bytes from the
vendor driver, which indicates the amount of data that the vendor
driver has yet to save for the VFIO device." ?


+
+- A ``save_live_iterate`` function that reads VFIO device's data from vendor


s/reads/reads the/
s/from/from the/


+driver through migration region during iterative phase.


s/through/through the/


+
+- A ``save_live_complete_precopy`` function that resets _RUNNING flag from VFIO


s/from/from the/


+device state, saves device config space, if any, and iteratively copies


s/saves/saves the/


+remaining data for VFIO device till pending_bytes returned by vendor driver
+is zero.


"...and interactively copies the remaining data for the VFIO device
until the vendor driver indicates that no data remains (pending_bytes
is zero)." ?


+
+- A ``load_state`` function loads config section and data sections generated by
+above save functions.


"A ``load_state`` function that loads the config section and the data
sections that are generated by the save functions above." ?


+
+- ``cleanup`` functions for both save and load that unmap migration region.


..."that perform any migration-related cleanup, including unmapping the
migration region." ?


+
+VM state change handler is registered to change VFIO device state based on VM
+state change.


"A VM state change handler is registered to change the VFIO device
state when the VM state changes." ?


+
+Similarly, a migration state change notifier is added to get a notification on


s/added/registered/ ?


+migration state change. These states are translated to VFIO device state and
+conveyed to vendor driver.
+
+System memory dirty pages tracking
+--
+
+A ``log_sync`` memory listener callback is added to mark system memory pages


s/is added to mark/marks those/


+as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried


s/by/by the/
s/Dirty/The dirty/


+per co

[PATCH v1] docs/devel: Add VFIO device migration documentation

2020-10-29 Thread Kirti Wankhede
Document interfaces used for VFIO device migration. Added flow of state
changes during live migration with VFIO device.

Signed-off-by: Kirti Wankhede 
---
 MAINTAINERS   |   1 +
 docs/devel/vfio-migration.rst | 119 ++
 2 files changed, 120 insertions(+)
 create mode 100644 docs/devel/vfio-migration.rst

diff --git a/MAINTAINERS b/MAINTAINERS
index 6a197bd358d6..6f3fcffc6b3d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1728,6 +1728,7 @@ M: Alex Williamson 
 S: Supported
 F: hw/vfio/*
 F: include/hw/vfio/
+F: docs/devel/vfio-migration.rst
 
 vfio-ccw
 M: Cornelia Huck 
diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
new file mode 100644
index ..dab9127825e4
--- /dev/null
+++ b/docs/devel/vfio-migration.rst
@@ -0,0 +1,119 @@
+=
+VFIO device Migration
+=
+
+VFIO devices use iterative approach for migration because certain VFIO devices
+(e.g. GPU) have large amount of data to be transfered. The iterative pre-copy
+phase of migration allows for the guest to continue whilst the VFIO device 
state
+is transferred to destination, this helps to reduce the total downtime of the
+VM. VFIO devices can choose to skip the pre-copy phase of migration by 
returning
+pending_bytes as zero during pre-copy phase.
+
+Detailed description of UAPI for VFIO device for migration is in the comment
+above ``vfio_device_migration_info`` structure definition in header file
+linux-headers/linux/vfio.h.
+
+VFIO device hooks for iterative approach:
+-  A ``save_setup`` function that setup migration region, sets _SAVING flag in
+VFIO device state and inform VFIO IOMMU module to start dirty page tracking.
+
+- A ``load_setup`` function that setup migration region on the destination and
+sets _RESUMING flag in VFIO device state.
+
+- A ``save_live_pending`` function that reads pending_bytes from vendor driver
+that indicate how much more data the vendor driver yet to save for the VFIO
+device.
+
+- A ``save_live_iterate`` function that reads VFIO device's data from vendor
+driver through migration region during iterative phase.
+
+- A ``save_live_complete_precopy`` function that resets _RUNNING flag from VFIO
+device state, saves device config space, if any, and iteratively copies
+remaining data for VFIO device till pending_bytes returned by vendor driver
+is zero.
+
+- A ``load_state`` function loads config section and data sections generated by
+above save functions.
+
+- ``cleanup`` functions for both save and load that unmap migration region.
+
+VM state change handler is registered to change VFIO device state based on VM
+state change.
+
+Similarly, a migration state change notifier is added to get a notification on
+migration state change. These states are translated to VFIO device state and
+conveyed to vendor driver.
+
+System memory dirty pages tracking
+--
+
+A ``log_sync`` memory listener callback is added to mark system memory pages
+as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried
+per container. All pages pinned by vendor driver through vfio_pin_pages()
+external API have to be marked as dirty during migration. When there are CPU
+writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
+by vendor driver can also be written by device. There is currently no device
+which has hardware support for dirty page tracking. So all pages which are
+pinned by vendor driver are considered as dirty.
+Dirty pages are tracked when device is in stop-and-copy phase because if pages
+are marked dirty during pre-copy phase and content is transfered from source to
+destination, there is no way to know newly dirtied pages from the point they
+were copied earlier until device stops. To avoid repeated copy of same content,
+pinned pages are marked dirty only during stop-and-copy phase.
+
+System memory dirty pages tracking when vIOMMU is enabled
+-
+With vIOMMU, IO virtual address range can get unmapped while in pre-copy phase
+of migration. In that case, unmap ioctl returns pages pinned in that range and
+QEMU reports corresponding guest physical pages dirty.
+During stop-and-copy phase, an IOMMU notifier is used to get a callback for
+mapped pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for
+those mapped ranges.
+
+Flow of state changes during Live migration
+===
+Below is the flow of state change during live migration where states in 
brackets
+represent VM state, migration state and VFIO device state as:
+(VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
+
+Live migration save path
+
+QEMU normal running state
+(RUNNING, _NONE, _RUNNING)
+|
+   migrate_init spawns

Re: [PATCH v29 05/17] vfio: Add VM state change handler to know state of VM

2020-10-26 Thread Kirti Wankhede




On 10/26/2020 7:32 PM, Alex Williamson wrote:

On Mon, 26 Oct 2020 19:18:51 +0530
Kirti Wankhede  wrote:


On 10/26/2020 6:30 PM, Alex Williamson wrote:

On Mon, 26 Oct 2020 15:06:15 +0530
Kirti Wankhede  wrote:
   

VM state change handler is called on change in VM's state. Based on
VM state, VFIO device state should be changed.
Added read/write helper functions for migration region.
Added function to set device_state.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Dr. David Alan Gilbert 
Reviewed-by: Cornelia Huck 
---
   hw/vfio/migration.c   | 158 
++
   hw/vfio/trace-events  |   2 +
   include/hw/vfio/vfio-common.h |   4 ++
   3 files changed, 164 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index fd7faf423cdc..65ce735d667b 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c

[snip]

@@ -64,6 +216,9 @@ static int vfio_migration_init(VFIODevice *vbasedev,
   ret = -EINVAL;
   goto err;
   }
+
+migration->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
+   vbasedev);
   return 0;
   
   err:


Fails to build, @migration is not defined.  We could use
vbasedev->migration or pull defining and setting @migration from patch
06.  Thanks,
   


Pulling and setting migration from patch 06 seems better option.
Should I resend patch 5 & 6 only?


I've resolved this locally as patch 05:

@@ -38,6 +190,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
  {
  int ret;
  Object *obj;
+VFIOMigration *migration;
  
  if (!vbasedev->ops->vfio_get_object) {

  return -EINVAL;
@@ -64,6 +217,10 @@ static int vfio_migration_init(VFIODevice *vbasedev,
  ret = -EINVAL;
  goto err;
  }
+
+migration = vbasedev->migration;
+migration->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
+   vbasedev);
  return 0;
  
  err:


patch 06:

@@ -219,8 +243,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
  }
  
  migration = vbasedev->migration;

+migration->vbasedev = vbasedev;
  migration->vm_state = 
qemu_add_vm_change_state_handler(vfio_vmstate_change,
 vbasedev);
+migration->migration_state.notify = vfio_migration_state_notifier;
+add_migration_state_change_notifier(>migration_state);
  return 0;
  
  err:


If you're satisfied with that, no need to resend.  Thanks,



Yes, this is exactly I was going to send.
Thanks for fixing it.

Thanks,
Kirti



Re: [PATCH v29 05/17] vfio: Add VM state change handler to know state of VM

2020-10-26 Thread Kirti Wankhede




On 10/26/2020 6:30 PM, Alex Williamson wrote:

On Mon, 26 Oct 2020 15:06:15 +0530
Kirti Wankhede  wrote:


VM state change handler is called on change in VM's state. Based on
VM state, VFIO device state should be changed.
Added read/write helper functions for migration region.
Added function to set device_state.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Dr. David Alan Gilbert 
Reviewed-by: Cornelia Huck 
---
  hw/vfio/migration.c   | 158 ++
  hw/vfio/trace-events  |   2 +
  include/hw/vfio/vfio-common.h |   4 ++
  3 files changed, 164 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index fd7faf423cdc..65ce735d667b 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c

[snip]

@@ -64,6 +216,9 @@ static int vfio_migration_init(VFIODevice *vbasedev,
  ret = -EINVAL;
  goto err;
  }
+
+migration->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
+   vbasedev);
  return 0;
  
  err:


Fails to build, @migration is not defined.  We could use
vbasedev->migration or pull defining and setting @migration from patch
06.  Thanks,



Pulling and setting migration from patch 06 seems better option.
Should I resend patch 5 & 6 only?

Thanks,
Kirti



[PATCH v29 14/17] vfio: Dirty page tracking when vIOMMU is enabled

2020-10-26 Thread Kirti Wankhede
When vIOMMU is enabled, register MAP notifier from log_sync when all
devices in container are in stop and copy phase of migration. Call replay
and get dirty pages from notifier callback.

Suggested-by: Alex Williamson 
Signed-off-by: Kirti Wankhede 
Reviewed-by: Yan Zhao 
---
 hw/vfio/common.c | 88 
 hw/vfio/trace-events |  1 +
 2 files changed, 83 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 2634387df948..c0b5b6245a47 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -442,8 +442,8 @@ static bool 
vfio_listener_skipped_section(MemoryRegionSection *section)
 }
 
 /* Called with rcu_read_lock held.  */
-static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
-   bool *read_only)
+static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
+   ram_addr_t *ram_addr, bool *read_only)
 {
 MemoryRegion *mr;
 hwaddr xlat;
@@ -474,8 +474,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void 
**vaddr,
 return false;
 }
 
-*vaddr = memory_region_get_ram_ptr(mr) + xlat;
-*read_only = !writable || mr->readonly;
+if (vaddr) {
+*vaddr = memory_region_get_ram_ptr(mr) + xlat;
+}
+
+if (ram_addr) {
+*ram_addr = memory_region_get_ram_addr(mr) + xlat;
+}
+
+if (read_only) {
+*read_only = !writable || mr->readonly;
+}
 
 return true;
 }
@@ -485,7 +494,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, 
IOMMUTLBEntry *iotlb)
 VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
 VFIOContainer *container = giommu->container;
 hwaddr iova = iotlb->iova + giommu->iommu_offset;
-bool read_only;
 void *vaddr;
 int ret;
 
@@ -501,7 +509,9 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, 
IOMMUTLBEntry *iotlb)
 rcu_read_lock();
 
 if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
-if (!vfio_get_vaddr(iotlb, , _only)) {
+bool read_only;
+
+if (!vfio_get_xlat_addr(iotlb, , NULL, _only)) {
 goto out;
 }
 /*
@@ -899,11 +909,77 @@ err_out:
 return ret;
 }
 
+typedef struct {
+IOMMUNotifier n;
+VFIOGuestIOMMU *giommu;
+} vfio_giommu_dirty_notifier;
+
+static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
+{
+vfio_giommu_dirty_notifier *gdn = container_of(n,
+vfio_giommu_dirty_notifier, n);
+VFIOGuestIOMMU *giommu = gdn->giommu;
+VFIOContainer *container = giommu->container;
+hwaddr iova = iotlb->iova + giommu->iommu_offset;
+ram_addr_t translated_addr;
+
+trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
+
+if (iotlb->target_as != _space_memory) {
+error_report("Wrong target AS \"%s\", only system memory is allowed",
+ iotlb->target_as->name ? iotlb->target_as->name : "none");
+return;
+}
+
+rcu_read_lock();
+if (vfio_get_xlat_addr(iotlb, NULL, _addr, NULL)) {
+int ret;
+
+ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
+translated_addr);
+if (ret) {
+error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
+ "0x%"HWADDR_PRIx") = %d (%m)",
+ container, iova,
+ iotlb->addr_mask + 1, ret);
+}
+}
+rcu_read_unlock();
+}
+
 static int vfio_sync_dirty_bitmap(VFIOContainer *container,
   MemoryRegionSection *section)
 {
 ram_addr_t ram_addr;
 
+if (memory_region_is_iommu(section->mr)) {
+VFIOGuestIOMMU *giommu;
+
+QLIST_FOREACH(giommu, >giommu_list, giommu_next) {
+if (MEMORY_REGION(giommu->iommu) == section->mr &&
+giommu->n.start == section->offset_within_region) {
+Int128 llend;
+vfio_giommu_dirty_notifier gdn = { .giommu = giommu };
+int idx = memory_region_iommu_attrs_to_index(giommu->iommu,
+   MEMTXATTRS_UNSPECIFIED);
+
+llend = 
int128_add(int128_make64(section->offset_within_region),
+   section->size);
+llend = int128_sub(llend, int128_one());
+
+iommu_notifier_init(,
+vfio_iommu_map_dirty_notify,
+IOMMU_NOTIFIER_MAP,
+section->offset_within_region,
+int128_get64(llend),
+idx);
+memory_regi

[PATCH v29 11/17] vfio: Get migration capability flags for container

2020-10-26 Thread Kirti Wankhede
Added helper functions to get IOMMU info capability chain.
Added function to get migration capability information from that
capability chain for IOMMU container.

Similar change was proposed earlier:
https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html

Disable migration for devices if IOMMU module doesn't support migration
capability.

Signed-off-by: Kirti Wankhede 
Cc: Shameer Kolothum 
Cc: Eric Auger 
---
 hw/vfio/common.c  | 90 +++
 hw/vfio/migration.c   |  7 +++-
 include/hw/vfio/vfio-common.h |  3 ++
 3 files changed, 91 insertions(+), 9 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c6e98b8d61be..d4959c036dd1 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1228,6 +1228,75 @@ static int vfio_init_container(VFIOContainer *container, 
int group_fd,
 return 0;
 }
 
+static int vfio_get_iommu_info(VFIOContainer *container,
+   struct vfio_iommu_type1_info **info)
+{
+
+size_t argsz = sizeof(struct vfio_iommu_type1_info);
+
+*info = g_new0(struct vfio_iommu_type1_info, 1);
+again:
+(*info)->argsz = argsz;
+
+if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
+g_free(*info);
+*info = NULL;
+return -errno;
+}
+
+if (((*info)->argsz > argsz)) {
+argsz = (*info)->argsz;
+*info = g_realloc(*info, argsz);
+goto again;
+}
+
+return 0;
+}
+
+static struct vfio_info_cap_header *
+vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
+{
+struct vfio_info_cap_header *hdr;
+void *ptr = info;
+
+if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
+return NULL;
+}
+
+for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+if (hdr->id == id) {
+return hdr;
+}
+}
+
+return NULL;
+}
+
+static void vfio_get_iommu_info_migration(VFIOContainer *container,
+ struct vfio_iommu_type1_info *info)
+{
+struct vfio_info_cap_header *hdr;
+struct vfio_iommu_type1_info_cap_migration *cap_mig;
+
+hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
+if (!hdr) {
+return;
+}
+
+cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration,
+header);
+
+/*
+ * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+ * TARGET_PAGE_SIZE to mark those dirty.
+ */
+if (cap_mig->pgsize_bitmap & TARGET_PAGE_SIZE) {
+container->dirty_pages_supported = true;
+container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
+container->dirty_pgsizes = cap_mig->pgsize_bitmap;
+}
+}
+
 static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
   Error **errp)
 {
@@ -1297,6 +1366,7 @@ static int vfio_connect_container(VFIOGroup *group, 
AddressSpace *as,
 container->space = space;
 container->fd = fd;
 container->error = NULL;
+container->dirty_pages_supported = false;
 QLIST_INIT(>giommu_list);
 QLIST_INIT(>hostwin_list);
 
@@ -1309,7 +1379,7 @@ static int vfio_connect_container(VFIOGroup *group, 
AddressSpace *as,
 case VFIO_TYPE1v2_IOMMU:
 case VFIO_TYPE1_IOMMU:
 {
-struct vfio_iommu_type1_info info;
+struct vfio_iommu_type1_info *info;
 
 /*
  * FIXME: This assumes that a Type1 IOMMU can map any 64-bit
@@ -1318,15 +1388,19 @@ static int vfio_connect_container(VFIOGroup *group, 
AddressSpace *as,
  * existing Type1 IOMMUs generally support any IOVA we're
  * going to actually try in practice.
  */
-info.argsz = sizeof(info);
-ret = ioctl(fd, VFIO_IOMMU_GET_INFO, );
-/* Ignore errors */
-if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
+ret = vfio_get_iommu_info(container, );
+
+if (ret || !(info->flags & VFIO_IOMMU_INFO_PGSIZES)) {
 /* Assume 4k IOVA page size */
-info.iova_pgsizes = 4096;
+info->iova_pgsizes = 4096;
 }
-vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
-container->pgsizes = info.iova_pgsizes;
+vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
+container->pgsizes = info->iova_pgsizes;
+
+if (!ret) {
+vfio_get_iommu_info_migration(container, info);
+}
+g_free(info);
 break;
 }
 case VFIO_SPAPR_TCE_v2_IOMMU:
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 6ac72b46a88b..93f8fe7bd869 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -832,9 +832,14 @@ err:
 
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
+VFIOContainer *container = vbasedev->gro

[PATCH v29 17/17] qapi: Add VFIO devices migration stats in Migration stats

2020-10-26 Thread Kirti Wankhede
Added amount of bytes transferred to the VM at destination by all VFIO
devices

Signed-off-by: Kirti Wankhede 
Reviewed-by: Dr. David Alan Gilbert 
---
 hw/vfio/common.c  | 19 +++
 hw/vfio/migration.c   |  9 +
 include/hw/vfio/vfio-common.h |  3 +++
 migration/migration.c | 17 +
 monitor/hmp-cmds.c|  6 ++
 qapi/migration.json   | 17 +
 6 files changed, 71 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 49c68a5253ae..56f6fee66a55 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -292,6 +292,25 @@ const MemoryRegionOps vfio_region_ops = {
  * Device state interfaces
  */
 
+bool vfio_mig_active(void)
+{
+VFIOGroup *group;
+VFIODevice *vbasedev;
+
+if (QLIST_EMPTY(_group_list)) {
+return false;
+}
+
+QLIST_FOREACH(group, _group_list, next) {
+QLIST_FOREACH(vbasedev, >device_list, next) {
+if (vbasedev->migration_blocker) {
+return false;
+}
+}
+}
+return true;
+}
+
 static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container)
 {
 VFIOGroup *group;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index ffedbcca179d..2d657289c68e 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -45,6 +45,8 @@
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
 
+static int64_t bytes_transferred;
+
 static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
   off_t off, bool iswrite)
 {
@@ -255,6 +257,7 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice 
*vbasedev, uint64_t *size)
 *size = data_size;
 }
 
+bytes_transferred += data_size;
 return ret;
 }
 
@@ -785,6 +788,7 @@ static void vfio_migration_state_notifier(Notifier 
*notifier, void *data)
 case MIGRATION_STATUS_CANCELLING:
 case MIGRATION_STATUS_CANCELLED:
 case MIGRATION_STATUS_FAILED:
+bytes_transferred = 0;
 ret = vfio_migration_set_state(vbasedev,
   ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
   VFIO_DEVICE_STATE_RUNNING);
@@ -866,6 +870,11 @@ err:
 
 /* -- */
 
+int64_t vfio_mig_bytes_transferred(void)
+{
+return bytes_transferred;
+}
+
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
 VFIOContainer *container = vbasedev->group->container;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index b1c1b18fd228..24e299d97425 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -203,6 +203,9 @@ extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
 extern VFIOGroupList vfio_group_list;
 
+bool vfio_mig_active(void);
+int64_t vfio_mig_bytes_transferred(void);
+
 #ifdef CONFIG_LINUX
 int vfio_get_region_info(VFIODevice *vbasedev, int index,
  struct vfio_region_info **info);
diff --git a/migration/migration.c b/migration/migration.c
index 0575ecb37953..995ccd96a774 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -57,6 +57,10 @@
 #include "qemu/queue.h"
 #include "multifd.h"
 
+#ifdef CONFIG_VFIO
+#include "hw/vfio/vfio-common.h"
+#endif
+
 #define MAX_THROTTLE  (128 << 20)  /* Migration transfer speed throttling 
*/
 
 /* Amount of time to allocate to each "chunk" of bandwidth-throttled
@@ -1002,6 +1006,17 @@ static void populate_disk_info(MigrationInfo *info)
 }
 }
 
+static void populate_vfio_info(MigrationInfo *info)
+{
+#ifdef CONFIG_VFIO
+if (vfio_mig_active()) {
+info->has_vfio = true;
+info->vfio = g_malloc0(sizeof(*info->vfio));
+info->vfio->transferred = vfio_mig_bytes_transferred();
+}
+#endif
+}
+
 static void fill_source_migration_info(MigrationInfo *info)
 {
 MigrationState *s = migrate_get_current();
@@ -1026,6 +1041,7 @@ static void fill_source_migration_info(MigrationInfo 
*info)
 populate_time_info(info, s);
 populate_ram_info(info, s);
 populate_disk_info(info);
+populate_vfio_info(info);
 break;
 case MIGRATION_STATUS_COLO:
 info->has_status = true;
@@ -1034,6 +1050,7 @@ static void fill_source_migration_info(MigrationInfo 
*info)
 case MIGRATION_STATUS_COMPLETED:
 populate_time_info(info, s);
 populate_ram_info(info, s);
+populate_vfio_info(info);
 break;
 case MIGRATION_STATUS_FAILED:
 info->has_status = true;
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 9789f4277f50..56e9bad33d94 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -357,6 +357,12 @@ void hmp_info_migrate

[PATCH v29 09/17] vfio: Add load state functions to SaveVMHandlers

2020-10-26 Thread Kirti Wankhede
Sequence  during _RESUMING device state:
While data for this device is available, repeat below steps:
a. read data_offset from where user application should write data.
b. write data of data_size to migration region from data_offset.
c. write data_size which indicates vendor driver that data is written in
   staging buffer.

For user, data is opaque. User should write data in the same order as
received.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Dr. David Alan Gilbert 
Reviewed-by: Yan Zhao 
---
 hw/vfio/migration.c  | 195 +++
 hw/vfio/trace-events |   4 ++
 2 files changed, 199 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 41d568558479..6ac72b46a88b 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -257,6 +257,77 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice 
*vbasedev, uint64_t *size)
 return ret;
 }
 
+static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
+uint64_t data_size)
+{
+VFIORegion *region = >migration->region;
+uint64_t data_offset = 0, size, report_size;
+int ret;
+
+do {
+ret = vfio_mig_read(vbasedev, _offset, sizeof(data_offset),
+  region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset));
+if (ret < 0) {
+return ret;
+}
+
+if (data_offset + data_size > region->size) {
+/*
+ * If data_size is greater than the data section of migration 
region
+ * then iterate the write buffer operation. This case can occur if
+ * size of migration region at destination is smaller than size of
+ * migration region at source.
+ */
+report_size = size = region->size - data_offset;
+data_size -= size;
+} else {
+report_size = size = data_size;
+data_size = 0;
+}
+
+trace_vfio_load_state_device_data(vbasedev->name, data_offset, size);
+
+while (size) {
+void *buf;
+uint64_t sec_size;
+bool buf_alloc = false;
+
+buf = get_data_section_size(region, data_offset, size, _size);
+
+if (!buf) {
+buf = g_try_malloc(sec_size);
+if (!buf) {
+error_report("%s: Error allocating buffer ", __func__);
+return -ENOMEM;
+}
+buf_alloc = true;
+}
+
+qemu_get_buffer(f, buf, sec_size);
+
+if (buf_alloc) {
+ret = vfio_mig_write(vbasedev, buf, sec_size,
+region->fd_offset + data_offset);
+g_free(buf);
+
+if (ret < 0) {
+return ret;
+}
+}
+size -= sec_size;
+data_offset += sec_size;
+}
+
+ret = vfio_mig_write(vbasedev, _size, sizeof(report_size),
+region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size));
+if (ret < 0) {
+return ret;
+}
+} while (data_size);
+
+return 0;
+}
+
 static int vfio_update_pending(VFIODevice *vbasedev)
 {
 VFIOMigration *migration = vbasedev->migration;
@@ -293,6 +364,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void 
*opaque)
 return qemu_file_get_error(f);
 }
 
+static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+uint64_t data;
+
+if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
+int ret;
+
+ret = vbasedev->ops->vfio_load_config(vbasedev, f);
+if (ret) {
+error_report("%s: Failed to load device config space",
+ vbasedev->name);
+return ret;
+}
+}
+
+data = qemu_get_be64(f);
+if (data != VFIO_MIG_FLAG_END_OF_STATE) {
+error_report("%s: Failed loading device config space, "
+ "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
+return -EINVAL;
+}
+
+trace_vfio_load_device_config_state(vbasedev->name);
+return qemu_file_get_error(f);
+}
+
 static void vfio_migration_cleanup(VFIODevice *vbasedev)
 {
 VFIOMigration *migration = vbasedev->migration;
@@ -483,12 +581,109 @@ static int vfio_save_complete_precopy(QEMUFile *f, void 
*opaque)
 return ret;
 }
 
+static int vfio_load_setup(QEMUFile *f, void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+int ret = 0;
+
+if (migration->region.mmaps) {
+ret = vfio_region_mmap(>region);
+if (ret) {
+error_report("%s: Failed to mmap VFIO migration region %d: %s",
+ vbasedev->name, mi

[PATCH v29 16/17] vfio: Make vfio-pci device migration capable

2020-10-26 Thread Kirti Wankhede
If the device is not a failover primary device, call
vfio_migration_probe() and vfio_migration_finalize() to enable
migration support for those devices that support it respectively to
tear it down again.
Removed migration blocker from VFIO PCI device specific structure and use
migration blocker from generic structure of  VFIO device.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Dr. David Alan Gilbert 
Reviewed-by: Cornelia Huck 
---
 hw/vfio/pci.c | 28 
 hw/vfio/pci.h |  1 -
 2 files changed, 8 insertions(+), 21 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e27c88be6d85..58c0ce8971e3 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2791,17 +2791,6 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 return;
 }
 
-if (!pdev->failover_pair_id) {
-error_setg(>migration_blocker,
-"VFIO device doesn't support migration");
-ret = migrate_add_blocker(vdev->migration_blocker, errp);
-if (ret) {
-error_free(vdev->migration_blocker);
-vdev->migration_blocker = NULL;
-return;
-}
-}
-
 vdev->vbasedev.name = g_path_get_basename(vdev->vbasedev.sysfsdev);
 vdev->vbasedev.ops = _pci_ops;
 vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
@@ -3069,6 +3058,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 }
 }
 
+if (!pdev->failover_pair_id) {
+ret = vfio_migration_probe(>vbasedev, errp);
+if (ret) {
+error_report("%s: Migration disabled", vdev->vbasedev.name);
+}
+}
+
 vfio_register_err_notifier(vdev);
 vfio_register_req_notifier(vdev);
 vfio_setup_resetfn_quirk(vdev);
@@ -3083,11 +3079,6 @@ out_teardown:
 vfio_bars_exit(vdev);
 error:
 error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
-if (vdev->migration_blocker) {
-migrate_del_blocker(vdev->migration_blocker);
-error_free(vdev->migration_blocker);
-vdev->migration_blocker = NULL;
-}
 }
 
 static void vfio_instance_finalize(Object *obj)
@@ -3099,10 +3090,6 @@ static void vfio_instance_finalize(Object *obj)
 vfio_bars_finalize(vdev);
 g_free(vdev->emulated_config_bits);
 g_free(vdev->rom);
-if (vdev->migration_blocker) {
-migrate_del_blocker(vdev->migration_blocker);
-error_free(vdev->migration_blocker);
-}
 /*
  * XXX Leaking igd_opregion is not an oversight, we can't remove the
  * fw_cfg entry therefore leaking this allocation seems like the safest
@@ -3130,6 +3117,7 @@ static void vfio_exitfn(PCIDevice *pdev)
 }
 vfio_teardown_msi(vdev);
 vfio_bars_exit(vdev);
+vfio_migration_finalize(>vbasedev);
 }
 
 static void vfio_pci_reset(DeviceState *dev)
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index bce71a9ac93f..1574ef983f8f 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -172,7 +172,6 @@ struct VFIOPCIDevice {
 bool no_vfio_ioeventfd;
 bool enable_ramfb;
 VFIODisplay *dpy;
-Error *migration_blocker;
 Notifier irqchip_change_notifier;
 };
 
-- 
2.7.0




[PATCH v29 06/17] vfio: Add migration state change notifier

2020-10-26 Thread Kirti Wankhede
Added migration state change notifier to get notification on migration state
change. These states are translated to VFIO device state and conveyed to
vendor driver.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Dr. David Alan Gilbert 
Reviewed-by: Cornelia Huck 
---
 hw/vfio/migration.c   | 30 ++
 hw/vfio/trace-events  |  1 +
 include/hw/vfio/vfio-common.h |  2 ++
 3 files changed, 33 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 65ce735d667b..888a615d39ea 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -175,6 +175,30 @@ static void vfio_vmstate_change(void *opaque, int running, 
RunState state)
 (migration->device_state & mask) | value);
 }
 
+static void vfio_migration_state_notifier(Notifier *notifier, void *data)
+{
+MigrationState *s = data;
+VFIOMigration *migration = container_of(notifier, VFIOMigration,
+migration_state);
+VFIODevice *vbasedev = migration->vbasedev;
+int ret;
+
+trace_vfio_migration_state_notifier(vbasedev->name,
+MigrationStatus_str(s->state));
+
+switch (s->state) {
+case MIGRATION_STATUS_CANCELLING:
+case MIGRATION_STATUS_CANCELLED:
+case MIGRATION_STATUS_FAILED:
+ret = vfio_migration_set_state(vbasedev,
+  ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
+  VFIO_DEVICE_STATE_RUNNING);
+if (ret) {
+error_report("%s: Failed to set state RUNNING", vbasedev->name);
+}
+}
+}
+
 static void vfio_migration_exit(VFIODevice *vbasedev)
 {
 VFIOMigration *migration = vbasedev->migration;
@@ -190,6 +214,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 {
 int ret;
 Object *obj;
+VFIOMigration *migration;
 
 if (!vbasedev->ops->vfio_get_object) {
 return -EINVAL;
@@ -217,8 +242,12 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 goto err;
 }
 
+migration = vbasedev->migration;
+migration->vbasedev = vbasedev;
 migration->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
vbasedev);
+migration->migration_state.notify = vfio_migration_state_notifier;
+add_migration_state_change_notifier(>migration_state);
 return 0;
 
 err:
@@ -268,6 +297,7 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
 if (vbasedev->migration) {
 VFIOMigration *migration = vbasedev->migration;
 
+remove_migration_state_change_notifier(>migration_state);
 qemu_del_vm_change_state_handler(migration->vm_state);
 vfio_migration_exit(vbasedev);
 }
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 41de81f12f60..78d7d83b5ef8 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -150,3 +150,4 @@ vfio_display_edid_write_error(void) ""
 vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
 vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
 vfio_vmstate_change(const char *name, int running, const char *reason, 
uint32_t dev_state) " (%s) running %d reason %s device state %d"
+vfio_migration_state_notifier(const char *name, const char *state) " (%s) 
state %s"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 9a571f1fb552..2bd593ba38bb 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -59,10 +59,12 @@ typedef struct VFIORegion {
 } VFIORegion;
 
 typedef struct VFIOMigration {
+struct VFIODevice *vbasedev;
 VMChangeStateEntry *vm_state;
 VFIORegion region;
 uint32_t device_state;
 int vm_running;
+Notifier migration_state;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
-- 
2.7.0




[PATCH v29 10/17] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled

2020-10-26 Thread Kirti Wankhede
mr->ram_block is NULL when mr->is_iommu is true, then fr.dirty_log_mask
wasn't set correctly due to which memory listener's log_sync doesn't
get called.
This patch returns log_mask with DIRTY_MEMORY_MIGRATION set when
IOMMU is enabled.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Yan Zhao 
---
 softmmu/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/softmmu/memory.c b/softmmu/memory.c
index 403ff3abc99b..94f606e9d9d9 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1792,7 +1792,7 @@ bool memory_region_is_ram_device(MemoryRegion *mr)
 uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
 {
 uint8_t mask = mr->dirty_log_mask;
-if (global_dirty_log && mr->ram_block) {
+if (global_dirty_log && (mr->ram_block || memory_region_is_iommu(mr))) {
 mask |= (1 << DIRTY_MEMORY_MIGRATION);
 }
 return mask;
-- 
2.7.0




[PATCH v29 13/17] vfio: Add vfio_listener_log_sync to mark dirty pages

2020-10-26 Thread Kirti Wankhede
vfio_listener_log_sync gets list of dirty pages from container using
VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all
devices are stopped and saving state.
Return early for the RAM block section of mapped MMIO region.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
---
 hw/vfio/common.c | 116 +++
 hw/vfio/trace-events |   1 +
 2 files changed, 117 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index d4959c036dd1..2634387df948 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -29,6 +29,7 @@
 #include "hw/vfio/vfio.h"
 #include "exec/address-spaces.h"
 #include "exec/memory.h"
+#include "exec/ram_addr.h"
 #include "hw/hw.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
@@ -37,6 +38,7 @@
 #include "sysemu/reset.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "migration/migration.h"
 
 VFIOGroupList vfio_group_list =
 QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -287,6 +289,39 @@ const MemoryRegionOps vfio_region_ops = {
 };
 
 /*
+ * Device state interfaces
+ */
+
+static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container)
+{
+VFIOGroup *group;
+VFIODevice *vbasedev;
+MigrationState *ms = migrate_get_current();
+
+if (!migration_is_setup_or_active(ms->state)) {
+return false;
+}
+
+QLIST_FOREACH(group, >group_list, container_next) {
+QLIST_FOREACH(vbasedev, >device_list, next) {
+VFIOMigration *migration = vbasedev->migration;
+
+if (!migration) {
+return false;
+}
+
+if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) &&
+!(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+continue;
+} else {
+return false;
+}
+}
+}
+return true;
+}
+
+/*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
 static int vfio_dma_unmap(VFIOContainer *container,
@@ -812,9 +847,90 @@ static void vfio_listener_region_del(MemoryListener 
*listener,
 }
 }
 
+static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
+ uint64_t size, ram_addr_t ram_addr)
+{
+struct vfio_iommu_type1_dirty_bitmap *dbitmap;
+struct vfio_iommu_type1_dirty_bitmap_get *range;
+uint64_t pages;
+int ret;
+
+dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
+
+dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
+dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
+range = (struct vfio_iommu_type1_dirty_bitmap_get *)>data;
+range->iova = iova;
+range->size = size;
+
+/*
+ * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+ * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap's pgsize to
+ * TARGET_PAGE_SIZE.
+ */
+range->bitmap.pgsize = TARGET_PAGE_SIZE;
+
+pages = TARGET_PAGE_ALIGN(range->size) >> TARGET_PAGE_BITS;
+range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+ BITS_PER_BYTE;
+range->bitmap.data = g_try_malloc0(range->bitmap.size);
+if (!range->bitmap.data) {
+ret = -ENOMEM;
+goto err_out;
+}
+
+ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
+if (ret) {
+error_report("Failed to get dirty bitmap for iova: 0x%llx "
+"size: 0x%llx err: %d",
+range->iova, range->size, errno);
+goto err_out;
+}
+
+cpu_physical_memory_set_dirty_lebitmap((uint64_t *)range->bitmap.data,
+ram_addr, pages);
+
+trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
+range->bitmap.size, ram_addr);
+err_out:
+g_free(range->bitmap.data);
+g_free(dbitmap);
+
+return ret;
+}
+
+static int vfio_sync_dirty_bitmap(VFIOContainer *container,
+  MemoryRegionSection *section)
+{
+ram_addr_t ram_addr;
+
+ram_addr = memory_region_get_ram_addr(section->mr) +
+   section->offset_within_region;
+
+return vfio_get_dirty_bitmap(container,
+   TARGET_PAGE_ALIGN(section->offset_within_address_space),
+   int128_get64(section->size), ram_addr);
+}
+
+static void vfio_listerner_log_sync(MemoryListener *listener,
+MemoryRegionSection *section)
+{
+VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+
+if (vfio_listener_skipped_section(section) ||
+!container->dirty_pages_supported) {
+r

[PATCH v29 07/17] vfio: Register SaveVMHandlers for VFIO device

2020-10-26 Thread Kirti Wankhede
Define flags to be used as delimiter in migration stream for VFIO devices.
Added .save_setup and .save_cleanup functions. Map & unmap migration
region from these functions at source during saving or pre-copy phase.

Set VFIO device state depending on VM's state. During live migration, VM is
running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
device. During save-restore, VM is paused, _SAVING state is set for VFIO device.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Cornelia Huck 
Reviewed-by: Yan Zhao 
---
 hw/vfio/migration.c  | 102 +++
 hw/vfio/trace-events |   2 +
 2 files changed, 104 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 888a615d39ea..d3ef9e18f39c 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -8,12 +8,15 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/main-loop.h"
+#include "qemu/cutils.h"
 #include 
 
 #include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "cpu.h"
 #include "migration/migration.h"
+#include "migration/vmstate.h"
 #include "migration/qemu-file.h"
 #include "migration/register.h"
 #include "migration/blocker.h"
@@ -25,6 +28,22 @@
 #include "trace.h"
 #include "hw/hw.h"
 
+/*
+ * Flags to be used as unique delimiters for VFIO devices in the migration
+ * stream. These flags are composed as:
+ * 0x => MSB 32-bit all 1s
+ * 0xef10 => Magic ID, represents emulated (virtual) function IO
+ * 0x => 16-bits reserved for flags
+ *
+ * The beginning of state information is marked by _DEV_CONFIG_STATE,
+ * _DEV_SETUP_STATE, or _DEV_DATA_STATE, respectively. The end of a
+ * certain state information is marked by _END_OF_STATE.
+ */
+#define VFIO_MIG_FLAG_END_OF_STATE  (0xef11ULL)
+#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xef12ULL)
+#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
+#define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
+
 static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
   off_t off, bool iswrite)
 {
@@ -129,6 +148,75 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, 
uint32_t mask,
 return 0;
 }
 
+static void vfio_migration_cleanup(VFIODevice *vbasedev)
+{
+VFIOMigration *migration = vbasedev->migration;
+
+if (migration->region.mmaps) {
+vfio_region_unmap(>region);
+}
+}
+
+/* -- */
+
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+int ret;
+
+trace_vfio_save_setup(vbasedev->name);
+
+qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
+
+if (migration->region.mmaps) {
+/*
+ * Calling vfio_region_mmap() from migration thread. Memory API called
+ * from this function require locking the iothread when called from
+ * outside the main loop thread.
+ */
+qemu_mutex_lock_iothread();
+ret = vfio_region_mmap(>region);
+qemu_mutex_unlock_iothread();
+if (ret) {
+error_report("%s: Failed to mmap VFIO migration region: %s",
+ vbasedev->name, strerror(-ret));
+error_report("%s: Falling back to slow path", vbasedev->name);
+}
+}
+
+ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
+   VFIO_DEVICE_STATE_SAVING);
+if (ret) {
+error_report("%s: Failed to set state SAVING", vbasedev->name);
+return ret;
+}
+
+qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+ret = qemu_file_get_error(f);
+if (ret) {
+return ret;
+}
+
+return 0;
+}
+
+static void vfio_save_cleanup(void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+
+vfio_migration_cleanup(vbasedev);
+trace_vfio_save_cleanup(vbasedev->name);
+}
+
+static SaveVMHandlers savevm_vfio_handlers = {
+.save_setup = vfio_save_setup,
+.save_cleanup = vfio_save_cleanup,
+};
+
+/* -- */
+
 static void vfio_vmstate_change(void *opaque, int running, RunState state)
 {
 VFIODevice *vbasedev = opaque;
@@ -215,6 +303,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 int ret;
 Object *obj;
 VFIOMigration *migration;
+char id[256] = "";
+g_autofree char *path = NULL, *oid = NULL;
 
 if (!vbasedev->ops->vfio_get_object) {
 return -EINVAL;
@@ -244,6 +334,18 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 
 migration = vbasedev->migration;
 migration-&g

[PATCH v29 12/17] vfio: Add function to start and stop dirty pages tracking

2020-10-26 Thread Kirti Wankhede
Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking
for VFIO devices.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Dr. David Alan Gilbert 
---
 hw/vfio/migration.c | 36 
 1 file changed, 36 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 93f8fe7bd869..ffedbcca179d 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -11,6 +11,7 @@
 #include "qemu/main-loop.h"
 #include "qemu/cutils.h"
 #include 
+#include 
 
 #include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
@@ -391,10 +392,40 @@ static int vfio_load_device_config_state(QEMUFile *f, 
void *opaque)
 return qemu_file_get_error(f);
 }
 
+static int vfio_set_dirty_page_tracking(VFIODevice *vbasedev, bool start)
+{
+int ret;
+VFIOMigration *migration = vbasedev->migration;
+VFIOContainer *container = vbasedev->group->container;
+struct vfio_iommu_type1_dirty_bitmap dirty = {
+.argsz = sizeof(dirty),
+};
+
+if (start) {
+if (migration->device_state & VFIO_DEVICE_STATE_SAVING) {
+dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
+} else {
+return -EINVAL;
+}
+} else {
+dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
+}
+
+ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, );
+if (ret) {
+error_report("Failed to set dirty tracking flag 0x%x errno: %d",
+ dirty.flags, errno);
+return -errno;
+}
+return ret;
+}
+
 static void vfio_migration_cleanup(VFIODevice *vbasedev)
 {
 VFIOMigration *migration = vbasedev->migration;
 
+vfio_set_dirty_page_tracking(vbasedev, false);
+
 if (migration->region.mmaps) {
 vfio_region_unmap(>region);
 }
@@ -435,6 +466,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
 return ret;
 }
 
+ret = vfio_set_dirty_page_tracking(vbasedev, true);
+if (ret) {
+return ret;
+}
+
 qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
 
 ret = qemu_file_get_error(f);
-- 
2.7.0




[PATCH v29 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap

2020-10-26 Thread Kirti Wankhede
With vIOMMU, IO virtual address range can get unmapped while in pre-copy
phase of migration. In that case, unmap ioctl should return pages pinned
in that range and QEMU should find its correcponding guest physical
addresses and report those dirty.

Suggested-by: Alex Williamson 
Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
---
 hw/vfio/common.c | 96 +---
 1 file changed, 92 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c0b5b6245a47..49c68a5253ae 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -321,11 +321,94 @@ static bool 
vfio_devices_all_stopped_and_saving(VFIOContainer *container)
 return true;
 }
 
+static bool vfio_devices_all_running_and_saving(VFIOContainer *container)
+{
+VFIOGroup *group;
+VFIODevice *vbasedev;
+MigrationState *ms = migrate_get_current();
+
+if (!migration_is_setup_or_active(ms->state)) {
+return false;
+}
+
+QLIST_FOREACH(group, >group_list, container_next) {
+QLIST_FOREACH(vbasedev, >device_list, next) {
+VFIOMigration *migration = vbasedev->migration;
+
+if (!migration) {
+return false;
+}
+
+if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) &&
+(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+continue;
+} else {
+return false;
+}
+}
+}
+return true;
+}
+
+static int vfio_dma_unmap_bitmap(VFIOContainer *container,
+ hwaddr iova, ram_addr_t size,
+ IOMMUTLBEntry *iotlb)
+{
+struct vfio_iommu_type1_dma_unmap *unmap;
+struct vfio_bitmap *bitmap;
+uint64_t pages = TARGET_PAGE_ALIGN(size) >> TARGET_PAGE_BITS;
+int ret;
+
+unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
+
+unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
+unmap->iova = iova;
+unmap->size = size;
+unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
+bitmap = (struct vfio_bitmap *)>data;
+
+/*
+ * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+ * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap_pgsize to
+ * TARGET_PAGE_SIZE.
+ */
+
+bitmap->pgsize = TARGET_PAGE_SIZE;
+bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+   BITS_PER_BYTE;
+
+if (bitmap->size > container->max_dirty_bitmap_size) {
+error_report("UNMAP: Size of bitmap too big 0x%llx", bitmap->size);
+ret = -E2BIG;
+goto unmap_exit;
+}
+
+bitmap->data = g_try_malloc0(bitmap->size);
+if (!bitmap->data) {
+ret = -ENOMEM;
+goto unmap_exit;
+}
+
+ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
+if (!ret) {
+cpu_physical_memory_set_dirty_lebitmap((uint64_t *)bitmap->data,
+iotlb->translated_addr, pages);
+} else {
+error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
+}
+
+g_free(bitmap->data);
+unmap_exit:
+g_free(unmap);
+return ret;
+}
+
 /*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
 static int vfio_dma_unmap(VFIOContainer *container,
-  hwaddr iova, ram_addr_t size)
+  hwaddr iova, ram_addr_t size,
+  IOMMUTLBEntry *iotlb)
 {
 struct vfio_iommu_type1_dma_unmap unmap = {
 .argsz = sizeof(unmap),
@@ -334,6 +417,11 @@ static int vfio_dma_unmap(VFIOContainer *container,
 .size = size,
 };
 
+if (iotlb && container->dirty_pages_supported &&
+vfio_devices_all_running_and_saving(container)) {
+return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
+}
+
 while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, )) {
 /*
  * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
@@ -381,7 +469,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr 
iova,
  * the VGA ROM space.
  */
 if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, ) == 0 ||
-(errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
+(errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
  ioctl(container->fd, VFIO_IOMMU_MAP_DMA, ) == 0)) {
 return 0;
 }
@@ -531,7 +619,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, 
IOMMUTLBEntry *iotlb)
  iotlb->addr_mask + 1, vaddr, ret);
 }
 } else {
-ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
+ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
 if (ret) {

[PATCH v29 08/17] vfio: Add save state functions to SaveVMHandlers

2020-10-26 Thread Kirti Wankhede
Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
functions. These functions handles pre-copy and stop-and-copy phase.

In _SAVING|_RUNNING device state or pre-copy phase:
- read pending_bytes. If pending_bytes > 0, go through below steps.
- read data_offset - indicates kernel driver to write data to staging
  buffer.
- read data_size - amount of data in bytes written by vendor driver in
  migration region.
- read data_size bytes of data from data_offset in the migration region.
- Write data packet to file stream as below:
{VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
VFIO_MIG_FLAG_END_OF_STATE }

In _SAVING device state or stop-and-copy phase
a. read config space of device and save to migration file stream. This
   doesn't need to be from vendor driver. Any other special config state
   from driver can be saved as data in following iteration.
b. read pending_bytes. If pending_bytes > 0, go through below steps.
c. read data_offset - indicates kernel driver to write data to staging
   buffer.
d. read data_size - amount of data in bytes written by vendor driver in
   migration region.
e. read data_size bytes of data from data_offset in the migration region.
f. Write data packet as below:
   {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
g. iterate through steps b to f while (pending_bytes > 0)
h. Write {VFIO_MIG_FLAG_END_OF_STATE}

When data region is mapped, its user's responsibility to read data from
data_offset of data_size before moving to next steps.

Added fix suggested by Artem Polyakov to reset pending_bytes in
vfio_save_iterate().
Added fix suggested by Zhi Wang to add 0 as data size in migration stream and
add END_OF_STATE delimiter to indicate phase complete.

Suggested-by: Artem Polyakov 
Suggested-by: Zhi Wang 
Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Yan Zhao 
---
 hw/vfio/migration.c   | 276 ++
 hw/vfio/trace-events  |   6 +
 include/hw/vfio/vfio-common.h |   1 +
 3 files changed, 283 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index d3ef9e18f39c..41d568558479 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -148,6 +148,151 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, 
uint32_t mask,
 return 0;
 }
 
+static void *get_data_section_size(VFIORegion *region, uint64_t data_offset,
+   uint64_t data_size, uint64_t *size)
+{
+void *ptr = NULL;
+uint64_t limit = 0;
+int i;
+
+if (!region->mmaps) {
+if (size) {
+*size = MIN(data_size, region->size - data_offset);
+}
+return ptr;
+}
+
+for (i = 0; i < region->nr_mmaps; i++) {
+VFIOMmap *map = region->mmaps + i;
+
+if ((data_offset >= map->offset) &&
+(data_offset < map->offset + map->size)) {
+
+/* check if data_offset is within sparse mmap areas */
+ptr = map->mmap + data_offset - map->offset;
+if (size) {
+*size = MIN(data_size, map->offset + map->size - data_offset);
+}
+break;
+} else if ((data_offset < map->offset) &&
+   (!limit || limit > map->offset)) {
+/*
+ * data_offset is not within sparse mmap areas, find size of
+ * non-mapped area. Check through all list since region->mmaps list
+ * is not sorted.
+ */
+limit = map->offset;
+}
+}
+
+if (!ptr && size) {
+*size = limit ? MIN(data_size, limit - data_offset) : data_size;
+}
+return ptr;
+}
+
+static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
+{
+VFIOMigration *migration = vbasedev->migration;
+VFIORegion *region = >region;
+uint64_t data_offset = 0, data_size = 0, sz;
+int ret;
+
+ret = vfio_mig_read(vbasedev, _offset, sizeof(data_offset),
+  region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset));
+if (ret < 0) {
+return ret;
+}
+
+ret = vfio_mig_read(vbasedev, _size, sizeof(data_size),
+region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size));
+if (ret < 0) {
+return ret;
+}
+
+trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
+   migration->pending_bytes);
+
+qemu_put_be64(f, data_size);
+sz = data_size;
+
+while (sz) {
+void *buf;
+uint64_t sec_size;
+bool buf_allocated = false;
+
+buf = get_data_section_size(region, data_offset, sz, _size);
+
+if (!buf) {
+buf = g_try_malloc(sec_size);
+if (!buf) {
+error_report("%s: Error allocating buffer ", __func__);
+ 

[PATCH v29 05/17] vfio: Add VM state change handler to know state of VM

2020-10-26 Thread Kirti Wankhede
VM state change handler is called on change in VM's state. Based on
VM state, VFIO device state should be changed.
Added read/write helper functions for migration region.
Added function to set device_state.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Dr. David Alan Gilbert 
Reviewed-by: Cornelia Huck 
---
 hw/vfio/migration.c   | 158 ++
 hw/vfio/trace-events  |   2 +
 include/hw/vfio/vfio-common.h |   4 ++
 3 files changed, 164 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index fd7faf423cdc..65ce735d667b 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -10,6 +10,7 @@
 #include "qemu/osdep.h"
 #include 
 
+#include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "cpu.h"
 #include "migration/migration.h"
@@ -22,6 +23,157 @@
 #include "exec/ram_addr.h"
 #include "pci.h"
 #include "trace.h"
+#include "hw/hw.h"
+
+static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
+  off_t off, bool iswrite)
+{
+int ret;
+
+ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
+pread(vbasedev->fd, val, count, off);
+if (ret < count) {
+error_report("vfio_mig_%s %d byte %s: failed at offset 0x%lx, err: %s",
+ iswrite ? "write" : "read", count,
+ vbasedev->name, off, strerror(errno));
+return (ret < 0) ? ret : -EINVAL;
+}
+return 0;
+}
+
+static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count,
+   off_t off, bool iswrite)
+{
+int ret, done = 0;
+__u8 *tbuf = buf;
+
+while (count) {
+int bytes = 0;
+
+if (count >= 8 && !(off % 8)) {
+bytes = 8;
+} else if (count >= 4 && !(off % 4)) {
+bytes = 4;
+} else if (count >= 2 && !(off % 2)) {
+bytes = 2;
+} else {
+bytes = 1;
+}
+
+ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite);
+if (ret) {
+return ret;
+}
+
+count -= bytes;
+done += bytes;
+off += bytes;
+tbuf += bytes;
+}
+return done;
+}
+
+#define vfio_mig_read(f, v, c, o)   vfio_mig_rw(f, (__u8 *)v, c, o, false)
+#define vfio_mig_write(f, v, c, o)  vfio_mig_rw(f, (__u8 *)v, c, o, true)
+
+#define VFIO_MIG_STRUCT_OFFSET(f)   \
+ offsetof(struct vfio_device_migration_info, f)
+/*
+ * Change the device_state register for device @vbasedev. Bits set in @mask
+ * are preserved, bits set in @value are set, and bits not set in either @mask
+ * or @value are cleared in device_state. If the register cannot be accessed,
+ * the resulting state would be invalid, or the device enters an error state,
+ * an error is returned.
+ */
+
+static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
+uint32_t value)
+{
+VFIOMigration *migration = vbasedev->migration;
+VFIORegion *region = >region;
+off_t dev_state_off = region->fd_offset +
+  VFIO_MIG_STRUCT_OFFSET(device_state);
+uint32_t device_state;
+int ret;
+
+ret = vfio_mig_read(vbasedev, _state, sizeof(device_state),
+dev_state_off);
+if (ret < 0) {
+return ret;
+}
+
+device_state = (device_state & mask) | value;
+
+if (!VFIO_DEVICE_STATE_VALID(device_state)) {
+return -EINVAL;
+}
+
+ret = vfio_mig_write(vbasedev, _state, sizeof(device_state),
+ dev_state_off);
+if (ret < 0) {
+int rret;
+
+rret = vfio_mig_read(vbasedev, _state, sizeof(device_state),
+ dev_state_off);
+
+if ((rret < 0) || (VFIO_DEVICE_STATE_IS_ERROR(device_state))) {
+hw_error("%s: Device in error state 0x%x", vbasedev->name,
+ device_state);
+return rret ? rret : -EIO;
+}
+return ret;
+}
+
+migration->device_state = device_state;
+trace_vfio_migration_set_state(vbasedev->name, device_state);
+return 0;
+}
+
+static void vfio_vmstate_change(void *opaque, int running, RunState state)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+uint32_t value, mask;
+int ret;
+
+if ((vbasedev->migration->vm_running == running)) {
+return;
+}
+
+if (running) {
+/*
+ * Here device state can have one of _SAVING, _RESUMING or _STOP bit.
+ * Transition from _SAVING to _RUNNING can happen if there is migration
+ * failure, in that case clear _S

[PATCH v29 04/17] vfio: Add migration region initialization and finalize function

2020-10-26 Thread Kirti Wankhede
Whether the VFIO device supports migration or not is decided based of
migration region query. If migration region query is successful and migration
region initialization is successful then migration is supported else
migration is blocked.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Acked-by: Dr. David Alan Gilbert 
---
 hw/vfio/meson.build   |   1 +
 hw/vfio/migration.c   | 122 ++
 hw/vfio/trace-events  |   3 ++
 include/hw/vfio/vfio-common.h |   9 
 4 files changed, 135 insertions(+)
 create mode 100644 hw/vfio/migration.c

diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index 37efa74018bc..da9af297a0c5 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
 vfio_ss.add(files(
   'common.c',
   'spapr.c',
+  'migration.c',
 ))
 vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
   'display.c',
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
new file mode 100644
index ..fd7faf423cdc
--- /dev/null
+++ b/hw/vfio/migration.c
@@ -0,0 +1,122 @@
+/*
+ * Migration support for VFIO devices
+ *
+ * Copyright NVIDIA, Inc. 2020
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include 
+
+#include "hw/vfio/vfio-common.h"
+#include "cpu.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "migration/register.h"
+#include "migration/blocker.h"
+#include "migration/misc.h"
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "exec/ram_addr.h"
+#include "pci.h"
+#include "trace.h"
+
+static void vfio_migration_exit(VFIODevice *vbasedev)
+{
+VFIOMigration *migration = vbasedev->migration;
+
+vfio_region_exit(>region);
+vfio_region_finalize(>region);
+g_free(vbasedev->migration);
+vbasedev->migration = NULL;
+}
+
+static int vfio_migration_init(VFIODevice *vbasedev,
+   struct vfio_region_info *info)
+{
+int ret;
+Object *obj;
+
+if (!vbasedev->ops->vfio_get_object) {
+return -EINVAL;
+}
+
+obj = vbasedev->ops->vfio_get_object(vbasedev);
+if (!obj) {
+return -EINVAL;
+}
+
+vbasedev->migration = g_new0(VFIOMigration, 1);
+
+ret = vfio_region_setup(obj, vbasedev, >migration->region,
+info->index, "migration");
+if (ret) {
+error_report("%s: Failed to setup VFIO migration region %d: %s",
+ vbasedev->name, info->index, strerror(-ret));
+goto err;
+}
+
+if (!vbasedev->migration->region.size) {
+error_report("%s: Invalid zero-sized VFIO migration region %d",
+ vbasedev->name, info->index);
+ret = -EINVAL;
+goto err;
+}
+return 0;
+
+err:
+vfio_migration_exit(vbasedev);
+return ret;
+}
+
+/* -- */
+
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
+{
+struct vfio_region_info *info = NULL;
+Error *local_err = NULL;
+int ret;
+
+ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
+   VFIO_REGION_SUBTYPE_MIGRATION, );
+if (ret) {
+goto add_blocker;
+}
+
+ret = vfio_migration_init(vbasedev, info);
+if (ret) {
+goto add_blocker;
+}
+
+g_free(info);
+trace_vfio_migration_probe(vbasedev->name, info->index);
+return 0;
+
+add_blocker:
+error_setg(>migration_blocker,
+   "VFIO device doesn't support migration");
+g_free(info);
+
+ret = migrate_add_blocker(vbasedev->migration_blocker, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+error_free(vbasedev->migration_blocker);
+vbasedev->migration_blocker = NULL;
+}
+return ret;
+}
+
+void vfio_migration_finalize(VFIODevice *vbasedev)
+{
+if (vbasedev->migration) {
+vfio_migration_exit(vbasedev);
+}
+
+if (vbasedev->migration_blocker) {
+migrate_del_blocker(vbasedev->migration_blocker);
+error_free(vbasedev->migration_blocker);
+vbasedev->migration_blocker = NULL;
+}
+}
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index a0c7b49a2ebc..9ced5ec6277c 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -145,3 +145,6 @@ vfio_display_edid_link_up(void) ""
 vfio_display_edid_link_down(void) ""
 vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
 vfio_display_edid_write_error(void) ""
+
+# migration.c
+vfio_migration_probe(const char 

[PATCH v29 02/17] vfio: Add vfio_get_object callback to VFIODeviceOps

2020-10-26 Thread Kirti Wankhede
Hook vfio_get_object callback for PCI devices.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Suggested-by: Cornelia Huck 
Reviewed-by: Cornelia Huck 
---
 hw/vfio/pci.c | 8 
 include/hw/vfio/vfio-common.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 0d83eb0e47bb..bffd5bfe3b78 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2394,10 +2394,18 @@ static void vfio_pci_compute_needs_reset(VFIODevice 
*vbasedev)
 }
 }
 
+static Object *vfio_pci_get_object(VFIODevice *vbasedev)
+{
+VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+return OBJECT(vdev);
+}
+
 static VFIODeviceOps vfio_pci_ops = {
 .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
 .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
 .vfio_eoi = vfio_intx_eoi,
+.vfio_get_object = vfio_pci_get_object,
 };
 
 int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index dc95f527b583..fe99c36a693a 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -119,6 +119,7 @@ struct VFIODeviceOps {
 void (*vfio_compute_needs_reset)(VFIODevice *vdev);
 int (*vfio_hot_reset_multi)(VFIODevice *vdev);
 void (*vfio_eoi)(VFIODevice *vdev);
+Object *(*vfio_get_object)(VFIODevice *vdev);
 };
 
 typedef struct VFIOGroup {
-- 
2.7.0




[PATCH v29 03/17] vfio: Add save and load functions for VFIO PCI devices

2020-10-26 Thread Kirti Wankhede
Added functions to save and restore PCI device specific data,
specifically config space of PCI device.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
---
 hw/vfio/pci.c | 51 +++
 include/hw/vfio/vfio-common.h |  2 ++
 2 files changed, 53 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index bffd5bfe3b78..e27c88be6d85 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -41,6 +41,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/blocker.h"
+#include "migration/qemu-file.h"
 
 #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
 
@@ -2401,11 +2402,61 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
 return OBJECT(vdev);
 }
 
+static bool vfio_msix_present(void *opaque, int version_id)
+{
+PCIDevice *pdev = opaque;
+
+return msix_present(pdev);
+}
+
+const VMStateDescription vmstate_vfio_pci_config = {
+.name = "VFIOPCIDevice",
+.version_id = 1,
+.minimum_version_id = 1,
+.fields = (VMStateField[]) {
+VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
+VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
+VMSTATE_END_OF_LIST()
+}
+};
+
+static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+vmstate_save_state(f, _vfio_pci_config, vdev, NULL);
+}
+
+static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+PCIDevice *pdev = >pdev;
+int ret;
+
+ret = vmstate_load_state(f, _vfio_pci_config, vdev, 1);
+if (ret) {
+return ret;
+}
+
+vfio_pci_write_config(pdev, PCI_COMMAND,
+  pci_get_word(pdev->config + PCI_COMMAND), 2);
+
+if (msi_enabled(pdev)) {
+vfio_msi_enable(vdev);
+} else if (msix_enabled(pdev)) {
+vfio_msix_enable(vdev);
+}
+
+return ret;
+}
+
 static VFIODeviceOps vfio_pci_ops = {
 .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
 .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
 .vfio_eoi = vfio_intx_eoi,
 .vfio_get_object = vfio_pci_get_object,
+.vfio_save_config = vfio_pci_save_config,
+.vfio_load_config = vfio_pci_load_config,
 };
 
 int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fe99c36a693a..ba6169cd926e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -120,6 +120,8 @@ struct VFIODeviceOps {
 int (*vfio_hot_reset_multi)(VFIODevice *vdev);
 void (*vfio_eoi)(VFIODevice *vdev);
 Object *(*vfio_get_object)(VFIODevice *vdev);
+void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
+int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
 };
 
 typedef struct VFIOGroup {
-- 
2.7.0




[PATCH v29 00/17] Add migration support for VFIO devices

2020-10-26 Thread Kirti Wankhede
opy is not supported.

v28 -> 29
- Nit picks.
- Write through PCI_COMMAND register on loading PCI config space as suggested by
  Yan.

v27 -> 28
- Nit picks and minor changes suggested by Alex.

v26 -> 27
- Major change in Patch 3 -PCI config space save and long using VMSTATE_*
- Major change in Patch 14 - Dirty page tracking when vIOMMU is enabled using 
IOMMU notifier and
  its replay functionality - as suggested by Alex.
- Some Structure changes to keep all migration related members at one place.
- Pulled fix suggested by Zhi Wang 
  https://www.mail-archive.com/qemu-devel@nongnu.org/msg743722.html
- Add comments where even suggested and required.

v25 -> 26
- Removed emulated_config_bits cache and vdev->pdev.wmask from config space save
  load functions.
- Used VMStateDescription for config space save and load functionality.
- Major fixes from previous version review.
  https://www.mail-archive.com/qemu-devel@nongnu.org/msg714625.html

v23 -> 25
- Updated config space save and load to save config cache, emulated bits cache
  and wmask cache.
- Created idr string as suggested by Dr Dave that includes bus path.
- Updated save and load function to read/write data to mixed regions, mapped or
  trapped.
- When vIOMMU is enabled, created mapped iova range list which also keeps
  translated address. This list is used to mark dirty pages. This reduces
  downtime significantly with vIOMMU enabled than migration patches from
   previous version. 
- Removed get_address_limit() function from v23 patch as this not required now.

v22 -> v23
-- Fixed issue reported by Yan
https://lore.kernel.org/kvm/97977ede-3c5b-c5a5-7858-7eecd7dd5...@nvidia.com/
- Sending this version to test v23 kernel version patches:
https://lore.kernel.org/kvm/1589998088-3250-1-git-send-email-kwankh...@nvidia.com/

v18 -> v22
- Few fixes from v18 review. But not yet fixed all concerns. I'll address those
  concerns in subsequent iterations.
- Sending this version to test v22 kernel version patches:
https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email-kwankh...@nvidia.com/

v16 -> v18
- Nit fixes
- Get migration capability flags from container
- Added VFIO stats to MigrationInfo
- Fixed bug reported by Yan
https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg4.html

v9 -> v16
- KABI almost finalised on kernel patches.
- Added support for migration with vIOMMU enabled.

v8 -> v9:
- Split patch set in 2 sets, Kernel and QEMU sets.
- Dirty pages bitmap is queried from IOMMU container rather than from
  vendor driver for per device. Added 2 ioctls to achieve this.

v7 -> v8:
- Updated comments for KABI
- Added BAR address validation check during PCI device's config space load as
  suggested by Dr. David Alan Gilbert.
- Changed vfio_migration_set_state() to set or clear device state flags.
- Some nit fixes.

v6 -> v7:
- Fix build failures.

v5 -> v6:
- Fix build failure.

v4 -> v5:
- Added decriptive comment about the sequence of access of members of structure
  vfio_device_migration_info to be followed based on Alex's suggestion
- Updated get dirty pages sequence.
- As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
  get_object, save_config and load_config.
- Fixed multiple nit picks.
- Tested live migration with multiple vfio device assigned to a VM.

v3 -> v4:
- Added one more bit for _RESUMING flag to be set explicitly.
- data_offset field is read-only for user space application.
- data_size is read for every iteration before reading data from migration, that
  is removed assumption that data will be till end of migration region.
- If vendor driver supports mappable sparsed region, map those region during
  setup state of save/load, similarly unmap those from cleanup routines.
- Handles race condition that causes data corruption in migration region during
  save device state by adding mutex and serialiaing save_buffer and
  get_dirty_pages routines.
- Skip called get_dirty_pages routine for mapped MMIO region of device.
- Added trace events.
- Splitted into multiple functional patches.

v2 -> v3:
- Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
- Re-structured vfio_device_migration_info to keep it minimal and defined action
  on read and write access on its members.

v1 -> v2:
- Defined MIGRATION region type and sub-type which should be used with region
  type capability.
- Re-structured vfio_device_migration_info. This structure will be placed at 0th
  offset of migration region.
- Replaced ioctl with read/write for trapped part of migration region.
- Added both type of access support, trapped or mmapped, for data section of the
  region.
- Moved PCI device functions to pci file.
- Added iteration to get dirty page bitmap until bitmap for all requested pages
  are copied.

Thanks,
Kirti


Kirti Wankhede (17):
  vfio: Add function to unmap VFIO region
  vfio: Add vfio_get_object callback to VFIODeviceOps
  vfio: Add save and 

[PATCH v29 01/17] vfio: Add function to unmap VFIO region

2020-10-26 Thread Kirti Wankhede
This function will be used for migration region.
Migration region is mmaped when migration starts and will be unmapped when
migration is complete.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Cornelia Huck 
---
 hw/vfio/common.c  | 32 
 hw/vfio/trace-events  |  1 +
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 13471ae29436..c6e98b8d61be 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -924,6 +924,18 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, 
VFIORegion *region,
 return 0;
 }
 
+static void vfio_subregion_unmap(VFIORegion *region, int index)
+{
+trace_vfio_region_unmap(memory_region_name(>mmaps[index].mem),
+region->mmaps[index].offset,
+region->mmaps[index].offset +
+region->mmaps[index].size - 1);
+memory_region_del_subregion(region->mem, >mmaps[index].mem);
+munmap(region->mmaps[index].mmap, region->mmaps[index].size);
+object_unparent(OBJECT(>mmaps[index].mem));
+region->mmaps[index].mmap = NULL;
+}
+
 int vfio_region_mmap(VFIORegion *region)
 {
 int i, prot = 0;
@@ -954,10 +966,7 @@ int vfio_region_mmap(VFIORegion *region)
 region->mmaps[i].mmap = NULL;
 
 for (i--; i >= 0; i--) {
-memory_region_del_subregion(region->mem, 
>mmaps[i].mem);
-munmap(region->mmaps[i].mmap, region->mmaps[i].size);
-object_unparent(OBJECT(>mmaps[i].mem));
-region->mmaps[i].mmap = NULL;
+vfio_subregion_unmap(region, i);
 }
 
 return ret;
@@ -982,6 +991,21 @@ int vfio_region_mmap(VFIORegion *region)
 return 0;
 }
 
+void vfio_region_unmap(VFIORegion *region)
+{
+int i;
+
+if (!region->mem) {
+return;
+}
+
+for (i = 0; i < region->nr_mmaps; i++) {
+if (region->mmaps[i].mmap) {
+vfio_subregion_unmap(region, i);
+}
+}
+}
+
 void vfio_region_exit(VFIORegion *region)
 {
 int i;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 93a0bc2522f8..a0c7b49a2ebc 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -113,6 +113,7 @@ vfio_region_mmap(const char *name, unsigned long offset, 
unsigned long end) "Reg
 vfio_region_exit(const char *name, int index) "Device %s, region %d"
 vfio_region_finalize(const char *name, int index) "Device %s, region %d"
 vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps 
enabled: %d"
+vfio_region_unmap(const char *name, unsigned long offset, unsigned long end) 
"Region %s unmap [0x%lx - 0x%lx]"
 vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) 
"Device %s region %d: %d sparse mmap entries"
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) 
"sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t 
subtype) "%s index %d, %08x/%0x8"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c78f3ff5593c..dc95f527b583 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -171,6 +171,7 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, 
VFIORegion *region,
   int index, const char *name);
 int vfio_region_mmap(VFIORegion *region);
 void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled);
+void vfio_region_unmap(VFIORegion *region);
 void vfio_region_exit(VFIORegion *region);
 void vfio_region_finalize(VFIORegion *region);
 void vfio_reset_handler(void *opaque);
-- 
2.7.0




Re: [PATCH v28 07/17] vfio: Register SaveVMHandlers for VFIO device

2020-10-24 Thread Kirti Wankhede




On 10/24/2020 4:56 PM, Yan Zhao wrote:

On Fri, Oct 23, 2020 at 04:10:33PM +0530, Kirti Wankhede wrote:

Define flags to be used as delimiter in migration stream for VFIO devices.
Added .save_setup and .save_cleanup functions. Map & unmap migration
region from these functions at source during saving or pre-copy phase.

Set VFIO device state depending on VM's state. During live migration, VM is
running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
device. During save-restore, VM is paused, _SAVING state is set for VFIO device.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
---
  hw/vfio/migration.c  | 102 +++
  hw/vfio/trace-events |   2 +
  2 files changed, 104 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index a0f0e79b9b73..94d2bdae5c54 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -8,12 +8,15 @@
   */
  
  #include "qemu/osdep.h"

+#include "qemu/main-loop.h"
+#include "qemu/cutils.h"
  #include 
  
  #include "sysemu/runstate.h"

  #include "hw/vfio/vfio-common.h"
  #include "cpu.h"
  #include "migration/migration.h"
+#include "migration/vmstate.h"
  #include "migration/qemu-file.h"
  #include "migration/register.h"
  #include "migration/blocker.h"
@@ -25,6 +28,22 @@
  #include "trace.h"
  #include "hw/hw.h"
  
+/*

+ * Flags to be used as unique delimiters for VFIO devices in the migration
+ * stream. These flags are composed as:
+ * 0x => MSB 32-bit all 1s
+ * 0xef10 => Magic ID, represents emulated (virtual) function IO
+ * 0x => 16-bits reserved for flags
+ *
+ * The beginning of state information is marked by _DEV_CONFIG_STATE,
+ * _DEV_SETUP_STATE, or _DEV_DATA_STATE, respectively. The end of a
+ * certain state information is marked by _END_OF_STATE.
+ */
+#define VFIO_MIG_FLAG_END_OF_STATE  (0xef11ULL)
+#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xef12ULL)
+#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
+#define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
+
  static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
off_t off, bool iswrite)
  {
@@ -129,6 +148,75 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, 
uint32_t mask,
  return 0;
  }
  
+static void vfio_migration_cleanup(VFIODevice *vbasedev)

+{
+VFIOMigration *migration = vbasedev->migration;
+
+if (migration->region.mmaps) {
+vfio_region_unmap(>region);
+}
+}
+
+/* -- */
+
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+int ret;
+
+trace_vfio_save_setup(vbasedev->name);
+
+qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
+
+if (migration->region.mmaps) {
+/*
+ * Calling vfio_region_mmap() from migration thread. Memory API called
+ * from this function require locking the iothread when called from
+ * outside the main loop thread.
+ */
+qemu_mutex_lock_iothread();
+ret = vfio_region_mmap(>region);
+qemu_mutex_unlock_iothread();
+if (ret) {
+error_report("%s: Failed to mmap VFIO migration region: %s",
+ vbasedev->name, strerror(-ret));
+error_report("%s: Falling back to slow path", vbasedev->name);
+}
+}
+
+ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
+   VFIO_DEVICE_STATE_SAVING);
+if (ret) {
+error_report("%s: Failed to set state SAVING", vbasedev->name);
+return ret;
+}
+


is it possible to call vfio_update_pending() and vfio_save_buffer() here?
so that vendor driver has a chance to hook compatibility checking string
early in save_setup stage and can avoid to hook the string in both
precopy iteration stage and stop and copy stage.


I would says its not about which stage, very first string irrespective 
of migration stage, it should be version compatibility check.

I don't think that needed in setup.



But I think it's ok if we agree to add this later.

Besides that,
Reviewed-by: Yan Zhao 



Thanks.

Kirti



+qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+ret = qemu_file_get_error(f);
+if (ret) {
+return ret;
+}
+
+return 0;
+}
+
+static void vfio_save_cleanup(void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+
+vfio_migration_cleanup(vbasedev);
+trace_vfio_save_cleanup(vbasedev->name);
+}
+
+static SaveVMHandlers savevm_vfio_handlers = {
+.save_setup = vfio_save_setup,
+.save_cleanup = vfio_save_clean

Re: [PATCH v28 05/17] vfio: Add VM state change handler to know state of VM

2020-10-24 Thread Kirti Wankhede




On 10/23/2020 5:02 PM, Cornelia Huck wrote:

On Fri, 23 Oct 2020 16:10:31 +0530
Kirti Wankhede  wrote:


VM state change handler is called on change in VM's state. Based on
VM state, VFIO device state should be changed.
Added read/write helper functions for migration region.
Added function to set device_state.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Dr. David Alan Gilbert 


Hm, this version looks a bit different from the one Dave gave his R-b
for... does it still apply?



I would defer that to Dave.


---
  hw/vfio/migration.c   | 156 ++
  hw/vfio/trace-events  |   2 +
  include/hw/vfio/vfio-common.h |   4 ++
  3 files changed, 162 insertions(+)


Reviewed-by: Cornelia Huck 



Thanks.

Kirti



Re: [PATCH v28 04/17] vfio: Add migration region initialization and finalize function

2020-10-24 Thread Kirti Wankhede




On 10/23/2020 4:54 PM, Cornelia Huck wrote:

On Fri, 23 Oct 2020 16:10:30 +0530
Kirti Wankhede  wrote:


Whether the VFIO device supports migration or not is decided based of
migration region query. If migration region query is successful and migration
region initialization is successful then migration is supported else
migration is blocked.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Acked-by: Dr. David Alan Gilbert 
---
  hw/vfio/meson.build   |   1 +
  hw/vfio/migration.c   | 133 ++
  hw/vfio/trace-events  |   3 +
  include/hw/vfio/vfio-common.h |   9 +++
  4 files changed, 146 insertions(+)
  create mode 100644 hw/vfio/migration.c


(...)


+static int vfio_migration_init(VFIODevice *vbasedev,
+   struct vfio_region_info *info)
+{
+int ret;
+Object *obj;
+VFIOMigration *migration;
+
+if (!vbasedev->ops->vfio_get_object) {
+return -EINVAL;
+}
+
+obj = vbasedev->ops->vfio_get_object(vbasedev);
+if (!obj) {
+return -EINVAL;
+}
+
+migration = g_new0(VFIOMigration, 1);
+
+ret = vfio_region_setup(obj, vbasedev, >region,
+info->index, "migration");
+if (ret) {
+error_report("%s: Failed to setup VFIO migration region %d: %s",
+ vbasedev->name, info->index, strerror(-ret));
+goto err;
+}
+
+vbasedev->migration = migration;
+
+if (!migration->region.size) {
+error_report("%s: Invalid zero-sized of VFIO migration region %d",


s/of //


+ vbasedev->name, info->index);
+ret = -EINVAL;
+goto err;
+}
+return 0;
+
+err:
+vfio_migration_region_exit(vbasedev);
+g_free(migration);
+vbasedev->migration = NULL;
+return ret;
+}


(...)


+void vfio_migration_finalize(VFIODevice *vbasedev)
+{
+VFIOMigration *migration = vbasedev->migration;


I don't think you need this variable?



Removing it.


+
+if (migration) {
+vfio_migration_region_exit(vbasedev);
+g_free(vbasedev->migration);
+vbasedev->migration = NULL;
+}
+
+if (vbasedev->migration_blocker) {
+migrate_del_blocker(vbasedev->migration_blocker);
+error_free(vbasedev->migration_blocker);
+vbasedev->migration_blocker = NULL;
+}
+}


(...)





Re: [PATCH v28 03/17] vfio: Add save and load functions for VFIO PCI devices

2020-10-24 Thread Kirti Wankhede




On 10/24/2020 7:46 PM, Alex Williamson wrote:

On Sat, 24 Oct 2020 19:53:39 +0800
Yan Zhao  wrote:


hi
when I migrating VFs, the PCI_COMMAND is not properly saved. and the
target side would meet below bug
root@tester:~# [  189.360671] ++>> reset starts here: iavf_reset_task 
!!!
[  199.360798] iavf :00:04.0: Reset never finished (0)
[  199.380504] kernel BUG at drivers/pci/msi.c:352!
[  199.382957] invalid opcode:  [#1] SMP PTI
[  199.384855] CPU: 1 PID: 419 Comm: kworker/1:2 Tainted: G   OE 
5.0.0-13-generic #14-Ubuntu
[  199.388204] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[  199.392401] Workqueue: events iavf_reset_task [iavf]
[  199.393586] RIP: 0010:free_msi_irqs+0x17b/0x1b0
[  199.394659] Code: 84 e1 fe ff ff 45 31 f6 eb 11 41 83 c6 01 44 39 73 14 0f 86 ce 
fe ff ff 8b 7b 10 44 01 f7 e8 3c 7a ba ff 48 83 78 70 00 74 e0 <0f> 0b 49 8d b5 
b0 00 00 00 e8 07 27 bb ff e9 cf fe ff ff 48 8b 78
[  199.399056] RSP: 0018:abd1006cfdb8 EFLAGS: 00010282
[  199.400302] RAX: 9e336d8a2800 RBX: 9eb006c0 RCX: 
[  199.402000] RDX:  RSI: 0019 RDI: baa68100
[  199.403168] RBP: abd1006cfde8 R08: 9e3375000248 R09: 9e3375000338
[  199.404343] R10:  R11: baa68108 R12: 9e3374ef12c0
[  199.405526] R13: 9e3374ef1000 R14:  R15: 9e3371f2d018
[  199.406702] FS:  () GS:9e3375b0() 
knlGS:
[  199.408027] CS:  0010 DS:  ES:  CR0: 80050033
[  199.408987] CR2:  CR3: 33266000 CR4: 06e0
[  199.410155] DR0:  DR1:  DR2: 
[  199.411321] DR3:  DR6: fffe0ff0 DR7: 0400
[  199.412437] Call Trace:
[  199.412750]  pci_disable_msix+0xf3/0x120
[  199.413227]  iavf_reset_interrupt_capability.part.40+0x19/0x40 [iavf]
[  199.413998]  iavf_reset_task+0x4b3/0x9d0 [iavf]
[  199.414544]  process_one_work+0x20f/0x410
[  199.415026]  worker_thread+0x34/0x400
[  199.415486]  kthread+0x120/0x140
[  199.415876]  ? process_one_work+0x410/0x410
[  199.416380]  ? __kthread_parkme+0x70/0x70
[  199.416864]  ret_from_fork+0x35/0x40



I verified MSIx with SRIOV VF, and I don't see this issue at my end.


I fixed it with below patch.


commit ad3efa0eeea7edb352294bfce35b904b8d3c759c
Author: Yan Zhao 
Date:   Sat Oct 24 19:45:01 2020 +0800

 msix fix.
 
 Signed-off-by: Yan Zhao 


diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index f63f15b553..92f71bf933 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2423,8 +2423,14 @@ const VMStateDescription vmstate_vfio_pci_config = {
  static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
  {
  VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+PCIDevice *pdev = >pdev;
+uint16_t pci_cmd;
+
+pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
+qemu_put_be16(f, pci_cmd);
  
  vmstate_save_state(f, _vfio_pci_config, vdev, NULL);

+
  }
  
  static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)

@@ -2432,6 +2438,10 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, 
QEMUFile *f)
  VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
  PCIDevice *pdev = >pdev;
  int ret;
+uint16_t pci_cmd;
+
+pci_cmd = qemu_get_be16(f);
+vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
  
  ret = vmstate_load_state(f, _vfio_pci_config, vdev, 1);

  if (ret) {




We need to avoid this sort of ad-hoc stuffing random fields into the
config stream.  The command register is already migrated in vconfig, it
only needs to be written through vfio:

vfio_pci_write_config(pdev, PCI_COMMAND,
  pci_get_word(pdev->config, PCI_COMMAND), 2);



I verified at my end again.
pci command value (using pci_default_read_config()) before 
vmstate_save_state() is 0x507 and at destination after 
vmstate_load_state() is also 0x507 - with pci_default_read_config() and 
the cached config space value using pci_get_word() - both are 0x507.

VM restores successfully.

Yan, can you share pci command values before and after as above? what 
exactly is missing?


Thanks,
Kirti


Thanks,
Alex



On Fri, Oct 23, 2020 at 04:10:29PM +0530, Kirti Wankhede wrote:

Added functions to save and restore PCI device specific data,
specifically config space of PCI device.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
---
  hw/vfio/pci.c | 48 +++
  include/hw/vfio/vfio-common.h |  2 ++
  2 files changed, 50 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index bffd5bfe3b78..92cc25a5489f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -41,6 +41,7 @@
  #include "trace.h"
  #include "qapi/error.h"
 

Re: [PATCH v28 00/17] Add migration support for VFIO devices

2020-10-24 Thread Kirti Wankhede




On 10/24/2020 10:26 PM, Philippe Mathieu-Daudé wrote:

Hi Kirti,

On 10/23/20 12:40 PM, Kirti Wankhede wrote:

Hi,

This Patch set adds migration support for VFIO devices in QEMU.

...


Since there is no device which has hardware support for system memmory
dirty bitmap tracking, right now there is no other API from vendor driver
to VFIO IOMMU module to report dirty pages. In future, when such hardware
support will be implemented, an API will be required in kernel such that
vendor driver could report dirty pages to VFIO module during migration 
phases.


Below is the flow of state change for live migration where states in 
brackets

represent VM state, migration state and VFIO device state as:
 (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)

Live migration save path:
 QEMU normal running state
 (RUNNING, _NONE, _RUNNING)
 |
 migrate_init spawns migration_thread.
 (RUNNING, _SETUP, _RUNNING|_SAVING)
 Migration thread then calls each device's .save_setup()
 |
 (RUNNING, _ACTIVE, _RUNNING|_SAVING)
 If device is active, get pending bytes by .save_live_pending()
 if pending bytes >= threshold_size,  call save_live_iterate()
 Data of VFIO device for pre-copy phase is copied.
 Iterate till total pending bytes converge and are less than 
threshold

 |
 On migration completion, vCPUs stops and calls 
.save_live_complete_precopy

 for each active device. VFIO device is then transitioned in
  _SAVING state.
 (FINISH_MIGRATE, _DEVICE, _SAVING)
 For VFIO device, iterate in .save_live_complete_precopy until
 pending data is 0.
 (FINISH_MIGRATE, _DEVICE, _STOPPED)
 |
 (FINISH_MIGRATE, _COMPLETED, _STOPPED)
 Migraton thread schedule cleanup bottom half and exit

Live migration resume path:
 Incomming migration calls .load_setup for each device
 (RESTORE_VM, _ACTIVE, _STOPPED)
 |
 For each device, .load_state is called for that device section data
 (RESTORE_VM, _ACTIVE, _RESUMING)
 |
 At the end, called .load_cleanup for each device and vCPUs are 
started.

 |
 (RUNNING, _NONE, _RUNNING)

Note that:
- Migration post copy is not supported.


Can you commit this ^^^ somewhere in docs/devel/ please?
(as a patch on top of this series)



Philippe, Alex,
I'm going to respin this series with r-bs and fix suggested by Yan.
Should this doc be part of this series or we can add it later after 
10/27 if again review of this doc would need some iterations?


Thanks,
Kirti



Re: [PATCH v28 04/17] vfio: Add migration region initialization and finalize function

2020-10-24 Thread Kirti Wankhede




On 10/24/2020 7:51 PM, Alex Williamson wrote:

On Sat, 24 Oct 2020 15:09:14 +0530
Kirti Wankhede  wrote:


On 10/23/2020 10:22 PM, Alex Williamson wrote:

On Fri, 23 Oct 2020 16:10:30 +0530
Kirti Wankhede  wrote:
   

Whether the VFIO device supports migration or not is decided based of
migration region query. If migration region query is successful and migration
region initialization is successful then migration is supported else
migration is blocked.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Acked-by: Dr. David Alan Gilbert 
---
   hw/vfio/meson.build   |   1 +
   hw/vfio/migration.c   | 133 
++
   hw/vfio/trace-events  |   3 +
   include/hw/vfio/vfio-common.h |   9 +++
   4 files changed, 146 insertions(+)
   create mode 100644 hw/vfio/migration.c

diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index 37efa74018bc..da9af297a0c5 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
   vfio_ss.add(files(
 'common.c',
 'spapr.c',
+  'migration.c',
   ))
   vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
 'display.c',
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
new file mode 100644
index ..bbe6e0b7a6cc
--- /dev/null
+++ b/hw/vfio/migration.c
@@ -0,0 +1,133 @@
+/*
+ * Migration support for VFIO devices
+ *
+ * Copyright NVIDIA, Inc. 2020
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include 
+
+#include "hw/vfio/vfio-common.h"
+#include "cpu.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "migration/register.h"
+#include "migration/blocker.h"
+#include "migration/misc.h"
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "exec/ram_addr.h"
+#include "pci.h"
+#include "trace.h"
+
+static void vfio_migration_region_exit(VFIODevice *vbasedev)
+{
+VFIOMigration *migration = vbasedev->migration;
+
+if (!migration) {
+return;
+}
+
+vfio_region_exit(>region);
+vfio_region_finalize(>region);


I think it would make sense to also:

g_free(migration);
vbasedev->migration = NULL;

here as well so the callers don't need to.


No, vfio_migration_init() case, err case is also hit when
vbasedev->migration is not yet set but local variable migration is not-NULL.


So why do we even call vfio_migration_region_exit() for that error
case?  It seems that could just g_free(migration); return ret; rather
than goto err.  Thanks,



Removing temporary local variable, with that above 2 can be moved to 
exit function.


Thanks,
Kirti


Alex


Not worth a re-spin itself,
maybe a follow-up if there's no other reason for a re-spin.  Thanks,

Alex
   

+}
+
+static int vfio_migration_init(VFIODevice *vbasedev,
+   struct vfio_region_info *info)
+{
+int ret;
+Object *obj;
+VFIOMigration *migration;
+
+if (!vbasedev->ops->vfio_get_object) {
+return -EINVAL;
+}
+
+obj = vbasedev->ops->vfio_get_object(vbasedev);
+if (!obj) {
+return -EINVAL;
+}
+
+migration = g_new0(VFIOMigration, 1);
+
+ret = vfio_region_setup(obj, vbasedev, >region,
+info->index, "migration");
+if (ret) {
+error_report("%s: Failed to setup VFIO migration region %d: %s",
+ vbasedev->name, info->index, strerror(-ret));
+goto err;
+}
+
+vbasedev->migration = migration;
+
+if (!migration->region.size) {
+error_report("%s: Invalid zero-sized of VFIO migration region %d",
+ vbasedev->name, info->index);
+ret = -EINVAL;
+goto err;
+}
+return 0;
+
+err:
+vfio_migration_region_exit(vbasedev);
+g_free(migration);
+vbasedev->migration = NULL;
+return ret;
+}
+
+/* -- */
+
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
+{
+struct vfio_region_info *info = NULL;
+Error *local_err = NULL;
+int ret;
+
+ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
+   VFIO_REGION_SUBTYPE_MIGRATION, );
+if (ret) {
+goto add_blocker;
+}
+
+ret = vfio_migration_init(vbasedev, info);
+if (ret) {
+goto add_blocker;
+}
+
+g_free(info);
+trace_vfio_migration_probe(vbasedev->name, info->index);
+return 0;
+
+add_blocker:
+error_setg(>migration_blocker,
+   "VFIO device doesn't support migration");
+g_free(info);
+
+ret = migrate_add_blocker(vbasedev->migration_blocker, _err);
+  

Re: [PATCH v28 04/17] vfio: Add migration region initialization and finalize function

2020-10-24 Thread Kirti Wankhede




On 10/23/2020 10:22 PM, Alex Williamson wrote:

On Fri, 23 Oct 2020 16:10:30 +0530
Kirti Wankhede  wrote:


Whether the VFIO device supports migration or not is decided based of
migration region query. If migration region query is successful and migration
region initialization is successful then migration is supported else
migration is blocked.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Acked-by: Dr. David Alan Gilbert 
---
  hw/vfio/meson.build   |   1 +
  hw/vfio/migration.c   | 133 ++
  hw/vfio/trace-events  |   3 +
  include/hw/vfio/vfio-common.h |   9 +++
  4 files changed, 146 insertions(+)
  create mode 100644 hw/vfio/migration.c

diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index 37efa74018bc..da9af297a0c5 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
  vfio_ss.add(files(
'common.c',
'spapr.c',
+  'migration.c',
  ))
  vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
'display.c',
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
new file mode 100644
index ..bbe6e0b7a6cc
--- /dev/null
+++ b/hw/vfio/migration.c
@@ -0,0 +1,133 @@
+/*
+ * Migration support for VFIO devices
+ *
+ * Copyright NVIDIA, Inc. 2020
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include 
+
+#include "hw/vfio/vfio-common.h"
+#include "cpu.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "migration/register.h"
+#include "migration/blocker.h"
+#include "migration/misc.h"
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "exec/ram_addr.h"
+#include "pci.h"
+#include "trace.h"
+
+static void vfio_migration_region_exit(VFIODevice *vbasedev)
+{
+VFIOMigration *migration = vbasedev->migration;
+
+if (!migration) {
+return;
+}
+
+vfio_region_exit(>region);
+vfio_region_finalize(>region);


I think it would make sense to also:

g_free(migration);
vbasedev->migration = NULL;

here as well so the callers don't need to. 


No, vfio_migration_init() case, err case is also hit when 
vbasedev->migration is not yet set but local variable migration is not-NULL.


Thanks,
Kirti


Not worth a re-spin itself,
maybe a follow-up if there's no other reason for a re-spin.  Thanks,

Alex


+}
+
+static int vfio_migration_init(VFIODevice *vbasedev,
+   struct vfio_region_info *info)
+{
+int ret;
+Object *obj;
+VFIOMigration *migration;
+
+if (!vbasedev->ops->vfio_get_object) {
+return -EINVAL;
+}
+
+obj = vbasedev->ops->vfio_get_object(vbasedev);
+if (!obj) {
+return -EINVAL;
+}
+
+migration = g_new0(VFIOMigration, 1);
+
+ret = vfio_region_setup(obj, vbasedev, >region,
+info->index, "migration");
+if (ret) {
+error_report("%s: Failed to setup VFIO migration region %d: %s",
+ vbasedev->name, info->index, strerror(-ret));
+goto err;
+}
+
+vbasedev->migration = migration;
+
+if (!migration->region.size) {
+error_report("%s: Invalid zero-sized of VFIO migration region %d",
+ vbasedev->name, info->index);
+ret = -EINVAL;
+goto err;
+}
+return 0;
+
+err:
+vfio_migration_region_exit(vbasedev);
+g_free(migration);
+vbasedev->migration = NULL;
+return ret;
+}
+
+/* -- */
+
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
+{
+struct vfio_region_info *info = NULL;
+Error *local_err = NULL;
+int ret;
+
+ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
+   VFIO_REGION_SUBTYPE_MIGRATION, );
+if (ret) {
+goto add_blocker;
+}
+
+ret = vfio_migration_init(vbasedev, info);
+if (ret) {
+goto add_blocker;
+}
+
+g_free(info);
+trace_vfio_migration_probe(vbasedev->name, info->index);
+return 0;
+
+add_blocker:
+error_setg(>migration_blocker,
+   "VFIO device doesn't support migration");
+g_free(info);
+
+ret = migrate_add_blocker(vbasedev->migration_blocker, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+error_free(vbasedev->migration_blocker);
+vbasedev->migration_blocker = NULL;
+}
+return ret;
+}
+
+void vfio_migration_finalize(VFIODevice *vbasedev)
+{
+VFIOMigration *migration = vbasedev->migration;
+
+if (migration) {
+vfio_migration_region_exit(vbasedev);
+

[PATCH v28 17/17] qapi: Add VFIO devices migration stats in Migration stats

2020-10-23 Thread Kirti Wankhede
Added amount of bytes transferred to the VM at destination by all VFIO
devices

Signed-off-by: Kirti Wankhede 
Reviewed-by: Dr. David Alan Gilbert 
---
 hw/vfio/common.c  | 19 +++
 hw/vfio/migration.c   |  9 +
 include/hw/vfio/vfio-common.h |  3 +++
 migration/migration.c | 17 +
 monitor/hmp-cmds.c|  6 ++
 qapi/migration.json   | 17 +
 6 files changed, 71 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 49c68a5253ae..56f6fee66a55 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -292,6 +292,25 @@ const MemoryRegionOps vfio_region_ops = {
  * Device state interfaces
  */
 
+bool vfio_mig_active(void)
+{
+VFIOGroup *group;
+VFIODevice *vbasedev;
+
+if (QLIST_EMPTY(_group_list)) {
+return false;
+}
+
+QLIST_FOREACH(group, _group_list, next) {
+QLIST_FOREACH(vbasedev, >device_list, next) {
+if (vbasedev->migration_blocker) {
+return false;
+}
+}
+}
+return true;
+}
+
 static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container)
 {
 VFIOGroup *group;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index d4ba24c2dfae..37390b9c05fe 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -45,6 +45,8 @@
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
 
+static int64_t bytes_transferred;
+
 static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
   off_t off, bool iswrite)
 {
@@ -255,6 +257,7 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice 
*vbasedev, uint64_t *size)
 *size = data_size;
 }
 
+bytes_transferred += data_size;
 return ret;
 }
 
@@ -785,6 +788,7 @@ static void vfio_migration_state_notifier(Notifier 
*notifier, void *data)
 case MIGRATION_STATUS_CANCELLING:
 case MIGRATION_STATUS_CANCELLED:
 case MIGRATION_STATUS_FAILED:
+bytes_transferred = 0;
 ret = vfio_migration_set_state(vbasedev,
   ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
   VFIO_DEVICE_STATE_RUNNING);
@@ -871,6 +875,11 @@ err:
 
 /* -- */
 
+int64_t vfio_mig_bytes_transferred(void)
+{
+return bytes_transferred;
+}
+
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
 VFIOContainer *container = vbasedev->group->container;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index b1c1b18fd228..24e299d97425 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -203,6 +203,9 @@ extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
 extern VFIOGroupList vfio_group_list;
 
+bool vfio_mig_active(void);
+int64_t vfio_mig_bytes_transferred(void);
+
 #ifdef CONFIG_LINUX
 int vfio_get_region_info(VFIODevice *vbasedev, int index,
  struct vfio_region_info **info);
diff --git a/migration/migration.c b/migration/migration.c
index 0575ecb37953..995ccd96a774 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -57,6 +57,10 @@
 #include "qemu/queue.h"
 #include "multifd.h"
 
+#ifdef CONFIG_VFIO
+#include "hw/vfio/vfio-common.h"
+#endif
+
 #define MAX_THROTTLE  (128 << 20)  /* Migration transfer speed throttling 
*/
 
 /* Amount of time to allocate to each "chunk" of bandwidth-throttled
@@ -1002,6 +1006,17 @@ static void populate_disk_info(MigrationInfo *info)
 }
 }
 
+static void populate_vfio_info(MigrationInfo *info)
+{
+#ifdef CONFIG_VFIO
+if (vfio_mig_active()) {
+info->has_vfio = true;
+info->vfio = g_malloc0(sizeof(*info->vfio));
+info->vfio->transferred = vfio_mig_bytes_transferred();
+}
+#endif
+}
+
 static void fill_source_migration_info(MigrationInfo *info)
 {
 MigrationState *s = migrate_get_current();
@@ -1026,6 +1041,7 @@ static void fill_source_migration_info(MigrationInfo 
*info)
 populate_time_info(info, s);
 populate_ram_info(info, s);
 populate_disk_info(info);
+populate_vfio_info(info);
 break;
 case MIGRATION_STATUS_COLO:
 info->has_status = true;
@@ -1034,6 +1050,7 @@ static void fill_source_migration_info(MigrationInfo 
*info)
 case MIGRATION_STATUS_COMPLETED:
 populate_time_info(info, s);
 populate_ram_info(info, s);
+populate_vfio_info(info);
 break;
 case MIGRATION_STATUS_FAILED:
 info->has_status = true;
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 9789f4277f50..56e9bad33d94 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -357,6 +357,12 @@ void hmp_info_migrate

[PATCH v28 12/17] vfio: Add function to start and stop dirty pages tracking

2020-10-23 Thread Kirti Wankhede
Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking
for VFIO devices.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Dr. David Alan Gilbert 
---
 hw/vfio/migration.c | 36 
 1 file changed, 36 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 0dc40e34a4de..d4ba24c2dfae 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -11,6 +11,7 @@
 #include "qemu/main-loop.h"
 #include "qemu/cutils.h"
 #include 
+#include 
 
 #include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
@@ -391,10 +392,40 @@ static int vfio_load_device_config_state(QEMUFile *f, 
void *opaque)
 return qemu_file_get_error(f);
 }
 
+static int vfio_set_dirty_page_tracking(VFIODevice *vbasedev, bool start)
+{
+int ret;
+VFIOMigration *migration = vbasedev->migration;
+VFIOContainer *container = vbasedev->group->container;
+struct vfio_iommu_type1_dirty_bitmap dirty = {
+.argsz = sizeof(dirty),
+};
+
+if (start) {
+if (migration->device_state & VFIO_DEVICE_STATE_SAVING) {
+dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
+} else {
+return -EINVAL;
+}
+} else {
+dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
+}
+
+ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, );
+if (ret) {
+error_report("Failed to set dirty tracking flag 0x%x errno: %d",
+ dirty.flags, errno);
+return -errno;
+}
+return ret;
+}
+
 static void vfio_migration_cleanup(VFIODevice *vbasedev)
 {
 VFIOMigration *migration = vbasedev->migration;
 
+vfio_set_dirty_page_tracking(vbasedev, false);
+
 if (migration->region.mmaps) {
 vfio_region_unmap(>region);
 }
@@ -435,6 +466,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
 return ret;
 }
 
+ret = vfio_set_dirty_page_tracking(vbasedev, true);
+if (ret) {
+return ret;
+}
+
 qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
 
 ret = qemu_file_get_error(f);
-- 
2.7.0




[PATCH v28 16/17] vfio: Make vfio-pci device migration capable

2020-10-23 Thread Kirti Wankhede
If the device is not a failover primary device, call
vfio_migration_probe() and vfio_migration_finalize() to enable
migration support for those devices that support it respectively to
tear it down again.
Removed migration blocker from VFIO PCI device specific structure and use
migration blocker from generic structure of  VFIO device.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Dr. David Alan Gilbert 
Reviewed-by: Cornelia Huck 
---
 hw/vfio/pci.c | 28 
 hw/vfio/pci.h |  1 -
 2 files changed, 8 insertions(+), 21 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 92cc25a5489f..d2a2b5756774 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2788,17 +2788,6 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 return;
 }
 
-if (!pdev->failover_pair_id) {
-error_setg(>migration_blocker,
-"VFIO device doesn't support migration");
-ret = migrate_add_blocker(vdev->migration_blocker, errp);
-if (ret) {
-error_free(vdev->migration_blocker);
-vdev->migration_blocker = NULL;
-return;
-}
-}
-
 vdev->vbasedev.name = g_path_get_basename(vdev->vbasedev.sysfsdev);
 vdev->vbasedev.ops = _pci_ops;
 vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
@@ -3066,6 +3055,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 }
 }
 
+if (!pdev->failover_pair_id) {
+ret = vfio_migration_probe(>vbasedev, errp);
+if (ret) {
+error_report("%s: Migration disabled", vdev->vbasedev.name);
+}
+}
+
 vfio_register_err_notifier(vdev);
 vfio_register_req_notifier(vdev);
 vfio_setup_resetfn_quirk(vdev);
@@ -3080,11 +3076,6 @@ out_teardown:
 vfio_bars_exit(vdev);
 error:
 error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
-if (vdev->migration_blocker) {
-migrate_del_blocker(vdev->migration_blocker);
-error_free(vdev->migration_blocker);
-vdev->migration_blocker = NULL;
-}
 }
 
 static void vfio_instance_finalize(Object *obj)
@@ -3096,10 +3087,6 @@ static void vfio_instance_finalize(Object *obj)
 vfio_bars_finalize(vdev);
 g_free(vdev->emulated_config_bits);
 g_free(vdev->rom);
-if (vdev->migration_blocker) {
-migrate_del_blocker(vdev->migration_blocker);
-error_free(vdev->migration_blocker);
-}
 /*
  * XXX Leaking igd_opregion is not an oversight, we can't remove the
  * fw_cfg entry therefore leaking this allocation seems like the safest
@@ -3127,6 +3114,7 @@ static void vfio_exitfn(PCIDevice *pdev)
 }
 vfio_teardown_msi(vdev);
 vfio_bars_exit(vdev);
+vfio_migration_finalize(>vbasedev);
 }
 
 static void vfio_pci_reset(DeviceState *dev)
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index bce71a9ac93f..1574ef983f8f 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -172,7 +172,6 @@ struct VFIOPCIDevice {
 bool no_vfio_ioeventfd;
 bool enable_ramfb;
 VFIODisplay *dpy;
-Error *migration_blocker;
 Notifier irqchip_change_notifier;
 };
 
-- 
2.7.0




[PATCH v28 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap

2020-10-23 Thread Kirti Wankhede
With vIOMMU, IO virtual address range can get unmapped while in pre-copy
phase of migration. In that case, unmap ioctl should return pages pinned
in that range and QEMU should find its correcponding guest physical
addresses and report those dirty.

Suggested-by: Alex Williamson 
Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
---
 hw/vfio/common.c | 96 +---
 1 file changed, 92 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c0b5b6245a47..49c68a5253ae 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -321,11 +321,94 @@ static bool 
vfio_devices_all_stopped_and_saving(VFIOContainer *container)
 return true;
 }
 
+static bool vfio_devices_all_running_and_saving(VFIOContainer *container)
+{
+VFIOGroup *group;
+VFIODevice *vbasedev;
+MigrationState *ms = migrate_get_current();
+
+if (!migration_is_setup_or_active(ms->state)) {
+return false;
+}
+
+QLIST_FOREACH(group, >group_list, container_next) {
+QLIST_FOREACH(vbasedev, >device_list, next) {
+VFIOMigration *migration = vbasedev->migration;
+
+if (!migration) {
+return false;
+}
+
+if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) &&
+(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+continue;
+} else {
+return false;
+}
+}
+}
+return true;
+}
+
+static int vfio_dma_unmap_bitmap(VFIOContainer *container,
+ hwaddr iova, ram_addr_t size,
+ IOMMUTLBEntry *iotlb)
+{
+struct vfio_iommu_type1_dma_unmap *unmap;
+struct vfio_bitmap *bitmap;
+uint64_t pages = TARGET_PAGE_ALIGN(size) >> TARGET_PAGE_BITS;
+int ret;
+
+unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
+
+unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
+unmap->iova = iova;
+unmap->size = size;
+unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
+bitmap = (struct vfio_bitmap *)>data;
+
+/*
+ * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+ * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap_pgsize to
+ * TARGET_PAGE_SIZE.
+ */
+
+bitmap->pgsize = TARGET_PAGE_SIZE;
+bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+   BITS_PER_BYTE;
+
+if (bitmap->size > container->max_dirty_bitmap_size) {
+error_report("UNMAP: Size of bitmap too big 0x%llx", bitmap->size);
+ret = -E2BIG;
+goto unmap_exit;
+}
+
+bitmap->data = g_try_malloc0(bitmap->size);
+if (!bitmap->data) {
+ret = -ENOMEM;
+goto unmap_exit;
+}
+
+ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
+if (!ret) {
+cpu_physical_memory_set_dirty_lebitmap((uint64_t *)bitmap->data,
+iotlb->translated_addr, pages);
+} else {
+error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
+}
+
+g_free(bitmap->data);
+unmap_exit:
+g_free(unmap);
+return ret;
+}
+
 /*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
 static int vfio_dma_unmap(VFIOContainer *container,
-  hwaddr iova, ram_addr_t size)
+  hwaddr iova, ram_addr_t size,
+  IOMMUTLBEntry *iotlb)
 {
 struct vfio_iommu_type1_dma_unmap unmap = {
 .argsz = sizeof(unmap),
@@ -334,6 +417,11 @@ static int vfio_dma_unmap(VFIOContainer *container,
 .size = size,
 };
 
+if (iotlb && container->dirty_pages_supported &&
+vfio_devices_all_running_and_saving(container)) {
+return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
+}
+
 while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, )) {
 /*
  * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
@@ -381,7 +469,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr 
iova,
  * the VGA ROM space.
  */
 if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, ) == 0 ||
-(errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
+(errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
  ioctl(container->fd, VFIO_IOMMU_MAP_DMA, ) == 0)) {
 return 0;
 }
@@ -531,7 +619,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, 
IOMMUTLBEntry *iotlb)
  iotlb->addr_mask + 1, vaddr, ret);
 }
 } else {
-ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
+ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
 if (ret) {

[PATCH v28 11/17] vfio: Get migration capability flags for container

2020-10-23 Thread Kirti Wankhede
Added helper functions to get IOMMU info capability chain.
Added function to get migration capability information from that
capability chain for IOMMU container.

Similar change was proposed earlier:
https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html

Disable migration for devices if IOMMU module doesn't support migration
capability.

Signed-off-by: Kirti Wankhede 
Cc: Shameer Kolothum 
Cc: Eric Auger 
---
 hw/vfio/common.c  | 90 +++
 hw/vfio/migration.c   |  7 +++-
 include/hw/vfio/vfio-common.h |  3 ++
 3 files changed, 91 insertions(+), 9 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c6e98b8d61be..d4959c036dd1 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1228,6 +1228,75 @@ static int vfio_init_container(VFIOContainer *container, 
int group_fd,
 return 0;
 }
 
+static int vfio_get_iommu_info(VFIOContainer *container,
+   struct vfio_iommu_type1_info **info)
+{
+
+size_t argsz = sizeof(struct vfio_iommu_type1_info);
+
+*info = g_new0(struct vfio_iommu_type1_info, 1);
+again:
+(*info)->argsz = argsz;
+
+if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
+g_free(*info);
+*info = NULL;
+return -errno;
+}
+
+if (((*info)->argsz > argsz)) {
+argsz = (*info)->argsz;
+*info = g_realloc(*info, argsz);
+goto again;
+}
+
+return 0;
+}
+
+static struct vfio_info_cap_header *
+vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
+{
+struct vfio_info_cap_header *hdr;
+void *ptr = info;
+
+if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
+return NULL;
+}
+
+for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+if (hdr->id == id) {
+return hdr;
+}
+}
+
+return NULL;
+}
+
+static void vfio_get_iommu_info_migration(VFIOContainer *container,
+ struct vfio_iommu_type1_info *info)
+{
+struct vfio_info_cap_header *hdr;
+struct vfio_iommu_type1_info_cap_migration *cap_mig;
+
+hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
+if (!hdr) {
+return;
+}
+
+cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration,
+header);
+
+/*
+ * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+ * TARGET_PAGE_SIZE to mark those dirty.
+ */
+if (cap_mig->pgsize_bitmap & TARGET_PAGE_SIZE) {
+container->dirty_pages_supported = true;
+container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
+container->dirty_pgsizes = cap_mig->pgsize_bitmap;
+}
+}
+
 static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
   Error **errp)
 {
@@ -1297,6 +1366,7 @@ static int vfio_connect_container(VFIOGroup *group, 
AddressSpace *as,
 container->space = space;
 container->fd = fd;
 container->error = NULL;
+container->dirty_pages_supported = false;
 QLIST_INIT(>giommu_list);
 QLIST_INIT(>hostwin_list);
 
@@ -1309,7 +1379,7 @@ static int vfio_connect_container(VFIOGroup *group, 
AddressSpace *as,
 case VFIO_TYPE1v2_IOMMU:
 case VFIO_TYPE1_IOMMU:
 {
-struct vfio_iommu_type1_info info;
+struct vfio_iommu_type1_info *info;
 
 /*
  * FIXME: This assumes that a Type1 IOMMU can map any 64-bit
@@ -1318,15 +1388,19 @@ static int vfio_connect_container(VFIOGroup *group, 
AddressSpace *as,
  * existing Type1 IOMMUs generally support any IOVA we're
  * going to actually try in practice.
  */
-info.argsz = sizeof(info);
-ret = ioctl(fd, VFIO_IOMMU_GET_INFO, );
-/* Ignore errors */
-if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
+ret = vfio_get_iommu_info(container, );
+
+if (ret || !(info->flags & VFIO_IOMMU_INFO_PGSIZES)) {
 /* Assume 4k IOVA page size */
-info.iova_pgsizes = 4096;
+info->iova_pgsizes = 4096;
 }
-vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
-container->pgsizes = info.iova_pgsizes;
+vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
+container->pgsizes = info->iova_pgsizes;
+
+if (!ret) {
+vfio_get_iommu_info_migration(container, info);
+}
+g_free(info);
 break;
 }
 case VFIO_SPAPR_TCE_v2_IOMMU:
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 240646592b39..0dc40e34a4de 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -837,9 +837,14 @@ err:
 
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
+VFIOContainer *container = vbasedev->gro

[PATCH v28 14/17] vfio: Dirty page tracking when vIOMMU is enabled

2020-10-23 Thread Kirti Wankhede
When vIOMMU is enabled, add MAP notifier from log_sync when all
devices in container are in stop and copy phase of migration. Call replay
and then from notifier callback, get dirty pages.

Suggested-by: Alex Williamson 
Signed-off-by: Kirti Wankhede 
---
 hw/vfio/common.c | 88 
 hw/vfio/trace-events |  1 +
 2 files changed, 83 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 2634387df948..c0b5b6245a47 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -442,8 +442,8 @@ static bool 
vfio_listener_skipped_section(MemoryRegionSection *section)
 }
 
 /* Called with rcu_read_lock held.  */
-static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
-   bool *read_only)
+static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
+   ram_addr_t *ram_addr, bool *read_only)
 {
 MemoryRegion *mr;
 hwaddr xlat;
@@ -474,8 +474,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void 
**vaddr,
 return false;
 }
 
-*vaddr = memory_region_get_ram_ptr(mr) + xlat;
-*read_only = !writable || mr->readonly;
+if (vaddr) {
+*vaddr = memory_region_get_ram_ptr(mr) + xlat;
+}
+
+if (ram_addr) {
+*ram_addr = memory_region_get_ram_addr(mr) + xlat;
+}
+
+if (read_only) {
+*read_only = !writable || mr->readonly;
+}
 
 return true;
 }
@@ -485,7 +494,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, 
IOMMUTLBEntry *iotlb)
 VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
 VFIOContainer *container = giommu->container;
 hwaddr iova = iotlb->iova + giommu->iommu_offset;
-bool read_only;
 void *vaddr;
 int ret;
 
@@ -501,7 +509,9 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, 
IOMMUTLBEntry *iotlb)
 rcu_read_lock();
 
 if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
-if (!vfio_get_vaddr(iotlb, , _only)) {
+bool read_only;
+
+if (!vfio_get_xlat_addr(iotlb, , NULL, _only)) {
 goto out;
 }
 /*
@@ -899,11 +909,77 @@ err_out:
 return ret;
 }
 
+typedef struct {
+IOMMUNotifier n;
+VFIOGuestIOMMU *giommu;
+} vfio_giommu_dirty_notifier;
+
+static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
+{
+vfio_giommu_dirty_notifier *gdn = container_of(n,
+vfio_giommu_dirty_notifier, n);
+VFIOGuestIOMMU *giommu = gdn->giommu;
+VFIOContainer *container = giommu->container;
+hwaddr iova = iotlb->iova + giommu->iommu_offset;
+ram_addr_t translated_addr;
+
+trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
+
+if (iotlb->target_as != _space_memory) {
+error_report("Wrong target AS \"%s\", only system memory is allowed",
+ iotlb->target_as->name ? iotlb->target_as->name : "none");
+return;
+}
+
+rcu_read_lock();
+if (vfio_get_xlat_addr(iotlb, NULL, _addr, NULL)) {
+int ret;
+
+ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
+translated_addr);
+if (ret) {
+error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
+ "0x%"HWADDR_PRIx") = %d (%m)",
+ container, iova,
+ iotlb->addr_mask + 1, ret);
+}
+}
+rcu_read_unlock();
+}
+
 static int vfio_sync_dirty_bitmap(VFIOContainer *container,
   MemoryRegionSection *section)
 {
 ram_addr_t ram_addr;
 
+if (memory_region_is_iommu(section->mr)) {
+VFIOGuestIOMMU *giommu;
+
+QLIST_FOREACH(giommu, >giommu_list, giommu_next) {
+if (MEMORY_REGION(giommu->iommu) == section->mr &&
+giommu->n.start == section->offset_within_region) {
+Int128 llend;
+vfio_giommu_dirty_notifier gdn = { .giommu = giommu };
+int idx = memory_region_iommu_attrs_to_index(giommu->iommu,
+   MEMTXATTRS_UNSPECIFIED);
+
+llend = 
int128_add(int128_make64(section->offset_within_region),
+   section->size);
+llend = int128_sub(llend, int128_one());
+
+iommu_notifier_init(,
+vfio_iommu_map_dirty_notify,
+IOMMU_NOTIFIER_MAP,
+section->offset_within_region,
+int128_get64(llend),
+idx);
+memory_region_iommu_replay(giommu->

[PATCH v28 08/17] vfio: Add save state functions to SaveVMHandlers

2020-10-23 Thread Kirti Wankhede
Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
functions. These functions handles pre-copy and stop-and-copy phase.

In _SAVING|_RUNNING device state or pre-copy phase:
- read pending_bytes. If pending_bytes > 0, go through below steps.
- read data_offset - indicates kernel driver to write data to staging
  buffer.
- read data_size - amount of data in bytes written by vendor driver in
  migration region.
- read data_size bytes of data from data_offset in the migration region.
- Write data packet to file stream as below:
{VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
VFIO_MIG_FLAG_END_OF_STATE }

In _SAVING device state or stop-and-copy phase
a. read config space of device and save to migration file stream. This
   doesn't need to be from vendor driver. Any other special config state
   from driver can be saved as data in following iteration.
b. read pending_bytes. If pending_bytes > 0, go through below steps.
c. read data_offset - indicates kernel driver to write data to staging
   buffer.
d. read data_size - amount of data in bytes written by vendor driver in
   migration region.
e. read data_size bytes of data from data_offset in the migration region.
f. Write data packet as below:
   {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
g. iterate through steps b to f while (pending_bytes > 0)
h. Write {VFIO_MIG_FLAG_END_OF_STATE}

When data region is mapped, its user's responsibility to read data from
data_offset of data_size before moving to next steps.

Added fix suggested by Artem Polyakov to reset pending_bytes in
vfio_save_iterate().
Added fix suggested by Zhi Wang to add 0 as data size in migration stream and
add END_OF_STATE delimiter to indicate phase complete.

Suggested-by: Artem Polyakov 
Suggested-by: Zhi Wang 
Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
---
 hw/vfio/migration.c   | 276 ++
 hw/vfio/trace-events  |   6 +
 include/hw/vfio/vfio-common.h |   1 +
 3 files changed, 283 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 94d2bdae5c54..be9e4aba541d 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -148,6 +148,151 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, 
uint32_t mask,
 return 0;
 }
 
+static void *get_data_section_size(VFIORegion *region, uint64_t data_offset,
+   uint64_t data_size, uint64_t *size)
+{
+void *ptr = NULL;
+uint64_t limit = 0;
+int i;
+
+if (!region->mmaps) {
+if (size) {
+*size = MIN(data_size, region->size - data_offset);
+}
+return ptr;
+}
+
+for (i = 0; i < region->nr_mmaps; i++) {
+VFIOMmap *map = region->mmaps + i;
+
+if ((data_offset >= map->offset) &&
+(data_offset < map->offset + map->size)) {
+
+/* check if data_offset is within sparse mmap areas */
+ptr = map->mmap + data_offset - map->offset;
+if (size) {
+*size = MIN(data_size, map->offset + map->size - data_offset);
+}
+break;
+} else if ((data_offset < map->offset) &&
+   (!limit || limit > map->offset)) {
+/*
+ * data_offset is not within sparse mmap areas, find size of
+ * non-mapped area. Check through all list since region->mmaps list
+ * is not sorted.
+ */
+limit = map->offset;
+}
+}
+
+if (!ptr && size) {
+*size = limit ? MIN(data_size, limit - data_offset) : data_size;
+}
+return ptr;
+}
+
+static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
+{
+VFIOMigration *migration = vbasedev->migration;
+VFIORegion *region = >region;
+uint64_t data_offset = 0, data_size = 0, sz;
+int ret;
+
+ret = vfio_mig_read(vbasedev, _offset, sizeof(data_offset),
+  region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset));
+if (ret < 0) {
+return ret;
+}
+
+ret = vfio_mig_read(vbasedev, _size, sizeof(data_size),
+region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size));
+if (ret < 0) {
+return ret;
+}
+
+trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
+   migration->pending_bytes);
+
+qemu_put_be64(f, data_size);
+sz = data_size;
+
+while (sz) {
+void *buf;
+uint64_t sec_size;
+bool buf_allocated = false;
+
+buf = get_data_section_size(region, data_offset, sz, _size);
+
+if (!buf) {
+buf = g_try_malloc(sec_size);
+if (!buf) {
+error_report("%s: Error allocating buffer ", __func__);
+return -ENOMEM;
+}
+buf_a

[PATCH v28 06/17] vfio: Add migration state change notifier

2020-10-23 Thread Kirti Wankhede
Added migration state change notifier to get notification on migration state
change. These states are translated to VFIO device state and conveyed to
vendor driver.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Dr. David Alan Gilbert 
---
 hw/vfio/migration.c   | 28 
 hw/vfio/trace-events  |  1 +
 include/hw/vfio/vfio-common.h |  2 ++
 3 files changed, 31 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 9b6949439f8e..a0f0e79b9b73 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -175,6 +175,30 @@ static void vfio_vmstate_change(void *opaque, int running, 
RunState state)
 (migration->device_state & mask) | value);
 }
 
+static void vfio_migration_state_notifier(Notifier *notifier, void *data)
+{
+MigrationState *s = data;
+VFIOMigration *migration = container_of(notifier, VFIOMigration,
+migration_state);
+VFIODevice *vbasedev = migration->vbasedev;
+int ret;
+
+trace_vfio_migration_state_notifier(vbasedev->name,
+MigrationStatus_str(s->state));
+
+switch (s->state) {
+case MIGRATION_STATUS_CANCELLING:
+case MIGRATION_STATUS_CANCELLED:
+case MIGRATION_STATUS_FAILED:
+ret = vfio_migration_set_state(vbasedev,
+  ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
+  VFIO_DEVICE_STATE_RUNNING);
+if (ret) {
+error_report("%s: Failed to set state RUNNING", vbasedev->name);
+}
+}
+}
+
 static void vfio_migration_region_exit(VFIODevice *vbasedev)
 {
 VFIOMigration *migration = vbasedev->migration;
@@ -222,8 +246,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 goto err;
 }
 
+migration->vbasedev = vbasedev;
 migration->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
vbasedev);
+migration->migration_state.notify = vfio_migration_state_notifier;
+add_migration_state_change_notifier(>migration_state);
 return 0;
 
 err:
@@ -275,6 +302,7 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
 VFIOMigration *migration = vbasedev->migration;
 
 if (migration) {
+remove_migration_state_change_notifier(>migration_state);
 qemu_del_vm_change_state_handler(migration->vm_state);
 vfio_migration_region_exit(vbasedev);
 g_free(vbasedev->migration);
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 41de81f12f60..78d7d83b5ef8 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -150,3 +150,4 @@ vfio_display_edid_write_error(void) ""
 vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
 vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
 vfio_vmstate_change(const char *name, int running, const char *reason, 
uint32_t dev_state) " (%s) running %d reason %s device state %d"
+vfio_migration_state_notifier(const char *name, const char *state) " (%s) 
state %s"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 9a571f1fb552..2bd593ba38bb 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -59,10 +59,12 @@ typedef struct VFIORegion {
 } VFIORegion;
 
 typedef struct VFIOMigration {
+struct VFIODevice *vbasedev;
 VMChangeStateEntry *vm_state;
 VFIORegion region;
 uint32_t device_state;
 int vm_running;
+Notifier migration_state;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
-- 
2.7.0




[PATCH v28 07/17] vfio: Register SaveVMHandlers for VFIO device

2020-10-23 Thread Kirti Wankhede
Define flags to be used as delimiter in migration stream for VFIO devices.
Added .save_setup and .save_cleanup functions. Map & unmap migration
region from these functions at source during saving or pre-copy phase.

Set VFIO device state depending on VM's state. During live migration, VM is
running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
device. During save-restore, VM is paused, _SAVING state is set for VFIO device.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
---
 hw/vfio/migration.c  | 102 +++
 hw/vfio/trace-events |   2 +
 2 files changed, 104 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index a0f0e79b9b73..94d2bdae5c54 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -8,12 +8,15 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/main-loop.h"
+#include "qemu/cutils.h"
 #include 
 
 #include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "cpu.h"
 #include "migration/migration.h"
+#include "migration/vmstate.h"
 #include "migration/qemu-file.h"
 #include "migration/register.h"
 #include "migration/blocker.h"
@@ -25,6 +28,22 @@
 #include "trace.h"
 #include "hw/hw.h"
 
+/*
+ * Flags to be used as unique delimiters for VFIO devices in the migration
+ * stream. These flags are composed as:
+ * 0x => MSB 32-bit all 1s
+ * 0xef10 => Magic ID, represents emulated (virtual) function IO
+ * 0x => 16-bits reserved for flags
+ *
+ * The beginning of state information is marked by _DEV_CONFIG_STATE,
+ * _DEV_SETUP_STATE, or _DEV_DATA_STATE, respectively. The end of a
+ * certain state information is marked by _END_OF_STATE.
+ */
+#define VFIO_MIG_FLAG_END_OF_STATE  (0xef11ULL)
+#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xef12ULL)
+#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
+#define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
+
 static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
   off_t off, bool iswrite)
 {
@@ -129,6 +148,75 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, 
uint32_t mask,
 return 0;
 }
 
+static void vfio_migration_cleanup(VFIODevice *vbasedev)
+{
+VFIOMigration *migration = vbasedev->migration;
+
+if (migration->region.mmaps) {
+vfio_region_unmap(>region);
+}
+}
+
+/* -- */
+
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+int ret;
+
+trace_vfio_save_setup(vbasedev->name);
+
+qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
+
+if (migration->region.mmaps) {
+/*
+ * Calling vfio_region_mmap() from migration thread. Memory API called
+ * from this function require locking the iothread when called from
+ * outside the main loop thread.
+ */
+qemu_mutex_lock_iothread();
+ret = vfio_region_mmap(>region);
+qemu_mutex_unlock_iothread();
+if (ret) {
+error_report("%s: Failed to mmap VFIO migration region: %s",
+ vbasedev->name, strerror(-ret));
+error_report("%s: Falling back to slow path", vbasedev->name);
+}
+}
+
+ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
+   VFIO_DEVICE_STATE_SAVING);
+if (ret) {
+error_report("%s: Failed to set state SAVING", vbasedev->name);
+return ret;
+}
+
+qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+ret = qemu_file_get_error(f);
+if (ret) {
+return ret;
+}
+
+return 0;
+}
+
+static void vfio_save_cleanup(void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+
+vfio_migration_cleanup(vbasedev);
+trace_vfio_save_cleanup(vbasedev->name);
+}
+
+static SaveVMHandlers savevm_vfio_handlers = {
+.save_setup = vfio_save_setup,
+.save_cleanup = vfio_save_cleanup,
+};
+
+/* -- */
+
 static void vfio_vmstate_change(void *opaque, int running, RunState state)
 {
 VFIODevice *vbasedev = opaque;
@@ -217,6 +305,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 int ret;
 Object *obj;
 VFIOMigration *migration;
+char id[256] = "";
+g_autofree char *path = NULL, *oid = NULL;
 
 if (!vbasedev->ops->vfio_get_object) {
 return -EINVAL;
@@ -247,6 +337,18 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 }
 
 migration->vbasedev = vbasedev;
+
+oid = vmstate_if_get_id(VMSTATE_IF(DEVICE(obj))

[PATCH v28 09/17] vfio: Add load state functions to SaveVMHandlers

2020-10-23 Thread Kirti Wankhede
Sequence  during _RESUMING device state:
While data for this device is available, repeat below steps:
a. read data_offset from where user application should write data.
b. write data of data_size to migration region from data_offset.
c. write data_size which indicates vendor driver that data is written in
   staging buffer.

For user, data is opaque. User should write data in the same order as
received.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Dr. David Alan Gilbert 
---
 hw/vfio/migration.c  | 195 +++
 hw/vfio/trace-events |   4 ++
 2 files changed, 199 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index be9e4aba541d..240646592b39 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -257,6 +257,77 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice 
*vbasedev, uint64_t *size)
 return ret;
 }
 
+static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
+uint64_t data_size)
+{
+VFIORegion *region = >migration->region;
+uint64_t data_offset = 0, size, report_size;
+int ret;
+
+do {
+ret = vfio_mig_read(vbasedev, _offset, sizeof(data_offset),
+  region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset));
+if (ret < 0) {
+return ret;
+}
+
+if (data_offset + data_size > region->size) {
+/*
+ * If data_size is greater than the data section of migration 
region
+ * then iterate the write buffer operation. This case can occur if
+ * size of migration region at destination is smaller than size of
+ * migration region at source.
+ */
+report_size = size = region->size - data_offset;
+data_size -= size;
+} else {
+report_size = size = data_size;
+data_size = 0;
+}
+
+trace_vfio_load_state_device_data(vbasedev->name, data_offset, size);
+
+while (size) {
+void *buf;
+uint64_t sec_size;
+bool buf_alloc = false;
+
+buf = get_data_section_size(region, data_offset, size, _size);
+
+if (!buf) {
+buf = g_try_malloc(sec_size);
+if (!buf) {
+error_report("%s: Error allocating buffer ", __func__);
+return -ENOMEM;
+}
+buf_alloc = true;
+}
+
+qemu_get_buffer(f, buf, sec_size);
+
+if (buf_alloc) {
+ret = vfio_mig_write(vbasedev, buf, sec_size,
+region->fd_offset + data_offset);
+g_free(buf);
+
+if (ret < 0) {
+return ret;
+}
+}
+size -= sec_size;
+data_offset += sec_size;
+}
+
+ret = vfio_mig_write(vbasedev, _size, sizeof(report_size),
+region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size));
+if (ret < 0) {
+return ret;
+}
+} while (data_size);
+
+return 0;
+}
+
 static int vfio_update_pending(VFIODevice *vbasedev)
 {
 VFIOMigration *migration = vbasedev->migration;
@@ -293,6 +364,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void 
*opaque)
 return qemu_file_get_error(f);
 }
 
+static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+uint64_t data;
+
+if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
+int ret;
+
+ret = vbasedev->ops->vfio_load_config(vbasedev, f);
+if (ret) {
+error_report("%s: Failed to load device config space",
+ vbasedev->name);
+return ret;
+}
+}
+
+data = qemu_get_be64(f);
+if (data != VFIO_MIG_FLAG_END_OF_STATE) {
+error_report("%s: Failed loading device config space, "
+ "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
+return -EINVAL;
+}
+
+trace_vfio_load_device_config_state(vbasedev->name);
+return qemu_file_get_error(f);
+}
+
 static void vfio_migration_cleanup(VFIODevice *vbasedev)
 {
 VFIOMigration *migration = vbasedev->migration;
@@ -483,12 +581,109 @@ static int vfio_save_complete_precopy(QEMUFile *f, void 
*opaque)
 return ret;
 }
 
+static int vfio_load_setup(QEMUFile *f, void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+int ret = 0;
+
+if (migration->region.mmaps) {
+ret = vfio_region_mmap(>region);
+if (ret) {
+error_report("%s: Failed to mmap VFIO migration region %d: %s",
+ vbasedev->name, migration->region.nr,
+   

[PATCH v28 04/17] vfio: Add migration region initialization and finalize function

2020-10-23 Thread Kirti Wankhede
Whether the VFIO device supports migration or not is decided based of
migration region query. If migration region query is successful and migration
region initialization is successful then migration is supported else
migration is blocked.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Acked-by: Dr. David Alan Gilbert 
---
 hw/vfio/meson.build   |   1 +
 hw/vfio/migration.c   | 133 ++
 hw/vfio/trace-events  |   3 +
 include/hw/vfio/vfio-common.h |   9 +++
 4 files changed, 146 insertions(+)
 create mode 100644 hw/vfio/migration.c

diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index 37efa74018bc..da9af297a0c5 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
 vfio_ss.add(files(
   'common.c',
   'spapr.c',
+  'migration.c',
 ))
 vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
   'display.c',
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
new file mode 100644
index ..bbe6e0b7a6cc
--- /dev/null
+++ b/hw/vfio/migration.c
@@ -0,0 +1,133 @@
+/*
+ * Migration support for VFIO devices
+ *
+ * Copyright NVIDIA, Inc. 2020
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include 
+
+#include "hw/vfio/vfio-common.h"
+#include "cpu.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "migration/register.h"
+#include "migration/blocker.h"
+#include "migration/misc.h"
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "exec/ram_addr.h"
+#include "pci.h"
+#include "trace.h"
+
+static void vfio_migration_region_exit(VFIODevice *vbasedev)
+{
+VFIOMigration *migration = vbasedev->migration;
+
+if (!migration) {
+return;
+}
+
+vfio_region_exit(>region);
+vfio_region_finalize(>region);
+}
+
+static int vfio_migration_init(VFIODevice *vbasedev,
+   struct vfio_region_info *info)
+{
+int ret;
+Object *obj;
+VFIOMigration *migration;
+
+if (!vbasedev->ops->vfio_get_object) {
+return -EINVAL;
+}
+
+obj = vbasedev->ops->vfio_get_object(vbasedev);
+if (!obj) {
+return -EINVAL;
+}
+
+migration = g_new0(VFIOMigration, 1);
+
+ret = vfio_region_setup(obj, vbasedev, >region,
+info->index, "migration");
+if (ret) {
+error_report("%s: Failed to setup VFIO migration region %d: %s",
+ vbasedev->name, info->index, strerror(-ret));
+goto err;
+}
+
+vbasedev->migration = migration;
+
+if (!migration->region.size) {
+error_report("%s: Invalid zero-sized of VFIO migration region %d",
+ vbasedev->name, info->index);
+ret = -EINVAL;
+goto err;
+}
+return 0;
+
+err:
+vfio_migration_region_exit(vbasedev);
+g_free(migration);
+vbasedev->migration = NULL;
+return ret;
+}
+
+/* -- */
+
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
+{
+struct vfio_region_info *info = NULL;
+Error *local_err = NULL;
+int ret;
+
+ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
+   VFIO_REGION_SUBTYPE_MIGRATION, );
+if (ret) {
+goto add_blocker;
+}
+
+ret = vfio_migration_init(vbasedev, info);
+if (ret) {
+goto add_blocker;
+}
+
+g_free(info);
+trace_vfio_migration_probe(vbasedev->name, info->index);
+return 0;
+
+add_blocker:
+error_setg(>migration_blocker,
+   "VFIO device doesn't support migration");
+g_free(info);
+
+ret = migrate_add_blocker(vbasedev->migration_blocker, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+error_free(vbasedev->migration_blocker);
+vbasedev->migration_blocker = NULL;
+}
+return ret;
+}
+
+void vfio_migration_finalize(VFIODevice *vbasedev)
+{
+VFIOMigration *migration = vbasedev->migration;
+
+if (migration) {
+vfio_migration_region_exit(vbasedev);
+g_free(vbasedev->migration);
+vbasedev->migration = NULL;
+}
+
+if (vbasedev->migration_blocker) {
+migrate_del_blocker(vbasedev->migration_blocker);
+error_free(vbasedev->migration_blocker);
+vbasedev->migration_blocker = NULL;
+}
+}
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index a0c7b49a2ebc..9ced5ec6277c 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -145,3 +145,6 @@ vfio_display_edid_link_up(void) ""
 

[PATCH v28 13/17] vfio: Add vfio_listener_log_sync to mark dirty pages

2020-10-23 Thread Kirti Wankhede
vfio_listener_log_sync gets list of dirty pages from container using
VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all
devices are stopped and saving state.
Return early for the RAM block section of mapped MMIO region.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
---
 hw/vfio/common.c | 116 +++
 hw/vfio/trace-events |   1 +
 2 files changed, 117 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index d4959c036dd1..2634387df948 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -29,6 +29,7 @@
 #include "hw/vfio/vfio.h"
 #include "exec/address-spaces.h"
 #include "exec/memory.h"
+#include "exec/ram_addr.h"
 #include "hw/hw.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
@@ -37,6 +38,7 @@
 #include "sysemu/reset.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "migration/migration.h"
 
 VFIOGroupList vfio_group_list =
 QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -287,6 +289,39 @@ const MemoryRegionOps vfio_region_ops = {
 };
 
 /*
+ * Device state interfaces
+ */
+
+static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container)
+{
+VFIOGroup *group;
+VFIODevice *vbasedev;
+MigrationState *ms = migrate_get_current();
+
+if (!migration_is_setup_or_active(ms->state)) {
+return false;
+}
+
+QLIST_FOREACH(group, >group_list, container_next) {
+QLIST_FOREACH(vbasedev, >device_list, next) {
+VFIOMigration *migration = vbasedev->migration;
+
+if (!migration) {
+return false;
+}
+
+if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) &&
+!(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+continue;
+} else {
+return false;
+}
+}
+}
+return true;
+}
+
+/*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
 static int vfio_dma_unmap(VFIOContainer *container,
@@ -812,9 +847,90 @@ static void vfio_listener_region_del(MemoryListener 
*listener,
 }
 }
 
+static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
+ uint64_t size, ram_addr_t ram_addr)
+{
+struct vfio_iommu_type1_dirty_bitmap *dbitmap;
+struct vfio_iommu_type1_dirty_bitmap_get *range;
+uint64_t pages;
+int ret;
+
+dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
+
+dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
+dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
+range = (struct vfio_iommu_type1_dirty_bitmap_get *)>data;
+range->iova = iova;
+range->size = size;
+
+/*
+ * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+ * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap's pgsize to
+ * TARGET_PAGE_SIZE.
+ */
+range->bitmap.pgsize = TARGET_PAGE_SIZE;
+
+pages = TARGET_PAGE_ALIGN(range->size) >> TARGET_PAGE_BITS;
+range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+ BITS_PER_BYTE;
+range->bitmap.data = g_try_malloc0(range->bitmap.size);
+if (!range->bitmap.data) {
+ret = -ENOMEM;
+goto err_out;
+}
+
+ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
+if (ret) {
+error_report("Failed to get dirty bitmap for iova: 0x%llx "
+"size: 0x%llx err: %d",
+range->iova, range->size, errno);
+goto err_out;
+}
+
+cpu_physical_memory_set_dirty_lebitmap((uint64_t *)range->bitmap.data,
+ram_addr, pages);
+
+trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
+range->bitmap.size, ram_addr);
+err_out:
+g_free(range->bitmap.data);
+g_free(dbitmap);
+
+return ret;
+}
+
+static int vfio_sync_dirty_bitmap(VFIOContainer *container,
+  MemoryRegionSection *section)
+{
+ram_addr_t ram_addr;
+
+ram_addr = memory_region_get_ram_addr(section->mr) +
+   section->offset_within_region;
+
+return vfio_get_dirty_bitmap(container,
+   TARGET_PAGE_ALIGN(section->offset_within_address_space),
+   int128_get64(section->size), ram_addr);
+}
+
+static void vfio_listerner_log_sync(MemoryListener *listener,
+MemoryRegionSection *section)
+{
+VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+
+if (vfio_listener_skipped_section(section) ||
+!container->dirty_pages_supported) {
+r

[PATCH v28 03/17] vfio: Add save and load functions for VFIO PCI devices

2020-10-23 Thread Kirti Wankhede
Added functions to save and restore PCI device specific data,
specifically config space of PCI device.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
---
 hw/vfio/pci.c | 48 +++
 include/hw/vfio/vfio-common.h |  2 ++
 2 files changed, 50 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index bffd5bfe3b78..92cc25a5489f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -41,6 +41,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/blocker.h"
+#include "migration/qemu-file.h"
 
 #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
 
@@ -2401,11 +2402,58 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
 return OBJECT(vdev);
 }
 
+static bool vfio_msix_present(void *opaque, int version_id)
+{
+PCIDevice *pdev = opaque;
+
+return msix_present(pdev);
+}
+
+const VMStateDescription vmstate_vfio_pci_config = {
+.name = "VFIOPCIDevice",
+.version_id = 1,
+.minimum_version_id = 1,
+.fields = (VMStateField[]) {
+VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
+VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
+VMSTATE_END_OF_LIST()
+}
+};
+
+static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+vmstate_save_state(f, _vfio_pci_config, vdev, NULL);
+}
+
+static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+PCIDevice *pdev = >pdev;
+int ret;
+
+ret = vmstate_load_state(f, _vfio_pci_config, vdev, 1);
+if (ret) {
+return ret;
+}
+
+if (msi_enabled(pdev)) {
+vfio_msi_enable(vdev);
+} else if (msix_enabled(pdev)) {
+vfio_msix_enable(vdev);
+}
+
+return ret;
+}
+
 static VFIODeviceOps vfio_pci_ops = {
 .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
 .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
 .vfio_eoi = vfio_intx_eoi,
 .vfio_get_object = vfio_pci_get_object,
+.vfio_save_config = vfio_pci_save_config,
+.vfio_load_config = vfio_pci_load_config,
 };
 
 int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fe99c36a693a..ba6169cd926e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -120,6 +120,8 @@ struct VFIODeviceOps {
 int (*vfio_hot_reset_multi)(VFIODevice *vdev);
 void (*vfio_eoi)(VFIODevice *vdev);
 Object *(*vfio_get_object)(VFIODevice *vdev);
+void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
+int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
 };
 
 typedef struct VFIOGroup {
-- 
2.7.0




[PATCH v28 01/17] vfio: Add function to unmap VFIO region

2020-10-23 Thread Kirti Wankhede
This function will be used for migration region.
Migration region is mmaped when migration starts and will be unmapped when
migration is complete.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Cornelia Huck 
---
 hw/vfio/common.c  | 32 
 hw/vfio/trace-events  |  1 +
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 13471ae29436..c6e98b8d61be 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -924,6 +924,18 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, 
VFIORegion *region,
 return 0;
 }
 
+static void vfio_subregion_unmap(VFIORegion *region, int index)
+{
+trace_vfio_region_unmap(memory_region_name(>mmaps[index].mem),
+region->mmaps[index].offset,
+region->mmaps[index].offset +
+region->mmaps[index].size - 1);
+memory_region_del_subregion(region->mem, >mmaps[index].mem);
+munmap(region->mmaps[index].mmap, region->mmaps[index].size);
+object_unparent(OBJECT(>mmaps[index].mem));
+region->mmaps[index].mmap = NULL;
+}
+
 int vfio_region_mmap(VFIORegion *region)
 {
 int i, prot = 0;
@@ -954,10 +966,7 @@ int vfio_region_mmap(VFIORegion *region)
 region->mmaps[i].mmap = NULL;
 
 for (i--; i >= 0; i--) {
-memory_region_del_subregion(region->mem, 
>mmaps[i].mem);
-munmap(region->mmaps[i].mmap, region->mmaps[i].size);
-object_unparent(OBJECT(>mmaps[i].mem));
-region->mmaps[i].mmap = NULL;
+vfio_subregion_unmap(region, i);
 }
 
 return ret;
@@ -982,6 +991,21 @@ int vfio_region_mmap(VFIORegion *region)
 return 0;
 }
 
+void vfio_region_unmap(VFIORegion *region)
+{
+int i;
+
+if (!region->mem) {
+return;
+}
+
+for (i = 0; i < region->nr_mmaps; i++) {
+if (region->mmaps[i].mmap) {
+vfio_subregion_unmap(region, i);
+}
+}
+}
+
 void vfio_region_exit(VFIORegion *region)
 {
 int i;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 93a0bc2522f8..a0c7b49a2ebc 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -113,6 +113,7 @@ vfio_region_mmap(const char *name, unsigned long offset, 
unsigned long end) "Reg
 vfio_region_exit(const char *name, int index) "Device %s, region %d"
 vfio_region_finalize(const char *name, int index) "Device %s, region %d"
 vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps 
enabled: %d"
+vfio_region_unmap(const char *name, unsigned long offset, unsigned long end) 
"Region %s unmap [0x%lx - 0x%lx]"
 vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) 
"Device %s region %d: %d sparse mmap entries"
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) 
"sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t 
subtype) "%s index %d, %08x/%0x8"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c78f3ff5593c..dc95f527b583 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -171,6 +171,7 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, 
VFIORegion *region,
   int index, const char *name);
 int vfio_region_mmap(VFIORegion *region);
 void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled);
+void vfio_region_unmap(VFIORegion *region);
 void vfio_region_exit(VFIORegion *region);
 void vfio_region_finalize(VFIORegion *region);
 void vfio_reset_handler(void *opaque);
-- 
2.7.0




[PATCH v28 10/17] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled

2020-10-23 Thread Kirti Wankhede
mr->ram_block is NULL when mr->is_iommu is true, then fr.dirty_log_mask
wasn't set correctly due to which memory listener's log_sync doesn't
get called.
This patch returns log_mask with DIRTY_MEMORY_MIGRATION set when
IOMMU is enabled.

Signed-off-by: Kirti Wankhede 
---
 softmmu/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/softmmu/memory.c b/softmmu/memory.c
index 403ff3abc99b..94f606e9d9d9 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1792,7 +1792,7 @@ bool memory_region_is_ram_device(MemoryRegion *mr)
 uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
 {
 uint8_t mask = mr->dirty_log_mask;
-if (global_dirty_log && mr->ram_block) {
+if (global_dirty_log && (mr->ram_block || memory_region_is_iommu(mr))) {
 mask |= (1 << DIRTY_MEMORY_MIGRATION);
 }
 return mask;
-- 
2.7.0




[PATCH v28 05/17] vfio: Add VM state change handler to know state of VM

2020-10-23 Thread Kirti Wankhede
VM state change handler is called on change in VM's state. Based on
VM state, VFIO device state should be changed.
Added read/write helper functions for migration region.
Added function to set device_state.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Dr. David Alan Gilbert 
---
 hw/vfio/migration.c   | 156 ++
 hw/vfio/trace-events  |   2 +
 include/hw/vfio/vfio-common.h |   4 ++
 3 files changed, 162 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index bbe6e0b7a6cc..9b6949439f8e 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -10,6 +10,7 @@
 #include "qemu/osdep.h"
 #include 
 
+#include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "cpu.h"
 #include "migration/migration.h"
@@ -22,6 +23,157 @@
 #include "exec/ram_addr.h"
 #include "pci.h"
 #include "trace.h"
+#include "hw/hw.h"
+
+static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
+  off_t off, bool iswrite)
+{
+int ret;
+
+ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
+pread(vbasedev->fd, val, count, off);
+if (ret < count) {
+error_report("vfio_mig_%s %d byte %s: failed at offset 0x%lx, err: %s",
+ iswrite ? "write" : "read", count,
+ vbasedev->name, off, strerror(errno));
+return (ret < 0) ? ret : -EINVAL;
+}
+return 0;
+}
+
+static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count,
+   off_t off, bool iswrite)
+{
+int ret, done = 0;
+__u8 *tbuf = buf;
+
+while (count) {
+int bytes = 0;
+
+if (count >= 8 && !(off % 8)) {
+bytes = 8;
+} else if (count >= 4 && !(off % 4)) {
+bytes = 4;
+} else if (count >= 2 && !(off % 2)) {
+bytes = 2;
+} else {
+bytes = 1;
+}
+
+ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite);
+if (ret) {
+return ret;
+}
+
+count -= bytes;
+done += bytes;
+off += bytes;
+tbuf += bytes;
+}
+return done;
+}
+
+#define vfio_mig_read(f, v, c, o)   vfio_mig_rw(f, (__u8 *)v, c, o, false)
+#define vfio_mig_write(f, v, c, o)  vfio_mig_rw(f, (__u8 *)v, c, o, true)
+
+#define VFIO_MIG_STRUCT_OFFSET(f)   \
+ offsetof(struct vfio_device_migration_info, f)
+/*
+ * Change the device_state register for device @vbasedev. Bits set in @mask
+ * are preserved, bits set in @value are set, and bits not set in either @mask
+ * or @value are cleared in device_state. If the register cannot be accessed,
+ * the resulting state would be invalid, or the device enters an error state,
+ * an error is returned.
+ */
+
+static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
+uint32_t value)
+{
+VFIOMigration *migration = vbasedev->migration;
+VFIORegion *region = >region;
+off_t dev_state_off = region->fd_offset +
+  VFIO_MIG_STRUCT_OFFSET(device_state);
+uint32_t device_state;
+int ret;
+
+ret = vfio_mig_read(vbasedev, _state, sizeof(device_state),
+dev_state_off);
+if (ret < 0) {
+return ret;
+}
+
+device_state = (device_state & mask) | value;
+
+if (!VFIO_DEVICE_STATE_VALID(device_state)) {
+return -EINVAL;
+}
+
+ret = vfio_mig_write(vbasedev, _state, sizeof(device_state),
+ dev_state_off);
+if (ret < 0) {
+int rret;
+
+rret = vfio_mig_read(vbasedev, _state, sizeof(device_state),
+ dev_state_off);
+
+if ((rret < 0) || (VFIO_DEVICE_STATE_IS_ERROR(device_state))) {
+hw_error("%s: Device in error state 0x%x", vbasedev->name,
+ device_state);
+return rret ? rret : -EIO;
+}
+return ret;
+}
+
+migration->device_state = device_state;
+trace_vfio_migration_set_state(vbasedev->name, device_state);
+return 0;
+}
+
+static void vfio_vmstate_change(void *opaque, int running, RunState state)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+uint32_t value, mask;
+int ret;
+
+if ((vbasedev->migration->vm_running == running)) {
+return;
+}
+
+if (running) {
+/*
+ * Here device state can have one of _SAVING, _RESUMING or _STOP bit.
+ * Transition from _SAVING to _RUNNING can happen if there is migration
+ * failure, in that case clear _SAVING bit.
+ * Transition from _RE

[PATCH v28 00/17] Add migration support for VFIO devices

2020-10-23 Thread Kirti Wankhede
opy is not supported.

v27 -> 28
- Nit picks and minor changes suggested by Alex.

v26 -> 27
- Major change in Patch 3 -PCI config space save and long using VMSTATE_*
- Major change in Patch 14 - Dirty page tracking when vIOMMU is enabled using 
IOMMU notifier and
  its replay functionality - as suggested by Alex.
- Some Structure changes to keep all migration related members at one place.
- Pulled fix suggested by Zhi Wang 
  https://www.mail-archive.com/qemu-devel@nongnu.org/msg743722.html
- Add comments where even suggested and required.

v25 -> 26
- Removed emulated_config_bits cache and vdev->pdev.wmask from config space save
  load functions.
- Used VMStateDescription for config space save and load functionality.
- Major fixes from previous version review.
  https://www.mail-archive.com/qemu-devel@nongnu.org/msg714625.html

v23 -> 25
- Updated config space save and load to save config cache, emulated bits cache
  and wmask cache.
- Created idr string as suggested by Dr Dave that includes bus path.
- Updated save and load function to read/write data to mixed regions, mapped or
  trapped.
- When vIOMMU is enabled, created mapped iova range list which also keeps
  translated address. This list is used to mark dirty pages. This reduces
  downtime significantly with vIOMMU enabled than migration patches from
   previous version. 
- Removed get_address_limit() function from v23 patch as this not required now.

v22 -> v23
-- Fixed issue reported by Yan
https://lore.kernel.org/kvm/97977ede-3c5b-c5a5-7858-7eecd7dd5...@nvidia.com/
- Sending this version to test v23 kernel version patches:
https://lore.kernel.org/kvm/1589998088-3250-1-git-send-email-kwankh...@nvidia.com/

v18 -> v22
- Few fixes from v18 review. But not yet fixed all concerns. I'll address those
  concerns in subsequent iterations.
- Sending this version to test v22 kernel version patches:
https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email-kwankh...@nvidia.com/

v16 -> v18
- Nit fixes
- Get migration capability flags from container
- Added VFIO stats to MigrationInfo
- Fixed bug reported by Yan
https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg4.html

v9 -> v16
- KABI almost finalised on kernel patches.
- Added support for migration with vIOMMU enabled.

v8 -> v9:
- Split patch set in 2 sets, Kernel and QEMU sets.
- Dirty pages bitmap is queried from IOMMU container rather than from
  vendor driver for per device. Added 2 ioctls to achieve this.

v7 -> v8:
- Updated comments for KABI
- Added BAR address validation check during PCI device's config space load as
  suggested by Dr. David Alan Gilbert.
- Changed vfio_migration_set_state() to set or clear device state flags.
- Some nit fixes.

v6 -> v7:
- Fix build failures.

v5 -> v6:
- Fix build failure.

v4 -> v5:
- Added decriptive comment about the sequence of access of members of structure
  vfio_device_migration_info to be followed based on Alex's suggestion
- Updated get dirty pages sequence.
- As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
  get_object, save_config and load_config.
- Fixed multiple nit picks.
- Tested live migration with multiple vfio device assigned to a VM.

v3 -> v4:
- Added one more bit for _RESUMING flag to be set explicitly.
- data_offset field is read-only for user space application.
- data_size is read for every iteration before reading data from migration, that
  is removed assumption that data will be till end of migration region.
- If vendor driver supports mappable sparsed region, map those region during
  setup state of save/load, similarly unmap those from cleanup routines.
- Handles race condition that causes data corruption in migration region during
  save device state by adding mutex and serialiaing save_buffer and
  get_dirty_pages routines.
- Skip called get_dirty_pages routine for mapped MMIO region of device.
- Added trace events.
- Splitted into multiple functional patches.

v2 -> v3:
- Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
- Re-structured vfio_device_migration_info to keep it minimal and defined action
  on read and write access on its members.

v1 -> v2:
- Defined MIGRATION region type and sub-type which should be used with region
  type capability.
- Re-structured vfio_device_migration_info. This structure will be placed at 0th
  offset of migration region.
- Replaced ioctl with read/write for trapped part of migration region.
- Added both type of access support, trapped or mmapped, for data section of the
  region.
- Moved PCI device functions to pci file.
- Added iteration to get dirty page bitmap until bitmap for all requested pages
  are copied.

Thanks,
Kirti



Kirti Wankhede (17):
  vfio: Add function to unmap VFIO region
  vfio: Add vfio_get_object callback to VFIODeviceOps
  vfio: Add save and load functions for VFIO PCI devices
  vfio: Add migration region initialization and finalize function
  vfio: Add V

[PATCH v28 02/17] vfio: Add vfio_get_object callback to VFIODeviceOps

2020-10-23 Thread Kirti Wankhede
Hook vfio_get_object callback for PCI devices.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Suggested-by: Cornelia Huck 
Reviewed-by: Cornelia Huck 
---
 hw/vfio/pci.c | 8 
 include/hw/vfio/vfio-common.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 0d83eb0e47bb..bffd5bfe3b78 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2394,10 +2394,18 @@ static void vfio_pci_compute_needs_reset(VFIODevice 
*vbasedev)
 }
 }
 
+static Object *vfio_pci_get_object(VFIODevice *vbasedev)
+{
+VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+return OBJECT(vdev);
+}
+
 static VFIODeviceOps vfio_pci_ops = {
 .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
 .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
 .vfio_eoi = vfio_intx_eoi,
+.vfio_get_object = vfio_pci_get_object,
 };
 
 int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index dc95f527b583..fe99c36a693a 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -119,6 +119,7 @@ struct VFIODeviceOps {
 void (*vfio_compute_needs_reset)(VFIODevice *vdev);
 int (*vfio_hot_reset_multi)(VFIODevice *vdev);
 void (*vfio_eoi)(VFIODevice *vdev);
+Object *(*vfio_get_object)(VFIODevice *vdev);
 };
 
 typedef struct VFIOGroup {
-- 
2.7.0




Re: [PATCH v27 17/17] qapi: Add VFIO devices migration stats in Migration stats

2020-10-23 Thread Kirti Wankhede




On 10/23/2020 3:48 AM, Alex Williamson wrote:

On Thu, 22 Oct 2020 16:42:07 +0530
Kirti Wankhede  wrote:


Added amount of bytes transferred to the VM at destination by all VFIO
devices

Signed-off-by: Kirti Wankhede 
Reviewed-by: Dr. David Alan Gilbert 
---
  hw/vfio/common.c| 20 
  hw/vfio/migration.c | 10 ++
  include/qemu/vfio-helpers.h |  3 +++
  migration/migration.c   | 14 ++
  monitor/hmp-cmds.c  |  6 ++
  qapi/migration.json | 17 +
  6 files changed, 70 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 9c879e5c0f62..8d0758eda9fa 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -39,6 +39,7 @@
  #include "trace.h"
  #include "qapi/error.h"
  #include "migration/migration.h"
+#include "qemu/vfio-helpers.h"
  
  VFIOGroupList vfio_group_list =

  QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -292,6 +293,25 @@ const MemoryRegionOps vfio_region_ops = {
   * Device state interfaces
   */
  
+bool vfio_mig_active(void)

+{
+VFIOGroup *group;
+VFIODevice *vbasedev;
+
+if (QLIST_EMPTY(_group_list)) {
+return false;
+}
+
+QLIST_FOREACH(group, _group_list, next) {
+QLIST_FOREACH(vbasedev, >device_list, next) {
+if (vbasedev->migration_blocker) {
+return false;
+}
+}
+}
+return true;
+}
+
  static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container)
  {
  VFIOGroup *group;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 77ee60a43ea5..b23e21c6de2b 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -28,6 +28,7 @@
  #include "pci.h"
  #include "trace.h"
  #include "hw/hw.h"
+#include "qemu/vfio-helpers.h"
  
  /*

   * Flags to be used as unique delimiters for VFIO devices in the migration
@@ -45,6 +46,8 @@
  #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
  #define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
  
+static int64_t bytes_transferred;

+
  static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
off_t off, bool iswrite)
  {
@@ -255,6 +258,7 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice 
*vbasedev, uint64_t *size)
  *size = data_size;
  }
  
+bytes_transferred += data_size;

  return ret;
  }
  
@@ -776,6 +780,7 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)

  case MIGRATION_STATUS_CANCELLING:
  case MIGRATION_STATUS_CANCELLED:
  case MIGRATION_STATUS_FAILED:
+bytes_transferred = 0;
  ret = vfio_migration_set_state(vbasedev,
~(VFIO_DEVICE_STATE_SAVING | 
VFIO_DEVICE_STATE_RESUMING),
VFIO_DEVICE_STATE_RUNNING);
@@ -862,6 +867,11 @@ err:
  
  /* -- */
  
+int64_t vfio_mig_bytes_transferred(void)

+{
+return bytes_transferred;
+}
+
  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
  {
  VFIOContainer *container = vbasedev->group->container;
diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
index 4491c8e1a6e9..7f7a46e6ef2d 100644
--- a/include/qemu/vfio-helpers.h
+++ b/include/qemu/vfio-helpers.h
@@ -29,4 +29,7 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, 
void *bar,
  int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
 int irq_type, Error **errp);
  
+bool vfio_mig_active(void);

+int64_t vfio_mig_bytes_transferred(void);
+
  #endif



I don't think vfio-helpers is the right place for this, this header is
specifically for using util/vfio-helpers.c.  Would
include/hw/vfio/vfio-common.h work?




Yes, works with CONFIG_VFIO check. Changing it.


diff --git a/migration/migration.c b/migration/migration.c
index 0575ecb37953..8b2865d25ef4 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -56,6 +56,7 @@
  #include "net/announce.h"
  #include "qemu/queue.h"
  #include "multifd.h"
+#include "qemu/vfio-helpers.h"
  
  #define MAX_THROTTLE  (128 << 20)  /* Migration transfer speed throttling */
  
@@ -1002,6 +1003,17 @@ static void populate_disk_info(MigrationInfo *info)

  }
  }
  
+static void populate_vfio_info(MigrationInfo *info)

+{
+#ifdef CONFIG_LINUX


Use CONFIG_VFIO?  I get a build failure on qemu-system-avr

/usr/bin/ld: /tmp/tmp.3QbqxgbENl/build/../migration/migration.c:1012:
undefined reference to `vfio_mig_bytes_transferred'.  Thanks,



Ok Changing it.


Alex


+if (vfio_mig_active()) {
+info->has_vfio = true;
+info->vfio = g_malloc0(sizeof(*info->vfio));
+info->vfio->transferred = vfio_mig_bytes_transferred();
+}
+#endif
+}
+
  static void fill_sou

Re: [PATCH v27 09/17] vfio: Add load state functions to SaveVMHandlers

2020-10-23 Thread Kirti Wankhede




On 10/23/2020 1:20 AM, Alex Williamson wrote:

On Thu, 22 Oct 2020 16:41:59 +0530
Kirti Wankhede  wrote:


Sequence  during _RESUMING device state:
While data for this device is available, repeat below steps:
a. read data_offset from where user application should write data.
b. write data of data_size to migration region from data_offset.
c. write data_size which indicates vendor driver that data is written in
staging buffer.

For user, data is opaque. User should write data in the same order as
received.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Dr. David Alan Gilbert 
---
  hw/vfio/migration.c  | 192 +++
  hw/vfio/trace-events |   3 +
  2 files changed, 195 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 5506cef15d88..46d05d230e2a 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -257,6 +257,77 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice 
*vbasedev, uint64_t *size)
  return ret;
  }
  
+static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,

+uint64_t data_size)
+{
+VFIORegion *region = >migration->region;
+uint64_t data_offset = 0, size, report_size;
+int ret;
+
+do {
+ret = vfio_mig_read(vbasedev, _offset, sizeof(data_offset),
+  region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset));
+if (ret < 0) {
+return ret;
+}
+
+if (data_offset + data_size > region->size) {
+/*
+ * If data_size is greater than the data section of migration 
region
+ * then iterate the write buffer operation. This case can occur if
+ * size of migration region at destination is smaller than size of
+ * migration region at source.
+ */
+report_size = size = region->size - data_offset;
+data_size -= size;
+} else {
+report_size = size = data_size;
+data_size = 0;
+}
+
+trace_vfio_load_state_device_data(vbasedev->name, data_offset, size);
+
+while (size) {
+void *buf;
+uint64_t sec_size;
+bool buf_alloc = false;
+
+buf = get_data_section_size(region, data_offset, size, _size);
+
+if (!buf) {
+buf = g_try_malloc(sec_size);
+if (!buf) {
+error_report("%s: Error allocating buffer ", __func__);
+return -ENOMEM;
+}
+buf_alloc = true;
+}
+
+qemu_get_buffer(f, buf, sec_size);
+
+if (buf_alloc) {
+ret = vfio_mig_write(vbasedev, buf, sec_size,
+region->fd_offset + data_offset);
+g_free(buf);
+
+if (ret < 0) {
+return ret;
+}
+}
+size -= sec_size;
+data_offset += sec_size;
+}
+
+ret = vfio_mig_write(vbasedev, _size, sizeof(report_size),
+region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size));
+if (ret < 0) {
+return ret;
+}
+} while (data_size);
+
+return 0;
+}
+
  static int vfio_update_pending(VFIODevice *vbasedev)
  {
  VFIOMigration *migration = vbasedev->migration;
@@ -293,6 +364,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void 
*opaque)
  return qemu_file_get_error(f);
  }
  
+static int vfio_load_device_config_state(QEMUFile *f, void *opaque)

+{
+VFIODevice *vbasedev = opaque;
+uint64_t data;
+
+if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
+int ret;
+
+ret = vbasedev->ops->vfio_load_config(vbasedev, f);
+if (ret) {
+error_report("%s: Failed to load device config space",
+ vbasedev->name);
+return ret;
+}
+}
+
+data = qemu_get_be64(f);
+if (data != VFIO_MIG_FLAG_END_OF_STATE) {
+error_report("%s: Failed loading device config space, "
+ "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
+return -EINVAL;
+}
+
+trace_vfio_load_device_config_state(vbasedev->name);
+return qemu_file_get_error(f);
+}
+
  /* -- */
  
  static int vfio_save_setup(QEMUFile *f, void *opaque)

@@ -477,12 +575,106 @@ static int vfio_save_complete_precopy(QEMUFile *f, void 
*opaque)
  return ret;
  }
  
+static int vfio_load_setup(QEMUFile *f, void *opaque)

+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+int ret = 0;
+
+if (migration->region.mmaps) {
+ret = vfio_region_mmap(>region);



Checking, are we in the right t

Re: [PATCH v27 14/17] vfio: Dirty page tracking when vIOMMU is enabled

2020-10-23 Thread Kirti Wankhede




On 10/23/2020 2:07 AM, Alex Williamson wrote:

On Thu, 22 Oct 2020 16:42:04 +0530
Kirti Wankhede  wrote:


When vIOMMU is enabled, register MAP notifier from log_sync when all
devices in container are in stop and copy phase of migration. Call replay
and get dirty pages from notifier callback.

Suggested-by: Alex Williamson 
Signed-off-by: Kirti Wankhede 
---
  hw/vfio/common.c  | 95 ---
  hw/vfio/trace-events  |  1 +
  include/hw/vfio/vfio-common.h |  1 +
  3 files changed, 91 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 2634387df948..98c2b1f9b190 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -442,8 +442,8 @@ static bool 
vfio_listener_skipped_section(MemoryRegionSection *section)
  }
  
  /* Called with rcu_read_lock held.  */

-static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
-   bool *read_only)
+static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
+   ram_addr_t *ram_addr, bool *read_only)
  {
  MemoryRegion *mr;
  hwaddr xlat;
@@ -474,8 +474,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void 
**vaddr,
  return false;
  }
  
-*vaddr = memory_region_get_ram_ptr(mr) + xlat;

-*read_only = !writable || mr->readonly;
+if (vaddr) {
+*vaddr = memory_region_get_ram_ptr(mr) + xlat;
+}
+
+if (ram_addr) {
+*ram_addr = memory_region_get_ram_addr(mr) + xlat;
+}
+
+if (read_only) {
+*read_only = !writable || mr->readonly;
+}
  
  return true;

  }
@@ -485,7 +494,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, 
IOMMUTLBEntry *iotlb)
  VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
  VFIOContainer *container = giommu->container;
  hwaddr iova = iotlb->iova + giommu->iommu_offset;
-bool read_only;
  void *vaddr;
  int ret;
  
@@ -501,7 +509,9 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)

  rcu_read_lock();
  
  if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {

-if (!vfio_get_vaddr(iotlb, , _only)) {
+bool read_only;
+
+if (!vfio_get_xlat_addr(iotlb, , NULL, _only)) {
  goto out;
  }
  /*
@@ -899,11 +909,84 @@ err_out:
  return ret;
  }
  
+static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)

+{
+VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, dirty_notify);
+VFIOContainer *container = giommu->container;
+hwaddr iova = iotlb->iova + giommu->iommu_offset;
+ram_addr_t translated_addr;
+
+trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
+
+if (iotlb->target_as != _space_memory) {
+error_report("Wrong target AS \"%s\", only system memory is allowed",
+ iotlb->target_as->name ? iotlb->target_as->name : "none");
+return;
+}
+
+rcu_read_lock();
+
+if (vfio_get_xlat_addr(iotlb, NULL, _addr, NULL)) {
+int ret;
+
+ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
+translated_addr);
+if (ret) {
+error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
+ "0x%"HWADDR_PRIx") = %d (%m)",
+ container, iova,
+ iotlb->addr_mask + 1, ret);
+}
+}
+
+rcu_read_unlock();
+}
+
  static int vfio_sync_dirty_bitmap(VFIOContainer *container,
MemoryRegionSection *section)
  {
  ram_addr_t ram_addr;
  
+if (memory_region_is_iommu(section->mr)) {

+VFIOGuestIOMMU *giommu;
+int ret = 0;
+
+QLIST_FOREACH(giommu, >giommu_list, giommu_next) {
+if (MEMORY_REGION(giommu->iommu) == section->mr &&
+giommu->n.start == section->offset_within_region) {
+Int128 llend;
+Error *err = NULL;
+int idx = memory_region_iommu_attrs_to_index(giommu->iommu,
+   MEMTXATTRS_UNSPECIFIED);
+
+llend = 
int128_add(int128_make64(section->offset_within_region),
+   section->size);
+llend = int128_sub(llend, int128_one());
+
+iommu_notifier_init(>dirty_notify,
+vfio_iommu_map_dirty_notify,
+IOMMU_NOTIFIER_MAP,
+section->offset_within_region,
+int128_get64(llend),
+idx);
+ret = memory_region_register_iommu_notifier(section->mr

Re: [PATCH v27 07/17] vfio: Register SaveVMHandlers for VFIO device

2020-10-23 Thread Kirti Wankhede




On 10/23/2020 12:21 AM, Alex Williamson wrote:

On Thu, 22 Oct 2020 16:41:57 +0530
Kirti Wankhede  wrote:


Define flags to be used as delimiter in migration stream for VFIO devices.
Added .save_setup and .save_cleanup functions. Map & unmap migration
region from these functions at source during saving or pre-copy phase.

Set VFIO device state depending on VM's state. During live migration, VM is
running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
device. During save-restore, VM is paused, _SAVING state is set for VFIO device.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
---
  hw/vfio/migration.c  | 96 
  hw/vfio/trace-events |  2 ++
  2 files changed, 98 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 7c4fa0d08ea6..2e1054bf7f43 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -8,12 +8,15 @@
   */
  
  #include "qemu/osdep.h"

+#include "qemu/main-loop.h"
+#include "qemu/cutils.h"
  #include 
  
  #include "sysemu/runstate.h"

  #include "hw/vfio/vfio-common.h"
  #include "cpu.h"
  #include "migration/migration.h"
+#include "migration/vmstate.h"
  #include "migration/qemu-file.h"
  #include "migration/register.h"
  #include "migration/blocker.h"
@@ -25,6 +28,22 @@
  #include "trace.h"
  #include "hw/hw.h"
  
+/*

+ * Flags to be used as unique delimiters for VFIO devices in the migration
+ * stream. These flags are composed as:
+ * 0x => MSB 32-bit all 1s
+ * 0xef10 => Magic ID, represents emulated (virtual) function IO
+ * 0x => 16-bits reserved for flags
+ *
+ * The beginning of state information is marked by _DEV_CONFIG_STATE,
+ * _DEV_SETUP_STATE, or _DEV_DATA_STATE, respectively. The end of a
+ * certain state information is marked by _END_OF_STATE.
+ */
+#define VFIO_MIG_FLAG_END_OF_STATE  (0xef11ULL)
+#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xef12ULL)
+#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
+#define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
+
  static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
off_t off, bool iswrite)
  {
@@ -129,6 +148,69 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, 
uint32_t mask,
  return 0;
  }
  
+/* -- */

+
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+int ret;
+
+trace_vfio_save_setup(vbasedev->name);
+
+qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
+
+if (migration->region.mmaps) {
+/*
+ * vfio_region_mmap() called from migration thread. Memory API called
+ * from vfio_regio_mmap() need it when called from outdide the main 
loop
+ * thread.
+ */


Thanks for adding this detail, maybe refine slightly as:

   Calling vfio_region_mmap() from migration thread.  Memory APIs called
   from this function require locking the iothread when called from
   outside the main loop thread.

Does that capture the intent?



Ok.


+qemu_mutex_lock_iothread();
+ret = vfio_region_mmap(>region);
+qemu_mutex_unlock_iothread();
+if (ret) {
+error_report("%s: Failed to mmap VFIO migration region: %s",
+ vbasedev->name, strerror(-ret));
+error_report("%s: Falling back to slow path", vbasedev->name);
+}
+}
+
+ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
+   VFIO_DEVICE_STATE_SAVING);
+if (ret) {
+error_report("%s: Failed to set state SAVING", vbasedev->name);
+return ret;
+}
+
+qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+ret = qemu_file_get_error(f);
+if (ret) {
+return ret;
+}
+
+return 0;
+}
+
+static void vfio_save_cleanup(void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+
+if (migration->region.mmaps) {
+vfio_region_unmap(>region);
+}



Are we in a different thread context here that we don't need that same
iothread locking?



qemu_savevm_state_setup() is called without holding iothread lock and 
qemu_savevm_state_cleanup() is called holding iothread lock, so we don't 
need lock here.





+trace_vfio_save_cleanup(vbasedev->name);
+}
+
+static SaveVMHandlers savevm_vfio_handlers = {
+.save_setup = vfio_save_setup,
+.save_cleanup = vfio_save_cleanup,
+};
+
+/* -- */
+
  static void vfio_vmstate_change(void *opaque, i

Re: [PATCH v27 05/17] vfio: Add VM state change handler to know state of VM

2020-10-22 Thread Kirti Wankhede




On 10/22/2020 10:05 PM, Alex Williamson wrote:

On Thu, 22 Oct 2020 16:41:55 +0530
Kirti Wankhede  wrote:


VM state change handler is called on change in VM's state. Based on
VM state, VFIO device state should be changed.
Added read/write helper functions for migration region.
Added function to set device_state.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Dr. David Alan Gilbert 
---
  hw/vfio/migration.c   | 158 ++
  hw/vfio/trace-events  |   2 +
  include/hw/vfio/vfio-common.h |   4 ++
  3 files changed, 164 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 5f74a3ad1d72..34f39c7e2e28 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -10,6 +10,7 @@
  #include "qemu/osdep.h"
  #include 
  
+#include "sysemu/runstate.h"

  #include "hw/vfio/vfio-common.h"
  #include "cpu.h"
  #include "migration/migration.h"
@@ -22,6 +23,157 @@
  #include "exec/ram_addr.h"
  #include "pci.h"
  #include "trace.h"
+#include "hw/hw.h"
+
+static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
+  off_t off, bool iswrite)
+{
+int ret;
+
+ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
+pread(vbasedev->fd, val, count, off);
+if (ret < count) {
+error_report("vfio_mig_%s %d byte %s: failed at offset 0x%lx, err: %s",
+ iswrite ? "write" : "read", count,
+ vbasedev->name, off, strerror(errno));
+return (ret < 0) ? ret : -EINVAL;
+}
+return 0;
+}
+
+static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count,
+   off_t off, bool iswrite)
+{
+int ret, done = 0;
+__u8 *tbuf = buf;
+
+while (count) {
+int bytes = 0;
+
+if (count >= 8 && !(off % 8)) {
+bytes = 8;
+} else if (count >= 4 && !(off % 4)) {
+bytes = 4;
+} else if (count >= 2 && !(off % 2)) {
+bytes = 2;
+} else {
+bytes = 1;
+}
+
+ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite);
+if (ret) {
+return ret;
+}
+
+count -= bytes;
+done += bytes;
+off += bytes;
+tbuf += bytes;
+}
+return done;
+}
+
+#define vfio_mig_read(f, v, c, o)   vfio_mig_rw(f, (__u8 *)v, c, o, false)
+#define vfio_mig_write(f, v, c, o)  vfio_mig_rw(f, (__u8 *)v, c, o, true)
+
+#define VFIO_MIG_STRUCT_OFFSET(f)   \
+ offsetof(struct vfio_device_migration_info, f)
+/*
+ * Change the device_state register for device @vbasedev. Bits set in @mask
+ * are preserved, bits set in @value are set, and bits not set in either @mask
+ * or @value are cleared in device_state. If the register cannot be accessed,
+ * the resulting state would be invalid, or the device enters an error state,
+ * an error is returned.
+ */
+
+static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
+uint32_t value)
+{
+VFIOMigration *migration = vbasedev->migration;
+VFIORegion *region = >region;
+off_t dev_state_off = region->fd_offset +
+  VFIO_MIG_STRUCT_OFFSET(device_state);
+uint32_t device_state;
+int ret;
+
+ret = vfio_mig_read(vbasedev, _state, sizeof(device_state),
+dev_state_off);
+if (ret < 0) {
+return ret;
+}
+
+device_state = (device_state & mask) | value;
+
+if (!VFIO_DEVICE_STATE_VALID(device_state)) {
+return -EINVAL;
+}
+
+ret = vfio_mig_write(vbasedev, _state, sizeof(device_state),
+ dev_state_off);
+if (ret < 0) {
+int rret;
+
+rret = vfio_mig_read(vbasedev, _state, sizeof(device_state),
+ dev_state_off);
+
+if ((rret < 0) || (VFIO_DEVICE_STATE_IS_ERROR(device_state))) {
+hw_error("%s: Device in error state 0x%x", vbasedev->name,
+ device_state);
+return rret ? rret : -EIO;
+}
+return ret;
+}
+
+migration->device_state = device_state;
+trace_vfio_migration_set_state(vbasedev->name, device_state);
+return 0;
+}
+
+static void vfio_vmstate_change(void *opaque, int running, RunState state)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+uint32_t value, mask;
+int ret;
+
+if ((vbasedev->migration->vm_running == running)) {
+return;
+}
+
+if (running) {
+/*
+ * Here device state can have one of _SAVING, _RESUMING or _STOP bit.
+ * Transition from _SAVING to _

Re: [PATCH v27 04/17] vfio: Add migration region initialization and finalize function

2020-10-22 Thread Kirti Wankhede




On 10/22/2020 7:52 PM, Alex Williamson wrote:

On Thu, 22 Oct 2020 16:41:54 +0530
Kirti Wankhede  wrote:


Whether the VFIO device supports migration or not is decided based of
migration region query. If migration region query is successful and migration
region initialization is successful then migration is supported else
migration is blocked.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Acked-by: Dr. David Alan Gilbert 
---
  hw/vfio/meson.build   |   1 +
  hw/vfio/migration.c   | 129 ++
  hw/vfio/trace-events  |   3 +
  include/hw/vfio/vfio-common.h |   9 +++
  4 files changed, 142 insertions(+)
  create mode 100644 hw/vfio/migration.c

diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index 37efa74018bc..da9af297a0c5 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
  vfio_ss.add(files(
'common.c',
'spapr.c',
+  'migration.c',
  ))
  vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
'display.c',
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
new file mode 100644
index ..5f74a3ad1d72
--- /dev/null
+++ b/hw/vfio/migration.c
@@ -0,0 +1,129 @@
+/*
+ * Migration support for VFIO devices
+ *
+ * Copyright NVIDIA, Inc. 2020
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include 
+
+#include "hw/vfio/vfio-common.h"
+#include "cpu.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "migration/register.h"
+#include "migration/blocker.h"
+#include "migration/misc.h"
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "exec/ram_addr.h"
+#include "pci.h"
+#include "trace.h"
+
+static void vfio_migration_region_exit(VFIODevice *vbasedev)
+{
+VFIOMigration *migration = vbasedev->migration;
+
+if (!migration) {
+return;
+}
+
+if (migration->region.size) {
+vfio_region_exit(>region);
+vfio_region_finalize(>region);
+}
+}
+
+static int vfio_migration_init(VFIODevice *vbasedev,
+   struct vfio_region_info *info)
+{
+int ret;
+Object *obj;
+VFIOMigration *migration;
+
+if (!vbasedev->ops->vfio_get_object) {
+return -EINVAL;
+}
+
+obj = vbasedev->ops->vfio_get_object(vbasedev);
+if (!obj) {
+return -EINVAL;
+}
+
+migration = g_new0(VFIOMigration, 1);
+
+ret = vfio_region_setup(obj, vbasedev, >region,
+info->index, "migration");
+if (ret) {
+error_report("%s: Failed to setup VFIO migration region %d: %s",
+ vbasedev->name, info->index, strerror(-ret));
+goto err;
+}
+
+if (!migration->region.size) {
+error_report("%s: Invalid zero-sized of VFIO migration region %d",
+ vbasedev->name, info->index);
+ret = -EINVAL;
+goto err;
+}
+
+vbasedev->migration = migration;
+return 0;
+
+err:
+vfio_migration_region_exit(vbasedev);


We can't get here with vbasedev->migration set, did you intend to set
vbasedev->migration before testing region.size?  Thanks,

Oh yes, I missed to address this when I moved migration variable to 
VFIODevice. Moving vbasedev->migration before region.size check.


Also removing region.size check vfio_migration_region_exit() for 
vfio_region_exit() and vfio_region_finalize().


Thanks,
Kirti


Alex




+g_free(migration);
+return ret;
+}
+
+/* -- */
+
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
+{
+struct vfio_region_info *info = NULL;
+Error *local_err = NULL;
+int ret;
+
+ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
+   VFIO_REGION_SUBTYPE_MIGRATION, );
+if (ret) {
+goto add_blocker;
+}
+
+ret = vfio_migration_init(vbasedev, info);
+if (ret) {
+goto add_blocker;
+}
+
+g_free(info);
+trace_vfio_migration_probe(vbasedev->name, info->index);
+return 0;
+
+add_blocker:
+error_setg(>migration_blocker,
+   "VFIO device doesn't support migration");
+g_free(info);
+
+ret = migrate_add_blocker(vbasedev->migration_blocker, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+error_free(vbasedev->migration_blocker);
+vbasedev->migration_blocker = NULL;
+}
+return ret;
+}
+
+void vfio_migration_finalize(VFIODevice *vbasedev)
+{
+if (vbasedev->migration_blocker) {
+migrate_del_blocker(vbasedev->migration_block

Re: [PATCH v27 03/17] vfio: Add save and load functions for VFIO PCI devices

2020-10-22 Thread Kirti Wankhede




On 10/22/2020 7:36 PM, Alex Williamson wrote:

On Thu, 22 Oct 2020 16:41:53 +0530
Kirti Wankhede  wrote:


Added functions to save and restore PCI device specific data,
specifically config space of PCI device.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
---
  hw/vfio/pci.c | 48 +++
  include/hw/vfio/vfio-common.h |  2 ++
  2 files changed, 50 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index bffd5bfe3b78..1036a5332772 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -41,6 +41,7 @@
  #include "trace.h"
  #include "qapi/error.h"
  #include "migration/blocker.h"
+#include "migration/qemu-file.h"
  
  #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
  
@@ -2401,11 +2402,58 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)

  return OBJECT(vdev);
  }
  
+static bool vfio_msix_enabled(void *opaque, int version_id)

+{
+PCIDevice *pdev = opaque;
+
+return msix_enabled(pdev);


Why msix_enabled() rather than msix_present()?  It seems that even if
MSI-X is not enabled at the point in time where this is called, there's
still emulated state in the vector table.  For example if the guest has
written the vectors but has not yet enabled the capability at the point
where we start a migration, this test might cause the guest on the
target to enable MSI-X with uninitialized data in the vector table.



You're correct. Changing it to check if present.


+}
+
+const VMStateDescription vmstate_vfio_pci_config = {
+.name = "VFIOPCIDevice",
+.version_id = 1,
+.minimum_version_id = 1,
+.fields = (VMStateField[]) {
+VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
+VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_enabled),


MSI (not-X) state is entirely in config space, so doesn't need a
separate field, correct?



Yes.


Otherwise this looks quite a bit cleaner than previous version, I hope
VMState experts can confirm this is sufficiently extensible within the
migration framework.  Thanks,



Thanks,
Kirti


Alex


+VMSTATE_END_OF_LIST()
+}
+};
+
+static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+vmstate_save_state(f, _vfio_pci_config, vdev, NULL);
+}
+
+static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+PCIDevice *pdev = >pdev;
+int ret;
+
+ret = vmstate_load_state(f, _vfio_pci_config, vdev, 1);
+if (ret) {
+return ret;
+}
+
+if (msi_enabled(pdev)) {
+vfio_msi_enable(vdev);
+} else if (msix_enabled(pdev)) {
+vfio_msix_enable(vdev);
+}
+
+return ret;
+}
+
  static VFIODeviceOps vfio_pci_ops = {
  .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
  .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
  .vfio_eoi = vfio_intx_eoi,
  .vfio_get_object = vfio_pci_get_object,
+.vfio_save_config = vfio_pci_save_config,
+.vfio_load_config = vfio_pci_load_config,
  };
  
  int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fe99c36a693a..ba6169cd926e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -120,6 +120,8 @@ struct VFIODeviceOps {
  int (*vfio_hot_reset_multi)(VFIODevice *vdev);
  void (*vfio_eoi)(VFIODevice *vdev);
  Object *(*vfio_get_object)(VFIODevice *vdev);
+void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
+int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
  };
  
  typedef struct VFIOGroup {






Re: [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM

2020-10-22 Thread Kirti Wankhede




On 10/22/2020 1:21 PM, Cornelia Huck wrote:

On Wed, 21 Oct 2020 11:03:23 +0530
Kirti Wankhede  wrote:


On 10/20/2020 4:21 PM, Cornelia Huck wrote:

On Sun, 18 Oct 2020 01:54:56 +0530
Kirti Wankhede  wrote:
   

On 9/29/2020 4:33 PM, Dr. David Alan Gilbert wrote:

* Cornelia Huck (coh...@redhat.com) wrote:

On Wed, 23 Sep 2020 04:54:07 +0530
Kirti Wankhede  wrote:



+static void vfio_vmstate_change(void *opaque, int running, RunState state)
+{
+VFIODevice *vbasedev = opaque;
+
+if ((vbasedev->vm_running != running)) {
+int ret;
+uint32_t value = 0, mask = 0;
+
+if (running) {
+value = VFIO_DEVICE_STATE_RUNNING;
+if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
+mask = ~VFIO_DEVICE_STATE_RESUMING;


I've been staring at this for some time and I think that the desired
result is
- set _RUNNING
- if _RESUMING was set, clear it, but leave the other bits intact


Upto here, you're correct.
  

- if _RESUMING was not set, clear everything previously set
This would really benefit from a comment (or am I the only one
struggling here?)
 


Here mask should be ~0. Correcting it.


Hm, now I'm confused. With value == _RUNNING, ~_RUNNING and ~0 as mask
should be equivalent, shouldn't they?
   


I too got confused after reading your comment.
Lets walk through the device states and transitions can happen here:

if running
   - device state could be either _SAVING or _RESUMING or _STOP. Both
_SAVING and _RESUMING can't be set at a time, that is the error state.
_STOP means 0.
   - Transition from _SAVING to _RUNNING can happen if there is migration
failure, in that case we have to clear _SAVING
- Transition from _RESUMING to _RUNNING can happen on resuming and we
have to clear _RESUMING.
- In both the above cases, we have to set _RUNNING and clear rest 2 bits.
Then:
mask = ~VFIO_DEVICE_STATE_MASK;
value = VFIO_DEVICE_STATE_RUNNING;


ok



if !running
- device state could be either _RUNNING or _SAVING|_RUNNING. Here we
have to reset running bit.
Then:
mask = ~VFIO_DEVICE_STATE_RUNNING;
value = 0;


ok



I'll add comment in the code above.


That will help.

I'm a bit worried though that all that reasoning which flags are set or
cleared when is quite complex, and it's easy to make mistakes.

Can we model this as a FSM, where an event (running state changes)
transitions the device state from one state to another? I (personally)
find FSMs easier to comprehend, but I'm not sure whether that change
would be too invasive. If others can parse the state changes with that
mask/value interface, I won't object to it.



I agree FSM will be easy and for long term may be easy to maintain. But 
at this moment it will be intrusive change. For now we can go ahead with 
this code and later we can change to FSM model, if all agrees on it.


Thanks,
Kirti







  

+}
+} else {
+mask = ~VFIO_DEVICE_STATE_RUNNING;
+}






[PATCH v27 17/17] qapi: Add VFIO devices migration stats in Migration stats

2020-10-22 Thread Kirti Wankhede
Added amount of bytes transferred to the VM at destination by all VFIO
devices

Signed-off-by: Kirti Wankhede 
Reviewed-by: Dr. David Alan Gilbert 
---
 hw/vfio/common.c| 20 
 hw/vfio/migration.c | 10 ++
 include/qemu/vfio-helpers.h |  3 +++
 migration/migration.c   | 14 ++
 monitor/hmp-cmds.c  |  6 ++
 qapi/migration.json | 17 +
 6 files changed, 70 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 9c879e5c0f62..8d0758eda9fa 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -39,6 +39,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/migration.h"
+#include "qemu/vfio-helpers.h"
 
 VFIOGroupList vfio_group_list =
 QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -292,6 +293,25 @@ const MemoryRegionOps vfio_region_ops = {
  * Device state interfaces
  */
 
+bool vfio_mig_active(void)
+{
+VFIOGroup *group;
+VFIODevice *vbasedev;
+
+if (QLIST_EMPTY(_group_list)) {
+return false;
+}
+
+QLIST_FOREACH(group, _group_list, next) {
+QLIST_FOREACH(vbasedev, >device_list, next) {
+if (vbasedev->migration_blocker) {
+return false;
+}
+}
+}
+return true;
+}
+
 static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container)
 {
 VFIOGroup *group;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 77ee60a43ea5..b23e21c6de2b 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -28,6 +28,7 @@
 #include "pci.h"
 #include "trace.h"
 #include "hw/hw.h"
+#include "qemu/vfio-helpers.h"
 
 /*
  * Flags to be used as unique delimiters for VFIO devices in the migration
@@ -45,6 +46,8 @@
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
 
+static int64_t bytes_transferred;
+
 static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
   off_t off, bool iswrite)
 {
@@ -255,6 +258,7 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice 
*vbasedev, uint64_t *size)
 *size = data_size;
 }
 
+bytes_transferred += data_size;
 return ret;
 }
 
@@ -776,6 +780,7 @@ static void vfio_migration_state_notifier(Notifier 
*notifier, void *data)
 case MIGRATION_STATUS_CANCELLING:
 case MIGRATION_STATUS_CANCELLED:
 case MIGRATION_STATUS_FAILED:
+bytes_transferred = 0;
 ret = vfio_migration_set_state(vbasedev,
   ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
   VFIO_DEVICE_STATE_RUNNING);
@@ -862,6 +867,11 @@ err:
 
 /* -- */
 
+int64_t vfio_mig_bytes_transferred(void)
+{
+return bytes_transferred;
+}
+
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
 VFIOContainer *container = vbasedev->group->container;
diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
index 4491c8e1a6e9..7f7a46e6ef2d 100644
--- a/include/qemu/vfio-helpers.h
+++ b/include/qemu/vfio-helpers.h
@@ -29,4 +29,7 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, 
void *bar,
 int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
int irq_type, Error **errp);
 
+bool vfio_mig_active(void);
+int64_t vfio_mig_bytes_transferred(void);
+
 #endif
diff --git a/migration/migration.c b/migration/migration.c
index 0575ecb37953..8b2865d25ef4 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -56,6 +56,7 @@
 #include "net/announce.h"
 #include "qemu/queue.h"
 #include "multifd.h"
+#include "qemu/vfio-helpers.h"
 
 #define MAX_THROTTLE  (128 << 20)  /* Migration transfer speed throttling 
*/
 
@@ -1002,6 +1003,17 @@ static void populate_disk_info(MigrationInfo *info)
 }
 }
 
+static void populate_vfio_info(MigrationInfo *info)
+{
+#ifdef CONFIG_LINUX
+if (vfio_mig_active()) {
+info->has_vfio = true;
+info->vfio = g_malloc0(sizeof(*info->vfio));
+info->vfio->transferred = vfio_mig_bytes_transferred();
+}
+#endif
+}
+
 static void fill_source_migration_info(MigrationInfo *info)
 {
 MigrationState *s = migrate_get_current();
@@ -1026,6 +1038,7 @@ static void fill_source_migration_info(MigrationInfo 
*info)
 populate_time_info(info, s);
 populate_ram_info(info, s);
 populate_disk_info(info);
+populate_vfio_info(info);
 break;
 case MIGRATION_STATUS_COLO:
 info->has_status = true;
@@ -1034,6 +1047,7 @@ static void fill_source_migration_info(MigrationInfo 
*info)
 case MIGRATION_STATUS_COMPLETED:
 populate_time_info(info, s);
 populate_ram_info(info, s);
+populate_vfio

[PATCH v27 14/17] vfio: Dirty page tracking when vIOMMU is enabled

2020-10-22 Thread Kirti Wankhede
When vIOMMU is enabled, register MAP notifier from log_sync when all
devices in container are in stop and copy phase of migration. Call replay
and get dirty pages from notifier callback.

Suggested-by: Alex Williamson 
Signed-off-by: Kirti Wankhede 
---
 hw/vfio/common.c  | 95 ---
 hw/vfio/trace-events  |  1 +
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 91 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 2634387df948..98c2b1f9b190 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -442,8 +442,8 @@ static bool 
vfio_listener_skipped_section(MemoryRegionSection *section)
 }
 
 /* Called with rcu_read_lock held.  */
-static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
-   bool *read_only)
+static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
+   ram_addr_t *ram_addr, bool *read_only)
 {
 MemoryRegion *mr;
 hwaddr xlat;
@@ -474,8 +474,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void 
**vaddr,
 return false;
 }
 
-*vaddr = memory_region_get_ram_ptr(mr) + xlat;
-*read_only = !writable || mr->readonly;
+if (vaddr) {
+*vaddr = memory_region_get_ram_ptr(mr) + xlat;
+}
+
+if (ram_addr) {
+*ram_addr = memory_region_get_ram_addr(mr) + xlat;
+}
+
+if (read_only) {
+*read_only = !writable || mr->readonly;
+}
 
 return true;
 }
@@ -485,7 +494,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, 
IOMMUTLBEntry *iotlb)
 VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
 VFIOContainer *container = giommu->container;
 hwaddr iova = iotlb->iova + giommu->iommu_offset;
-bool read_only;
 void *vaddr;
 int ret;
 
@@ -501,7 +509,9 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, 
IOMMUTLBEntry *iotlb)
 rcu_read_lock();
 
 if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
-if (!vfio_get_vaddr(iotlb, , _only)) {
+bool read_only;
+
+if (!vfio_get_xlat_addr(iotlb, , NULL, _only)) {
 goto out;
 }
 /*
@@ -899,11 +909,84 @@ err_out:
 return ret;
 }
 
+static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
+{
+VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, dirty_notify);
+VFIOContainer *container = giommu->container;
+hwaddr iova = iotlb->iova + giommu->iommu_offset;
+ram_addr_t translated_addr;
+
+trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
+
+if (iotlb->target_as != _space_memory) {
+error_report("Wrong target AS \"%s\", only system memory is allowed",
+ iotlb->target_as->name ? iotlb->target_as->name : "none");
+return;
+}
+
+rcu_read_lock();
+
+if (vfio_get_xlat_addr(iotlb, NULL, _addr, NULL)) {
+int ret;
+
+ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
+translated_addr);
+if (ret) {
+error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
+ "0x%"HWADDR_PRIx") = %d (%m)",
+ container, iova,
+ iotlb->addr_mask + 1, ret);
+}
+}
+
+rcu_read_unlock();
+}
+
 static int vfio_sync_dirty_bitmap(VFIOContainer *container,
   MemoryRegionSection *section)
 {
 ram_addr_t ram_addr;
 
+if (memory_region_is_iommu(section->mr)) {
+VFIOGuestIOMMU *giommu;
+int ret = 0;
+
+QLIST_FOREACH(giommu, >giommu_list, giommu_next) {
+if (MEMORY_REGION(giommu->iommu) == section->mr &&
+giommu->n.start == section->offset_within_region) {
+Int128 llend;
+Error *err = NULL;
+int idx = memory_region_iommu_attrs_to_index(giommu->iommu,
+   MEMTXATTRS_UNSPECIFIED);
+
+llend = 
int128_add(int128_make64(section->offset_within_region),
+   section->size);
+llend = int128_sub(llend, int128_one());
+
+iommu_notifier_init(>dirty_notify,
+vfio_iommu_map_dirty_notify,
+IOMMU_NOTIFIER_MAP,
+section->offset_within_region,
+int128_get64(llend),
+idx);
+ret = memory_region_register_iommu_notifier(section->mr,
+  >dirty_notify, );
+if (ret) {
+error_report_err

[PATCH v27 05/17] vfio: Add VM state change handler to know state of VM

2020-10-22 Thread Kirti Wankhede
VM state change handler is called on change in VM's state. Based on
VM state, VFIO device state should be changed.
Added read/write helper functions for migration region.
Added function to set device_state.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Dr. David Alan Gilbert 
---
 hw/vfio/migration.c   | 158 ++
 hw/vfio/trace-events  |   2 +
 include/hw/vfio/vfio-common.h |   4 ++
 3 files changed, 164 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 5f74a3ad1d72..34f39c7e2e28 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -10,6 +10,7 @@
 #include "qemu/osdep.h"
 #include 
 
+#include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "cpu.h"
 #include "migration/migration.h"
@@ -22,6 +23,157 @@
 #include "exec/ram_addr.h"
 #include "pci.h"
 #include "trace.h"
+#include "hw/hw.h"
+
+static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
+  off_t off, bool iswrite)
+{
+int ret;
+
+ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
+pread(vbasedev->fd, val, count, off);
+if (ret < count) {
+error_report("vfio_mig_%s %d byte %s: failed at offset 0x%lx, err: %s",
+ iswrite ? "write" : "read", count,
+ vbasedev->name, off, strerror(errno));
+return (ret < 0) ? ret : -EINVAL;
+}
+return 0;
+}
+
+static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count,
+   off_t off, bool iswrite)
+{
+int ret, done = 0;
+__u8 *tbuf = buf;
+
+while (count) {
+int bytes = 0;
+
+if (count >= 8 && !(off % 8)) {
+bytes = 8;
+} else if (count >= 4 && !(off % 4)) {
+bytes = 4;
+} else if (count >= 2 && !(off % 2)) {
+bytes = 2;
+} else {
+bytes = 1;
+}
+
+ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite);
+if (ret) {
+return ret;
+}
+
+count -= bytes;
+done += bytes;
+off += bytes;
+tbuf += bytes;
+}
+return done;
+}
+
+#define vfio_mig_read(f, v, c, o)   vfio_mig_rw(f, (__u8 *)v, c, o, false)
+#define vfio_mig_write(f, v, c, o)  vfio_mig_rw(f, (__u8 *)v, c, o, true)
+
+#define VFIO_MIG_STRUCT_OFFSET(f)   \
+ offsetof(struct vfio_device_migration_info, f)
+/*
+ * Change the device_state register for device @vbasedev. Bits set in @mask
+ * are preserved, bits set in @value are set, and bits not set in either @mask
+ * or @value are cleared in device_state. If the register cannot be accessed,
+ * the resulting state would be invalid, or the device enters an error state,
+ * an error is returned.
+ */
+
+static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
+uint32_t value)
+{
+VFIOMigration *migration = vbasedev->migration;
+VFIORegion *region = >region;
+off_t dev_state_off = region->fd_offset +
+  VFIO_MIG_STRUCT_OFFSET(device_state);
+uint32_t device_state;
+int ret;
+
+ret = vfio_mig_read(vbasedev, _state, sizeof(device_state),
+dev_state_off);
+if (ret < 0) {
+return ret;
+}
+
+device_state = (device_state & mask) | value;
+
+if (!VFIO_DEVICE_STATE_VALID(device_state)) {
+return -EINVAL;
+}
+
+ret = vfio_mig_write(vbasedev, _state, sizeof(device_state),
+ dev_state_off);
+if (ret < 0) {
+int rret;
+
+rret = vfio_mig_read(vbasedev, _state, sizeof(device_state),
+ dev_state_off);
+
+if ((rret < 0) || (VFIO_DEVICE_STATE_IS_ERROR(device_state))) {
+hw_error("%s: Device in error state 0x%x", vbasedev->name,
+ device_state);
+return rret ? rret : -EIO;
+}
+return ret;
+}
+
+migration->device_state = device_state;
+trace_vfio_migration_set_state(vbasedev->name, device_state);
+return 0;
+}
+
+static void vfio_vmstate_change(void *opaque, int running, RunState state)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+uint32_t value, mask;
+int ret;
+
+if ((vbasedev->migration->vm_running == running)) {
+return;
+}
+
+if (running) {
+/*
+ * Here device state can have one of _SAVING, _RESUMING or _STOP bit.
+ * Transition from _SAVING to _RUNNING can happen if there is migration
+ * failure, in that case clear _SAVING bit.
+ * Transition from _RE

[PATCH v27 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap

2020-10-22 Thread Kirti Wankhede
With vIOMMU, IO virtual address range can get unmapped while in pre-copy
phase of migration. In that case, unmap ioctl should return pages pinned
in that range and QEMU should find its correcponding guest physical
addresses and report those dirty.

Suggested-by: Alex Williamson 
Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
---
 hw/vfio/common.c | 96 +---
 1 file changed, 92 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 98c2b1f9b190..9c879e5c0f62 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -321,11 +321,94 @@ static bool 
vfio_devices_all_stopped_and_saving(VFIOContainer *container)
 return true;
 }
 
+static bool vfio_devices_all_running_and_saving(VFIOContainer *container)
+{
+VFIOGroup *group;
+VFIODevice *vbasedev;
+MigrationState *ms = migrate_get_current();
+
+if (!migration_is_setup_or_active(ms->state)) {
+return false;
+}
+
+QLIST_FOREACH(group, >group_list, container_next) {
+QLIST_FOREACH(vbasedev, >device_list, next) {
+VFIOMigration *migration = vbasedev->migration;
+
+if (!migration) {
+return false;
+}
+
+if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) &&
+(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+continue;
+} else {
+return false;
+}
+}
+}
+return true;
+}
+
+static int vfio_dma_unmap_bitmap(VFIOContainer *container,
+ hwaddr iova, ram_addr_t size,
+ IOMMUTLBEntry *iotlb)
+{
+struct vfio_iommu_type1_dma_unmap *unmap;
+struct vfio_bitmap *bitmap;
+uint64_t pages = TARGET_PAGE_ALIGN(size) >> TARGET_PAGE_BITS;
+int ret;
+
+unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
+
+unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
+unmap->iova = iova;
+unmap->size = size;
+unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
+bitmap = (struct vfio_bitmap *)>data;
+
+/*
+ * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+ * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap_pgsize to
+ * TARGET_PAGE_SIZE.
+ */
+
+bitmap->pgsize = TARGET_PAGE_SIZE;
+bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+   BITS_PER_BYTE;
+
+if (bitmap->size > container->max_dirty_bitmap_size) {
+error_report("UNMAP: Size of bitmap too big 0x%llx", bitmap->size);
+ret = -E2BIG;
+goto unmap_exit;
+}
+
+bitmap->data = g_try_malloc0(bitmap->size);
+if (!bitmap->data) {
+ret = -ENOMEM;
+goto unmap_exit;
+}
+
+ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
+if (!ret) {
+cpu_physical_memory_set_dirty_lebitmap((uint64_t *)bitmap->data,
+iotlb->translated_addr, pages);
+} else {
+error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
+}
+
+g_free(bitmap->data);
+unmap_exit:
+g_free(unmap);
+return ret;
+}
+
 /*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
 static int vfio_dma_unmap(VFIOContainer *container,
-  hwaddr iova, ram_addr_t size)
+  hwaddr iova, ram_addr_t size,
+  IOMMUTLBEntry *iotlb)
 {
 struct vfio_iommu_type1_dma_unmap unmap = {
 .argsz = sizeof(unmap),
@@ -334,6 +417,11 @@ static int vfio_dma_unmap(VFIOContainer *container,
 .size = size,
 };
 
+if (iotlb && container->dirty_pages_supported &&
+vfio_devices_all_running_and_saving(container)) {
+return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
+}
+
 while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, )) {
 /*
  * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
@@ -381,7 +469,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr 
iova,
  * the VGA ROM space.
  */
 if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, ) == 0 ||
-(errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
+(errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
  ioctl(container->fd, VFIO_IOMMU_MAP_DMA, ) == 0)) {
 return 0;
 }
@@ -531,7 +619,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, 
IOMMUTLBEntry *iotlb)
  iotlb->addr_mask + 1, vaddr, ret);
 }
 } else {
-ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
+ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
 if (ret) {

[PATCH v27 13/17] vfio: Add vfio_listener_log_sync to mark dirty pages

2020-10-22 Thread Kirti Wankhede
vfio_listener_log_sync gets list of dirty pages from container using
VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all
devices are stopped and saving state.
Return early for the RAM block section of mapped MMIO region.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
---
 hw/vfio/common.c | 116 +++
 hw/vfio/trace-events |   1 +
 2 files changed, 117 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index d4959c036dd1..2634387df948 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -29,6 +29,7 @@
 #include "hw/vfio/vfio.h"
 #include "exec/address-spaces.h"
 #include "exec/memory.h"
+#include "exec/ram_addr.h"
 #include "hw/hw.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
@@ -37,6 +38,7 @@
 #include "sysemu/reset.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "migration/migration.h"
 
 VFIOGroupList vfio_group_list =
 QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -287,6 +289,39 @@ const MemoryRegionOps vfio_region_ops = {
 };
 
 /*
+ * Device state interfaces
+ */
+
+static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container)
+{
+VFIOGroup *group;
+VFIODevice *vbasedev;
+MigrationState *ms = migrate_get_current();
+
+if (!migration_is_setup_or_active(ms->state)) {
+return false;
+}
+
+QLIST_FOREACH(group, >group_list, container_next) {
+QLIST_FOREACH(vbasedev, >device_list, next) {
+VFIOMigration *migration = vbasedev->migration;
+
+if (!migration) {
+return false;
+}
+
+if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) &&
+!(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+continue;
+} else {
+return false;
+}
+}
+}
+return true;
+}
+
+/*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
 static int vfio_dma_unmap(VFIOContainer *container,
@@ -812,9 +847,90 @@ static void vfio_listener_region_del(MemoryListener 
*listener,
 }
 }
 
+static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
+ uint64_t size, ram_addr_t ram_addr)
+{
+struct vfio_iommu_type1_dirty_bitmap *dbitmap;
+struct vfio_iommu_type1_dirty_bitmap_get *range;
+uint64_t pages;
+int ret;
+
+dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
+
+dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
+dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
+range = (struct vfio_iommu_type1_dirty_bitmap_get *)>data;
+range->iova = iova;
+range->size = size;
+
+/*
+ * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+ * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap's pgsize to
+ * TARGET_PAGE_SIZE.
+ */
+range->bitmap.pgsize = TARGET_PAGE_SIZE;
+
+pages = TARGET_PAGE_ALIGN(range->size) >> TARGET_PAGE_BITS;
+range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+ BITS_PER_BYTE;
+range->bitmap.data = g_try_malloc0(range->bitmap.size);
+if (!range->bitmap.data) {
+ret = -ENOMEM;
+goto err_out;
+}
+
+ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
+if (ret) {
+error_report("Failed to get dirty bitmap for iova: 0x%llx "
+"size: 0x%llx err: %d",
+range->iova, range->size, errno);
+goto err_out;
+}
+
+cpu_physical_memory_set_dirty_lebitmap((uint64_t *)range->bitmap.data,
+ram_addr, pages);
+
+trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
+range->bitmap.size, ram_addr);
+err_out:
+g_free(range->bitmap.data);
+g_free(dbitmap);
+
+return ret;
+}
+
+static int vfio_sync_dirty_bitmap(VFIOContainer *container,
+  MemoryRegionSection *section)
+{
+ram_addr_t ram_addr;
+
+ram_addr = memory_region_get_ram_addr(section->mr) +
+   section->offset_within_region;
+
+return vfio_get_dirty_bitmap(container,
+   TARGET_PAGE_ALIGN(section->offset_within_address_space),
+   int128_get64(section->size), ram_addr);
+}
+
+static void vfio_listerner_log_sync(MemoryListener *listener,
+MemoryRegionSection *section)
+{
+VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+
+if (vfio_listener_skipped_section(section) ||
+!container->dirty_pages_supported) {
+r

[PATCH v27 10/17] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled

2020-10-22 Thread Kirti Wankhede
mr->ram_block is NULL when mr->is_iommu is true, then fr.dirty_log_mask
wasn't set correctly due to which memory listener's log_sync doesn't
get called.
This patch returns log_mask with DIRTY_MEMORY_MIGRATION set when
IOMMU is enabled.

Signed-off-by: Kirti Wankhede 
---
 softmmu/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/softmmu/memory.c b/softmmu/memory.c
index 403ff3abc99b..94f606e9d9d9 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1792,7 +1792,7 @@ bool memory_region_is_ram_device(MemoryRegion *mr)
 uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
 {
 uint8_t mask = mr->dirty_log_mask;
-if (global_dirty_log && mr->ram_block) {
+if (global_dirty_log && (mr->ram_block || memory_region_is_iommu(mr))) {
 mask |= (1 << DIRTY_MEMORY_MIGRATION);
 }
 return mask;
-- 
2.7.0




[PATCH v27 07/17] vfio: Register SaveVMHandlers for VFIO device

2020-10-22 Thread Kirti Wankhede
Define flags to be used as delimiter in migration stream for VFIO devices.
Added .save_setup and .save_cleanup functions. Map & unmap migration
region from these functions at source during saving or pre-copy phase.

Set VFIO device state depending on VM's state. During live migration, VM is
running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
device. During save-restore, VM is paused, _SAVING state is set for VFIO device.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
---
 hw/vfio/migration.c  | 96 
 hw/vfio/trace-events |  2 ++
 2 files changed, 98 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 7c4fa0d08ea6..2e1054bf7f43 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -8,12 +8,15 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/main-loop.h"
+#include "qemu/cutils.h"
 #include 
 
 #include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "cpu.h"
 #include "migration/migration.h"
+#include "migration/vmstate.h"
 #include "migration/qemu-file.h"
 #include "migration/register.h"
 #include "migration/blocker.h"
@@ -25,6 +28,22 @@
 #include "trace.h"
 #include "hw/hw.h"
 
+/*
+ * Flags to be used as unique delimiters for VFIO devices in the migration
+ * stream. These flags are composed as:
+ * 0x => MSB 32-bit all 1s
+ * 0xef10 => Magic ID, represents emulated (virtual) function IO
+ * 0x => 16-bits reserved for flags
+ *
+ * The beginning of state information is marked by _DEV_CONFIG_STATE,
+ * _DEV_SETUP_STATE, or _DEV_DATA_STATE, respectively. The end of a
+ * certain state information is marked by _END_OF_STATE.
+ */
+#define VFIO_MIG_FLAG_END_OF_STATE  (0xef11ULL)
+#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xef12ULL)
+#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
+#define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
+
 static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
   off_t off, bool iswrite)
 {
@@ -129,6 +148,69 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, 
uint32_t mask,
 return 0;
 }
 
+/* -- */
+
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+int ret;
+
+trace_vfio_save_setup(vbasedev->name);
+
+qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
+
+if (migration->region.mmaps) {
+/*
+ * vfio_region_mmap() called from migration thread. Memory API called
+ * from vfio_regio_mmap() need it when called from outdide the main 
loop
+ * thread.
+ */
+qemu_mutex_lock_iothread();
+ret = vfio_region_mmap(>region);
+qemu_mutex_unlock_iothread();
+if (ret) {
+error_report("%s: Failed to mmap VFIO migration region: %s",
+ vbasedev->name, strerror(-ret));
+error_report("%s: Falling back to slow path", vbasedev->name);
+}
+}
+
+ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
+   VFIO_DEVICE_STATE_SAVING);
+if (ret) {
+error_report("%s: Failed to set state SAVING", vbasedev->name);
+return ret;
+}
+
+qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+ret = qemu_file_get_error(f);
+if (ret) {
+return ret;
+}
+
+return 0;
+}
+
+static void vfio_save_cleanup(void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+
+if (migration->region.mmaps) {
+vfio_region_unmap(>region);
+}
+trace_vfio_save_cleanup(vbasedev->name);
+}
+
+static SaveVMHandlers savevm_vfio_handlers = {
+.save_setup = vfio_save_setup,
+.save_cleanup = vfio_save_cleanup,
+};
+
+/* -- */
+
 static void vfio_vmstate_change(void *opaque, int running, RunState state)
 {
 VFIODevice *vbasedev = opaque;
@@ -219,6 +301,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 int ret;
 Object *obj;
 VFIOMigration *migration;
+char id[256] = "";
+g_autofree char *path = NULL, *oid;
 
 if (!vbasedev->ops->vfio_get_object) {
 return -EINVAL;
@@ -248,6 +332,18 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 
 vbasedev->migration = migration;
 migration->vbasedev = vbasedev;
+
+oid = vmstate_if_get_id(VMSTATE_IF(DEVICE(obj)));
+if (oid) {
+path = g_strdup_printf("%s/vfio", oid);

[PATCH v27 16/17] vfio: Make vfio-pci device migration capable

2020-10-22 Thread Kirti Wankhede
If the device is not a failover primary device, call
vfio_migration_probe() and vfio_migration_finalize() to enable
migration support for those devices that support it respectively to
tear it down again.
Removed migration blocker from VFIO PCI device specific structure and use
migration blocker from generic structure of  VFIO device.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Dr. David Alan Gilbert 
Reviewed-by: Cornelia Huck 
---
 hw/vfio/pci.c | 28 
 hw/vfio/pci.h |  1 -
 2 files changed, 8 insertions(+), 21 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 1036a5332772..c67fb4cced8e 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2788,17 +2788,6 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 return;
 }
 
-if (!pdev->failover_pair_id) {
-error_setg(>migration_blocker,
-"VFIO device doesn't support migration");
-ret = migrate_add_blocker(vdev->migration_blocker, errp);
-if (ret) {
-error_free(vdev->migration_blocker);
-vdev->migration_blocker = NULL;
-return;
-}
-}
-
 vdev->vbasedev.name = g_path_get_basename(vdev->vbasedev.sysfsdev);
 vdev->vbasedev.ops = _pci_ops;
 vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
@@ -3066,6 +3055,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 }
 }
 
+if (!pdev->failover_pair_id) {
+ret = vfio_migration_probe(>vbasedev, errp);
+if (ret) {
+error_report("%s: Migration disabled", vdev->vbasedev.name);
+}
+}
+
 vfio_register_err_notifier(vdev);
 vfio_register_req_notifier(vdev);
 vfio_setup_resetfn_quirk(vdev);
@@ -3080,11 +3076,6 @@ out_teardown:
 vfio_bars_exit(vdev);
 error:
 error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
-if (vdev->migration_blocker) {
-migrate_del_blocker(vdev->migration_blocker);
-error_free(vdev->migration_blocker);
-vdev->migration_blocker = NULL;
-}
 }
 
 static void vfio_instance_finalize(Object *obj)
@@ -3096,10 +3087,6 @@ static void vfio_instance_finalize(Object *obj)
 vfio_bars_finalize(vdev);
 g_free(vdev->emulated_config_bits);
 g_free(vdev->rom);
-if (vdev->migration_blocker) {
-migrate_del_blocker(vdev->migration_blocker);
-error_free(vdev->migration_blocker);
-}
 /*
  * XXX Leaking igd_opregion is not an oversight, we can't remove the
  * fw_cfg entry therefore leaking this allocation seems like the safest
@@ -3127,6 +3114,7 @@ static void vfio_exitfn(PCIDevice *pdev)
 }
 vfio_teardown_msi(vdev);
 vfio_bars_exit(vdev);
+vfio_migration_finalize(>vbasedev);
 }
 
 static void vfio_pci_reset(DeviceState *dev)
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index bce71a9ac93f..1574ef983f8f 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -172,7 +172,6 @@ struct VFIOPCIDevice {
 bool no_vfio_ioeventfd;
 bool enable_ramfb;
 VFIODisplay *dpy;
-Error *migration_blocker;
 Notifier irqchip_change_notifier;
 };
 
-- 
2.7.0




[PATCH v27 12/17] vfio: Add function to start and stop dirty pages tracking

2020-10-22 Thread Kirti Wankhede
Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking
for VFIO devices.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Dr. David Alan Gilbert 
---
 hw/vfio/migration.c | 36 
 1 file changed, 36 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index ea5e0f1b8489..77ee60a43ea5 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -11,6 +11,7 @@
 #include "qemu/main-loop.h"
 #include "qemu/cutils.h"
 #include 
+#include 
 
 #include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
@@ -391,6 +392,34 @@ static int vfio_load_device_config_state(QEMUFile *f, void 
*opaque)
 return qemu_file_get_error(f);
 }
 
+static int vfio_set_dirty_page_tracking(VFIODevice *vbasedev, bool start)
+{
+int ret;
+VFIOMigration *migration = vbasedev->migration;
+VFIOContainer *container = vbasedev->group->container;
+struct vfio_iommu_type1_dirty_bitmap dirty = {
+.argsz = sizeof(dirty),
+};
+
+if (start) {
+if (migration->device_state & VFIO_DEVICE_STATE_SAVING) {
+dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
+} else {
+return -EINVAL;
+}
+} else {
+dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
+}
+
+ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, );
+if (ret) {
+error_report("Failed to set dirty tracking flag 0x%x errno: %d",
+ dirty.flags, errno);
+return -errno;
+}
+return ret;
+}
+
 /* -- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -426,6 +455,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
 return ret;
 }
 
+ret = vfio_set_dirty_page_tracking(vbasedev, true);
+if (ret) {
+return ret;
+}
+
 qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
 
 ret = qemu_file_get_error(f);
@@ -441,6 +475,8 @@ static void vfio_save_cleanup(void *opaque)
 VFIODevice *vbasedev = opaque;
 VFIOMigration *migration = vbasedev->migration;
 
+vfio_set_dirty_page_tracking(vbasedev, false);
+
 if (migration->region.mmaps) {
 vfio_region_unmap(>region);
 }
-- 
2.7.0




[PATCH v27 11/17] vfio: Get migration capability flags for container

2020-10-22 Thread Kirti Wankhede
Added helper functions to get IOMMU info capability chain.
Added function to get migration capability information from that
capability chain for IOMMU container.

Similar change was proposed earlier:
https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html

Disable migration for devices if IOMMU module doesn't support migration
capability.

Signed-off-by: Kirti Wankhede 
Cc: Shameer Kolothum 
Cc: Eric Auger 
---
 hw/vfio/common.c  | 90 +++
 hw/vfio/migration.c   |  7 +++-
 include/hw/vfio/vfio-common.h |  3 ++
 3 files changed, 91 insertions(+), 9 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c6e98b8d61be..d4959c036dd1 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1228,6 +1228,75 @@ static int vfio_init_container(VFIOContainer *container, 
int group_fd,
 return 0;
 }
 
+static int vfio_get_iommu_info(VFIOContainer *container,
+   struct vfio_iommu_type1_info **info)
+{
+
+size_t argsz = sizeof(struct vfio_iommu_type1_info);
+
+*info = g_new0(struct vfio_iommu_type1_info, 1);
+again:
+(*info)->argsz = argsz;
+
+if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
+g_free(*info);
+*info = NULL;
+return -errno;
+}
+
+if (((*info)->argsz > argsz)) {
+argsz = (*info)->argsz;
+*info = g_realloc(*info, argsz);
+goto again;
+}
+
+return 0;
+}
+
+static struct vfio_info_cap_header *
+vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
+{
+struct vfio_info_cap_header *hdr;
+void *ptr = info;
+
+if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
+return NULL;
+}
+
+for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+if (hdr->id == id) {
+return hdr;
+}
+}
+
+return NULL;
+}
+
+static void vfio_get_iommu_info_migration(VFIOContainer *container,
+ struct vfio_iommu_type1_info *info)
+{
+struct vfio_info_cap_header *hdr;
+struct vfio_iommu_type1_info_cap_migration *cap_mig;
+
+hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
+if (!hdr) {
+return;
+}
+
+cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration,
+header);
+
+/*
+ * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+ * TARGET_PAGE_SIZE to mark those dirty.
+ */
+if (cap_mig->pgsize_bitmap & TARGET_PAGE_SIZE) {
+container->dirty_pages_supported = true;
+container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
+container->dirty_pgsizes = cap_mig->pgsize_bitmap;
+}
+}
+
 static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
   Error **errp)
 {
@@ -1297,6 +1366,7 @@ static int vfio_connect_container(VFIOGroup *group, 
AddressSpace *as,
 container->space = space;
 container->fd = fd;
 container->error = NULL;
+container->dirty_pages_supported = false;
 QLIST_INIT(>giommu_list);
 QLIST_INIT(>hostwin_list);
 
@@ -1309,7 +1379,7 @@ static int vfio_connect_container(VFIOGroup *group, 
AddressSpace *as,
 case VFIO_TYPE1v2_IOMMU:
 case VFIO_TYPE1_IOMMU:
 {
-struct vfio_iommu_type1_info info;
+struct vfio_iommu_type1_info *info;
 
 /*
  * FIXME: This assumes that a Type1 IOMMU can map any 64-bit
@@ -1318,15 +1388,19 @@ static int vfio_connect_container(VFIOGroup *group, 
AddressSpace *as,
  * existing Type1 IOMMUs generally support any IOVA we're
  * going to actually try in practice.
  */
-info.argsz = sizeof(info);
-ret = ioctl(fd, VFIO_IOMMU_GET_INFO, );
-/* Ignore errors */
-if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
+ret = vfio_get_iommu_info(container, );
+
+if (ret || !(info->flags & VFIO_IOMMU_INFO_PGSIZES)) {
 /* Assume 4k IOVA page size */
-info.iova_pgsizes = 4096;
+info->iova_pgsizes = 4096;
 }
-vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
-container->pgsizes = info.iova_pgsizes;
+vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
+container->pgsizes = info->iova_pgsizes;
+
+if (!ret) {
+vfio_get_iommu_info_migration(container, info);
+}
+g_free(info);
 break;
 }
 case VFIO_SPAPR_TCE_v2_IOMMU:
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 46d05d230e2a..ea5e0f1b8489 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -828,9 +828,14 @@ err:
 
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
+VFIOContainer *container = vbasedev->gro

[PATCH v27 03/17] vfio: Add save and load functions for VFIO PCI devices

2020-10-22 Thread Kirti Wankhede
Added functions to save and restore PCI device specific data,
specifically config space of PCI device.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
---
 hw/vfio/pci.c | 48 +++
 include/hw/vfio/vfio-common.h |  2 ++
 2 files changed, 50 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index bffd5bfe3b78..1036a5332772 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -41,6 +41,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/blocker.h"
+#include "migration/qemu-file.h"
 
 #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
 
@@ -2401,11 +2402,58 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
 return OBJECT(vdev);
 }
 
+static bool vfio_msix_enabled(void *opaque, int version_id)
+{
+PCIDevice *pdev = opaque;
+
+return msix_enabled(pdev);
+}
+
+const VMStateDescription vmstate_vfio_pci_config = {
+.name = "VFIOPCIDevice",
+.version_id = 1,
+.minimum_version_id = 1,
+.fields = (VMStateField[]) {
+VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
+VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_enabled),
+VMSTATE_END_OF_LIST()
+}
+};
+
+static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+vmstate_save_state(f, _vfio_pci_config, vdev, NULL);
+}
+
+static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+PCIDevice *pdev = >pdev;
+int ret;
+
+ret = vmstate_load_state(f, _vfio_pci_config, vdev, 1);
+if (ret) {
+return ret;
+}
+
+if (msi_enabled(pdev)) {
+vfio_msi_enable(vdev);
+} else if (msix_enabled(pdev)) {
+vfio_msix_enable(vdev);
+}
+
+return ret;
+}
+
 static VFIODeviceOps vfio_pci_ops = {
 .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
 .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
 .vfio_eoi = vfio_intx_eoi,
 .vfio_get_object = vfio_pci_get_object,
+.vfio_save_config = vfio_pci_save_config,
+.vfio_load_config = vfio_pci_load_config,
 };
 
 int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fe99c36a693a..ba6169cd926e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -120,6 +120,8 @@ struct VFIODeviceOps {
 int (*vfio_hot_reset_multi)(VFIODevice *vdev);
 void (*vfio_eoi)(VFIODevice *vdev);
 Object *(*vfio_get_object)(VFIODevice *vdev);
+void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
+int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
 };
 
 typedef struct VFIOGroup {
-- 
2.7.0




[PATCH v27 09/17] vfio: Add load state functions to SaveVMHandlers

2020-10-22 Thread Kirti Wankhede
Sequence  during _RESUMING device state:
While data for this device is available, repeat below steps:
a. read data_offset from where user application should write data.
b. write data of data_size to migration region from data_offset.
c. write data_size which indicates vendor driver that data is written in
   staging buffer.

For user, data is opaque. User should write data in the same order as
received.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Dr. David Alan Gilbert 
---
 hw/vfio/migration.c  | 192 +++
 hw/vfio/trace-events |   3 +
 2 files changed, 195 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 5506cef15d88..46d05d230e2a 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -257,6 +257,77 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice 
*vbasedev, uint64_t *size)
 return ret;
 }
 
+static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
+uint64_t data_size)
+{
+VFIORegion *region = >migration->region;
+uint64_t data_offset = 0, size, report_size;
+int ret;
+
+do {
+ret = vfio_mig_read(vbasedev, _offset, sizeof(data_offset),
+  region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset));
+if (ret < 0) {
+return ret;
+}
+
+if (data_offset + data_size > region->size) {
+/*
+ * If data_size is greater than the data section of migration 
region
+ * then iterate the write buffer operation. This case can occur if
+ * size of migration region at destination is smaller than size of
+ * migration region at source.
+ */
+report_size = size = region->size - data_offset;
+data_size -= size;
+} else {
+report_size = size = data_size;
+data_size = 0;
+}
+
+trace_vfio_load_state_device_data(vbasedev->name, data_offset, size);
+
+while (size) {
+void *buf;
+uint64_t sec_size;
+bool buf_alloc = false;
+
+buf = get_data_section_size(region, data_offset, size, _size);
+
+if (!buf) {
+buf = g_try_malloc(sec_size);
+if (!buf) {
+error_report("%s: Error allocating buffer ", __func__);
+return -ENOMEM;
+}
+buf_alloc = true;
+}
+
+qemu_get_buffer(f, buf, sec_size);
+
+if (buf_alloc) {
+ret = vfio_mig_write(vbasedev, buf, sec_size,
+region->fd_offset + data_offset);
+g_free(buf);
+
+if (ret < 0) {
+return ret;
+}
+}
+size -= sec_size;
+data_offset += sec_size;
+}
+
+ret = vfio_mig_write(vbasedev, _size, sizeof(report_size),
+region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size));
+if (ret < 0) {
+return ret;
+}
+} while (data_size);
+
+return 0;
+}
+
 static int vfio_update_pending(VFIODevice *vbasedev)
 {
 VFIOMigration *migration = vbasedev->migration;
@@ -293,6 +364,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void 
*opaque)
 return qemu_file_get_error(f);
 }
 
+static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+uint64_t data;
+
+if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
+int ret;
+
+ret = vbasedev->ops->vfio_load_config(vbasedev, f);
+if (ret) {
+error_report("%s: Failed to load device config space",
+ vbasedev->name);
+return ret;
+}
+}
+
+data = qemu_get_be64(f);
+if (data != VFIO_MIG_FLAG_END_OF_STATE) {
+error_report("%s: Failed loading device config space, "
+ "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
+return -EINVAL;
+}
+
+trace_vfio_load_device_config_state(vbasedev->name);
+return qemu_file_get_error(f);
+}
+
 /* -- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -477,12 +575,106 @@ static int vfio_save_complete_precopy(QEMUFile *f, void 
*opaque)
 return ret;
 }
 
+static int vfio_load_setup(QEMUFile *f, void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+int ret = 0;
+
+if (migration->region.mmaps) {
+ret = vfio_region_mmap(>region);
+if (ret) {
+error_report("%s: Failed to mmap VFIO migration region %d: %s",
+ vbasedev->name, migra

[PATCH v27 04/17] vfio: Add migration region initialization and finalize function

2020-10-22 Thread Kirti Wankhede
Whether the VFIO device supports migration or not is decided based of
migration region query. If migration region query is successful and migration
region initialization is successful then migration is supported else
migration is blocked.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Acked-by: Dr. David Alan Gilbert 
---
 hw/vfio/meson.build   |   1 +
 hw/vfio/migration.c   | 129 ++
 hw/vfio/trace-events  |   3 +
 include/hw/vfio/vfio-common.h |   9 +++
 4 files changed, 142 insertions(+)
 create mode 100644 hw/vfio/migration.c

diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index 37efa74018bc..da9af297a0c5 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
 vfio_ss.add(files(
   'common.c',
   'spapr.c',
+  'migration.c',
 ))
 vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
   'display.c',
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
new file mode 100644
index ..5f74a3ad1d72
--- /dev/null
+++ b/hw/vfio/migration.c
@@ -0,0 +1,129 @@
+/*
+ * Migration support for VFIO devices
+ *
+ * Copyright NVIDIA, Inc. 2020
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include 
+
+#include "hw/vfio/vfio-common.h"
+#include "cpu.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "migration/register.h"
+#include "migration/blocker.h"
+#include "migration/misc.h"
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "exec/ram_addr.h"
+#include "pci.h"
+#include "trace.h"
+
+static void vfio_migration_region_exit(VFIODevice *vbasedev)
+{
+VFIOMigration *migration = vbasedev->migration;
+
+if (!migration) {
+return;
+}
+
+if (migration->region.size) {
+vfio_region_exit(>region);
+vfio_region_finalize(>region);
+}
+}
+
+static int vfio_migration_init(VFIODevice *vbasedev,
+   struct vfio_region_info *info)
+{
+int ret;
+Object *obj;
+VFIOMigration *migration;
+
+if (!vbasedev->ops->vfio_get_object) {
+return -EINVAL;
+}
+
+obj = vbasedev->ops->vfio_get_object(vbasedev);
+if (!obj) {
+return -EINVAL;
+}
+
+migration = g_new0(VFIOMigration, 1);
+
+ret = vfio_region_setup(obj, vbasedev, >region,
+info->index, "migration");
+if (ret) {
+error_report("%s: Failed to setup VFIO migration region %d: %s",
+ vbasedev->name, info->index, strerror(-ret));
+goto err;
+}
+
+if (!migration->region.size) {
+error_report("%s: Invalid zero-sized of VFIO migration region %d",
+ vbasedev->name, info->index);
+ret = -EINVAL;
+goto err;
+}
+
+vbasedev->migration = migration;
+return 0;
+
+err:
+vfio_migration_region_exit(vbasedev);
+g_free(migration);
+return ret;
+}
+
+/* -- */
+
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
+{
+struct vfio_region_info *info = NULL;
+Error *local_err = NULL;
+int ret;
+
+ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
+   VFIO_REGION_SUBTYPE_MIGRATION, );
+if (ret) {
+goto add_blocker;
+}
+
+ret = vfio_migration_init(vbasedev, info);
+if (ret) {
+goto add_blocker;
+}
+
+g_free(info);
+trace_vfio_migration_probe(vbasedev->name, info->index);
+return 0;
+
+add_blocker:
+error_setg(>migration_blocker,
+   "VFIO device doesn't support migration");
+g_free(info);
+
+ret = migrate_add_blocker(vbasedev->migration_blocker, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+error_free(vbasedev->migration_blocker);
+vbasedev->migration_blocker = NULL;
+}
+return ret;
+}
+
+void vfio_migration_finalize(VFIODevice *vbasedev)
+{
+if (vbasedev->migration_blocker) {
+migrate_del_blocker(vbasedev->migration_blocker);
+error_free(vbasedev->migration_blocker);
+vbasedev->migration_blocker = NULL;
+}
+
+vfio_migration_region_exit(vbasedev);
+g_free(vbasedev->migration);
+}
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index a0c7b49a2ebc..9ced5ec6277c 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -145,3 +145,6 @@ vfio_display_edid_link_up(void) ""
 vfio_display_edid_link_down(void) ""
 vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"

  1   2   3   4   5   6   7   8   9   10   >