Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported

2022-05-16 Thread Alex Williamson
On Mon, 16 May 2022 13:22:14 +0200
Juan Quintela  wrote:

> Avihai Horon  wrote:
> > Currently, if IOMMU of a VFIO container doesn't support dirty page
> > tracking, migration is blocked completely. This is because a DMA-able
> > VFIO device can dirty RAM pages without updating QEMU about it, thus
> > breaking the migration.
> >
> > However, this doesn't mean that migration can't be done at all. If
> > migration pre-copy phase is skipped, the VFIO device doesn't have a
> > chance to dirty RAM pages that have been migrated already, thus
> > eliminating the problem previously mentioned.
> >
> > Hence, in such case allow migration but skip pre-copy phase.
> >
> > Signed-off-by: Avihai Horon   
> 
> I don't know (TM).
> Several issues:
> - Patch is ugly as hell (ok, that depends on taste)
> - It changes migration_iteration_run() instead of directly
>   migration_thread.
> - There is already another case where we skip the sending of RAM
>   (localhost migration with shared memory)
> 
> In migration/ram.c:
> 
> static int ram_find_and_save_block(RAMState *rs, bool last_stage)
> {
> PageSearchStatus pss;
> int pages = 0;
> bool again, found;
> 
> /* No dirty page as there is zero RAM */
> if (!ram_bytes_total()) {
> return pages;
> }
> 
> This is the other place where we _don't_ send any RAM at all.
> 
> I don't have a great idea about how to make things clear at a higher
> level, I have to think about this.

It seems like if we have devices dictating what type of migrations can
be performed then there probably needs to be a switch to restrict use of
such devices just as we have the -only-migratable switch now to prevent
attaching devices that don't support migration.  I'd guess that we need
the switch to opt-in to allowing such devices to maintain
compatibility.  There's probably a whole pile of qapi things missing to
expose this to management tools as well.  Thanks,

Alex

> > ---
> >  hw/vfio/migration.c   | 9 -
> >  migration/migration.c | 5 +
> >  migration/migration.h | 3 +++
> >  3 files changed, 16 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 21e8f9d4d4..d4b6653026 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -863,10 +863,17 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error 
> > **errp)
> >  struct vfio_region_info *info = NULL;
> >  int ret = -ENOTSUP;
> >  
> > -if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
> > +if (!vbasedev->enable_migration) {
> >  goto add_blocker;
> >  }
> >  
> > +if (!container->dirty_pages_supported) {
> > +warn_report(
> > +"%s: IOMMU of the device's VFIO container doesn't support 
> > dirty page tracking, migration pre-copy phase will be skipped",
> > +vbasedev->name);
> > +migrate_get_current()->skip_precopy = true;
> > +}
> > +
> >  ret = vfio_get_dev_region_info(vbasedev,
> > VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> > 
> > VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 5a31b23bd6..668343508d 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -3593,6 +3593,11 @@ static MigIterateState 
> > migration_iteration_run(MigrationState *s)
> >  uint64_t pending_size, pend_pre, pend_compat, pend_post;
> >  bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
> >  
> > +if (s->skip_precopy) {
> > +migration_completion(s);
> > +return MIG_ITERATE_BREAK;
> > +}
> > +
> >  qemu_savevm_state_pending(s->to_dst_file, s->threshold_size, _pre,
> >_compat, _post);
> >  pending_size = pend_pre + pend_compat + pend_post;
> > diff --git a/migration/migration.h b/migration/migration.h
> > index a863032b71..876713e7e1 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -332,6 +332,9 @@ struct MigrationState {
> >   * This save hostname when out-going migration starts
> >   */
> >  char *hostname;
> > +
> > +/* Whether to skip pre-copy phase of migration or not */
> > +bool skip_precopy;
> >  };
> >  
> >  void migrate_set_state(int *state, int old_state, int new_state);  
> 




[PULL 0/1] Linux header update to v5.18-rc6

2022-05-13 Thread Alex Williamson
The following changes since commit 9de5f2b40860c5f8295e73fea9922df6f0b8d89a:

  Merge tag 'for-upstream' of https://gitlab.com/bonzini/qemu into staging 
(2022-05-12 10:52:15 -0700)

are available in the Git repository at:

  https://gitlab.com/alex.williamson/qemu.git tags/linux-headers-v5.18-rc6

for you to fetch changes up to e4082063e47e9731dbeb1c26174c17f6038f577f:

  linux-headers: Update to v5.18-rc6 (2022-05-13 08:20:11 -0600)


 * Linux header update to v5.18-rc6 and vfio file massaging (Alex Williamson)


Alex Williamson (1):
  linux-headers: Update to v5.18-rc6

 hw/vfio/common.c   |   6 +-
 hw/vfio/migration.c|  27 +-
 include/standard-headers/linux/input-event-codes.h |  25 +-
 include/standard-headers/linux/virtio_config.h |   6 +
 include/standard-headers/linux/virtio_crypto.h |  82 -
 linux-headers/asm-arm64/kvm.h  |  16 +
 linux-headers/asm-generic/mman-common.h|   2 +
 linux-headers/asm-mips/mman.h  |   2 +
 linux-headers/linux/kvm.h  |  27 +-
 linux-headers/linux/psci.h |   4 +
 linux-headers/linux/userfaultfd.h  |   8 +-
 linux-headers/linux/vfio.h | 406 ++---
 linux-headers/linux/vhost.h|   7 +
 13 files changed, 383 insertions(+), 235 deletions(-)




[PULL 1/1] linux-headers: Update to v5.18-rc6

2022-05-13 Thread Alex Williamson
Update to c5eb0a61238d ("Linux 5.18-rc6").  Mechanical search and
replace of vfio defines with white space massaging.

Signed-off-by: Alex Williamson 
---
 hw/vfio/common.c   |6 
 hw/vfio/migration.c|   27 +
 include/standard-headers/linux/input-event-codes.h |   25 +
 include/standard-headers/linux/virtio_config.h |6 
 include/standard-headers/linux/virtio_crypto.h |   82 
 linux-headers/asm-arm64/kvm.h  |   16 +
 linux-headers/asm-generic/mman-common.h|2 
 linux-headers/asm-mips/mman.h  |2 
 linux-headers/linux/kvm.h  |   27 +
 linux-headers/linux/psci.h |4 
 linux-headers/linux/userfaultfd.h  |8 
 linux-headers/linux/vfio.h |  406 ++--
 linux-headers/linux/vhost.h|7 
 13 files changed, 383 insertions(+), 235 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 159f910421bc..29982c7af8c4 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -355,7 +355,7 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer 
*container)
 }
 
 if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF)
-&& (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+&& (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) {
 return false;
 }
 }
@@ -381,8 +381,8 @@ static bool 
vfio_devices_all_running_and_saving(VFIOContainer *container)
 return false;
 }
 
-if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) &&
-(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+if ((migration->device_state & VFIO_DEVICE_STATE_V1_SAVING) &&
+(migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) {
 continue;
 } else {
 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index ff6b45de6b55..a6ad1f894561 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -432,7 +432,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
 }
 
 ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
-   VFIO_DEVICE_STATE_SAVING);
+   VFIO_DEVICE_STATE_V1_SAVING);
 if (ret) {
 error_report("%s: Failed to set state SAVING", vbasedev->name);
 return ret;
@@ -531,8 +531,8 @@ static int vfio_save_complete_precopy(QEMUFile *f, void 
*opaque)
 uint64_t data_size;
 int ret;
 
-ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_RUNNING,
-   VFIO_DEVICE_STATE_SAVING);
+ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_RUNNING,
+   VFIO_DEVICE_STATE_V1_SAVING);
 if (ret) {
 error_report("%s: Failed to set state STOP and SAVING",
  vbasedev->name);
@@ -569,7 +569,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void 
*opaque)
 return ret;
 }
 
-ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_SAVING, 0);
+ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_SAVING, 0);
 if (ret) {
 error_report("%s: Failed to set state STOPPED", vbasedev->name);
 return ret;
@@ -609,7 +609,7 @@ static int vfio_load_setup(QEMUFile *f, void *opaque)
 }
 
 ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
-   VFIO_DEVICE_STATE_RESUMING);
+   VFIO_DEVICE_STATE_V1_RESUMING);
 if (ret) {
 error_report("%s: Failed to set state RESUMING", vbasedev->name);
 if (migration->region.mmaps) {
@@ -717,20 +717,20 @@ static void vfio_vmstate_change(void *opaque, bool 
running, RunState state)
  * In both the above cases, set _RUNNING bit.
  */
 mask = ~VFIO_DEVICE_STATE_MASK;
-value = VFIO_DEVICE_STATE_RUNNING;
+value = VFIO_DEVICE_STATE_V1_RUNNING;
 } else {
 /*
  * Here device state could be either _RUNNING or _SAVING|_RUNNING. 
Reset
  * _RUNNING bit
  */
-mask = ~VFIO_DEVICE_STATE_RUNNING;
+mask = ~VFIO_DEVICE_STATE_V1_RUNNING;
 
 /*
  * When VM state transition to stop for savevm command, device should
  * start saving data.
  */
 if (state == RUN_STATE_SAVE_VM) {
-value = VFIO_DEVICE_STATE_SAVING;
+value = VFIO_DEVICE_STATE_V1_SAVING;
 } else {
 value = 0;
 }
@@ -768,8 +768,9 @@ static void 

Re: [PATCH 2/9] vfio: Fix compilation errors caused by VFIO migration v1 deprecation

2022-05-12 Thread Alex Williamson
On Thu, 12 May 2022 15:25:32 -0300
Jason Gunthorpe  wrote:

> On Thu, May 12, 2022 at 11:57:10AM -0600, Alex Williamson wrote:
> > > @@ -767,9 +767,10 @@ static void vfio_migration_state_notifier(Notifier 
> > > *notifier, void *data)
> > >  case MIGRATION_STATUS_CANCELLED:
> > >  case MIGRATION_STATUS_FAILED:
> > >  bytes_transferred = 0;
> > > -ret = vfio_migration_set_state(vbasedev,
> > > -  ~(VFIO_DEVICE_STATE_SAVING | 
> > > VFIO_DEVICE_STATE_RESUMING),
> > > -  VFIO_DEVICE_STATE_RUNNING);
> > > +ret = vfio_migration_set_state(
> > > +vbasedev,
> > > +~(VFIO_DEVICE_STATE_V1_SAVING | 
> > > VFIO_DEVICE_STATE_V1_RESUMING),
> > > +VFIO_DEVICE_STATE_V1_RUNNING);  
> > 
> > Yikes!  Please follow the line wrapping used elsewhere.  There's no need
> > to put the first arg on a new line and subsequent wrapped lines should
> > be indented to match the previous line, or at least to avoid wrapping
> > itself.  Here we can use something like:  
> 
> This is generated by clang-format with one of the qmeu styles, it
> follows the documented guide:
> 
>  In case of function, there are several variants:
> 
>  - 4 spaces indent from the beginning
>  - align the secondary lines just after the opening parenthesis of the
>first
> 
> clang-format selected the first option due to its optimization
> algorithm.
> 
> Knowing nothing about qmeu, I am confused??

Maybe someone needs to throw more AI models at clang-format so that it
considers the more readable option?  QEMU does a lot wrong with style
imo, and maybe it's technically compliant as written, but I think what
I proposed is also compliant, as well as more readable and more
consistent with the existing file.  Thanks,

Alex




Re: [PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2

2022-05-12 Thread Alex Williamson
On Thu, 12 May 2022 18:43:11 +0300
Avihai Horon  wrote:

> Hello,
> 
> Following VFIO migration protocol v2 acceptance in kernel, this series
> implements VFIO migration according to the new v2 protocol and replaces
> the now deprecated v1 implementation.

Let's not bottleneck others waiting on a linux header file update on
also incorporating v2 support.  In the short term we just need the
first two patches here.

Are there any objections to folding those patches together for the sake
of bisection?  Thanks,

Alex




Re: [PATCH 2/9] vfio: Fix compilation errors caused by VFIO migration v1 deprecation

2022-05-12 Thread Alex Williamson
On Thu, 12 May 2022 18:43:13 +0300
Avihai Horon  wrote:

> VFIO migration protocol v1 was deprecated and as part of it some of the
> uAPI definitions were renamed. This caused compilation errors.
> Fix them.
> 
> Signed-off-by: Avihai Horon 
> ---
>  hw/vfio/common.c|  6 +++---
>  hw/vfio/migration.c | 29 -
>  2 files changed, 19 insertions(+), 16 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 159f910421..29982c7af8 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -355,7 +355,7 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer 
> *container)
>  }
>  
>  if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF)
> -&& (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
> +&& (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) 
> {
>  return false;
>  }
>  }
> @@ -381,8 +381,8 @@ static bool 
> vfio_devices_all_running_and_saving(VFIOContainer *container)
>  return false;
>  }
>  
> -if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) &&
> -(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
> +if ((migration->device_state & VFIO_DEVICE_STATE_V1_SAVING) &&
> +(migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) {
>  continue;
>  } else {
>  return false;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index ff6b45de6b..835608cd23 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -432,7 +432,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>  }
>  
>  ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
> -   VFIO_DEVICE_STATE_SAVING);
> +   VFIO_DEVICE_STATE_V1_SAVING);
>  if (ret) {
>  error_report("%s: Failed to set state SAVING", vbasedev->name);
>  return ret;
> @@ -531,8 +531,8 @@ static int vfio_save_complete_precopy(QEMUFile *f, void 
> *opaque)
>  uint64_t data_size;
>  int ret;
>  
> -ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_RUNNING,
> -   VFIO_DEVICE_STATE_SAVING);
> +ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_RUNNING,
> +   VFIO_DEVICE_STATE_V1_SAVING);
>  if (ret) {
>  error_report("%s: Failed to set state STOP and SAVING",
>   vbasedev->name);
> @@ -569,7 +569,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void 
> *opaque)
>  return ret;
>  }
>  
> -ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_SAVING, 0);
> +ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_SAVING, 
> 0);
>  if (ret) {
>  error_report("%s: Failed to set state STOPPED", vbasedev->name);
>  return ret;
> @@ -609,7 +609,7 @@ static int vfio_load_setup(QEMUFile *f, void *opaque)
>  }
>  
>  ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
> -   VFIO_DEVICE_STATE_RESUMING);
> +   VFIO_DEVICE_STATE_V1_RESUMING);
>  if (ret) {
>  error_report("%s: Failed to set state RESUMING", vbasedev->name);
>  if (migration->region.mmaps) {
> @@ -717,20 +717,20 @@ static void vfio_vmstate_change(void *opaque, bool 
> running, RunState state)
>   * In both the above cases, set _RUNNING bit.
>   */
>  mask = ~VFIO_DEVICE_STATE_MASK;
> -value = VFIO_DEVICE_STATE_RUNNING;
> +value = VFIO_DEVICE_STATE_V1_RUNNING;
>  } else {
>  /*
>   * Here device state could be either _RUNNING or _SAVING|_RUNNING. 
> Reset
>   * _RUNNING bit
>   */
> -mask = ~VFIO_DEVICE_STATE_RUNNING;
> +mask = ~VFIO_DEVICE_STATE_V1_RUNNING;
>  
>  /*
>   * When VM state transition to stop for savevm command, device should
>   * start saving data.
>   */
>  if (state == RUN_STATE_SAVE_VM) {
> -value = VFIO_DEVICE_STATE_SAVING;
> +value = VFIO_DEVICE_STATE_V1_SAVING;
>  } else {
>  value = 0;
>  }
> @@ -767,9 +767,10 @@ static void vfio_migration_state_notifier(Notifier 
> *notifier, void *data)
>  case MIGRATION_STATUS_CANCELLED:
>  case MIGRATION_STATUS_FAILED:
>  bytes_transferred = 0;
> -ret = vfio_migration_set_state(vbasedev,
> -  ~(VFIO_DEVICE_STATE_SAVING | 
> VFIO_DEVICE_STATE_RESUMING),
> -  VFIO_DEVICE_STATE_RUNNING);
> +ret = vfio_migration_set_state(
> +vbasedev,
> +~(VFIO_DEVICE_STATE_V1_SAVING | VFIO_DEVICE_STATE_V1_RESUMING),
> +VFIO_DEVICE_STATE_V1_RUNNING);

Yikes!  Please follow 

[PULL 11/11] vfio/common: Rename VFIOGuestIOMMU::iommu into ::iommu_mr

2022-05-06 Thread Alex Williamson
From: Yi Liu 

Rename VFIOGuestIOMMU iommu field into iommu_mr. Then it becomes clearer
it is an IOMMU memory region.

no functional change intended

Signed-off-by: Yi Liu 
Link: https://lore.kernel.org/r/20220502094223.36384-4-yi.l@intel.com
Signed-off-by: Alex Williamson 
---
 hw/vfio/common.c  |   16 
 include/hw/vfio/vfio-common.h |2 +-
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index cfcb71974a61..159f910421bc 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1017,7 +1017,7 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
  * device emulation the VFIO iommu handles to use).
  */
 giommu = g_malloc0(sizeof(*giommu));
-giommu->iommu = iommu_mr;
+giommu->iommu_mr = iommu_mr;
 giommu->iommu_offset = section->offset_within_address_space -
section->offset_within_region;
 giommu->container = container;
@@ -1032,7 +1032,7 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 int128_get64(llend),
 iommu_idx);
 
-ret = memory_region_iommu_set_page_size_mask(giommu->iommu,
+ret = memory_region_iommu_set_page_size_mask(giommu->iommu_mr,
  container->pgsizes,
  );
 if (ret) {
@@ -1047,7 +1047,7 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 goto fail;
 }
 QLIST_INSERT_HEAD(>giommu_list, giommu, giommu_next);
-memory_region_iommu_replay(giommu->iommu, >n);
+memory_region_iommu_replay(giommu->iommu_mr, >n);
 
 return;
 }
@@ -1153,7 +1153,7 @@ static void vfio_listener_region_del(MemoryListener 
*listener,
 VFIOGuestIOMMU *giommu;
 
 QLIST_FOREACH(giommu, >giommu_list, giommu_next) {
-if (MEMORY_REGION(giommu->iommu) == section->mr &&
+if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
 giommu->n.start == section->offset_within_region) {
 memory_region_unregister_iommu_notifier(section->mr,
 >n);
@@ -1418,11 +1418,11 @@ static int vfio_sync_dirty_bitmap(VFIOContainer 
*container,
 VFIOGuestIOMMU *giommu;
 
 QLIST_FOREACH(giommu, >giommu_list, giommu_next) {
-if (MEMORY_REGION(giommu->iommu) == section->mr &&
+if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
 giommu->n.start == section->offset_within_region) {
 Int128 llend;
 vfio_giommu_dirty_notifier gdn = { .giommu = giommu };
-int idx = memory_region_iommu_attrs_to_index(giommu->iommu,
+int idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr,
MEMTXATTRS_UNSPECIFIED);
 
 llend = 
int128_add(int128_make64(section->offset_within_region),
@@ -1435,7 +1435,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer 
*container,
 section->offset_within_region,
 int128_get64(llend),
 idx);
-memory_region_iommu_replay(giommu->iommu, );
+memory_region_iommu_replay(giommu->iommu_mr, );
 break;
 }
 }
@@ -2270,7 +2270,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
 
 QLIST_FOREACH_SAFE(giommu, >giommu_list, giommu_next, tmp) {
 memory_region_unregister_iommu_notifier(
-MEMORY_REGION(giommu->iommu), >n);
+MEMORY_REGION(giommu->iommu_mr), >n);
 QLIST_REMOVE(giommu, giommu_next);
 g_free(giommu);
 }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 8af11b0a7692..e573f5a9f19f 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -98,7 +98,7 @@ typedef struct VFIOContainer {
 
 typedef struct VFIOGuestIOMMU {
 VFIOContainer *container;
-IOMMUMemoryRegion *iommu;
+IOMMUMemoryRegion *iommu_mr;
 hwaddr iommu_offset;
 IOMMUNotifier n;
 QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;





[PULL 10/11] vfio/pci: Use vbasedev local variable in vfio_realize()

2022-05-06 Thread Alex Williamson
From: Eric Auger 

Using a VFIODevice handle local variable to improve the code readability.

no functional change intended

Signed-off-by: Eric Auger 
Signed-off-by: Yi Liu 
Link: https://lore.kernel.org/r/20220502094223.36384-3-yi.l@intel.com
Signed-off-by: Alex Williamson 
---
 hw/vfio/pci.c |   49 +
 1 file changed, 25 insertions(+), 24 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index cb912bd3f4b2..939dcc3d4a9e 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2846,6 +2846,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice 
*vdev)
 static void vfio_realize(PCIDevice *pdev, Error **errp)
 {
 VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+VFIODevice *vbasedev = >vbasedev;
 VFIODevice *vbasedev_iter;
 VFIOGroup *group;
 char *tmp, *subsys, group_path[PATH_MAX], *group_name;
@@ -2856,7 +2857,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 int i, ret;
 bool is_mdev;
 
-if (!vdev->vbasedev.sysfsdev) {
+if (!vbasedev->sysfsdev) {
 if (!(~vdev->host.domain || ~vdev->host.bus ||
   ~vdev->host.slot || ~vdev->host.function)) {
 error_setg(errp, "No provided host device");
@@ -2864,24 +2865,24 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
   "or -device vfio-pci,sysfsdev=PATH_TO_DEVICE\n");
 return;
 }
-vdev->vbasedev.sysfsdev =
+vbasedev->sysfsdev =
 g_strdup_printf("/sys/bus/pci/devices/%04x:%02x:%02x.%01x",
 vdev->host.domain, vdev->host.bus,
 vdev->host.slot, vdev->host.function);
 }
 
-if (stat(vdev->vbasedev.sysfsdev, ) < 0) {
+if (stat(vbasedev->sysfsdev, ) < 0) {
 error_setg_errno(errp, errno, "no such host device");
-error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.sysfsdev);
+error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->sysfsdev);
 return;
 }
 
-vdev->vbasedev.name = g_path_get_basename(vdev->vbasedev.sysfsdev);
-vdev->vbasedev.ops = _pci_ops;
-vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
-vdev->vbasedev.dev = DEVICE(vdev);
+vbasedev->name = g_path_get_basename(vbasedev->sysfsdev);
+vbasedev->ops = _pci_ops;
+vbasedev->type = VFIO_DEVICE_TYPE_PCI;
+vbasedev->dev = DEVICE(vdev);
 
-tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
+tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
 len = readlink(tmp, group_path, sizeof(group_path));
 g_free(tmp);
 
@@ -2899,7 +2900,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 goto error;
 }
 
-trace_vfio_realize(vdev->vbasedev.name, groupid);
+trace_vfio_realize(vbasedev->name, groupid);
 
 group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), 
errp);
 if (!group) {
@@ -2907,7 +2908,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 }
 
 QLIST_FOREACH(vbasedev_iter, >device_list, next) {
-if (strcmp(vbasedev_iter->name, vdev->vbasedev.name) == 0) {
+if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
 error_setg(errp, "device is already attached");
 vfio_put_group(group);
 goto error;
@@ -2920,22 +2921,22 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
  * stays in sync with the active working set of the guest driver.  Prevent
  * the x-balloon-allowed option unless this is minimally an mdev device.
  */
-tmp = g_strdup_printf("%s/subsystem", vdev->vbasedev.sysfsdev);
+tmp = g_strdup_printf("%s/subsystem", vbasedev->sysfsdev);
 subsys = realpath(tmp, NULL);
 g_free(tmp);
 is_mdev = subsys && (strcmp(subsys, "/sys/bus/mdev") == 0);
 free(subsys);
 
-trace_vfio_mdev(vdev->vbasedev.name, is_mdev);
+trace_vfio_mdev(vbasedev->name, is_mdev);
 
-if (vdev->vbasedev.ram_block_discard_allowed && !is_mdev) {
+if (vbasedev->ram_block_discard_allowed && !is_mdev) {
 error_setg(errp, "x-balloon-allowed only potentially compatible "
"with mdev devices");
 vfio_put_group(group);
 goto error;
 }
 
-ret = vfio_get_device(group, vdev->vbasedev.name, >vbasedev, errp);
+ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
 if (ret) {
 vfio_put_group(group);
 goto error;
@@ -2948,7 +2949,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 }
 
 /* Get a copy of config space */
-ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
+ret = pread(vbasedev->fd, 

[PULL 08/11] vfio/common: remove spurious tpm-crb-cmd misalignment warning

2022-05-06 Thread Alex Williamson
From: Eric Auger 

The CRB command buffer currently is a RAM MemoryRegion and given
its base address alignment, it causes an error report on
vfio_listener_region_add(). This region could have been a RAM device
region, easing the detection of such safe situation but this option
was not well received. So let's add a helper function that uses the
memory region owner type to detect the situation is safe wrt
the assignment. Other device types can be checked here if such kind
of problem occurs again.

Signed-off-by: Eric Auger 
Reviewed-by: Philippe Mathieu-Daudé 
Acked-by: Stefan Berger 
Reviewed-by: Cornelia Huck 
Link: https://lore.kernel.org/r/20220506132510.1847942-3-eric.au...@redhat.com
Signed-off-by: Alex Williamson 
---
 hw/vfio/common.c |   27 ++-
 hw/vfio/trace-events |1 +
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6065834717eb..cfcb71974a61 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -40,6 +40,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/migration.h"
+#include "sysemu/tpm.h"
 
 VFIOGroupList vfio_group_list =
 QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -861,6 +862,22 @@ static void 
vfio_unregister_ram_discard_listener(VFIOContainer *container,
 g_free(vrdl);
 }
 
+static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
+{
+MemoryRegion *mr = section->mr;
+
+if (!TPM_IS_CRB(mr->owner)) {
+return false;
+}
+
+/* this is a known safe misaligned region, just trace for debug purpose */
+trace_vfio_known_safe_misalignment(memory_region_name(mr),
+   section->offset_within_address_space,
+   section->offset_within_region,
+   qemu_real_host_page_size());
+return true;
+}
+
 static void vfio_listener_region_add(MemoryListener *listener,
  MemoryRegionSection *section)
 {
@@ -884,7 +901,15 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 if (unlikely((section->offset_within_address_space &
   ~qemu_real_host_page_mask()) !=
  (section->offset_within_region & 
~qemu_real_host_page_mask( {
-error_report("%s received unaligned region", __func__);
+if (!vfio_known_safe_misalignment(section)) {
+error_report("%s received unaligned region %s iova=0x%"PRIx64
+ " offset_within_region=0x%"PRIx64
+ " qemu_real_host_page_size=0x%"PRIxPTR,
+ __func__, memory_region_name(section->mr),
+ section->offset_within_address_space,
+ section->offset_within_region,
+ qemu_real_host_page_size());
+}
 return;
 }
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 0ef1b5f4a65f..582882db91c3 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -100,6 +100,7 @@ vfio_listener_region_add_skip(uint64_t start, uint64_t end) 
"SKIPPING region_add
 vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to 
liobn fd %d"
 vfio_listener_region_add_iommu(uint64_t start, uint64_t end) "region_add 
[iommu] 0x%"PRIx64" - 0x%"PRIx64
 vfio_listener_region_add_ram(uint64_t iova_start, uint64_t iova_end, void 
*vaddr) "region_add [ram] 0x%"PRIx64" - 0x%"PRIx64" [%p]"
+vfio_known_safe_misalignment(const char *name, uint64_t iova, uint64_t 
offset_within_region, uintptr_t page_size) "Region \"%s\" iova=0x%"PRIx64" 
offset_within_region=0x%"PRIx64" qemu_real_host_page_size=0x%"PRIxPTR ": cannot 
be mapped for DMA"
 vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, uint64_t 
size, uint64_t page_size) "Region \"%s\" 0x%"PRIx64" size=0x%"PRIx64" is not 
aligned to 0x%"PRIx64" and cannot be mapped for DMA"
 vfio_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING 
region_del 0x%"PRIx64" - 0x%"PRIx64
 vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 0x%"PRIx64" 
- 0x%"PRIx64





[PULL 06/11] vfio/common: Fix a small boundary issue of a trace

2022-05-06 Thread Alex Williamson
From: Xiang Chen 

It uses [offset, offset + size - 1] to indicate that the length of range is
size in most places in vfio trace code (such as
trace_vfio_region_region_mmap()) execpt trace_vfio_region_sparse_mmap_entry().
So change it for trace_vfio_region_sparse_mmap_entry(), but if size is zero,
the trace will be weird with an underflow, so move the trace and trace it
only if size is not zero.

Signed-off-by: Xiang Chen 
Link: 
https://lore.kernel.org/r/1650100104-130737-1-git-send-email-chenxian...@hisilicon.com
Signed-off-by: Alex Williamson 
---
 hw/vfio/common.c |7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 2b1f78fdfaeb..6065834717eb 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1544,11 +1544,10 @@ static int vfio_setup_region_sparse_mmaps(VFIORegion 
*region,
 region->mmaps = g_new0(VFIOMmap, sparse->nr_areas);
 
 for (i = 0, j = 0; i < sparse->nr_areas; i++) {
-trace_vfio_region_sparse_mmap_entry(i, sparse->areas[i].offset,
-sparse->areas[i].offset +
-sparse->areas[i].size);
-
 if (sparse->areas[i].size) {
+trace_vfio_region_sparse_mmap_entry(i, sparse->areas[i].offset,
+sparse->areas[i].offset +
+sparse->areas[i].size - 1);
 region->mmaps[j].offset = sparse->areas[i].offset;
 region->mmaps[j].size = sparse->areas[i].size;
 j++;





[PULL 07/11] sysemu: tpm: Add a stub function for TPM_IS_CRB

2022-05-06 Thread Alex Williamson
From: Eric Auger 

In a subsequent patch, VFIO will need to recognize if
a memory region owner is a TPM CRB device. Hence VFIO
needs to use TPM_IS_CRB() even if CONFIG_TPM is unset. So
let's add a stub function.

Signed-off-by: Eric Auger 
Suggested-by: Cornelia Huck 
Reviewed-by: Stefan Berger 
Link: https://lore.kernel.org/r/20220506132510.1847942-2-eric.au...@redhat.com
Signed-off-by: Alex Williamson 
---
 include/sysemu/tpm.h |6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/sysemu/tpm.h b/include/sysemu/tpm.h
index 68b2206463c5..fb40e30ff60e 100644
--- a/include/sysemu/tpm.h
+++ b/include/sysemu/tpm.h
@@ -80,6 +80,12 @@ static inline TPMVersion tpm_get_version(TPMIf *ti)
 #define tpm_init()  (0)
 #define tpm_cleanup()
 
+/* needed for an alignment check in non-tpm code */
+static inline Object *TPM_IS_CRB(Object *obj)
+{
+ return NULL;
+}
+
 #endif /* CONFIG_TPM */
 
 #endif /* QEMU_TPM_H */





[PULL 03/11] vfio: simplify the failure path in vfio_msi_enable

2022-05-06 Thread Alex Williamson
From: Longpeng(Mike) 

Use vfio_msi_disable_common to simplify the error handling
in vfio_msi_enable.

Signed-off-by: Longpeng(Mike) 
Link: https://lore.kernel.org/r/20220326060226.1892-4-longpe...@huawei.com
Signed-off-by: Alex Williamson 
---
 hw/vfio/pci.c |   16 ++--
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index b3c27c22aaeb..50562629ea8f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -47,6 +47,7 @@
 
 static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
+static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
 
 /*
  * Disabling BAR mmaping can be slow, but toggling it around INTx can
@@ -658,24 +659,12 @@ retry:
  "MSI vectors, retry with %d", vdev->nr_vectors, ret);
 }
 
-for (i = 0; i < vdev->nr_vectors; i++) {
-VFIOMSIVector *vector = >msi_vectors[i];
-if (vector->virq >= 0) {
-vfio_remove_kvm_msi_virq(vector);
-}
-qemu_set_fd_handler(event_notifier_get_fd(>interrupt),
-NULL, NULL, NULL);
-event_notifier_cleanup(>interrupt);
-}
-
-g_free(vdev->msi_vectors);
-vdev->msi_vectors = NULL;
+vfio_msi_disable_common(vdev);
 
 if (ret > 0) {
 vdev->nr_vectors = ret;
 goto retry;
 }
-vdev->nr_vectors = 0;
 
 /*
  * Failing to setup MSI doesn't really fall within any specification.
@@ -683,7 +672,6 @@ retry:
  * out to fall back to INTx for this device.
  */
 error_report("vfio: Error: Failed to enable MSI");
-vdev->interrupt = VFIO_INT_NONE;
 
 return;
 }





[PULL 09/11] hw/vfio/pci: fix vfio_pci_hot_reset_result trace point

2022-05-06 Thread Alex Williamson
From: Eric Auger 

"%m" format specifier is not interpreted by the trace infrastructure
and thus "%m" is output instead of the actual errno string. Fix it by
outputting strerror(errno).

Signed-off-by: Eric Auger 
Signed-off-by: Yi Liu 
Link: https://lore.kernel.org/r/20220502094223.36384-2-yi.l@intel.com
[aw: replace commit log as provided by Eric]
Signed-off-by: Alex Williamson 
---
 hw/vfio/pci.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index ef9d7bf326de..cb912bd3f4b2 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2380,7 +2380,7 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool 
single)
 g_free(reset);
 
 trace_vfio_pci_hot_reset_result(vdev->vbasedev.name,
-ret ? "%m" : "Success");
+ret ? strerror(errno) : "Success");
 
 out:
 /* Re-enable INTx on affected devices */





[PULL 05/11] vfio: defer to commit kvm irq routing when enable msi/msix

2022-05-06 Thread Alex Williamson
From: Longpeng(Mike) 

In migration resume phase, all unmasked msix vectors need to be
setup when loading the VF state. However, the setup operation would
take longer if the VM has more VFs and each VF has more unmasked
vectors.

The hot spot is kvm_irqchip_commit_routes, it'll scan and update
all irqfds that are already assigned each invocation, so more
vectors means need more time to process them.

vfio_pci_load_config
  vfio_msix_enable
msix_set_vector_notifiers
  for (vector = 0; vector < dev->msix_entries_nr; vector++) {
vfio_msix_vector_do_use
  vfio_add_kvm_msi_virq
kvm_irqchip_commit_routes <-- expensive
  }

We can reduce the cost by only committing once outside the loop.
The routes are cached in kvm_state, we commit them first and then
bind irqfd for each vector.

The test VM has 128 vcpus and 8 VF (each one has 65 vectors),
we measure the cost of the vfio_msix_enable for each VF, and
we can see 90+% costs can be reduce.

VF  Count of irqfds[*]  OriginalWith this patch

1st   658   2
2nd   130   15  2
3rd   195   22  2
4th   260   24  3
5th   325   36  2
6th   390   44  3
7th   455   51  3
8th   520   58  4
Total   258ms   21ms

[*] Count of irqfds
How many irqfds that already assigned and need to process in this
round.

The optimization can be applied to msi type too.

Signed-off-by: Longpeng(Mike) 
Link: https://lore.kernel.org/r/20220326060226.1892-6-longpe...@huawei.com
Signed-off-by: Alex Williamson 
---
 hw/vfio/pci.c |  130 +++--
 hw/vfio/pci.h |2 +
 2 files changed, 99 insertions(+), 33 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 8bc36f081afd..ef9d7bf326de 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -45,6 +45,9 @@
 
 #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
 
+/* Protected by BQL */
+static KVMRouteChange vfio_route_change;
+
 static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
 static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
@@ -413,33 +416,36 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool 
msix)
 static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
   int vector_n, bool msix)
 {
-KVMRouteChange c;
-int virq;
-
 if ((msix && vdev->no_kvm_msix) || (!msix && vdev->no_kvm_msi)) {
 return;
 }
 
-if (event_notifier_init(>kvm_interrupt, 0)) {
+vector->virq = kvm_irqchip_add_msi_route(_route_change,
+ vector_n, >pdev);
+}
+
+static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
+{
+if (vector->virq < 0) {
 return;
 }
 
-c = kvm_irqchip_begin_route_changes(kvm_state);
-virq = kvm_irqchip_add_msi_route(, vector_n, >pdev);
-if (virq < 0) {
-event_notifier_cleanup(>kvm_interrupt);
-return;
+if (event_notifier_init(>kvm_interrupt, 0)) {
+goto fail_notifier;
 }
-kvm_irqchip_commit_route_changes();
 
 if (kvm_irqchip_add_irqfd_notifier_gsi(kvm_state, >kvm_interrupt,
-   NULL, virq) < 0) {
-kvm_irqchip_release_virq(kvm_state, virq);
-event_notifier_cleanup(>kvm_interrupt);
-return;
+   NULL, vector->virq) < 0) {
+goto fail_kvm;
 }
 
-vector->virq = virq;
+return;
+
+fail_kvm:
+event_notifier_cleanup(>kvm_interrupt);
+fail_notifier:
+kvm_irqchip_release_virq(kvm_state, vector->virq);
+vector->virq = -1;
 }
 
 static void vfio_remove_kvm_msi_virq(VFIOMSIVector *vector)
@@ -494,7 +500,14 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, 
unsigned int nr,
 }
 } else {
 if (msg) {
-vfio_add_kvm_msi_virq(vdev, vector, nr, true);
+if (vdev->defer_kvm_irq_routing) {
+vfio_add_kvm_msi_virq(vdev, vector, nr, true);
+} else {
+vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
+vfio_add_kvm_msi_virq(vdev, vector, nr, true);
+kvm_irqchip_commit_route_changes(_route_change);
+vfio_connect_kvm_msi_virq(vector);
+}
 }
 }
 
@@ -504,11 +517,13 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, 
unsigned int nr,
  * increase them as needed.
  */
 if (vdev->nr_vectors < nr + 1) {
-vfio_disable_irqindex(>vbasedev, VFIO_PCI_MSIX_IRQ_INDEX);
 vdev->nr_vectors = nr + 1;
-ret = 

[PULL 02/11] vfio: move re-enabling INTX out of the common helper

2022-05-06 Thread Alex Williamson
From: Longpeng(Mike) 

Move re-enabling INTX out, and the callers should decide to
re-enable it or not.

Signed-off-by: Longpeng(Mike) 
Link: https://lore.kernel.org/r/20220326060226.1892-3-longpe...@huawei.com
Signed-off-by: Alex Williamson 
---
 hw/vfio/pci.c |   17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index cab1a6ef57f1..b3c27c22aaeb 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -693,7 +693,6 @@ retry:
 
 static void vfio_msi_disable_common(VFIOPCIDevice *vdev)
 {
-Error *err = NULL;
 int i;
 
 for (i = 0; i < vdev->nr_vectors; i++) {
@@ -712,15 +711,11 @@ static void vfio_msi_disable_common(VFIOPCIDevice *vdev)
 vdev->msi_vectors = NULL;
 vdev->nr_vectors = 0;
 vdev->interrupt = VFIO_INT_NONE;
-
-vfio_intx_enable(vdev, );
-if (err) {
-error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
-}
 }
 
 static void vfio_msix_disable(VFIOPCIDevice *vdev)
 {
+Error *err = NULL;
 int i;
 
 msix_unset_vector_notifiers(>pdev);
@@ -741,6 +736,10 @@ static void vfio_msix_disable(VFIOPCIDevice *vdev)
 }
 
 vfio_msi_disable_common(vdev);
+vfio_intx_enable(vdev, );
+if (err) {
+error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
+}
 
 memset(vdev->msix->pending, 0,
BITS_TO_LONGS(vdev->msix->entries) * sizeof(unsigned long));
@@ -750,8 +749,14 @@ static void vfio_msix_disable(VFIOPCIDevice *vdev)
 
 static void vfio_msi_disable(VFIOPCIDevice *vdev)
 {
+Error *err = NULL;
+
 vfio_disable_irqindex(>vbasedev, VFIO_PCI_MSI_IRQ_INDEX);
 vfio_msi_disable_common(vdev);
+vfio_intx_enable(vdev, );
+if (err) {
+error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
+}
 
 trace_vfio_msi_disable(vdev->vbasedev.name);
 }





[PULL 04/11] Revert "vfio: Avoid disabling and enabling vectors repeatedly in VFIO migration"

2022-05-06 Thread Alex Williamson
From: Longpeng(Mike) 

Commit ecebe53fe993 ("vfio: Avoid disabling and enabling vectors
repeatedly in VFIO migration") avoids inefficiently disabling and
enabling vectors repeatedly and lets the unmasked vectors be enabled
one by one.

But we want to batch multiple routes and defer the commit, and only
commit once outside the loop of setting vector notifiers, so we
cannot enable the vectors one by one in the loop now.

Revert that commit and we will take another way in the next patch,
it can not only avoid disabling/enabling vectors repeatedly, but
also satisfy our requirement of defer to commit.

Signed-off-by: Longpeng(Mike) 
Link: https://lore.kernel.org/r/20220326060226.1892-5-longpe...@huawei.com
Signed-off-by: Alex Williamson 
---
 hw/vfio/pci.c |   20 +++-
 1 file changed, 3 insertions(+), 17 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 50562629ea8f..8bc36f081afd 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -572,9 +572,6 @@ static void vfio_msix_vector_release(PCIDevice *pdev, 
unsigned int nr)
 
 static void vfio_msix_enable(VFIOPCIDevice *vdev)
 {
-PCIDevice *pdev = >pdev;
-unsigned int nr, max_vec = 0;
-
 vfio_disable_interrupts(vdev);
 
 vdev->msi_vectors = g_new0(VFIOMSIVector, vdev->msix->entries);
@@ -593,22 +590,11 @@ static void vfio_msix_enable(VFIOPCIDevice *vdev)
  * triggering to userspace, then immediately release the vector, leaving
  * the physical device with no vectors enabled, but MSI-X enabled, just
  * like the guest view.
- * If there are already unmasked vectors (in migration resume phase and
- * some guest startups) which will be enabled soon, we can allocate all
- * of them here to avoid inefficiently disabling and enabling vectors
- * repeatedly later.
  */
-if (!pdev->msix_function_masked) {
-for (nr = 0; nr < msix_nr_vectors_allocated(pdev); nr++) {
-if (!msix_is_masked(pdev, nr)) {
-max_vec = nr;
-}
-}
-}
-vfio_msix_vector_do_use(pdev, max_vec, NULL, NULL);
-vfio_msix_vector_release(pdev, max_vec);
+vfio_msix_vector_do_use(>pdev, 0, NULL, NULL);
+vfio_msix_vector_release(>pdev, 0);
 
-if (msix_set_vector_notifiers(pdev, vfio_msix_vector_use,
+if (msix_set_vector_notifiers(>pdev, vfio_msix_vector_use,
   vfio_msix_vector_release, NULL)) {
 error_report("vfio: msix_set_vector_notifiers failed");
 }





[PULL 01/11] vfio: simplify the conditional statements in vfio_msi_enable

2022-05-06 Thread Alex Williamson
From: Longpeng(Mike) 

It's unnecessary to test against the specific return value of
VFIO_DEVICE_SET_IRQS, since any positive return is an error
indicating the number of vectors we should retry with.

Signed-off-by: Longpeng(Mike) 
Link: https://lore.kernel.org/r/20220326060226.1892-2-longpe...@huawei.com
Signed-off-by: Alex Williamson 
---
 hw/vfio/pci.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 9fd9faee1d14..cab1a6ef57f1 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -653,7 +653,7 @@ retry:
 if (ret) {
 if (ret < 0) {
 error_report("vfio: Error: Failed to setup MSI fds: %m");
-} else if (ret != vdev->nr_vectors) {
+} else {
 error_report("vfio: Error: Failed to enable %d "
  "MSI vectors, retry with %d", vdev->nr_vectors, ret);
 }
@@ -671,7 +671,7 @@ retry:
 g_free(vdev->msi_vectors);
 vdev->msi_vectors = NULL;
 
-if (ret > 0 && ret != vdev->nr_vectors) {
+if (ret > 0) {
 vdev->nr_vectors = ret;
 goto retry;
 }





[PULL 00/11] Series short description

2022-05-06 Thread Alex Williamson
Switching to gitlab for pull requests to take advantage of the CI.
Sorry for the delay in some of these.  Thanks,

Alex

The following changes since commit 31abf61c4929a91275fe32f1fafe6e6b3e840b2a:

  Merge tag 'pull-ppc-20220505' of https://gitlab.com/danielhb/qemu into 
staging (2022-05-05 13:52:22 -0500)

are available in the Git repository at:

  https://gitlab.com/alex.williamson/qemu.git tags/vfio-updates-20220506.1

for you to fetch changes up to 44ee6aaae0c937abb631e57a9853c2cdef2bc9bb:

  vfio/common: Rename VFIOGuestIOMMU::iommu into ::iommu_mr (2022-05-06 
09:06:51 -0600)


VFIO updates 2022-05-06

 * Defer IRQ routing commits to improve setup and resume latency (Longpeng)

 * Fix trace sparse mmap boundary condition (Xiang Chen)

 * Quiet misalignment warning from TPM device mapping (Eric Auger)

 * Misc cleanups (Yi Liu, Eric Auger)


Eric Auger (4):
  sysemu: tpm: Add a stub function for TPM_IS_CRB
  vfio/common: remove spurious tpm-crb-cmd misalignment warning
  hw/vfio/pci: fix vfio_pci_hot_reset_result trace point
  vfio/pci: Use vbasedev local variable in vfio_realize()

Longpeng (Mike) (5):
  vfio: simplify the conditional statements in vfio_msi_enable
  vfio: move re-enabling INTX out of the common helper
  vfio: simplify the failure path in vfio_msi_enable
  Revert "vfio: Avoid disabling and enabling vectors repeatedly in VFIO 
migration"
  vfio: defer to commit kvm irq routing when enable msi/msix

Xiang Chen (1):
  vfio/common: Fix a small boundary issue of a trace

Yi Liu (1):
  vfio/common: Rename VFIOGuestIOMMU::iommu into ::iommu_mr

 hw/vfio/common.c  |  50 ++---
 hw/vfio/pci.c | 234 +-
 hw/vfio/pci.h |   2 +
 hw/vfio/trace-events  |   1 +
 include/hw/vfio/vfio-common.h |   2 +-
 include/sysemu/tpm.h  |   6 ++
 6 files changed, 186 insertions(+), 109 deletions(-)




Re: [PATCH v8 15/17] vfio-user: handle device interrupts

2022-05-05 Thread Alex Williamson
On Thu, 28 Apr 2022 10:54:04 +0100
Stefan Hajnoczi  wrote:

> On Mon, Apr 25, 2022 at 05:40:01PM +, Jag Raman wrote:
> > > On Apr 25, 2022, at 6:27 AM, Stefan Hajnoczi  wrote:
> > > 
> > > On Tue, Apr 19, 2022 at 04:44:20PM -0400, Jagannathan Raman wrote:  
> > >> +static MSIMessage vfu_object_msi_prepare_msg(PCIDevice *pci_dev,
> > >> + unsigned int vector)
> > >> +{
> > >> +MSIMessage msg;
> > >> +
> > >> +msg.address = 0;
> > >> +msg.data = vector;
> > >> +
> > >> +return msg;
> > >> +}
> > >> +
> > >> +static void vfu_object_msi_trigger(PCIDevice *pci_dev, MSIMessage msg)
> > >> +{
> > >> +vfu_ctx_t *vfu_ctx = pci_dev->irq_opaque;
> > >> +
> > >> +vfu_irq_trigger(vfu_ctx, msg.data);
> > >> +}  
> > > 
> > > Why did you switch to vfu_object_msi_prepare_msg() +
> > > vfu_object_msi_trigger() in this revision?  
> > 
> > We previously did not do this switch because the server didn’t get updates
> > to the MSIx table & PBA.
> > 
> > The latest client version (which is not part of this series) forwards 
> > accesses
> > to the MSIx table & PBA over to the server. It also reads the PBA set by the
> > server. These change make it possible for the server to make this switch.  
> 
> Interesting. That's different from kernel VFIO. Before vfio-user commits
> to a new approach it would be worth checking with Alex that he agrees
> with the design.
> 
> I remember sending an email asking about why VFIO MSI-X PBA does not
> offer the full semantics described in the PCIe spec but didn't get a
> response from Alex (Message-Id:
> YkMWp0lUJAHhivJA@stefanha-x1.localdomain).

IIUC, the question is why we redirect the MSI-X interrupt from the KVM
irqfd to be handled in QEMU when the vector is masked.  This is largely
to work around the fact that we haven't had a means to implement mask
and unmask in the kernel, therefore we leave the vector enabled and
only enable the emulated PBA if a masked vector fires.  This works
because nobody really cares about the PBA, nor operates in a mode where
vectors are masked and the PBA is polled.  Drivers that understand the
device likely have better places to poll for service requests than the
PBA.

Ideally, masking a vector would make use of the existing mask and
unmask uAPI via the SET_IRQS ioctl, but we haven't been able to
implement this due to lack of internal kernel APIs to support it.  We
may have those interfaces now, but lacking bandwidth, I haven't checked
recently and we seem to be getting by ok as is.  Thanks,

Alex




Re: [Patch 1/3] hw/vfio/pci: fix vfio_pci_hot_reset_result trace point

2022-05-02 Thread Alex Williamson
On Mon,  2 May 2022 02:42:21 -0700
Yi Liu  wrote:

> From: Eric Auger 
> 
> Properly output the errno string.

More explanation please, why is it broken and how does this fix it?
Thanks,

Alex
 
> Signed-off-by: Eric Auger 
> Signed-off-by: Yi Liu 
> ---
>  hw/vfio/pci.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 9fd9faee1d..4a66376be6 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2337,7 +2337,7 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool 
> single)
>  g_free(reset);
>  
>  trace_vfio_pci_hot_reset_result(vdev->vbasedev.name,
> -ret ? "%m" : "Success");
> +ret ? strerror(errno) : "Success");
>  
>  out:
>  /* Re-enable INTx on affected devices */




Re: [PATCH v4] vfio/common: remove spurious tpm-crb-cmd misalignment warning

2022-04-28 Thread Alex Williamson
On Thu, 28 Apr 2022 15:49:45 +0200
Eric Auger  wrote:

> The CRB command buffer currently is a RAM MemoryRegion and given
> its base address alignment, it causes an error report on
> vfio_listener_region_add(). This region could have been a RAM device
> region, easing the detection of such safe situation but this option
> was not well received. So let's add a helper function that uses the
> memory region owner type to detect the situation is safe wrt
> the assignment. Other device types can be checked here if such kind
> of problem occurs again.
> 
> Signed-off-by: Eric Auger 
> Reviewed-by: Philippe Mathieu-Daudé 
> Acked-by: Stefan Berger 
> Reviewed-by: Cornelia Huck 
> 
> ---
> 
> v3 -> v4:
> - rebase on top of qemu_real_host_page_size() and
>   qemu_real_host_page_size(). Print the size and make the message
>   consistent
> - Added Stefan's A-b and Connie R-b (despite the changes)
> ---
>  hw/vfio/common.c | 27 ++-
>  hw/vfio/trace-events |  1 +
>  2 files changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 2b1f78fdfa..f6b9bb6d71 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -40,6 +40,7 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/migration.h"
> +#include "sysemu/tpm.h"
>  
>  VFIOGroupList vfio_group_list =
>  QLIST_HEAD_INITIALIZER(vfio_group_list);
> @@ -861,6 +862,22 @@ static void 
> vfio_unregister_ram_discard_listener(VFIOContainer *container,
>  g_free(vrdl);
>  }
>  
> +static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
> +{
> +MemoryRegion *mr = section->mr;
> +
> +if (!TPM_IS_CRB(mr->owner)) {
> +return false;
> +}

It looks like this test is going to need to be wrapped in #ifdef
CONFIG_TPM:

https://gitlab.com/alex.williamson/qemu/-/jobs/2391952412

Thanks,

Alex

> +
> +/* this is a known safe misaligned region, just trace for debug purpose 
> */
> +trace_vfio_known_safe_misalignment(memory_region_name(mr),
> +   section->offset_within_address_space,
> +   section->offset_within_region,
> +   qemu_real_host_page_size());
> +return true;
> +}
> +
>  static void vfio_listener_region_add(MemoryListener *listener,
>   MemoryRegionSection *section)
>  {
> @@ -884,7 +901,15 @@ static void vfio_listener_region_add(MemoryListener 
> *listener,
>  if (unlikely((section->offset_within_address_space &
>~qemu_real_host_page_mask()) !=
>   (section->offset_within_region & 
> ~qemu_real_host_page_mask( {
> -error_report("%s received unaligned region", __func__);
> +if (!vfio_known_safe_misalignment(section)) {
> +error_report("%s received unaligned region %s iova=0x%"PRIx64
> + " offset_within_region=0x%"PRIx64
> + " qemu_real_host_page_size=0x%"PRIxPTR,
> + __func__, memory_region_name(section->mr),
> + section->offset_within_address_space,
> + section->offset_within_region,
> + qemu_real_host_page_size());
> +}
>  return;
>  }
>  
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 0ef1b5f4a6..582882db91 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -100,6 +100,7 @@ vfio_listener_region_add_skip(uint64_t start, uint64_t 
> end) "SKIPPING region_add
>  vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to 
> liobn fd %d"
>  vfio_listener_region_add_iommu(uint64_t start, uint64_t end) "region_add 
> [iommu] 0x%"PRIx64" - 0x%"PRIx64
>  vfio_listener_region_add_ram(uint64_t iova_start, uint64_t iova_end, void 
> *vaddr) "region_add [ram] 0x%"PRIx64" - 0x%"PRIx64" [%p]"
> +vfio_known_safe_misalignment(const char *name, uint64_t iova, uint64_t 
> offset_within_region, uintptr_t page_size) "Region \"%s\" iova=0x%"PRIx64" 
> offset_within_region=0x%"PRIx64" qemu_real_host_page_size=0x%"PRIxPTR ": 
> cannot be mapped for DMA"
>  vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, 
> uint64_t size, uint64_t page_size) "Region \"%s\" 0x%"PRIx64" 
> size=0x%"PRIx64" is not aligned to 0x%"PRIx64" and cannot be mapped for DMA"
>  vfio_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING 
> region_del 0x%"PRIx64" - 0x%"PRIx64
>  vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 
> 0x%"PRIx64" - 0x%"PRIx64




Re: [RFC 00/18] vfio: Adopt iommufd

2022-04-28 Thread Alex Williamson
On Thu, 28 Apr 2022 03:21:45 +
"Tian, Kevin"  wrote:

> > From: Alex Williamson 
> > Sent: Wednesday, April 27, 2022 12:22 AM  
> > > >
> > > > My expectation would be that libvirt uses:
> > > >
> > > >  -object iommufd,id=iommufd0,fd=NNN
> > > >  -device vfio-pci,fd=MMM,iommufd=iommufd0
> > > >
> > > > Whereas simple QEMU command line would be:
> > > >
> > > >  -object iommufd,id=iommufd0
> > > >  -device vfio-pci,iommufd=iommufd0,host=:02:00.0
> > > >
> > > > The iommufd object would open /dev/iommufd itself.  Creating an
> > > > implicit iommufd object is someone problematic because one of the
> > > > things I forgot to highlight in my previous description is that the
> > > > iommufd object is meant to be shared across not only various vfio
> > > > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
> > > > vdpa.  
> > >
> > > Out of curiosity - in concept one iommufd is sufficient to support all
> > > ioas requirements across subsystems while having multiple iommufd's
> > > instead lose the benefit of centralized accounting. The latter will also
> > > cause some trouble when we start virtualizing ENQCMD which requires
> > > VM-wide PASID virtualization thus further needs to share that
> > > information across iommufd's. Not unsolvable but really no gain by
> > > adding such complexity. So I'm curious whether Qemu provide
> > > a way to restrict that certain object type can only have one instance
> > > to discourage such multi-iommufd attempt?  
> > 
> > I don't see any reason for QEMU to restrict iommufd objects.  The QEMU
> > philosophy seems to be to let users create whatever configuration they
> > want.  For libvirt though, the assumption would be that a single
> > iommufd object can be used across subsystems, so libvirt would never
> > automatically create multiple objects.  
> 
> I like the flexibility what the objection approach gives in your proposal.
> But with the said complexity in mind (with no foreseen benefit), I wonder

What's the actual complexity?  Front-end/backend splits are very common
in QEMU.  We're making the object connection via name, why is it
significantly more complicated to allow multiple iommufd objects?  On
the contrary, it seems to me that we'd need to go out of our way to add
code to block multiple iommufd objects.

> whether an alternative approach which treats iommufd as a global
> property instead of an object is acceptable in Qemu, i.e.:
> 
> -iommufd on/off
> -device vfio-pci,iommufd,[fd=MMM/host=:02:00.0]
> 
> All devices with iommufd specified then implicitly share a single iommufd
> object within Qemu.

QEMU requires key-value pairs AFAIK, so the above doesn't work, then
we're just back to the iommufd=on/off.
 
> This still allows vfio devices to be specified via fd but just requires 
> Libvirt
> to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be
> considered or just not a typical way in Qemu philosophy e.g. any object
> associated with a device must be explicitly specified?

Avoiding QEMU opening files was a significant focus of my alternate
proposal.  Also note that we must be able to support hotplug, so we
need to be able to dynamically add and remove the iommufd object, I
don't see that a global property allows for that.  Implicit
associations of devices to shared resources doesn't seem particularly
desirable to me.  Thanks,

Alex




Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend

2022-04-26 Thread Alex Williamson
On Tue, 26 Apr 2022 16:27:03 -0300
Jason Gunthorpe  wrote:

> On Tue, Apr 26, 2022 at 12:45:41PM -0600, Alex Williamson wrote:
> > On Tue, 26 Apr 2022 11:11:56 -0300
> > Jason Gunthorpe  wrote:
> >   
> > > On Tue, Apr 26, 2022 at 10:08:30PM +0800, Yi Liu wrote:
> > >   
> > > > > I think it is strange that the allowed DMA a guest can do depends on
> > > > > the order how devices are plugged into the guest, and varys from
> > > > > device to device?
> > > > > 
> > > > > IMHO it would be nicer if qemu would be able to read the new reserved
> > > > > regions and unmap the conflicts before hot plugging the new device. We
> > > > > don't have a kernel API to do this, maybe we should have one?
> > > > 
> > > > For userspace drivers, it is fine to do it. For QEMU, it's not quite 
> > > > easy
> > > > since the IOVA is GPA which is determined per the e820 table.
> > > 
> > > Sure, that is why I said we may need a new API to get this data back
> > > so userspace can fix the address map before attempting to attach the
> > > new device. Currently that is not possible at all, the device attach
> > > fails and userspace has no way to learn what addresses are causing
> > > problems.  
> > 
> > We have APIs to get the IOVA ranges, both with legacy vfio and the
> > iommufd RFC, QEMU could compare these, but deciding to remove an
> > existing mapping is not something to be done lightly.   
> 
> Not quite, you can get the IOVA ranges after you attach the device,
> but device attach will fail if the new range restrictions intersect
> with the existing mappings. So we don't have an easy way to learn the
> new range restriction in a way that lets userspace ensure an attach
> will not fail due to reserved ranged overlapping with mappings.
> 
> The best you could do is make a dummy IOAS then attach the device,
> read the mappings, detatch, and then do your unmaps.

Right, the same thing the kernel does currently.

> I'm imagining something like IOMMUFD_DEVICE_GET_RANGES that can be
> called prior to attaching on the device ID.

Something like /sys/kernel/iommu_groups/$GROUP/reserved_regions?

> > We must be absolutely certain that there is no DMA to that range
> > before doing so.  
> 
> Yes, but at the same time if the VM thinks it can DMA to that memory
> then it is quite likely to DMA to it with the new device that doesn't
> have it mapped in the first place.

Sorry, this assertion doesn't make sense to me.  We can't assume a
vIOMMU on x86, so QEMU typically maps the entire VM address space (ie.
device address space == system memory).  Some of those mappings are
likely DMA targets (RAM), but only a tiny fraction of the address space
may actually be used for DMA.  Some of those mappings are exceedingly
unlikely P2P DMA targets (device memory), so we don't consider mapping
failures to be fatal to attaching the device.

If we have a case where a range failed for one device but worked for a
previous, we're in the latter scenario, because we should have failed
the device attach otherwise.  Your assertion would require that there
are existing devices (plural) making use of this mapping and that the
new device is also likely to make use of this mapping.  I have a hard
time believing that evidence exists to support that statement.
 
> It is also a bit odd that the behavior depends on the order the
> devices are installed as if you plug the narrower device first then
> the next device will happily use the narrower ranges, but viceversa
> will get a different result.

P2P use cases are sufficiently rare that this hasn't been an issue.  I
think there's also still a sufficient healthy dose of FUD whether a
system supports P2P that drivers do some validation before relying on
it.
 
> This is why I find it bit strange that qemu doesn't check the
> ranges. eg I would expect that anything declared as memory in the E820
> map has to be mappable to the iommu_domain or the device should not
> attach at all.

You have some interesting assumptions around associating
MemoryRegionSegments from the device AddressSpace to something like an
x86 specific E820 table.  The currently used rule of thumb is that if
we think it's memory, mapping failure is fatal to the device, otherwise
it's not.  If we want each device to have the most complete mapping
possible, then we'd use a container per device, but that implies a lot
of extra overhead.  Instead we try to attach the device to an existing
container within the address space and assume if it was good enough
there, it's good enough here.

> The P2P is a bit trickier, and I know we don't have a good story
> because we lack ACPI descript

Re: [RFC 00/18] vfio: Adopt iommufd

2022-04-26 Thread Alex Williamson
On Tue, 26 Apr 2022 13:42:17 -0300
Jason Gunthorpe  wrote:

> On Tue, Apr 26, 2022 at 10:21:59AM -0600, Alex Williamson wrote:
> > We also need to be able to advise libvirt as to how each iommufd object
> > or user of that object factors into the VM locked memory requirement.
> > When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt
> > to set the locked memory limit to the size of VM RAM per iommufd,
> > regardless of the number of devices using a given iommufd.  However, I
> > don't know if all users of iommufd will be exclusively mapping VM RAM.
> > Combinations of devices where some map VM RAM and others map QEMU
> > buffer space could still require some incremental increase per device
> > (I'm not sure if vfio-nvme is such a device).  It seems like heuristics
> > will still be involved even after iommufd solves the per-device
> > vfio-pci locked memory limit issue.  Thanks,  
> 
> If the model is to pass the FD, how about we put a limit on the FD
> itself instead of abusing the locked memory limit?
> 
> We could have a no-way-out ioctl that directly limits the # of PFNs
> covered by iopt_pages inside an iommufd.

FD passing would likely only be the standard for libvirt invoked VMs.
The QEMU vfio-pci device would still parse a host= or sysfsdev= option
when invoked by mortals and associate to use the legacy vfio group
interface or the new vfio device interface based on whether an iommufd
is specified.

Does that rule out your suggestion?  I don't know, please reveal more
about the mechanics of putting a limit on the FD itself and this
no-way-out ioctl.  The latter name suggests to me that I should also
note that we need to support memory hotplug with these devices.  Thanks,

Alex




Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend

2022-04-26 Thread Alex Williamson
On Tue, 26 Apr 2022 11:11:56 -0300
Jason Gunthorpe  wrote:

> On Tue, Apr 26, 2022 at 10:08:30PM +0800, Yi Liu wrote:
> 
> > > I think it is strange that the allowed DMA a guest can do depends on
> > > the order how devices are plugged into the guest, and varys from
> > > device to device?
> > > 
> > > IMHO it would be nicer if qemu would be able to read the new reserved
> > > regions and unmap the conflicts before hot plugging the new device. We
> > > don't have a kernel API to do this, maybe we should have one?  
> > 
> > For userspace drivers, it is fine to do it. For QEMU, it's not quite easy
> > since the IOVA is GPA which is determined per the e820 table.  
> 
> Sure, that is why I said we may need a new API to get this data back
> so userspace can fix the address map before attempting to attach the
> new device. Currently that is not possible at all, the device attach
> fails and userspace has no way to learn what addresses are causing
> problems.

We have APIs to get the IOVA ranges, both with legacy vfio and the
iommufd RFC, QEMU could compare these, but deciding to remove an
existing mapping is not something to be done lightly.  We must be
absolutely certain that there is no DMA to that range before doing so.
 
> > > eg currently I see the log messages that it is passing P2P BAR memory
> > > into iommufd map, this should be prevented inside qemu because it is
> > > not reliable right now if iommufd will correctly reject it.  
> > 
> > yeah. qemu can filter the P2P BAR mapping and just stop it in qemu. We
> > haven't added it as it is something you will add in future. so didn't
> > add it in this RFC. :-) Please let me know if it feels better to filter
> > it from today.  
> 
> I currently hope it will use a different map API entirely and not rely
> on discovering the P2P via the VMA. eg using a DMABUF FD or something.
> 
> So blocking it in qemu feels like the right thing to do.

Wait a sec, so legacy vfio supports p2p between devices, which has a
least a couple known use cases, primarily involving GPUs for at least
one of the peers, and we're not going to make equivalent support a
feature requirement for iommufd?  This would entirely fracture the
notion that iommufd is a direct replacement and upgrade from legacy
vfio and make a transparent transition for libvirt managed VMs
impossible.  Let's reconsider.  Thanks,

Alex




Re: [RFC 00/18] vfio: Adopt iommufd

2022-04-26 Thread Alex Williamson
On Tue, 26 Apr 2022 12:43:35 +
Shameerali Kolothum Thodi  wrote:

> > -Original Message-
> > From: Eric Auger [mailto:eric.au...@redhat.com]
> > Sent: 26 April 2022 12:45
> > To: Shameerali Kolothum Thodi ; Yi
> > Liu ; alex.william...@redhat.com; coh...@redhat.com;
> > qemu-devel@nongnu.org
> > Cc: da...@gibson.dropbear.id.au; th...@redhat.com; far...@linux.ibm.com;
> > mjros...@linux.ibm.com; akrow...@linux.ibm.com; pa...@linux.ibm.com;
> > jjhe...@linux.ibm.com; jasow...@redhat.com; k...@vger.kernel.org;
> > j...@nvidia.com; nicol...@nvidia.com; eric.auger@gmail.com;
> > kevin.t...@intel.com; chao.p.p...@intel.com; yi.y@intel.com;
> > pet...@redhat.com; Zhangfei Gao 
> > Subject: Re: [RFC 00/18] vfio: Adopt iommufd  
> 
> [...]
>  
> > >>  
> > https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_...@nvidia.com  
> > >> /
> > >> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
> > >> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1  
> > > Hi,
> > >
> > > I had a go with the above branches on our ARM64 platform trying to  
> > pass-through  
> > > a VF dev, but Qemu reports an error as below,
> > >
> > > [0.444728] hisi_sec2 :00:01.0: enabling device ( -> 0002)
> > > qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
> > > qemu-system-aarch64-iommufd: vfio_container_dma_map(0xfeb40ce0,  
> > 0x80, 0x1, 0xb40ef000) = -14 (Bad address)  
> > >
> > > I think this happens for the dev BAR addr range. I haven't debugged the  
> > kernel  
> > > yet to see where it actually reports that.  
> > Does it prevent your assigned device from working? I have such errors
> > too but this is a known issue. This is due to the fact P2P DMA is not
> > supported yet.
> >   
> 
> Yes, the basic tests all good so far. I am still not very clear how it works 
> if
> the map() fails though. It looks like it fails in,
> 
> iommufd_ioas_map()
>   iopt_map_user_pages()
>iopt_map_pages()
>..
>  pfn_reader_pin_pages()
> 
> So does it mean it just works because the page is resident()?

No, it just means that you're not triggering any accesses that require
peer-to-peer DMA support.  Any sort of test where the device is only
performing DMA to guest RAM, which is by far the standard use case,
will work fine.  This also doesn't affect vCPU access to BAR space.
It's only a failure of the mappings of the BAR space into the IOAS,
which is only used when a device tries to directly target another
device's BAR space via DMA.  Thanks,

Alex




Re: [RFC 00/18] vfio: Adopt iommufd

2022-04-26 Thread Alex Williamson
On Tue, 26 Apr 2022 08:37:41 +
"Tian, Kevin"  wrote:

> > From: Alex Williamson 
> > Sent: Monday, April 25, 2022 10:38 PM
> > 
> > On Mon, 25 Apr 2022 11:10:14 +0100
> > Daniel P. Berrangé  wrote:
> >   
> > > On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:  
> > > > [Cc +libvirt folks]
> > > >
> > > > On Thu, 14 Apr 2022 03:46:52 -0700
> > > > Yi Liu  wrote:
> > > >  
> > > > > With the introduction of iommufd[1], the linux kernel provides a  
> > generic  
> > > > > interface for userspace drivers to propagate their DMA mappings to  
> > kernel  
> > > > > for assigned devices. This series does the porting of the VFIO devices
> > > > > onto the /dev/iommu uapi and let it coexist with the legacy  
> > implementation.  
> > > > > Other devices like vpda, vfio mdev and etc. are not considered yet.  
> > >
> > > snip
> > >  
> > > > > The selection of the backend is made on a device basis using the new
> > > > > iommufd option (on/off/auto). By default the iommufd backend is  
> > selected  
> > > > > if supported by the host and by QEMU (iommufd KConfig). This option  
> > is  
> > > > > currently available only for the vfio-pci device. For other types of
> > > > > devices, it does not yet exist and the legacy BE is chosen by 
> > > > > default.  
> > > >
> > > > I've discussed this a bit with Eric, but let me propose a different
> > > > command line interface.  Libvirt generally likes to pass file
> > > > descriptors to QEMU rather than grant it access to those files
> > > > directly.  This was problematic with vfio-pci because libvirt can't
> > > > easily know when QEMU will want to grab another /dev/vfio/vfio
> > > > container.  Therefore we abandoned this approach and instead libvirt
> > > > grants file permissions.
> > > >
> > > > However, with iommufd there's no reason that QEMU ever needs more  
> > than  
> > > > a single instance of /dev/iommufd and we're using per device vfio file
> > > > descriptors, so it seems like a good time to revisit this.  
> > >
> > > I assume access to '/dev/iommufd' gives the process somewhat elevated
> > > privileges, such that you don't want to unconditionally give QEMU
> > > access to this device ?  
> > 
> > It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged
> > interface which should have limited scope for abuse, but more so here
> > the goal would be to de-privilege QEMU that one step further that it
> > cannot open the device file itself.
> >   
> > > > The interface I was considering would be to add an iommufd object to
> > > > QEMU, so we might have a:
> > > >
> > > > -device iommufd[,fd=#][,id=foo]
> > > >
> > > > For non-libivrt usage this would have the ability to open /dev/iommufd
> > > > itself if an fd is not provided.  This object could be shared with
> > > > other iommufd users in the VM and maybe we'd allow multiple instances
> > > > for more esoteric use cases.  [NB, maybe this should be a -object 
> > > > rather  
> > than  
> > > > -device since the iommufd is not a guest visible device?]  
> > >
> > > Yes,  -object would be the right answer for something that's purely
> > > a host side backend impl selector.
> > >  
> > > > The vfio-pci device might then become:
> > > >
> > > > -device vfio-  
> > pci[,host=:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f
> > oo]  
> > > >
> > > > So essentially we can specify the device via host, sysfsdev, or passing
> > > > an fd to the vfio device file.  When an iommufd object is specified,
> > > > "foo" in the example above, each of those options would use the
> > > > vfio-device access mechanism, essentially the same as iommufd=on in
> > > > your example.  With the fd passing option, an iommufd object would be
> > > > required and necessarily use device level access.
> > > >
> > > > In your example, the iommufd=auto seems especially troublesome for
> > > > libvirt because QEMU is going to have different locked memory
> > > > requirements based on whether we're using type1 or iommufd, where  
> > the  
> > > > latter resolves the

Re: [RFC 00/18] vfio: Adopt iommufd

2022-04-25 Thread Alex Williamson
On Mon, 25 Apr 2022 22:23:05 +0200
Eric Auger  wrote:

> Hi Alex,
> 
> On 4/23/22 12:09 AM, Alex Williamson wrote:
> > [Cc +libvirt folks]
> >
> > On Thu, 14 Apr 2022 03:46:52 -0700
> > Yi Liu  wrote:
> >  
> >> With the introduction of iommufd[1], the linux kernel provides a generic
> >> interface for userspace drivers to propagate their DMA mappings to kernel
> >> for assigned devices. This series does the porting of the VFIO devices
> >> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> >> Other devices like vpda, vfio mdev and etc. are not considered yet.
> >>
> >> For vfio devices, the new interface is tied with device fd and iommufd
> >> as the iommufd solution is device-centric. This is different from legacy
> >> vfio which is group-centric. To support both interfaces in QEMU, this
> >> series introduces the iommu backend concept in the form of different
> >> container classes. The existing vfio container is named legacy container
> >> (equivalent with legacy iommu backend in this series), while the new
> >> iommufd based container is named as iommufd container (may also be 
> >> mentioned
> >> as iommufd backend in this series). The two backend types have their own
> >> way to setup secure context and dma management interface. Below diagram
> >> shows how it looks like with both BEs.
> >>
> >> VFIO   AddressSpace/Memory
> >> +---+  +--+  +-+  +-+
> >> |  pci  |  | platform |  |  ap |  | ccw |
> >> +---+---+  ++-+  +--+--+  +--+--+ +--+
> >> |   |   |||   AddressSpace   |
> >> |   |   ||++-+
> >> +---V---V---VV+   /
> >> |   VFIOAddressSpace  | <+
> >> |  |  |  MemoryListener
> >> |  VFIOContainer list |
> >> +---+++
> >> ||
> >> ||
> >> +---V--++V--+
> >> |   iommufd||vfio legacy|
> >> |  container   || container |
> >> +---+--+++--+
> >> ||
> >> | /dev/iommu | /dev/vfio/vfio
> >> | /dev/vfio/devices/vfioX| /dev/vfio/$group_id
> >>  Userspace  ||
> >>  ===++
> >>  Kernel |  device fd |
> >> +---+| group/container fd
> >> | (BIND_IOMMUFD || (SET_CONTAINER/SET_IOMMU)
> >> |  ATTACH_IOAS) || device fd
> >> |   ||
> >> |   +---VV-+
> >> iommufd |   |vfio  |
> >> (map/unmap  |   +-++---+
> >>  ioas_copy) | || map/unmap
> >> | ||
> >>  +--V--++-V--+  +--V+
> >>  | iommfd core ||  device|  |  vfio iommu   |
> >>  +-+++  +---+
> >>
> >> [Secure Context setup]
> >> - iommufd BE: uses device fd and iommufd to setup secure context
> >>   (bind_iommufd, attach_ioas)
> >> - vfio legacy BE: uses group fd and container fd to setup secure context
> >>   (set_container, set_iommu)
> >> [Device access]
> >> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> >> - vfio legacy BE: device fd is retrieved from group fd ioctl
> >> [DMA Mapping flow]
> >> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> >> - VFIO populates DMA map/unmap via the container BEs
> >>   *) iommufd BE: uses iommufd
> >>   *) vfio legacy BE: uses container fd
> >>
> >> This series qomifies the VFIOContainer object which acts as a base class
> >> for a container. This base 

Re: [PATCH v3 for-7.1] vfio/common: remove spurious tpm-crb-cmd misalignment warning

2022-04-25 Thread Alex Williamson
On Wed, 23 Mar 2022 21:31:19 +0100
Eric Auger  wrote:

> The CRB command buffer currently is a RAM MemoryRegion and given
> its base address alignment, it causes an error report on
> vfio_listener_region_add(). This region could have been a RAM device
> region, easing the detection of such safe situation but this option
> was not well received. So let's add a helper function that uses the
> memory region owner type to detect the situation is safe wrt
> the assignment. Other device types can be checked here if such kind
> of problem occurs again.
> 
> Signed-off-by: Eric Auger 
> Reviewed-by: Philippe Mathieu-Daudé 
> 
> ---
> 
> v2 -> v3:
> - Use TPM_IS_CRB()
> 
> v1 -> v2:
> - do not check the MR name but rather the owner type
> ---
>  hw/vfio/common.c | 27 ++-
>  hw/vfio/trace-events |  1 +
>  2 files changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 080046e3f51..55bc116473e 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -40,6 +40,7 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/migration.h"
> +#include "sysemu/tpm.h"
>  
>  VFIOGroupList vfio_group_list =
>  QLIST_HEAD_INITIALIZER(vfio_group_list);
> @@ -861,6 +862,22 @@ static void 
> vfio_unregister_ram_discard_listener(VFIOContainer *container,
>  g_free(vrdl);
>  }
>  
> +static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
> +{
> +MemoryRegion *mr = section->mr;
> +
> +if (!TPM_IS_CRB(mr->owner)) {
> +return false;
> +}
> +
> +/* this is a known safe misaligned region, just trace for debug purpose 
> */
> +trace_vfio_known_safe_misalignment(memory_region_name(mr),
> +   section->offset_within_address_space,
> +   section->offset_within_region,
> +   qemu_real_host_page_size);

qemu_real_host_page_size and qemu_real_host_page_mask are now functions.

I thought I'd just append "()" in each case, but then the 32-bit build
breaks...

> +return true;
> +}
> +
>  static void vfio_listener_region_add(MemoryListener *listener,
>   MemoryRegionSection *section)
>  {
> @@ -884,7 +901,15 @@ static void vfio_listener_region_add(MemoryListener 
> *listener,
>  if (unlikely((section->offset_within_address_space &
>~qemu_real_host_page_mask) !=
>   (section->offset_within_region & 
> ~qemu_real_host_page_mask))) {
> -error_report("%s received unaligned region", __func__);
> +if (!vfio_known_safe_misalignment(section)) {
> +error_report("%s received unaligned region %s iova=0x%"PRIx64
> + " offset_within_region=0x%"PRIx64
> + " qemu_real_host_page_mask=0x%"PRIxPTR,
> + __func__, memory_region_name(section->mr),
> + section->offset_within_address_space,
> + section->offset_within_region,
> + qemu_real_host_page_mask);

Note how here we're very verbosely printing
"qemu_real_host_page_mask=0x%..." and we're passing the
qemu_real_host_page_mask value.  In the previous trace command we're
passing qemu_real_host_page_size.

> +}
>  return;
>  }
>  
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 0ef1b5f4a65..6f38a2e6991 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -100,6 +100,7 @@ vfio_listener_region_add_skip(uint64_t start, uint64_t 
> end) "SKIPPING region_add
>  vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to 
> liobn fd %d"
>  vfio_listener_region_add_iommu(uint64_t start, uint64_t end) "region_add 
> [iommu] 0x%"PRIx64" - 0x%"PRIx64
>  vfio_listener_region_add_ram(uint64_t iova_start, uint64_t iova_end, void 
> *vaddr) "region_add [ram] 0x%"PRIx64" - 0x%"PRIx64" [%p]"
> +vfio_known_safe_misalignment(const char *name, uint64_t iova, uint64_t 
> offset_within_region, uint64_t page_size) "Region \"%s\" iova=0x%"PRIx64" 
> offset_within_region=0x%"PRIx64" qemu_real_host_page_mask=0x%"PRIxPTR ": 
> cannot be mapped for DMA"

So here we've been passed qemu_real_host_page_size but we're again
printing "qemu_real_host_page_mask=0x%...".  To make things slightly
more complicated, qemu_real_host_page_mask is now an intptr_t, which is
arbitrarily not supported in trace commands, while
qemu_real_host_page_size is a uintptr_t which is supported in trace
commands :-\  I'll let you decide how you want to resolve this.  Thanks,

Alex




Re: [RFC 00/18] vfio: Adopt iommufd

2022-04-25 Thread Alex Williamson
On Mon, 25 Apr 2022 11:10:14 +0100
Daniel P. Berrangé  wrote:

> On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:
> > [Cc +libvirt folks]
> > 
> > On Thu, 14 Apr 2022 03:46:52 -0700
> > Yi Liu  wrote:
> >   
> > > With the introduction of iommufd[1], the linux kernel provides a generic
> > > interface for userspace drivers to propagate their DMA mappings to kernel
> > > for assigned devices. This series does the porting of the VFIO devices
> > > onto the /dev/iommu uapi and let it coexist with the legacy 
> > > implementation.
> > > Other devices like vpda, vfio mdev and etc. are not considered yet.  
> 
> snip
> 
> > > The selection of the backend is made on a device basis using the new
> > > iommufd option (on/off/auto). By default the iommufd backend is selected
> > > if supported by the host and by QEMU (iommufd KConfig). This option is
> > > currently available only for the vfio-pci device. For other types of
> > > devices, it does not yet exist and the legacy BE is chosen by default.  
> > 
> > I've discussed this a bit with Eric, but let me propose a different
> > command line interface.  Libvirt generally likes to pass file
> > descriptors to QEMU rather than grant it access to those files
> > directly.  This was problematic with vfio-pci because libvirt can't
> > easily know when QEMU will want to grab another /dev/vfio/vfio
> > container.  Therefore we abandoned this approach and instead libvirt
> > grants file permissions.
> > 
> > However, with iommufd there's no reason that QEMU ever needs more than
> > a single instance of /dev/iommufd and we're using per device vfio file
> > descriptors, so it seems like a good time to revisit this.  
> 
> I assume access to '/dev/iommufd' gives the process somewhat elevated
> privileges, such that you don't want to unconditionally give QEMU
> access to this device ?

It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged
interface which should have limited scope for abuse, but more so here
the goal would be to de-privilege QEMU that one step further that it
cannot open the device file itself.

> > The interface I was considering would be to add an iommufd object to
> > QEMU, so we might have a:
> > 
> > -device iommufd[,fd=#][,id=foo]
> > 
> > For non-libivrt usage this would have the ability to open /dev/iommufd
> > itself if an fd is not provided.  This object could be shared with
> > other iommufd users in the VM and maybe we'd allow multiple instances
> > for more esoteric use cases.  [NB, maybe this should be a -object rather 
> > than
> > -device since the iommufd is not a guest visible device?]  
> 
> Yes,  -object would be the right answer for something that's purely
> a host side backend impl selector.
> 
> > The vfio-pci device might then become:
> > 
> > -device 
> > vfio-pci[,host=:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
> > 
> > So essentially we can specify the device via host, sysfsdev, or passing
> > an fd to the vfio device file.  When an iommufd object is specified,
> > "foo" in the example above, each of those options would use the
> > vfio-device access mechanism, essentially the same as iommufd=on in
> > your example.  With the fd passing option, an iommufd object would be
> > required and necessarily use device level access.
> > 
> > In your example, the iommufd=auto seems especially troublesome for
> > libvirt because QEMU is going to have different locked memory
> > requirements based on whether we're using type1 or iommufd, where the
> > latter resolves the duplicate accounting issues.  libvirt needs to know
> > deterministically which backed is being used, which this proposal seems
> > to provide, while at the same time bringing us more in line with fd
> > passing.  Thoughts?  Thanks,  
> 
> Yep, I agree that libvirt needs to have more direct control over this.
> This is also even more important if there are notable feature differences
> in the 2 backends.
> 
> I wonder if anyone has considered an even more distinct impl, whereby
> we have a completely different device type on the backend, eg
> 
>   -device 
> vfio-iommu-pci[,host=:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
> 
> If a vendor wants to fully remove the legacy impl, they can then use the
> Kconfig mechanism to disable the build of the legacy impl device, while
> keeping the iommu impl (or vica-verca if the new iommu impl isn't considered
> reliable enough for them to support yet).
> 
> Libvirt would use
> 
>-obj

Re: [RFC 00/18] vfio: Adopt iommufd

2022-04-22 Thread Alex Williamson
[Cc +libvirt folks]

On Thu, 14 Apr 2022 03:46:52 -0700
Yi Liu  wrote:

> With the introduction of iommufd[1], the linux kernel provides a generic
> interface for userspace drivers to propagate their DMA mappings to kernel
> for assigned devices. This series does the porting of the VFIO devices
> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> Other devices like vpda, vfio mdev and etc. are not considered yet.
> 
> For vfio devices, the new interface is tied with device fd and iommufd
> as the iommufd solution is device-centric. This is different from legacy
> vfio which is group-centric. To support both interfaces in QEMU, this
> series introduces the iommu backend concept in the form of different
> container classes. The existing vfio container is named legacy container
> (equivalent with legacy iommu backend in this series), while the new
> iommufd based container is named as iommufd container (may also be mentioned
> as iommufd backend in this series). The two backend types have their own
> way to setup secure context and dma management interface. Below diagram
> shows how it looks like with both BEs.
> 
> VFIO   AddressSpace/Memory
> +---+  +--+  +-+  +-+
> |  pci  |  | platform |  |  ap |  | ccw |
> +---+---+  ++-+  +--+--+  +--+--+ +--+
> |   |   |||   AddressSpace   |
> |   |   ||++-+
> +---V---V---VV+   /
> |   VFIOAddressSpace  | <+
> |  |  |  MemoryListener
> |  VFIOContainer list |
> +---+++
> ||
> ||
> +---V--++V--+
> |   iommufd||vfio legacy|
> |  container   || container |
> +---+--+++--+
> ||
> | /dev/iommu | /dev/vfio/vfio
> | /dev/vfio/devices/vfioX| /dev/vfio/$group_id
>  Userspace  ||
>  ===++
>  Kernel |  device fd |
> +---+| group/container fd
> | (BIND_IOMMUFD || (SET_CONTAINER/SET_IOMMU)
> |  ATTACH_IOAS) || device fd
> |   ||
> |   +---VV-+
> iommufd |   |vfio  |
> (map/unmap  |   +-++---+
>  ioas_copy) | || map/unmap
> | ||
>  +--V--++-V--+  +--V+
>  | iommfd core ||  device|  |  vfio iommu   |
>  +-+++  +---+
> 
> [Secure Context setup]
> - iommufd BE: uses device fd and iommufd to setup secure context
>   (bind_iommufd, attach_ioas)
> - vfio legacy BE: uses group fd and container fd to setup secure context
>   (set_container, set_iommu)
> [Device access]
> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> - vfio legacy BE: device fd is retrieved from group fd ioctl
> [DMA Mapping flow]
> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> - VFIO populates DMA map/unmap via the container BEs
>   *) iommufd BE: uses iommufd
>   *) vfio legacy BE: uses container fd
> 
> This series qomifies the VFIOContainer object which acts as a base class
> for a container. This base class is derived into the legacy VFIO container
> and the new iommufd based container. The base class implements generic code
> such as code related to memory_listener and address space management whereas
> the derived class implements callbacks that depend on the kernel user space
> being used.
> 
> The selection of the backend is made on a device basis using the new
> iommufd option (on/off/auto). By default the iommufd backend is selected
> if supported by the host and by QEMU (iommufd KConfig). This option is
> currently available only for the vfio-pci device. For other types of
> devices, it does not yet exist and the legacy BE is chosen by default.

I've discussed this a bit with Eric, but let me propose a different
command line interface.  Libvirt generally likes to pass file
descriptors to QEMU rather than grant it access to those files
directly.  This was problematic with vfio-pci because libvirt can't
easily know when QEMU will want to grab another /dev/vfio/vfio
container.  Therefore we abandoned 

Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend

2022-04-22 Thread Alex Williamson
On Fri, 22 Apr 2022 11:58:15 -0300
Jason Gunthorpe  wrote:
> 
> I don't see IOMMU_IOAS_IOVA_RANGES called at all, that seems like a
> problem..

Not as much as you might think.  Note that you also won't find QEMU
testing VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE in the QEMU vfio-pci
driver either.  The vfio-nvme driver does because it has control of the
address space it chooses to use, but for vfio-pci the address space is
dictated by the VM and there's not a lot of difference between knowing
in advance that a mapping conflicts with a reserved range or just
trying add the mapping and taking appropriate action if it fails.
Thanks,

Alex




Re: [PATCH] acpi: Bodge acpi_index migration

2022-04-05 Thread Alex Williamson
On Tue,  5 Apr 2022 20:06:58 +0100
"Dr. David Alan Gilbert (git)"  wrote:

> From: "Dr. David Alan Gilbert" 
> 
> The 'acpi_index' field is a statically configured field, which for
> some reason is migrated; this never makes much sense because it's
> command line static.
> 
> However, on piix4 it's conditional, and the condition/test function
> ends up having the wrong pointer passed to it (it gets a PIIX4PMState
> not the AcpiPciHpState it was expecting, because VMSTATE_PCI_HOTPLUG
> is a macro and not another struct).  This means the field is randomly
> loaded/saved based on a random pointer.  In 6.x this random pointer
> randomly seems to get 0 for everyone (!); in 7.0rc it's getting junk
> and trying to load a field that the source didn't send.

FWIW, after some hunting and pecking, 6.2 (64bit):

(gdb) p &((struct AcpiPciHpState *)0)->acpi_index
$1 = (uint32_t *) 0xc04

(gdb) p &((struct PIIX4PMState *)0)->ar.tmr.io.addr
$2 = (hwaddr *) 0xc00

f53faa70bb63:

(gdb) p &((struct AcpiPciHpState *)0)->acpi_index
$1 = (uint32_t *) 0xc04

(gdb) p &((struct PIIX4PMState *)0)->io_gpe.coalesced.tqh_circ.tql_prev
$2 = (struct QTailQLink **) 0xc00

So yeah, it seems 0xc04 will always be part of a pointer on current
mainline.  I can't really speak to the ACPIPMTimer MemoryRegion in the
PIIX4PMState, maybe if there's a hwaddr it's always 32bit and the upper
dword is reliably zero?  Thanks,

Alex

>  The migration
> stream gets out of line and hits the section footer.
> 
> The bodge is on piix4 never to load the field:
>   a) Most 6.x builds never send it, so most of the time the migration
> will work.
>   b) We can backport this fix to 6.x to remove the boobytrap.
>   c) It should never have made a difference anyway since the acpi-index
> is command line configured and should be correct on the destination
> anyway
>   d) ich9 is still sending/receiving this (unconditionally all the time)
> but due to (c) should never notice.  We could follow up to make it
> skip.
> 
> It worries me just when (a) actually happens.
> 
> Fixes: b32bd76 ("pci: introduce acpi-index property for PCI device")
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/932
> 
> Signed-off-by: Dr. David Alan Gilbert 
> ---
>  hw/acpi/acpi-pci-hotplug-stub.c |  4 
>  hw/acpi/pcihp.c |  6 --
>  hw/acpi/piix4.c | 11 ++-
>  include/hw/acpi/pcihp.h |  2 --
>  4 files changed, 10 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/acpi/acpi-pci-hotplug-stub.c b/hw/acpi/acpi-pci-hotplug-stub.c
> index 734e4c5986..a43f6dafc9 100644
> --- a/hw/acpi/acpi-pci-hotplug-stub.c
> +++ b/hw/acpi/acpi-pci-hotplug-stub.c
> @@ -41,7 +41,3 @@ void acpi_pcihp_reset(AcpiPciHpState *s, bool 
> acpihp_root_off)
>  return;
>  }
>  
> -bool vmstate_acpi_pcihp_use_acpi_index(void *opaque, int version_id)
> -{
> -return false;
> -}
> diff --git a/hw/acpi/pcihp.c b/hw/acpi/pcihp.c
> index 6351bd3424..bf65bbea49 100644
> --- a/hw/acpi/pcihp.c
> +++ b/hw/acpi/pcihp.c
> @@ -554,12 +554,6 @@ void acpi_pcihp_init(Object *owner, AcpiPciHpState *s, 
> PCIBus *root_bus,
> OBJ_PROP_FLAG_READ);
>  }
>  
> -bool vmstate_acpi_pcihp_use_acpi_index(void *opaque, int version_id)
> -{
> - AcpiPciHpState *s = opaque;
> - return s->acpi_index;
> -}
> -
>  const VMStateDescription vmstate_acpi_pcihp_pci_status = {
>  .name = "acpi_pcihp_pci_status",
>  .version_id = 1,
> diff --git a/hw/acpi/piix4.c b/hw/acpi/piix4.c
> index cc37fa3416..48aeedd5f0 100644
> --- a/hw/acpi/piix4.c
> +++ b/hw/acpi/piix4.c
> @@ -267,6 +267,15 @@ static bool piix4_vmstate_need_smbus(void *opaque, int 
> version_id)
>  return pm_smbus_vmstate_needed();
>  }
>  
> +/*
> + * This is a fudge to turn off the acpi_index field, whose
> + * test was always broken on piix4.
> + */
> +static bool vmstate_test_never(void *opaque, int version_id)
> +{
> +return false;
> +}
> +
>  /* qemu-kvm 1.2 uses version 3 but advertised as 2
>   * To support incoming qemu-kvm 1.2 migration, change version_id
>   * and minimum_version_id to 2 below (which breaks migration from
> @@ -297,7 +306,7 @@ static const VMStateDescription vmstate_acpi = {
>  struct AcpiPciHpPciStatus),
>  VMSTATE_PCI_HOTPLUG(acpi_pci_hotplug, PIIX4PMState,
>  vmstate_test_use_acpi_hotplug_bridge,
> -vmstate_acpi_pcihp_use_acpi_index),
> +vmstate_test_never),
>  VMSTATE_END_OF_LIST()
>  },
>  .subsections = (const VMStateDescription*[]) {
> diff --git a/include/hw/acpi/pcihp.h b/include/hw/acpi/pcihp.h
> index af1a169fc3..7e268c2c9c 100644
> --- a/include/hw/acpi/pcihp.h
> +++ b/include/hw/acpi/pcihp.h
> @@ -73,8 +73,6 @@ void acpi_pcihp_reset(AcpiPciHpState *s, bool 
> acpihp_root_off);
>  
>  extern const VMStateDescription vmstate_acpi_pcihp_pci_status;
>  
> -bool 

Re: [PATCH v6 0/5] optimize the downtime for vfio migration

2022-03-28 Thread Alex Williamson
On Sat, 26 Mar 2022 14:02:21 +0800
"Longpeng(Mike)"  wrote:

> From: Longpeng 
> 
> Hi guys,
>  
> In vfio migration resume phase, the cost would increase if the
> vfio device has more unmasked vectors. We try to optimize it in
> this series.
>  
> You can see the commit message in PATCH 6 for details.
>  
> Patch 1-3 are simple cleanups and fixup.
> Patch 4 are the preparations for the optimization.
> Patch 5 optimizes the vfio msix setup path.
> 
> v5: https://lore.kernel.org/all/20211103081657.1945-1-longpe...@huawei.com/T/
> 
> Change v5->v6:
>  - remove the Patch 4("kvm: irqchip: extract 
> kvm_irqchip_add_deferred_msi_route")
> of v5, and use KVMRouteChange API instead. [Paolo, Longpeng]
> 
> Changes v4->v5:
>  - setup the notifier and irqfd in the same function to makes
>the code neater.[Alex]
> 
> Changes v3->v4:
>  - fix several typos and grammatical errors [Alex]
>  - remove the patches that fix and clean the MSIX common part
>from this series [Alex]
>  - Patch 6:
> - use vector->use directly and fill it with -1 on error
>   paths [Alex]
> - add comment before enable deferring to commit [Alex]
> - move the code that do_use/release on vector 0 into an
>   "else" branch [Alex]
> - introduce vfio_prepare_kvm_msi_virq_batch() that enables
>   the 'defer_kvm_irq_routing' flag [Alex]
> - introduce vfio_commit_kvm_msi_virq_batch() that clears the
>   'defer_kvm_irq_routing' flag and does further work [Alex]
> 
> Changes v2->v3:
>  - fix two errors [Longpeng]
> 
> Changes v1->v2:
>  - fix several typos and grammatical errors [Alex, Philippe]
>  - split fixups and cleanups into separate patches  [Alex, Philippe]
>  - introduce kvm_irqchip_add_deferred_msi_route to
>minimize code changes[Alex]
>  - enable the optimization in msi setup path[Alex]
> 
> Longpeng (Mike) (5):
>   vfio: simplify the conditional statements in vfio_msi_enable
>   vfio: move re-enabling INTX out of the common helper
>   vfio: simplify the failure path in vfio_msi_enable
>   Revert "vfio: Avoid disabling and enabling vectors repeatedly in VFIO
> migration"
>   vfio: defer to commit kvm irq routing when enable msi/msix
> 
>  hw/vfio/pci.c | 183 +++---
>  hw/vfio/pci.h |   2 +
>  2 files changed, 115 insertions(+), 70 deletions(-)
> 

Nice to see you found a solution with Paolo's suggestion for
begin/commit batching.  Looks ok to me; I'll queue this for after the
v7.0 QEMU release and look for further reviews and comments in the
interim.  Thanks,

Alex




Re: [PATCH v3 for-7.1] vfio/common: remove spurious tpm-crb-cmd misalignment warning

2022-03-28 Thread Alex Williamson
On Wed, 23 Mar 2022 21:31:19 +0100
Eric Auger  wrote:

> The CRB command buffer currently is a RAM MemoryRegion and given
> its base address alignment, it causes an error report on
> vfio_listener_region_add(). This region could have been a RAM device
> region, easing the detection of such safe situation but this option
> was not well received. So let's add a helper function that uses the
> memory region owner type to detect the situation is safe wrt
> the assignment. Other device types can be checked here if such kind
> of problem occurs again.
> 
> Signed-off-by: Eric Auger 
> Reviewed-by: Philippe Mathieu-Daudé 
> 
> ---

Thanks Eric!  I'll queue this for after the v7.0 release with Connie's
and Stefan's reviews.  Thanks,

Alex

> 
> v2 -> v3:
> - Use TPM_IS_CRB()
> 
> v1 -> v2:
> - do not check the MR name but rather the owner type
> ---
>  hw/vfio/common.c | 27 ++-
>  hw/vfio/trace-events |  1 +
>  2 files changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 080046e3f51..55bc116473e 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -40,6 +40,7 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/migration.h"
> +#include "sysemu/tpm.h"
>  
>  VFIOGroupList vfio_group_list =
>  QLIST_HEAD_INITIALIZER(vfio_group_list);
> @@ -861,6 +862,22 @@ static void 
> vfio_unregister_ram_discard_listener(VFIOContainer *container,
>  g_free(vrdl);
>  }
>  
> +static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
> +{
> +MemoryRegion *mr = section->mr;
> +
> +if (!TPM_IS_CRB(mr->owner)) {
> +return false;
> +}
> +
> +/* this is a known safe misaligned region, just trace for debug purpose 
> */
> +trace_vfio_known_safe_misalignment(memory_region_name(mr),
> +   section->offset_within_address_space,
> +   section->offset_within_region,
> +   qemu_real_host_page_size);
> +return true;
> +}
> +
>  static void vfio_listener_region_add(MemoryListener *listener,
>   MemoryRegionSection *section)
>  {
> @@ -884,7 +901,15 @@ static void vfio_listener_region_add(MemoryListener 
> *listener,
>  if (unlikely((section->offset_within_address_space &
>~qemu_real_host_page_mask) !=
>   (section->offset_within_region & 
> ~qemu_real_host_page_mask))) {
> -error_report("%s received unaligned region", __func__);
> +if (!vfio_known_safe_misalignment(section)) {
> +error_report("%s received unaligned region %s iova=0x%"PRIx64
> + " offset_within_region=0x%"PRIx64
> + " qemu_real_host_page_mask=0x%"PRIxPTR,
> + __func__, memory_region_name(section->mr),
> + section->offset_within_address_space,
> + section->offset_within_region,
> + qemu_real_host_page_mask);
> +}
>  return;
>  }
>  
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 0ef1b5f4a65..6f38a2e6991 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -100,6 +100,7 @@ vfio_listener_region_add_skip(uint64_t start, uint64_t 
> end) "SKIPPING region_add
>  vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to 
> liobn fd %d"
>  vfio_listener_region_add_iommu(uint64_t start, uint64_t end) "region_add 
> [iommu] 0x%"PRIx64" - 0x%"PRIx64
>  vfio_listener_region_add_ram(uint64_t iova_start, uint64_t iova_end, void 
> *vaddr) "region_add [ram] 0x%"PRIx64" - 0x%"PRIx64" [%p]"
> +vfio_known_safe_misalignment(const char *name, uint64_t iova, uint64_t 
> offset_within_region, uint64_t page_size) "Region \"%s\" iova=0x%"PRIx64" 
> offset_within_region=0x%"PRIx64" qemu_real_host_page_mask=0x%"PRIxPTR ": 
> cannot be mapped for DMA"
>  vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, 
> uint64_t size, uint64_t page_size) "Region \"%s\" 0x%"PRIx64" 
> size=0x%"PRIx64" is not aligned to 0x%"PRIx64" and cannot be mapped for DMA"
>  vfio_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING 
> region_del 0x%"PRIx64" - 0x%"PRIx64
>  vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 
> 0x%"PRIx64" - 0x%"PRIx64




Re: [PATCH for-7.1] vfio/common: remove spurious tpm-crb-cmd misalignment warning

2022-03-17 Thread Alex Williamson
On Thu, 17 Mar 2022 15:34:53 +0100
Eric Auger  wrote:

> Hi Alex,
> 
> On 3/17/22 3:23 PM, Alex Williamson wrote:
> > On Thu, 17 Mar 2022 14:57:30 +0100
> > Eric Auger  wrote:
> >  
> >> Hi Alex,
> >>
> >> On 3/17/22 12:08 AM, Alex Williamson wrote:  
> >>> On Wed, 16 Mar 2022 21:29:51 +0100
> >>> Eric Auger  wrote:
> >>>
> >>>> The CRB command buffer currently is a RAM MemoryRegion and given
> >>>> its base address alignment, it causes an error report on
> >>>> vfio_listener_region_add(). This region could have been a RAM device
> >>>> region, easing the detection of such safe situation but this option
> >>>> was not well received. So let's add a helper function that uses the
> >>>> memory region name to recognize the region and detect the situation
> >>>> is safe wrt assignment. Other regions can be listed here if such kind
> >>>> of problem occurs again.
> >>>>
> >>>> Signed-off-by: Eric Auger 
> >>>> ---
> >>>>  hw/vfio/common.c | 26 +-
> >>>>  hw/vfio/trace-events |  1 +
> >>>>  2 files changed, 26 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>> index 080046e3f51..b58a38f5c57 100644
> >>>> --- a/hw/vfio/common.c
> >>>> +++ b/hw/vfio/common.c
> >>>> @@ -861,6 +861,22 @@ static void 
> >>>> vfio_unregister_ram_discard_listener(VFIOContainer *container,
> >>>>  g_free(vrdl);
> >>>>  }
> >>>>  
> >>>> +static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
> >>>> +{
> >>>> +MemoryRegion *mr = section->mr;
> >>>> +
> >>>> +if (strcmp(memory_region_name(mr), "tpm-crb-cmd") != 0) {
> >>>> +return false;
> >>>> +}
> >>> Hi Eric,
> >>>
> >>> I was thinking more along the lines that we could use
> >>> memory_region_owner() to get the owning Object, then on
> >>> that we could maybe use INTERFACE_CHECK to look for TYPE_MEMORY_DEVICE,
> >>> then consider anything else optional.  (a) could something like that
> >>> work and (b) do all required mappings currently expose that interface?
> >>> Thanks,
> >> If I understand correctly you just want to error_report() misalignement
> >> of MR sections belonging to
> >>
> >> TYPE_MEMORY_DEVICE devices and silence the rest? Is that a correct
> >> understanding? I thought you wanted to be much more protective and
> >> ignore misalignments on a case by case basis hence the white listing
> >> of this single tpm-crb-cmd region.  
> > Ah right, so I'm just slipping back into what we currently do, fail for
> > memory and warn on devices, which would be a generally reasonable long
> > term plan except people file bugs about those warnings.  Crud.
> >
> > I guess I don't have a better idea than creating essentially an
> > exception list like this.  Do you think it's better to do the strcmp
> > for the specific memory region or would it maybe be sufficient to test
> > the owner object is TYPE_TPM_CRB?  Thanks,  
> I asked myself the question and eventually chose to be more conservative
> with the granularity of the MR. Sometimes objects own several MRs and I
> wondered if some misalignments could be considered as safe while others
> unsafe, within the same object.  Nevertheless I don't have a strong
> opinion and will respin according to your preferencee.

Hi Eric,

As we discussed offline, I think the benefits of being able to test the
type, ie. TYPE_TPM_CRB, might outweigh the flexibility of having per mr
granularity.  The strcmp seems like a maintenance red flag since that's
subject to change, though maybe migration support forces it to be more
static than it would otherwise appear.  In any case, it's probably not
worth warning about any DMA mapping failures for mr's backed by a tpm
device.  Thanks,

Alex

> >>>> +
> >>>> +/* this is a known safe misaligned region, just trace for
> >>>> debug purpose */
> >>>> +trace_vfio_known_safe_misalignment(memory_region_name(mr),
> >>>> +
> >>>> section->offset_within_address_space,
> >>>> +
> >>>> section->offset_within_region,
> >>>> +   qemu_

Re: [PATCH for-7.1] vfio/common: remove spurious tpm-crb-cmd misalignment warning

2022-03-17 Thread Alex Williamson
On Thu, 17 Mar 2022 14:57:30 +0100
Eric Auger  wrote:

> Hi Alex,
> 
> On 3/17/22 12:08 AM, Alex Williamson wrote:
> > On Wed, 16 Mar 2022 21:29:51 +0100
> > Eric Auger  wrote:
> >  
> >> The CRB command buffer currently is a RAM MemoryRegion and given
> >> its base address alignment, it causes an error report on
> >> vfio_listener_region_add(). This region could have been a RAM device
> >> region, easing the detection of such safe situation but this option
> >> was not well received. So let's add a helper function that uses the
> >> memory region name to recognize the region and detect the situation
> >> is safe wrt assignment. Other regions can be listed here if such kind
> >> of problem occurs again.
> >>
> >> Signed-off-by: Eric Auger 
> >> ---
> >>  hw/vfio/common.c | 26 +-
> >>  hw/vfio/trace-events |  1 +
> >>  2 files changed, 26 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index 080046e3f51..b58a38f5c57 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -861,6 +861,22 @@ static void 
> >> vfio_unregister_ram_discard_listener(VFIOContainer *container,
> >>  g_free(vrdl);
> >>  }
> >>  
> >> +static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
> >> +{
> >> +MemoryRegion *mr = section->mr;
> >> +
> >> +if (strcmp(memory_region_name(mr), "tpm-crb-cmd") != 0) {
> >> +return false;
> >> +}  
> > Hi Eric,
> >
> > I was thinking more along the lines that we could use
> > memory_region_owner() to get the owning Object, then on
> > that we could maybe use INTERFACE_CHECK to look for TYPE_MEMORY_DEVICE,
> > then consider anything else optional.  (a) could something like that
> > work and (b) do all required mappings currently expose that interface?
> > Thanks,  
> If I understand correctly you just want to error_report() misalignement
> of MR sections belonging to
> 
> TYPE_MEMORY_DEVICE devices and silence the rest? Is that a correct
> understanding? I thought you wanted to be much more protective and
> ignore misalignments on a case by case basis hence the white listing
> of this single tpm-crb-cmd region.

Ah right, so I'm just slipping back into what we currently do, fail for
memory and warn on devices, which would be a generally reasonable long
term plan except people file bugs about those warnings.  Crud.

I guess I don't have a better idea than creating essentially an
exception list like this.  Do you think it's better to do the strcmp
for the specific memory region or would it maybe be sufficient to test
the owner object is TYPE_TPM_CRB?  Thanks,

Alex

> >> +
> >> +/* this is a known safe misaligned region, just trace for
> >> debug purpose */
> >> +trace_vfio_known_safe_misalignment(memory_region_name(mr),
> >> +
> >> section->offset_within_address_space,
> >> +
> >> section->offset_within_region,
> >> +   qemu_real_host_page_size);
> >> +return true;
> >> +}
> >> +
> >>  static void vfio_listener_region_add(MemoryListener *listener,
> >>   MemoryRegionSection *section)
> >>  {
> >> @@ -884,7 +900,15 @@ static void
> >> vfio_listener_region_add(MemoryListener *listener, if
> >> (unlikely((section->offset_within_address_space &
> >> ~qemu_real_host_page_mask) != (section->offset_within_region &
> >> ~qemu_real_host_page_mask))) {
> >> -error_report("%s received unaligned region", __func__);
> >> +if (!vfio_known_safe_misalignment(section)) {
> >> +error_report("%s received unaligned region %s
> >> iova=0x%"PRIx64
> >> + " offset_within_region=0x%"PRIx64
> >> + " qemu_real_host_page_mask=0x%"PRIxPTR,
> >> + __func__,
> >> memory_region_name(section->mr),
> >> + section->offset_within_address_space,
> >> + section->offset_within_region,
> >> + qemu_real_host_page_mask);
> >> +}
> >>  return;
> >>  }
> >>  
> >> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> >> index 0ef1b5f4a65..6f38a2e6991 10064

Re: [PATCH for-7.1] vfio/common: remove spurious tpm-crb-cmd misalignment warning

2022-03-16 Thread Alex Williamson
On Wed, 16 Mar 2022 21:29:51 +0100
Eric Auger  wrote:

> The CRB command buffer currently is a RAM MemoryRegion and given
> its base address alignment, it causes an error report on
> vfio_listener_region_add(). This region could have been a RAM device
> region, easing the detection of such safe situation but this option
> was not well received. So let's add a helper function that uses the
> memory region name to recognize the region and detect the situation
> is safe wrt assignment. Other regions can be listed here if such kind
> of problem occurs again.
> 
> Signed-off-by: Eric Auger 
> ---
>  hw/vfio/common.c | 26 +-
>  hw/vfio/trace-events |  1 +
>  2 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 080046e3f51..b58a38f5c57 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -861,6 +861,22 @@ static void 
> vfio_unregister_ram_discard_listener(VFIOContainer *container,
>  g_free(vrdl);
>  }
>  
> +static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
> +{
> +MemoryRegion *mr = section->mr;
> +
> +if (strcmp(memory_region_name(mr), "tpm-crb-cmd") != 0) {
> +return false;
> +}

Hi Eric,

I was thinking more along the lines that we could use
memory_region_owner() to get the owning Object, then on
that we could maybe use INTERFACE_CHECK to look for TYPE_MEMORY_DEVICE,
then consider anything else optional.  (a) could something like that
work and (b) do all required mappings currently expose that interface?
Thanks,

Alex


> +
> +/* this is a known safe misaligned region, just trace for debug purpose 
> */
> +trace_vfio_known_safe_misalignment(memory_region_name(mr),
> +   section->offset_within_address_space,
> +   section->offset_within_region,
> +   qemu_real_host_page_size);
> +return true;
> +}
> +
>  static void vfio_listener_region_add(MemoryListener *listener,
>   MemoryRegionSection *section)
>  {
> @@ -884,7 +900,15 @@ static void vfio_listener_region_add(MemoryListener 
> *listener,
>  if (unlikely((section->offset_within_address_space &
>~qemu_real_host_page_mask) !=
>   (section->offset_within_region & 
> ~qemu_real_host_page_mask))) {
> -error_report("%s received unaligned region", __func__);
> +if (!vfio_known_safe_misalignment(section)) {
> +error_report("%s received unaligned region %s iova=0x%"PRIx64
> + " offset_within_region=0x%"PRIx64
> + " qemu_real_host_page_mask=0x%"PRIxPTR,
> + __func__, memory_region_name(section->mr),
> + section->offset_within_address_space,
> + section->offset_within_region,
> + qemu_real_host_page_mask);
> +}
>  return;
>  }
>  
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 0ef1b5f4a65..6f38a2e6991 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -100,6 +100,7 @@ vfio_listener_region_add_skip(uint64_t start, uint64_t 
> end) "SKIPPING region_add
>  vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to 
> liobn fd %d"
>  vfio_listener_region_add_iommu(uint64_t start, uint64_t end) "region_add 
> [iommu] 0x%"PRIx64" - 0x%"PRIx64
>  vfio_listener_region_add_ram(uint64_t iova_start, uint64_t iova_end, void 
> *vaddr) "region_add [ram] 0x%"PRIx64" - 0x%"PRIx64" [%p]"
> +vfio_known_safe_misalignment(const char *name, uint64_t iova, uint64_t 
> offset_within_region, uint64_t page_size) "Region \"%s\" iova=0x%"PRIx64" 
> offset_within_region=0x%"PRIx64" qemu_real_host_page_mask=0x%"PRIxPTR ": 
> cannot be mapped for DMA"
>  vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, 
> uint64_t size, uint64_t page_size) "Region \"%s\" 0x%"PRIx64" 
> size=0x%"PRIx64" is not aligned to 0x%"PRIx64" and cannot be mapped for DMA"
>  vfio_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING 
> region_del 0x%"PRIx64" - 0x%"PRIx64
>  vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 
> 0x%"PRIx64" - 0x%"PRIx64




Re: [RFC v4 01/21] vfio-user: introduce vfio-user protocol specification

2022-03-15 Thread Alex Williamson
On Tue, 15 Mar 2022 21:43:15 +
Thanos Makatos  wrote:

> > -Original Message-
> > From: Qemu-devel  > bounces+thanos.makatos=nutanix@nongnu.org> On Behalf Of Alex  
> > Williamson
> > Sent: 09 March 2022 22:35
> > To: John Johnson 
> > Cc: qemu-devel@nongnu.org
> > Subject: Re: [RFC v4 01/21] vfio-user: introduce vfio-user protocol 
> > specification
> > 
> > On Tue, 11 Jan 2022 16:43:37 -0800
> > John Johnson  wrote:  
> > > +VFIO region info cap sparse mmap
> > > +""""""""""""""""""""""""""""""""
> > > +
> > > ++--++--+
> > > +| Name | Offset | Size |
> > > ++==++==+
> > > +| nr_areas | 0  | 4|
> > > ++--++--+
> > > +| reserved | 4  | 4|
> > > ++--++--+
> > > +| offset   | 8  | 8|
> > > ++--++--+
> > > +| size | 16 | 9|
> > > ++--++--+  
> > 
> > Typo, I'm pretty sure size isn't 9 bytes.
> >   
> > > +| ...  ||  |
> > > ++--++--+
> > > +
> > > +* *nr_areas* is the number of sparse mmap areas in the region.
> > > +* *offset* and size describe a single area that can be mapped by the 
> > > client.
> > > +  There will be *nr_areas* pairs of offset and size. The offset will be 
> > > added to
> > > +  the base offset given in the ``VFIO_USER_DEVICE_GET_REGION_INFO`` to  
> > form the  
> > > +  offset argument of the subsequent mmap() call.
> > > +
> > > +The VFIO sparse mmap area is defined in  (``struct
> > > +vfio_region_info_cap_sparse_mmap``).
> > > +
> > > +VFIO region type cap header
> > > +"""""""""""""""""""""""""""
> > > +
> > > ++--+---+
> > > +| Name | Value |
> > > ++==+===+
> > > +| id   | VFIO_REGION_INFO_CAP_TYPE |
> > > ++--+---+
> > > +| version  | 0x1   |
> > > ++--+---+
> > > +| next | |
> > > ++--+---+
> > > +| region info type | VFIO region info type |
> > > ++--+---+
> > > +
> > > +This capability is defined when a region is specific to the device.
> > > +
> > > +VFIO region info type cap
> > > +"""""""""""""""""""""""""
> > > +
> > > +The VFIO region info type is defined in 
> > > +(``struct vfio_region_info_cap_type``).
> > > +
> > > ++-++--+
> > > +| Name| Offset | Size |
> > > ++=++==+
> > > +| type| 0  | 4|
> > > ++-++--+
> > > +| subtype | 4  | 4|
> > > ++-++--+
> > > +
> > > +The only device-specific region type and subtype supported by vfio-user 
> > > is
> > > +``VFIO_REGION_TYPE_MIGRATION`` (3) and  
> > ``VFIO_REGION_SUBTYPE_MIGRATION`` (1).
> > 
> > These should be considered deprecated from the kernel interface.  I
> > hope there are plans for vfio-user to adopt the new interface that's
> > currently available in linux-next and intended for v5.18.
> > 
> > ...  
> > > +Unused VFIO ``ioctl()`` commands
> > > +
> > > +
> > > +The following VFIO commands do not have an equivalent vfio-user  
> > command:  
> > > +
> > > +* ``VFIO_GET_API_VERSION``
> > > +* ``VFIO_CHECK_EXTENSION``
> > > +* ``VFIO_SET_IOMMU``
> > > +* ``VFIO_GROUP_GET_STATUS``
> > > +* ``VFIO_GROUP_SET_CONTAINER``
> > > +* ``VFIO_GROUP_UNSET_CONTAINER``
> > > +* ``VFIO_GROUP_GET_DEVICE_FD``
> > > +* ``VFIO_IOMMU_GET_INFO``
> > > +
> > > +However, once support for live migration f

Re: XIVE VFIO kernel resample failure in INTx mode under heavy load

2022-03-14 Thread Alex Williamson
[Cc +Alexey]

On Fri, 11 Mar 2022 12:35:45 -0600 (CST)
Timothy Pearson  wrote:

> All,
> 
> I've been struggling for some time with what is looking like a
> potential bug in QEMU/KVM on the POWER9 platform.  It appears that in
> XIVE mode, when the in-kernel IRQ chip is enabled, an external device
> that rapidly asserts IRQs via the legacy INTx level mechanism will
> only receive one interrupt in the KVM guest.
> 
> Changing any one of those items appears to avoid the glitch, e.g.
> XICS mode with the in-kernel IRQ chip works (all interrupts are
> passed through), and XIVE mode with the in-kernel IRQ chip disabled
> also works.  We are also not seeing any problems in XIVE mode with
> the in-kernel chip from MSI/MSI-X devices.
> 
> The device in question is a real time card that needs to raise an
> interrupt every 1ms.  It works perfectly on the host, but fails in
> the guest -- with the in-kernel IRQ chip and XIVE enabled, it
> receives exactly one interrupt, at which point the host continues to
> see INTx+ but the guest sees INTX-, and the IRQ handler in the guest
> kernel is never reentered.
> 
> We have also seen some very rare glitches where, over a long period
> of time, we can enter a similar deadlock in XICS mode.  Disabling the
> in-kernel IRQ chip in XIVE mode will also lead to the lockup with
> this device, since the userspace IRQ emulation cannot keep up with
> the rapid interrupt firing (measurements show around 100ms required
> for processing each interrupt in the user mode).
> 
> My understanding is the resample mechanism does some clever tricks
> with level IRQs, but that QEMU needs to check if the IRQ is still
> asserted by the device on guest EOI.  Since a failure here would
> explain these symptoms I'm wondering if there is a bug in either QEMU
> or KVM for POWER / pSeries (SPAPr) where the IRQ is not resampled and
> therefore not re-fired in the guest?
> 
> Unfortunately I lack the resources at the moment to dig through the
> QEMU codebase and try to find the bug.  Any IBMers here that might be
> able to help out?  I can provide access to a test setup if desired.

Your experiments with in-kernel vs QEMU irqchip would suggest to me
that both the device and the generic INTx handling code are working
correctly, though it's hard to say that definitively given the massive
timing differences.

As an experiment, does anything change with the "nointxmask=1" vfio-pci
module option?

Adding Alexey, I have zero XIVE knowledge myself. Thanks,

Alex




Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

2022-03-10 Thread Alex Williamson
On Thu, 10 Mar 2022 14:55:50 -0500
Steven Sistare  wrote:

> On 3/10/2022 1:35 PM, Alex Williamson wrote:
> > On Thu, 10 Mar 2022 10:00:29 -0500
> > Steven Sistare  wrote:
> >   
> >> On 3/7/2022 5:16 PM, Alex Williamson wrote:  
> >>> On Wed, 22 Dec 2021 11:05:24 -0800
> >>> Steve Sistare  wrote:  
> >>>> @@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer 
> >>>> *container, int group_fd,
> >>>>  {
> >>>>  int iommu_type, ret;
> >>>>  
> >>>> +/*
> >>>> + * If container is reused, just set its type and skip the ioctls, 
> >>>> as the
> >>>> + * container and group are already configured in the kernel.
> >>>> + * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
> >>>> + * If you ever add new types or spapr cpr support, kind reader, 
> >>>> please
> >>>> + * also implement VFIO_GET_IOMMU.
> >>>> + */
> >>>
> >>> VFIO_CHECK_EXTENSION should be able to tell us this, right?  Maybe the
> >>> problem is that vfio_iommu_type1_check_extension() should actually base
> >>> some of the details on the instantiated vfio_iommu, ex.
> >>>
> >>>   switch (arg) {
> >>>   case VFIO_TYPE1_IOMMU:
> >>>   return (iommu && iommu->v2) ? 0 : 1;
> >>>   case VFIO_UNMAP_ALL:
> >>>   case VFIO_UPDATE_VADDR:
> >>>   case VFIO_TYPE1v2_IOMMU:
> >>>   return (iommu && !iommu->v2) ? 0 : 1;
> >>>   case VFIO_TYPE1_NESTING_IOMMU:
> >>>   return (iommu && !iommu->nesting) ? 0 : 1;
> >>>   ...
> >>>
> >>> We can't support v1 if we've already set a v2 container and vice versa.
> >>> There are probably some corner cases and compatibility to puzzle
> >>> through, but I wouldn't think we need a new ioctl to check this.
> >>
> >> That change makes sense, and may be worth while on its own merits, but 
> >> does not
> >> solve the problem, which is that qemu will not be able to infer iommu_type 
> >> in
> >> the future if new types are added.  Given:
> >>   * a new kernel supporting shiny new TYPE1v3
> >>   * old qemu starts and selects TYPE1v2 in vfio_get_iommu_type because it 
> >> has no
> >> knowledge of v3
> >>   * live update to qemu which supports v3, which will be listed first in 
> >> vfio_get_iommu_type.
> >>
> >> Then the new qemu has no way to infer iommu_type.  If it has code that 
> >> makes 
> >> decisions based on iommu_type (eg, VFIO_SPAPR_TCE_v2_IOMMU in 
> >> vfio_container_region_add,
> >> or vfio_ram_block_discard_disable, or ...), then new qemu cannot function 
> >> correctly.
> >>
> >> For that, VFIO_GET_IOMMU would be the cleanest solution, to be added the 
> >> same time our
> >> hypothetical future developer adds TYPE1v3.  The current inability to ask 
> >> the kernel
> >> "what are you" about a container feels like a bug to me.  
> > 
> > Hmm, I don't think the kernel has an innate responsibility to remind
> > the user of a configuration that they've already made.
> 
> No, but it can make userland cleaner.  For example, CRIU checkpoint/restart 
> queries
> the kernel to save process state, and later makes syscalls to restore it.  
> Where the
> kernel does not export sufficient information, CRIU must provide interpose 
> libraries
> so it can remember state internally on its way to the kernel.  And 
> applications must
> link against the interpose libraries.

The counter argument is that it bloats the kernel to add interfaces to
report back things that userspace should already know.  Which has more
exploit vectors, a new kernel ioctl or yet another userspace library?
 
> > But I also
> > don't follow your TYPE1v3 example.  If we added such a type, I imagine
> > the switch would change to:
> > 
> > switch (arg)
> > case VFIO_TYPE1_IOMMU:
> > return (iommu && (iommu->v2 || iommu->v3) ? 0 : 1;
> > case VFIO_UNMAP_ALL:
> > case VFIO_UPDATE_VADDR:
> > return (iommu && !(iommu-v2 || iommu->v3) ? 0 : 1;
> > case VFIO_TYPE1v2_IOMMU:
> > return (iommu && !iommu-v2) ? 0 : 1;
> > case VFIO_TYPE1v3_IOMMU:
> > return (iommu && !iommu->v

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

2022-03-10 Thread Alex Williamson
On Thu, 10 Mar 2022 10:00:29 -0500
Steven Sistare  wrote:

> On 3/7/2022 5:16 PM, Alex Williamson wrote:
> > On Wed, 22 Dec 2021 11:05:24 -0800
> > Steve Sistare  wrote:
> >> @@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer 
> >> *container, int group_fd,
> >>  {
> >>  int iommu_type, ret;
> >>  
> >> +/*
> >> + * If container is reused, just set its type and skip the ioctls, as 
> >> the
> >> + * container and group are already configured in the kernel.
> >> + * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
> >> + * If you ever add new types or spapr cpr support, kind reader, please
> >> + * also implement VFIO_GET_IOMMU.
> >> + */  
> > 
> > VFIO_CHECK_EXTENSION should be able to tell us this, right?  Maybe the
> > problem is that vfio_iommu_type1_check_extension() should actually base
> > some of the details on the instantiated vfio_iommu, ex.
> > 
> > switch (arg) {
> > case VFIO_TYPE1_IOMMU:
> > return (iommu && iommu->v2) ? 0 : 1;
> > case VFIO_UNMAP_ALL:
> > case VFIO_UPDATE_VADDR:
> > case VFIO_TYPE1v2_IOMMU:
> > return (iommu && !iommu->v2) ? 0 : 1;
> > case VFIO_TYPE1_NESTING_IOMMU:
> > return (iommu && !iommu->nesting) ? 0 : 1;
> > ...
> > 
> > We can't support v1 if we've already set a v2 container and vice versa.
> > There are probably some corner cases and compatibility to puzzle
> > through, but I wouldn't think we need a new ioctl to check this.  
> 
> That change makes sense, and may be worth while on its own merits, but does 
> not
> solve the problem, which is that qemu will not be able to infer iommu_type in
> the future if new types are added.  Given:
>   * a new kernel supporting shiny new TYPE1v3
>   * old qemu starts and selects TYPE1v2 in vfio_get_iommu_type because it has 
> no
> knowledge of v3
>   * live update to qemu which supports v3, which will be listed first in 
> vfio_get_iommu_type.
> 
> Then the new qemu has no way to infer iommu_type.  If it has code that makes 
> decisions based on iommu_type (eg, VFIO_SPAPR_TCE_v2_IOMMU in 
> vfio_container_region_add,
> or vfio_ram_block_discard_disable, or ...), then new qemu cannot function 
> correctly.
> 
> For that, VFIO_GET_IOMMU would be the cleanest solution, to be added the same 
> time our
> hypothetical future developer adds TYPE1v3.  The current inability to ask the 
> kernel
> "what are you" about a container feels like a bug to me.

Hmm, I don't think the kernel has an innate responsibility to remind
the user of a configuration that they've already made.  But I also
don't follow your TYPE1v3 example.  If we added such a type, I imagine
the switch would change to:

switch (arg)
case VFIO_TYPE1_IOMMU:
return (iommu && (iommu->v2 || iommu->v3) ? 0 : 1;
case VFIO_UNMAP_ALL:
case VFIO_UPDATE_VADDR:
return (iommu && !(iommu-v2 || iommu->v3) ? 0 : 1;
case VFIO_TYPE1v2_IOMMU:
return (iommu && !iommu-v2) ? 0 : 1;
case VFIO_TYPE1v3_IOMMU:
return (iommu && !iommu->v3) ? 0 : 1;
...

How would that not allow exactly the scenario described, ie. new QEMU
can see that old QEMU left it a v2 IOMMU.

...
> >> +
> >> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
> >> +{
> >> +if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
> >> +!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
> >> +error_setg(errp, "VFIO container does not support 
> >> VFIO_UPDATE_VADDR "
> >> + "or VFIO_UNMAP_ALL");
> >> +return false;
> >> +} else {
> >> +return true;
> >> +}
> >> +}  
> > 
> > We could have minimally used this where we assumed a TYPE1v2 container.  
> 
> Are you referring to vfio_init_container (discussed above)?
> Are you suggesting that, if reused is true, we validate those extensions are
> present, before setting iommu_type = VFIO_TYPE1v2_IOMMU?

Yeah, though maybe it's not sufficiently precise to be worthwhile given
the current kernel behavior.

> >> +
> >> +/*
> >> + * Verify that all containers support CPR, and unmap all dma vaddr's.
> >> + */
> >> +int vfio_cpr_save(Error **errp)
> >> +{
> >> +ERRP_GUARD();
> >> +VFIOAddressSpace *spa

Re: [RFC v4 04/21] vfio-user: add region cache

2022-03-09 Thread Alex Williamson
On Tue, 11 Jan 2022 16:43:40 -0800
John Johnson  wrote:

> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index 0cf69a8..223bd02 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -1601,16 +1601,14 @@ int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice 
> *vdev, Error **errp)
>  
>  hdr = vfio_get_region_info_cap(nv2reg, 
> VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
>  if (!hdr) {
> -ret = -ENODEV;
> -goto free_exit;
> +return -ENODEV;
>  }
>  cap = (void *) hdr;
>  
>  p = mmap(NULL, nv2reg->size, PROT_READ | PROT_WRITE,
>   MAP_SHARED, vdev->vbasedev.fd, nv2reg->offset);
>  if (p == MAP_FAILED) {
> -ret = -errno;
> -goto free_exit;
> +return -errno;
>  }
>  
>  quirk = vfio_quirk_alloc(1);
> @@ -1623,7 +1621,7 @@ int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, 
> Error **errp)
>  (void *) (uintptr_t) cap->tgt);
>  trace_vfio_pci_nvidia_gpu_setup_quirk(vdev->vbasedev.name, cap->tgt,
>nv2reg->size);
> -free_exit:
> +
>  g_free(nv2reg);

Shouldn't this g_free() be removed as well?

>  
>  return ret;
> @@ -1651,16 +1649,14 @@ int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error 
> **errp)
>  hdr = vfio_get_region_info_cap(atsdreg,
> VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
>  if (!hdr) {
> -ret = -ENODEV;
> -goto free_exit;
> +return -ENODEV;
>  }
>  captgt = (void *) hdr;
>  
>  hdr = vfio_get_region_info_cap(atsdreg,
> VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD);
>  if (!hdr) {
> -ret = -ENODEV;
> -goto free_exit;
> +return -ENODEV;
>  }
>  capspeed = (void *) hdr;
>  
> @@ -1669,8 +1665,7 @@ int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error 
> **errp)
>  p = mmap(NULL, atsdreg->size, PROT_READ | PROT_WRITE,
>   MAP_SHARED, vdev->vbasedev.fd, atsdreg->offset);
>  if (p == MAP_FAILED) {
> -ret = -errno;
> -goto free_exit;
> +return -errno;
>  }
>  
>  quirk = vfio_quirk_alloc(1);
> @@ -1690,8 +1685,6 @@ int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error 
> **errp)
>  (void *) (uintptr_t) capspeed->link_speed);
>  trace_vfio_pci_nvlink2_setup_quirk_lnkspd(vdev->vbasedev.name,
>capspeed->link_speed);
> -free_exit:
> -g_free(atsdreg);

Like was done for this equivalent usage.  Thanks,

Alex




Re: [RFC v4 01/21] vfio-user: introduce vfio-user protocol specification

2022-03-09 Thread Alex Williamson
On Tue, 11 Jan 2022 16:43:37 -0800
John Johnson  wrote:
> +VFIO region info cap sparse mmap
> +
> +
> ++--++--+
> +| Name | Offset | Size |
> ++==++==+
> +| nr_areas | 0  | 4|
> ++--++--+
> +| reserved | 4  | 4|
> ++--++--+
> +| offset   | 8  | 8|
> ++--++--+
> +| size | 16 | 9|
> ++--++--+

Typo, I'm pretty sure size isn't 9 bytes.

> +| ...  ||  |
> ++--++--+
> +
> +* *nr_areas* is the number of sparse mmap areas in the region.
> +* *offset* and size describe a single area that can be mapped by the client.
> +  There will be *nr_areas* pairs of offset and size. The offset will be 
> added to
> +  the base offset given in the ``VFIO_USER_DEVICE_GET_REGION_INFO`` to form 
> the
> +  offset argument of the subsequent mmap() call.
> +
> +The VFIO sparse mmap area is defined in  (``struct
> +vfio_region_info_cap_sparse_mmap``).
> +
> +VFIO region type cap header
> +"""
> +
> ++--+---+
> +| Name | Value |
> ++==+===+
> +| id   | VFIO_REGION_INFO_CAP_TYPE |
> ++--+---+
> +| version  | 0x1   |
> ++--+---+
> +| next | |
> ++--+---+
> +| region info type | VFIO region info type |
> ++--+---+
> +
> +This capability is defined when a region is specific to the device.
> +
> +VFIO region info type cap
> +"
> +
> +The VFIO region info type is defined in 
> +(``struct vfio_region_info_cap_type``).
> +
> ++-++--+
> +| Name| Offset | Size |
> ++=++==+
> +| type| 0  | 4|
> ++-++--+
> +| subtype | 4  | 4|
> ++-++--+
> +
> +The only device-specific region type and subtype supported by vfio-user is
> +``VFIO_REGION_TYPE_MIGRATION`` (3) and ``VFIO_REGION_SUBTYPE_MIGRATION`` (1).

These should be considered deprecated from the kernel interface.  I
hope there are plans for vfio-user to adopt the new interface that's
currently available in linux-next and intended for v5.18.

...
> +Unused VFIO ``ioctl()`` commands
> +
> +
> +The following VFIO commands do not have an equivalent vfio-user command:
> +
> +* ``VFIO_GET_API_VERSION``
> +* ``VFIO_CHECK_EXTENSION``
> +* ``VFIO_SET_IOMMU``
> +* ``VFIO_GROUP_GET_STATUS``
> +* ``VFIO_GROUP_SET_CONTAINER``
> +* ``VFIO_GROUP_UNSET_CONTAINER``
> +* ``VFIO_GROUP_GET_DEVICE_FD``
> +* ``VFIO_IOMMU_GET_INFO``
> +
> +However, once support for live migration for VFIO devices is finalized some
> +of the above commands may have to be handled by the client in their
> +corresponding vfio-user form. This will be addressed in a future protocol
> +version.

As above, I'd go ahead and drop the migration region interface support,
it's being removed from the kernel.  Dirty page handling might also be
something you want to pull back on as we're expecting in-kernel vfio to
essentially deprecate its iommu backends in favor of a new shared
userspace iommufd interface.  We expect to have backwards compatibility
via that interface, but as QEMU migration support for vfio-pci devices
is experimental and there are desires not to consolidate dirty page
tracking behind the iommu interface in the new model, it's not clear if
the kernel will continue to expose the current dirty page tracking.

AIUI, we're expecting to see patches officially proposing the iommufd
interface in the kernel "soon".  Thanks,

Alex




Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

2022-03-07 Thread Alex Williamson
On Wed, 22 Dec 2021 11:05:24 -0800
Steve Sistare  wrote:

> Enable vfio-pci devices to be saved and restored across an exec restart
> of qemu.
> 
> At vfio creation time, save the value of vfio container, group, and device
> descriptors in cpr state.
> 
> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
> at a different VA after exec.  DMA to already-mapped pages continues.  Save
> the msi message area as part of vfio-pci vmstate, save the interrupt and
> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
> vfio descriptors.  The flag is not cleared earlier because the descriptors
> should not persist across miscellaneous fork and exec calls that may be
> performed during normal operation.
> 
> On qemu restart, vfio_realize() finds the saved descriptors, uses
> the descriptors, and notes that the device is being reused.  Device and
> iommu state is already configured, so operations in vfio_realize that
> would modify the configuration are skipped for a reused device, including
> vfio ioctl's and writes to PCI configuration space.  The result is that
> vfio_realize constructs qemu data structures that reflect the current
> state of the device.  However, the reconstruction is not complete until
> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
> state.  It rebuilds vector data structures and attaches the interrupts to
> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
> which walks the flattened ranges of the vfio_address_spaces and calls
> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
> starts the VM and suppresses vfio pci device reset.
> 
> This functionality is delivered by 3 patches for clarity.  Part 1 handles
> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
> support.  Part 3 adds INTX support.
> 
> Signed-off-by: Steve Sistare 
> ---
>  MAINTAINERS   |   1 +
>  hw/pci/pci.c  |  10 
>  hw/vfio/common.c  | 115 
> ++
>  hw/vfio/cpr.c |  94 ++
>  hw/vfio/meson.build   |   1 +
>  hw/vfio/pci.c |  77 
>  hw/vfio/trace-events  |   1 +
>  include/hw/pci/pci.h  |   1 +
>  include/hw/vfio/vfio-common.h |   8 +++
>  include/migration/cpr.h   |   3 ++
>  migration/cpr.c   |  10 +++-
>  migration/target.c|  14 +
>  12 files changed, 324 insertions(+), 11 deletions(-)
>  create mode 100644 hw/vfio/cpr.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index cfe7480..feed239 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2992,6 +2992,7 @@ CPR
>  M: Steve Sistare 
>  M: Mark Kanda 
>  S: Maintained
> +F: hw/vfio/cpr.c
>  F: include/migration/cpr.h
>  F: migration/cpr.c
>  F: qapi/cpr.json
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 0fd21e1..e35df4f 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>  {
>  int r;
>  
> +/*
> + * A reused vfio-pci device is already configured, so do not reset it
> + * during qemu_system_reset prior to cpr-load, else interrupts may be
> + * lost.  By contrast, pure-virtual pci devices may be reset here and
> + * updated with new state in cpr-load with no ill effects.
> + */
> +if (dev->reused) {
> +return;
> +}
> +
>  pci_device_deassert_intx(dev);
>  assert(dev->irq_state == 0);
>  
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 5b87f95..90f66ad 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -31,6 +31,7 @@
>  #include "exec/memory.h"
>  #include "exec/ram_addr.h"
>  #include "hw/hw.h"
> +#include "migration/cpr.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/range.h"
> @@ -459,6 +460,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
>  .size = size,
>  };
>  
> +assert(!container->reused);
> +
>  if (iotlb && container->dirty_pages_supported &&
>  vfio_devices_all_running_and_saving(container)) {
>  return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
> @@ -495,12 +498,24 @@ static int vfio_dma_map(VFIOContainer *container, 
> hwaddr iova,
>  {
>  struct vfio_iommu_type1_dma_map map = {
>  .argsz = sizeof(map),
> -.flags = VFIO_DMA_MAP_FLAG_READ,
>  .vaddr = (__u64)(uintptr_t)vaddr,
>  .iova = iova,
>  .size = size,
>  };
>  
> +/*
> + * Set the new vaddr for any mappings registered during cpr-load.
> + * Reused is cleared thereafter.
> + */
> +if (container->reused) {
> +map.flags = VFIO_DMA_MAP_FLAG_VADDR;
> +if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, )) {
> +goto fail;
> +}

Re: [PATCH V7 18/29] vfio-pci: refactor for cpr

2022-03-03 Thread Alex Williamson
On Wed, 22 Dec 2021 11:05:23 -0800
Steve Sistare  wrote:

> +if (vfio_notifier_init(vdev, >intx.unmask, "intx-unmask", 0)) {
...
> +vfio_notifier_cleanup(vdev, >intx.unmask, "intx-unmask", 0);
...
> +vfio_notifier_cleanup(vdev, >intx.unmask, "intx-unmask", 0);
...
> +ret = vfio_notifier_init(vdev, >intx.interrupt, "intx-interrupt", 
> 0);
...
> +vfio_notifier_cleanup(vdev, >intx.interrupt, "intx-interrupt", 
> 0);
...
> +vfio_notifier_cleanup(vdev, >intx.interrupt, "intx-interrupt", 0);
...
> +const char *name = "kvm_interrupt";
...
> +if (vfio_notifier_init(vdev, >kvm_interrupt, name, nr)) {
...
> +vfio_notifier_cleanup(vdev, >kvm_interrupt, name, nr);
...
> +vfio_notifier_cleanup(vdev, >kvm_interrupt, name, nr);
...
> +vfio_notifier_cleanup(vdev, >kvm_interrupt, "kvm_interrupt", nr);
...
> +if (vfio_notifier_init(vdev, >interrupt, "interrupt", nr)) {
...
> +if (vfio_notifier_init(vdev, >interrupt, "interrupt", i)) {
...
> +vfio_notifier_cleanup(vdev, >interrupt, "interrupt", i);
...
> +vfio_notifier_cleanup(vdev, >interrupt, "interrupt", i);
...
> +if (vfio_notifier_init(vdev, >err_notifier, "err", 0)) {
...
> +vfio_notifier_cleanup(vdev, >err_notifier, "err_notifier", 0);
...
> +vfio_notifier_cleanup(vdev, >err_notifier, "err_notifier", 0);
...
> +if (vfio_notifier_init(vdev, >req_notifier, "req", 0)) {
...
> +vfio_notifier_cleanup(vdev, >req_notifier, "req_notifier", 0);
...
> +vfio_notifier_cleanup(vdev, >req_notifier, "req_notifier", 0);

Something seems to have gone astray with "err" and "req" vs
"err_notifier" and "req_notifier".  The pattern is broken.  Thanks,

Alex




Re: [PATCH 1/2] Allow returning EventNotifier's wfd

2022-03-02 Thread Alex Williamson
On Wed, 2 Mar 2022 16:23:42 +0100
Sergio Lopez  wrote:

> On Wed, Mar 02, 2022 at 08:12:34AM -0700, Alex Williamson wrote:
> > On Wed,  2 Mar 2022 12:36:43 +0100
> > Sergio Lopez  wrote:
> >   
> > > event_notifier_get_fd(const EventNotifier *e) always returns
> > > EventNotifier's read file descriptor (rfd). This is not a problem when
> > > the EventNotifier is backed by a an eventfd, as a single file
> > > descriptor is used both for reading and triggering events (rfd ==
> > > wfd).
> > > 
> > > But, when EventNotifier is backed by a pipefd, we have two file
> > > descriptors, one that can only be used for reads (rfd), and the other
> > > only for writes (wfd).
> > > 
> > > There's, at least, one known situation in which we need to obtain wfd
> > > instead of rfd, which is when setting up the file that's going to be
> > > sent to the peer in vhost's SET_VRING_CALL.
> > > 
> > > Extend event_notifier_get_fd() to receive an argument which indicates
> > > whether the caller wants to obtain rfd (false) or wfd (true).  
> > 
> > There are about 50 places where we add the false arg here and 1 where
> > we use true.  Seems it would save a lot of churn to hide this
> > internally, event_notifier_get_fd() returns an rfd, a new
> > event_notifier_get_wfd() returns the wfd.  Thanks,  
> 
> I agree. In fact, that's what I implemented in the first place. I
> changed to this version in which event_notifier_get_fd() is extended
> because it feels more "correct". But yes, the pragmatic option would
> be adding a new event_notifier_get_wfd().
> 
> I'll wait for more reviews, and unless someone voices against it, I'll
> respin the patches with that strategy (I already have it around here).

I'd argue that adding a bool as an arg to a function to change the
return value is sufficiently non-intuitive to program for that the
wrapper method is actually more correct.  event_notifier_get_fd()
essentially becomes a shorthand for event_notifier_get_rfd().  Thanks,

Alex




Re: [PATCH 1/2] Allow returning EventNotifier's wfd

2022-03-02 Thread Alex Williamson
On Wed,  2 Mar 2022 12:36:43 +0100
Sergio Lopez  wrote:

> event_notifier_get_fd(const EventNotifier *e) always returns
> EventNotifier's read file descriptor (rfd). This is not a problem when
> the EventNotifier is backed by a an eventfd, as a single file
> descriptor is used both for reading and triggering events (rfd ==
> wfd).
> 
> But, when EventNotifier is backed by a pipefd, we have two file
> descriptors, one that can only be used for reads (rfd), and the other
> only for writes (wfd).
> 
> There's, at least, one known situation in which we need to obtain wfd
> instead of rfd, which is when setting up the file that's going to be
> sent to the peer in vhost's SET_VRING_CALL.
> 
> Extend event_notifier_get_fd() to receive an argument which indicates
> whether the caller wants to obtain rfd (false) or wfd (true).

There are about 50 places where we add the false arg here and 1 where
we use true.  Seems it would save a lot of churn to hide this
internally, event_notifier_get_fd() returns an rfd, a new
event_notifier_get_wfd() returns the wfd.  Thanks,

Alex




Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-25 Thread Alex Williamson
On Fri, 25 Feb 2022 12:36:24 +
Joao Martins  wrote:

> On 2/24/22 21:40, Alex Williamson wrote:
> > On Thu, 24 Feb 2022 20:34:40 +
> > Joao Martins  wrote:
> >> Of all those cases I would feel the machine-property is better,
> >> and more flexible than having VFIO/VDPA deal with a bad memory-layout and
> >> discovering late stage that the user is doing something wrong (and thus
> >> fail the DMA_MAP operation for those who do check invalid iovas)  
> > 
> > The trouble is that anything we can glean from the host system where we
> > instantiate the VM is mostly meaningless relative to data center
> > orchestration.  We're relatively insulated from these sorts of issues
> > on x86 (apparently aside from this case), AIUI ARM is even worse about
> > having arbitrary reserved ranges within their IOVA space.
> >   
> In the multi-socket servers we have for ARM I haven't seen much
> issues /yet/ with VFIO. I only have this reserved region:
> 
> 0x0800 0x080f msi
> 
> But of course ARM servers aren't very good representatives of the
> shifting nature of other ARM machine models. ISTR some thread about GIC ITS 
> ranges
> being reserved by IOMMU in some hardware. Perhaps that's what you might
> be referring to about:
> 
> https://lore.kernel.org/qemu-devel/1510622154-17224-1-git-send-email-zhuyi...@huawei.com/


Right, and notice there also that the msi range is different.  On x86
the msi block is defined by the processor, not the platform and we have
commonality between Intel and AMD on that range.  We emulate the same
range in the guest, so for any x86 guest running on an x86 host, the
msi range is a non-issue because they overlap due to the architectural
standards.

How do you create an ARM guest that reserves a block at both 0x800
for your host and 0xc600 for the host in the above link?  Whatever
solution we develop to resolve that issue should equally apply to the
AMD reserved block:

0x00fd 0x00ff reserved

> > For a comprehensive solution, it's not a machine accelerator property
> > or enable such-and-such functionality flag, it's the ability to specify
> > a VM memory map on the QEMU command line and data center orchestration
> > tools gaining insight across all their systems to specify a memory
> > layout that can work regardless of how a VM might be migrated. 
> > Maybe
> > there's a "host" option to that memory map command line option that
> > would take into account the common case of a static host or at least
> > homogeneous set of hosts.  Overall, it's not unlike specifying CPU flags
> > to generate a least common denominator set such that the VM is
> > compatible to any host in the cluster.
> >   
> 
> I remember you iterated over the initial RFC over such idea. I do like that
> option of adjusting memory map... should any new restrictions appear in the
> IOVA space appear as opposed to have to change the machine code everytime
> that happens.
> 
> 
> I am trying to approach this iteratively and starting by fixing AMD 1T+ guests
> with something that hopefully is less painful to bear and unbreaks users doing
> multi-TB guests on kernels >= 5.4. While for < 5.4 it would not wrongly be
> DMA mapping bad IOVAs that may lead guests own spurious failures.
> For the longterm, qemu would need some sort of handling of configurable a 
> sparse
> map of all guest RAM which currently does not exist (and it's stuffed inside 
> on a
> per-machine basis as you're aware). What I am unsure is the churn associated
> with it (compat, migration, mem-hotplug, nvdimms, memory-backends) versus 
> benefit
> if it's "just" one class of x86 platforms (Intel not affected) -- which is 
> what I find
> attractive with the past 2 revisions via smaller change.
> 
> > On the device end, I really would prefer not to see device driver
> > specific enables and we simply cannot hot-add a device of the given
> > type without a pre-enabled VM.  Give the user visibility and
> > configurability to the issue and simply fail the device add (ideally
> > with a useful error message) if the device IOVA space cannot support
> > the VM memory layout (this is what vfio already does afaik).
> > 
> > When we have iommufd support common for vfio and vdpa, hopefully we'll
> > also be able to recommend a common means for learning about system and
> > IOMMU restrictions to IOVA spaces.   
> 
> Perhaps even advertising platform-wide regions (without a domain allocated) 
> that
> are common in any protection domain (for example on x86 this is one
> such case where MSI/HT ranges are hardcoded in Intel/AMD).
> 
> > For no

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-24 Thread Alex Williamson
On Thu, 24 Feb 2022 20:34:40 +
Joao Martins  wrote:

> On 2/24/22 20:12, Michael S. Tsirkin wrote:
> > On Thu, Feb 24, 2022 at 08:04:48PM +, Joao Martins wrote:  
> >> On 2/24/22 19:54, Michael S. Tsirkin wrote:  
> >>> On Thu, Feb 24, 2022 at 07:44:26PM +, Joao Martins wrote:  
>  On 2/24/22 18:30, Michael S. Tsirkin wrote:  
> > On Thu, Feb 24, 2022 at 05:54:58PM +, Joao Martins wrote:  
> >> On 2/24/22 17:23, Michael S. Tsirkin wrote:  
> >>> On Thu, Feb 24, 2022 at 04:07:22PM +, Joao Martins wrote:  
>  On 2/23/22 23:35, Joao Martins wrote:  
> > On 2/23/22 21:22, Michael S. Tsirkin wrote:  
> >>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>> +  uint64_t 
> >>> pci_hole64_size)
> >>> +{
> >>> +X86MachineState *x86ms = X86_MACHINE(pcms);
> >>> +uint32_t eax, vendor[3];
> >>> +
> >>> +host_cpuid(0x0, 0, , [0], [2], [1]);
> >>> +if (!IS_AMD_VENDOR(vendor)) {
> >>> +return;
> >>> +}  
> >>
> >> Wait a sec, should this actually be tying things to the host CPU 
> >> ID?
> >> It's really about what we present to the guest though,
> >> isn't it?  
> >
> > It was the easier catch all to use cpuid without going into
> > Linux UAPI specifics. But it doesn't have to tie in there, it is 
> > only
> > for systems with an IOMMU present.
> >  
> >> Also, can't we tie this to whether the AMD IOMMU is present?
> >>  
> > I think so, I can add that. Something like a amd_iommu_exists() 
> > helper
> > in util/vfio-helpers.c which checks if there's any sysfs child 
> > entries
> > that start with ivhd in /sys/class/iommu/. Given that this HT 
> > region is
> > hardcoded in iommu reserved regions since >=4.11 (to latest) I 
> > don't think it's
> > even worth checking the range exists in:
> >
> > /sys/kernel/iommu_groups/0/reserved_regions
> >
> > (Also that sysfs ABI is >= 4.11 only)  
> 
>  Here's what I have staged in local tree, to address your comment.
> 
>  Naturally the first chunk is what's affected by this patch the rest 
>  is a
>  precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to 
>  pass
>  all the tests and what not.
> 
>  I am not entirely sure this is the right place to put such a helper, 
>  open
>  to suggestions. wrt to the naming of the helper, I tried to follow 
>  the rest
>  of the file's style.
> 
>  diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>  index a9be5d33a291..2ea4430d5dcc 100644
>  --- a/hw/i386/pc.c
>  +++ b/hw/i386/pc.c
>  @@ -868,10 +868,8 @@ static void 
>  x86_update_above_4g_mem_start(PCMachineState *pcms,
> uint64_t pci_hole64_size)
>   {
>   X86MachineState *x86ms = X86_MACHINE(pcms);
>  -uint32_t eax, vendor[3];
> 
>  -host_cpuid(0x0, 0, , [0], [2], [1]);
>  -if (!IS_AMD_VENDOR(vendor)) {
>  +if (!qemu_amd_iommu_is_present()) {
>   return;
>   }
> 
>  diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>  index 7bcce3bceb0f..eb4ea071ecec 100644
>  --- a/include/qemu/osdep.h
>  +++ b/include/qemu/osdep.h
>  @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
>    */
>   size_t qemu_get_host_physmem(void);
> 
>  +/**
>  + * qemu_amd_iommu_is_present:
>  + *
>  + * Operating system agnostic way of querying if an AMD IOMMU
>  + * is present.
>  + *
>  + */
>  +bool qemu_amd_iommu_is_present(void);
>  +
>   /*
>    * Toggle write/execute on the pages marked MAP_JIT
>    * for the current thread.
>  diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>  index f2be7321c59f..54cef21217c4 100644
>  --- a/util/oslib-posix.c
>  +++ b/util/oslib-posix.c
>  @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
>   #endif
>   return 0;
>   }
>  +
>  +bool qemu_amd_iommu_is_present(void)
>  +{
>  +bool found = false;
>  +#ifdef CONFIG_LINUX
>  +struct dirent *entry;
>  +char *path;
>  +DIR *dir;
>  +
>  +path = g_strdup_printf("/sys/class/iommu");
>  +dir = opendir(path);
>  +if (!dir) {
>  +g_free(path);
>  +return found;
> 

Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-02-10 Thread Alex Williamson
On Thu, 10 Feb 2022 18:28:56 -0500
"Michael S. Tsirkin"  wrote:

> On Thu, Feb 10, 2022 at 04:17:34PM -0700, Alex Williamson wrote:
> > On Thu, 10 Feb 2022 22:23:01 +
> > Jag Raman  wrote:
> >   
> > > > On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin  wrote:
> > > > 
> > > > On Thu, Feb 10, 2022 at 12:08:27AM +, Jag Raman wrote:
> > > >> 
> > > >> Thanks for the explanation, Alex. Thanks to everyone else in the 
> > > >> thread who
> > > >> helped to clarify this problem.
> > > >> 
> > > >> We have implemented the memory isolation based on the discussion in the
> > > >> thread. We will send the patches out shortly.
> > > >> 
> > > >> Devices such as “name" and “e1000” worked fine. But I’d like to note 
> > > >> that
> > > >> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
> > > >> to be IOMMU aware. In LSI’s case, the kernel driver is asking the 
> > > >> device to
> > > >> read instructions from the CPU VA (lsi_execute_script() -> 
> > > >> read_dword()),
> > > >> which is forbidden when IOMMU is enabled. Specifically, the driver is 
> > > >> asking
> > > >> the device to access other BAR regions by using the BAR address 
> > > >> programmed
> > > >> in the PCI config space. This happens even without vfio-user patches. 
> > > >> For example,
> > > >> we could enable IOMMU using “-device intel-iommu” QEMU option and also
> > > >> adding the following to the kernel command-line: “intel_iommu=on 
> > > >> iommu=nopt”.
> > > >> In this case, we could see an IOMMU fault.
> > > > 
> > > > So, device accessing its own BAR is different. Basically, these
> > > > transactions never go on the bus at all, never mind get to the IOMMU.   
> > > >  
> > > 
> > > Hi Michael,
> > > 
> > > In LSI case, I did notice that it went to the IOMMU. The device is 
> > > reading the BAR
> > > address as if it was a DMA address.
> > >   
> > > > I think it's just used as a handle to address internal device memory.
> > > > This kind of trick is not universal, but not terribly unusual.
> > > > 
> > > > 
> > > >> Unfortunately, we started off our project with the LSI device. So that 
> > > >> lead to all the
> > > >> confusion about what is expected at the server end in-terms of
> > > >> vectoring/address-translation. It gave an impression as if the request 
> > > >> was still on
> > > >> the CPU side of the PCI root complex, but the actual problem was with 
> > > >> the
> > > >> device driver itself.
> > > >> 
> > > >> I’m wondering how to deal with this problem. Would it be OK if we 
> > > >> mapped the
> > > >> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR 
> > > >> registers?
> > > >> This would help devices such as LSI to circumvent this problem. One 
> > > >> problem
> > > >> with this approach is that it has the potential to collide with 
> > > >> another legitimate
> > > >> IOVA address. Kindly share your thought on this.
> > > >> 
> > > >> Thank you!
> > > > 
> > > > I am not 100% sure what do you plan to do but it sounds fine since even
> > > > if it collides, with traditional PCI device must never initiate cycles  
> > > >   
> > > 
> > > OK sounds good, I’ll create a mapping of the device BARs in the IOVA.  
> > 
> > I don't think this is correct.  Look for instance at ACPI _TRA support
> > where a system can specify a translation offset such that, for example,
> > a CPU access to a device is required to add the provided offset to the
> > bus address of the device.  A system using this could have multiple
> > root bridges, where each is given the same, overlapping MMIO aperture.  
> > >From the processor perspective, each MMIO range is unique and possibly  
> > none of those devices have a zero _TRA, there could be system memory at
> > the equivalent flat memory address.  
> 
> I am guessing there are reasons to have these in acpi besides firmware
> vendors wanting to find corner cases in device implementations though
> :).

Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-02-10 Thread Alex Williamson
On Thu, 10 Feb 2022 22:23:01 +
Jag Raman  wrote:

> > On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin  wrote:
> > 
> > On Thu, Feb 10, 2022 at 12:08:27AM +, Jag Raman wrote:  
> >> 
> >> Thanks for the explanation, Alex. Thanks to everyone else in the thread who
> >> helped to clarify this problem.
> >> 
> >> We have implemented the memory isolation based on the discussion in the
> >> thread. We will send the patches out shortly.
> >> 
> >> Devices such as “name" and “e1000” worked fine. But I’d like to note that
> >> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
> >> to be IOMMU aware. In LSI’s case, the kernel driver is asking the device to
> >> read instructions from the CPU VA (lsi_execute_script() -> read_dword()),
> >> which is forbidden when IOMMU is enabled. Specifically, the driver is 
> >> asking
> >> the device to access other BAR regions by using the BAR address programmed
> >> in the PCI config space. This happens even without vfio-user patches. For 
> >> example,
> >> we could enable IOMMU using “-device intel-iommu” QEMU option and also
> >> adding the following to the kernel command-line: “intel_iommu=on 
> >> iommu=nopt”.
> >> In this case, we could see an IOMMU fault.  
> > 
> > So, device accessing its own BAR is different. Basically, these
> > transactions never go on the bus at all, never mind get to the IOMMU.  
> 
> Hi Michael,
> 
> In LSI case, I did notice that it went to the IOMMU. The device is reading 
> the BAR
> address as if it was a DMA address.
> 
> > I think it's just used as a handle to address internal device memory.
> > This kind of trick is not universal, but not terribly unusual.
> > 
> >   
> >> Unfortunately, we started off our project with the LSI device. So that 
> >> lead to all the
> >> confusion about what is expected at the server end in-terms of
> >> vectoring/address-translation. It gave an impression as if the request was 
> >> still on
> >> the CPU side of the PCI root complex, but the actual problem was with the
> >> device driver itself.
> >> 
> >> I’m wondering how to deal with this problem. Would it be OK if we mapped 
> >> the
> >> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR 
> >> registers?
> >> This would help devices such as LSI to circumvent this problem. One problem
> >> with this approach is that it has the potential to collide with another 
> >> legitimate
> >> IOVA address. Kindly share your thought on this.
> >> 
> >> Thank you!  
> > 
> > I am not 100% sure what do you plan to do but it sounds fine since even
> > if it collides, with traditional PCI device must never initiate cycles  
> 
> OK sounds good, I’ll create a mapping of the device BARs in the IOVA.

I don't think this is correct.  Look for instance at ACPI _TRA support
where a system can specify a translation offset such that, for example,
a CPU access to a device is required to add the provided offset to the
bus address of the device.  A system using this could have multiple
root bridges, where each is given the same, overlapping MMIO aperture.
>From the processor perspective, each MMIO range is unique and possibly
none of those devices have a zero _TRA, there could be system memory at
the equivalent flat memory address.

So if the transaction actually hits this bus, which I think is what
making use of the device AddressSpace implies, I don't think it can
assume that it's simply reflected back at itself.  Conventional PCI and
PCI Express may be software compatible, but there's a reason we don't
see IOMMUs that provide both translation and isolation in conventional
topologies.

Is this more a bug in the LSI device emulation model?  For instance in
vfio-pci, if I want to access an offset into a BAR from within QEMU, I
don't care what address is programmed into that BAR, I perform an
access relative to the vfio file descriptor region representing that
BAR space.  I'd expect that any viable device emulation model does the
same, an access to device memory uses an offset from an internal
resource, irrespective of the BAR address.

It would seem strange if the driver is actually programming the device
to DMA to itself and if that's actually happening, I'd wonder if this
driver is actually compatible with an IOMMU on bare metal.

> > within their own BAR range, and PCIe is software-compatible with PCI. So
> > devices won't be able to access this IOVA even if it was programmed in
> > the IOMMU.
> > 
> > As was mentioned elsewhere on this thread, devices accessing each
> > other's BAR is a different matter.
> > 
> > I do not remember which rules apply to multiple functions of a
> > multi-function device though. I think in a traditional PCI
> > they will never go out on the bus, but with e.g. SRIOV they
> > would probably do go out? Alex, any idea?

This falls under implementation specific behavior in the spec, IIRC.
This is actually why IOMMU grouping requires ACS support on
multi-function devices to clarify the behavior of p2p 

Re: [PATCH v4 1/2] tpm: CRB: Use ram_device for "tpm-crb-cmd" region

2022-02-08 Thread Alex Williamson
On Tue, 8 Feb 2022 16:01:48 +
Peter Maydell  wrote:

> On Tue, 8 Feb 2022 at 15:56, Eric Auger  wrote:
> >
> > Hi Peter,
> >
> > On 2/8/22 4:17 PM, Peter Maydell wrote:  
> > > On Tue, 8 Feb 2022 at 15:08, Eric Auger  wrote:  
> > >> Representing the CRB cmd/response buffer as a standard
> > >> RAM region causes some trouble when the device is used
> > >> with VFIO. Indeed VFIO attempts to DMA_MAP this region
> > >> as usual RAM but this latter does not have a valid page
> > >> size alignment causing such an error report:
> > >> "vfio_listener_region_add received unaligned region".
> > >> To allow VFIO to detect that failing dma mapping
> > >> this region is not an issue, let's use a ram_device
> > >> memory region type instead.  
> > > This seems like VFIO's problem to me. There's nothing
> > > that guarantees alignment for memory regions at all,
> > > whether they're RAM, IO or anything else.  
> >
> > VFIO dma maps all the guest RAM.  
> 
> Well, it can if it likes, but "this is a RAM-backed MemoryRegion"
> doesn't imply "this is really guest actual RAM RAM", so if it's
> using that as its discriminator it should probably use something else.
> What is it actually trying to do here ?

VFIO is device agnostic, we don't understand the device programming
model, we can't know how the device is programmed to perform DMA.  The
only way we can provide transparent assignment of arbitrary PCI devices
is to install DMA mappings for everything in the device AddressSpace
through the system IOMMU.  If we were to get a sub-page RAM mapping
through the MemoryListener and that mapping had the possibility of
being a DMA target, then we have a problem, because we cannot represent
that through the IOMMU.  If the device were to use that address for DMA,
we'd likely have data loss/corruption in the VM.

AFAIK, and I thought we had some general agreement on this, declaring
device memory as ram_device is the only means we have to differentiate
MemoryRegion segments generated by a device from actual system RAM.
For device memory, we can lean on the fact that peer-to-peer DMA is
much more rare and likely involves some degree of validation by the
drivers since it can be blocked on physical hardware due to various
topology and chipset related issues.  Therefore we can consider
failures to map device memory at a lower risk than failures to map
ranges we think are actual system RAM.

Are there better approaches?  We can't rely on the device sitting
behind a vIOMMU in the guest to restrict the address space and we can't
afford the performance hit for dyanmic DMA mappings through a vIOMMU
either.  Thanks,

Alex




Re: [PULL 0/2] VFIO fixes 2022-02-03

2022-02-07 Thread Alex Williamson
On Mon, 7 Feb 2022 09:54:59 -0700
Alex Williamson  wrote:

> On Mon, 7 Feb 2022 17:08:01 +0100
> Philippe Mathieu-Daudé  wrote:
> 
> > On 7/2/22 16:50, Alex Williamson wrote:  
> > > On Sat, 5 Feb 2022 10:49:35 +
> > > Peter Maydell  wrote:
> >   
> > >> Hi; this has a format-string issue that means it doesn't build
> > >> on 32-bit systems:
> > >>
> > >> https://gitlab.com/qemu-project/qemu/-/jobs/2057116569
> > >>
> > >> ../hw/vfio/common.c: In function 'vfio_listener_region_add':
> > >> ../hw/vfio/common.c:893:26: error: format '%llx' expects argument of
> > >> type 'long long unsigned int', but argument 6 has type 'intptr_t' {aka
> > >> 'int'} [-Werror=format=]
> > >> error_report("%s received unaligned region %s iova=0x%"PRIx64
> > >> ^~
> > >> ../hw/vfio/common.c:899:26:
> > >> qemu_real_host_page_mask);
> > >> 
> > >>
> > >> For intptr_t you want PRIxPTR.
> > > 
> > > Darn.  Well, let me use this opportunity to ask, how are folks doing
> > > 32-bit cross builds on Fedora?  I used to keep an i686 PAE VM for this
> > > purpose, but I was eventually no longer able to maintain the build
> > > dependencies.  Looks like this failed on a mipsel cross build, but I
> > > don't see such a cross compiler in Fedora.  I do mingw32/64 cross
> > > builds, but they leave a lot to be desired for code coverage.  Thanks,
> > 
> > You can use docker images:
> > https://wiki.qemu.org/Testing/DockerBuild  
> 
> Hmm, not ideal...
> 
> Clean git clone, HEAD 55ef0b702bc2 ("Merge remote-tracking branch 
> 'remotes/lvivier-gitlab/tags/linux-user-for-7.0-pull-request' into staging")
> 
> $ make docker-test-quick@debian-mips64el-cross J=16

Accidentally selected the mips64el, but tests failing seems to be
common.  I can reproduce the build issue with either the mipsel or
fedora-i386-cross, so I'll include some flavor of the test-build in my
build script.  Thanks,

Alex




Re: [PULL 0/2] VFIO fixes 2022-02-03

2022-02-07 Thread Alex Williamson
On Mon, 7 Feb 2022 17:08:01 +0100
Philippe Mathieu-Daudé  wrote:

> On 7/2/22 16:50, Alex Williamson wrote:
> > On Sat, 5 Feb 2022 10:49:35 +
> > Peter Maydell  wrote:  
> 
> >> Hi; this has a format-string issue that means it doesn't build
> >> on 32-bit systems:
> >>
> >> https://gitlab.com/qemu-project/qemu/-/jobs/2057116569
> >>
> >> ../hw/vfio/common.c: In function 'vfio_listener_region_add':
> >> ../hw/vfio/common.c:893:26: error: format '%llx' expects argument of
> >> type 'long long unsigned int', but argument 6 has type 'intptr_t' {aka
> >> 'int'} [-Werror=format=]
> >> error_report("%s received unaligned region %s iova=0x%"PRIx64
> >> ^~
> >> ../hw/vfio/common.c:899:26:
> >> qemu_real_host_page_mask);
> >> 
> >>
> >> For intptr_t you want PRIxPTR.  
> > 
> > Darn.  Well, let me use this opportunity to ask, how are folks doing
> > 32-bit cross builds on Fedora?  I used to keep an i686 PAE VM for this
> > purpose, but I was eventually no longer able to maintain the build
> > dependencies.  Looks like this failed on a mipsel cross build, but I
> > don't see such a cross compiler in Fedora.  I do mingw32/64 cross
> > builds, but they leave a lot to be desired for code coverage.  Thanks,  
> 
> You can use docker images:
> https://wiki.qemu.org/Testing/DockerBuild

Hmm, not ideal...

Clean git clone, HEAD 55ef0b702bc2 ("Merge remote-tracking branch 
'remotes/lvivier-gitlab/tags/linux-user-for-7.0-pull-request' into staging")

$ make docker-test-quick@debian-mips64el-cross J=16
...
1/1 qemu:block / qemu-iotests qcow2 RUNNING   
>>> PYTHON=/usr/bin/python3 MALLOC_PERTURB_=188 /bin/sh 
>>> /tmp/qemu-test/build/../src/tests/qemu-iotests/../check-block.sh qcow2
1/1 qemu:block / qemu-iotests qcow2 ERROR   0.18s   exit status 1


Summary of Failures:

1/1 qemu:block / qemu-iotests qcow2 ERROR   0.18s   exit status 1


Ok: 0   
Expected Fail:  0   
Fail:   1   
Unexpected Pass:0   
Skipped:0   
Timeout:0   

Full log written to /tmp/qemu-test/build/meson-logs/iotestslog.txt
make: *** [/tmp/qemu-test/src/tests/Makefile.include:160: check-block] Error 1
make: *** Waiting for unfinished jobs
130/131 qemu:qapi-schema+qapi-frontend / QAPI schema regression tests OK
  0.20s
131/131 qemu:decodetree / decodetree  OK
  1.75s


Ok: 3   
Expected Fail:  0   
Fail:   0   
Unexpected Pass:0   
Skipped:128 
Timeout:0   

Full log written to /tmp/qemu-test/build/meson-logs/testlog.txt
Traceback (most recent call last):
  File "/tmp/qemu.git/./tests/docker/docker.py", line 758, in 
sys.exit(main())
  File "/tmp/qemu.git/./tests/docker/docker.py", line 754, in main
return args.cmdobj.run(args, argv)
  File "/tmp/qemu.git/./tests/docker/docker.py", line 430, in run
return Docker().run(argv, args.keep, quiet=args.quiet,
  File "/tmp/qemu.git/./tests/docker/docker.py", line 388, in run
ret = self._do_check(["run", "--rm", "--label",
  File "/tmp/qemu.git/./tests/docker/docker.py", line 252, in _do_check
return subprocess.check_call(self._command + cmd, **kwargs)
  File "/usr/lib64/python3.9/subprocess.py", line 373, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['podman', 'run', '--rm', '--label', 
'com.qemu.instance.uuid=560d8331a06b4fd9bbb74910f3a2b436', '--userns=keep-id', 
'-u', '1000', '--security-opt', 'seccomp=unconfined', '--net=none', '-e', 
'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=16', '-e', 
'DEBUG=', '-e', 'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', 
'/home/alwillia/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', 
'/tmp/qemu.git/docker-src.2022-02-07-09.45.59.2258561:/var/tmp/qemu:z,ro', 
'qemu/debian-mips64el-cross', '/var/tmp/qemu/run', 'test-quick']' returned 
non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=560d8331a06b4fd9bbb74910f3a2b436
make[1]: *** [tests/docker/Makefile.include:306: docker-run] Error 1
make[1]: Leaving directory '/tmp/qemu.git'
make: *** [tests/docker/Makefile.include:339: 
docker-run-test-quick@debian-mips64el-cross] Error 2




Re: [PULL 0/2] VFIO fixes 2022-02-03

2022-02-07 Thread Alex Williamson
On Sat, 5 Feb 2022 10:49:35 +
Peter Maydell  wrote:

> On Thu, 3 Feb 2022 at 22:38, Alex Williamson  
> wrote:
> >
> > The following changes since commit 8f3e5ce773c62bb5c4a847f3a9a5c98bbb3b359f:
> >
> >   Merge remote-tracking branch 
> > 'remotes/hdeller/tags/hppa-updates-pull-request' into staging (2022-02-02 
> > 19:54:30 +)
> >
> > are available in the Git repository at:
> >
> >   git://github.com/awilliam/qemu-vfio.git tags/vfio-fixes-20220203.0
> >
> > for you to fetch changes up to 36fe5d5836c8d5d928ef6d34e999d6991a2f732e:
> >
> >   hw/vfio/common: Silence ram device offset alignment error traces 
> > (2022-02-03 15:05:05 -0700)
> >
> > 
> > VFIO fixes 2022-02-03
> >
> >  * Fix alignment warnings when using TPM CRB with vfio-pci devices
> >(Eric Auger & Philippe Mathieu-Daudé)  
> 
> Hi; this has a format-string issue that means it doesn't build
> on 32-bit systems:
> 
> https://gitlab.com/qemu-project/qemu/-/jobs/2057116569
> 
> ../hw/vfio/common.c: In function 'vfio_listener_region_add':
> ../hw/vfio/common.c:893:26: error: format '%llx' expects argument of
> type 'long long unsigned int', but argument 6 has type 'intptr_t' {aka
> 'int'} [-Werror=format=]
> error_report("%s received unaligned region %s iova=0x%"PRIx64
> ^~
> ../hw/vfio/common.c:899:26:
> qemu_real_host_page_mask);
> 
> 
> For intptr_t you want PRIxPTR.

Darn.  Well, let me use this opportunity to ask, how are folks doing
32-bit cross builds on Fedora?  I used to keep an i686 PAE VM for this
purpose, but I was eventually no longer able to maintain the build
dependencies.  Looks like this failed on a mipsel cross build, but I
don't see such a cross compiler in Fedora.  I do mingw32/64 cross
builds, but they leave a lot to be desired for code coverage.  Thanks,

Alex




[PULL 1/2] tpm: CRB: Use ram_device for "tpm-crb-cmd" region

2022-02-03 Thread Alex Williamson
From: Eric Auger 

Representing the CRB cmd/response buffer as a standard
RAM region causes some trouble when the device is used
with VFIO. Indeed VFIO attempts to DMA_MAP this region
as usual RAM but this latter does not have a valid page
size alignment causing such an error report:
"vfio_listener_region_add received unaligned region".
To allow VFIO to detect that failing dma mapping
this region is not an issue, let's use a ram_device
memory region type instead.

Signed-off-by: Eric Auger 
Tested-by: Stefan Berger 
Acked-by: Stefan Berger 
[PMD: Keep tpm_crb.c in meson's softmmu_ss]
Signed-off-by: Philippe Mathieu-Daudé 
Link: https://lore.kernel.org/r/20220120001242.230082-2-f4...@amsat.org
Signed-off-by: Alex Williamson 
---
 hw/tpm/tpm_crb.c |   22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/hw/tpm/tpm_crb.c b/hw/tpm/tpm_crb.c
index 58ebd1469c35..be0884ea6031 100644
--- a/hw/tpm/tpm_crb.c
+++ b/hw/tpm/tpm_crb.c
@@ -25,6 +25,7 @@
 #include "sysemu/tpm_backend.h"
 #include "sysemu/tpm_util.h"
 #include "sysemu/reset.h"
+#include "exec/cpu-common.h"
 #include "tpm_prop.h"
 #include "tpm_ppi.h"
 #include "trace.h"
@@ -43,6 +44,7 @@ struct CRBState {
 
 bool ppi_enabled;
 TPMPPI ppi;
+uint8_t *crb_cmd_buf;
 };
 typedef struct CRBState CRBState;
 
@@ -291,10 +293,14 @@ static void tpm_crb_realize(DeviceState *dev, Error 
**errp)
 return;
 }
 
+s->crb_cmd_buf = qemu_memalign(qemu_real_host_page_size,
+HOST_PAGE_ALIGN(CRB_CTRL_CMD_SIZE));
+
 memory_region_init_io(>mmio, OBJECT(s), _crb_memory_ops, s,
 "tpm-crb-mmio", sizeof(s->regs));
-memory_region_init_ram(>cmdmem, OBJECT(s),
-"tpm-crb-cmd", CRB_CTRL_CMD_SIZE, errp);
+memory_region_init_ram_device_ptr(>cmdmem, OBJECT(s), "tpm-crb-cmd",
+  CRB_CTRL_CMD_SIZE, s->crb_cmd_buf);
+vmstate_register_ram(>cmdmem, DEVICE(s));
 
 memory_region_add_subregion(get_system_memory(),
 TPM_CRB_ADDR_BASE, >mmio);
@@ -309,12 +315,24 @@ static void tpm_crb_realize(DeviceState *dev, Error 
**errp)
 qemu_register_reset(tpm_crb_reset, dev);
 }
 
+static void tpm_crb_unrealize(DeviceState *dev)
+{
+CRBState *s = CRB(dev);
+
+qemu_vfree(s->crb_cmd_buf);
+
+if (s->ppi_enabled) {
+qemu_vfree(s->ppi.buf);
+}
+}
+
 static void tpm_crb_class_init(ObjectClass *klass, void *data)
 {
 DeviceClass *dc = DEVICE_CLASS(klass);
 TPMIfClass *tc = TPM_IF_CLASS(klass);
 
 dc->realize = tpm_crb_realize;
+dc->unrealize = tpm_crb_unrealize;
 device_class_set_props(dc, tpm_crb_properties);
 dc->vmsd  = _tpm_crb;
 dc->user_creatable = true;





[PULL 2/2] hw/vfio/common: Silence ram device offset alignment error traces

2022-02-03 Thread Alex Williamson
From: Eric Auger 

Failing to DMA MAP a ram_device should not cause an error message.
This is currently happening with the TPM CRB command region and
this is causing confusion.

We may want to keep the trace for debug purpose though.

Signed-off-by: Eric Auger 
Tested-by: Stefan Berger 
Acked-by: Alex Williamson 
Acked-by: Stefan Berger 
Signed-off-by: Philippe Mathieu-Daudé 
Link: https://lore.kernel.org/r/20220120001242.230082-3-f4...@amsat.org
Signed-off-by: Alex Williamson 
---
 hw/vfio/common.c |   15 ++-
 hw/vfio/trace-events |1 +
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 080046e3f511..9caa560b0788 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -884,7 +884,20 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 if (unlikely((section->offset_within_address_space &
   ~qemu_real_host_page_mask) !=
  (section->offset_within_region & ~qemu_real_host_page_mask))) 
{
-error_report("%s received unaligned region", __func__);
+if (memory_region_is_ram_device(section->mr)) { /* just debug purpose 
*/
+trace_vfio_listener_region_add_bad_offset_alignment(
+memory_region_name(section->mr),
+section->offset_within_address_space,
+section->offset_within_region, qemu_real_host_page_size);
+} else { /* error case we don't want to be fatal */
+error_report("%s received unaligned region %s iova=0x%"PRIx64
+ " offset_within_region=0x%"PRIx64
+ " qemu_real_host_page_mask=0x%"PRIx64,
+ __func__, memory_region_name(section->mr),
+ section->offset_within_address_space,
+ section->offset_within_region,
+ qemu_real_host_page_mask);
+}
 return;
 }
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 0ef1b5f4a65f..ccd9d7610d69 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -100,6 +100,7 @@ vfio_listener_region_add_skip(uint64_t start, uint64_t end) 
"SKIPPING region_add
 vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to 
liobn fd %d"
 vfio_listener_region_add_iommu(uint64_t start, uint64_t end) "region_add 
[iommu] 0x%"PRIx64" - 0x%"PRIx64
 vfio_listener_region_add_ram(uint64_t iova_start, uint64_t iova_end, void 
*vaddr) "region_add [ram] 0x%"PRIx64" - 0x%"PRIx64" [%p]"
+vfio_listener_region_add_bad_offset_alignment(const char *name, uint64_t iova, 
uint64_t offset_within_region, uint64_t page_size) "Region \"%s\" @0x%"PRIx64", 
offset_within_region=0x%"PRIx64", qemu_real_host_page_mask=0x%"PRIx64 " cannot 
be mapped for DMA"
 vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, uint64_t 
size, uint64_t page_size) "Region \"%s\" 0x%"PRIx64" size=0x%"PRIx64" is not 
aligned to 0x%"PRIx64" and cannot be mapped for DMA"
 vfio_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING 
region_del 0x%"PRIx64" - 0x%"PRIx64
 vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 0x%"PRIx64" 
- 0x%"PRIx64





[PULL 0/2] VFIO fixes 2022-02-03

2022-02-03 Thread Alex Williamson
The following changes since commit 8f3e5ce773c62bb5c4a847f3a9a5c98bbb3b359f:

  Merge remote-tracking branch 'remotes/hdeller/tags/hppa-updates-pull-request' 
into staging (2022-02-02 19:54:30 +)

are available in the Git repository at:

  git://github.com/awilliam/qemu-vfio.git tags/vfio-fixes-20220203.0

for you to fetch changes up to 36fe5d5836c8d5d928ef6d34e999d6991a2f732e:

  hw/vfio/common: Silence ram device offset alignment error traces (2022-02-03 
15:05:05 -0700)


VFIO fixes 2022-02-03

 * Fix alignment warnings when using TPM CRB with vfio-pci devices
   (Eric Auger & Philippe Mathieu-Daudé)


Eric Auger (2):
  tpm: CRB: Use ram_device for "tpm-crb-cmd" region
  hw/vfio/common: Silence ram device offset alignment error traces

 hw/tpm/tpm_crb.c | 22 --
 hw/vfio/common.c | 15 ++-
 hw/vfio/trace-events |  1 +
 3 files changed, 35 insertions(+), 3 deletions(-)




Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-02-02 Thread Alex Williamson
On Wed, 2 Feb 2022 09:30:42 +
Peter Maydell  wrote:

> On Tue, 1 Feb 2022 at 23:51, Alex Williamson  
> wrote:
> >
> > On Tue, 1 Feb 2022 21:24:08 +
> > Jag Raman  wrote:  
> > > The PCIBus data structure already has address_space_mem and
> > > address_space_io to contain the BAR regions of devices attached
> > > to it. I understand that these two PCIBus members form the
> > > PCI address space.  
> >
> > These are the CPU address spaces.  When there's no IOMMU, the PCI bus is
> > identity mapped to the CPU address space.  When there is an IOMMU, the
> > device address space is determined by the granularity of the IOMMU and
> > may be entirely separate from address_space_mem.  
> 
> Note that those fields in PCIBus are just whatever MemoryRegions
> the pci controller model passed in to the call to pci_root_bus_init()
> or equivalent. They may or may not be specifically the CPU's view
> of anything. (For instance on the versatilepb board, the PCI controller
> is visible to the CPU via several MMIO "windows" at known addresses,
> which let the CPU access into the PCI address space at a programmable
> offset. We model that by creating a couple of container MRs which
> we pass to pci_root_bus_init() to be the PCI memory and IO spaces,
> and then using alias MRs to provide the view into those at the
> guest-programmed offset. The CPU sees those windows, and doesn't
> have direct access to the whole PCIBus::address_space_mem.)
> I guess you could say they're the PCI controller's view of the PCI
> address space ?

Sure, that's fair.

> We have a tendency to be a bit sloppy with use of AddressSpaces
> within QEMU where it happens that the view of the world that a
> DMA-capable device matches that of the CPU, but conceptually
> they can definitely be different, especially in the non-x86 world.
> (Linux also confuses matters here by preferring to program a 1:1
> mapping even if the hardware is more flexible and can do other things.
> The model of the h/w in QEMU should support the other cases too, not
> just 1:1.)

Right, this is why I prefer to look at the device address space as
simply an IOVA.  The IOVA might be a direct physical address or
coincidental identity mapped physical address via an IOMMU, but none of
that should be the concern of the device.
 
> > I/O port space is always the identity mapped CPU address space unless
> > sparse translations are used to create multiple I/O port spaces (not
> > implemented).  I/O port space is only accessed by the CPU, there are no
> > device initiated I/O port transactions, so the address space relative
> > to the device is irrelevant.  
> 
> Does the PCI spec actually forbid any master except the CPU from
> issuing I/O port transactions, or is it just that in practice nobody
> makes a PCI device that does weird stuff like that ?

As realized in reply to MST, more the latter.  Not used, no point to
enabling, no means to enable depending on the physical IOMMU
implementation.  Thanks,

Alex




Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-02-02 Thread Alex Williamson
On Wed, 2 Feb 2022 05:06:49 -0500
"Michael S. Tsirkin"  wrote:

> On Wed, Feb 02, 2022 at 09:30:42AM +, Peter Maydell wrote:
> > > I/O port space is always the identity mapped CPU address space unless
> > > sparse translations are used to create multiple I/O port spaces (not
> > > implemented).  I/O port space is only accessed by the CPU, there are no
> > > device initiated I/O port transactions, so the address space relative
> > > to the device is irrelevant.  
> > 
> > Does the PCI spec actually forbid any master except the CPU from
> > issuing I/O port transactions, or is it just that in practice nobody
> > makes a PCI device that does weird stuff like that ?
> > 
> > thanks
> > -- PMM  
> 
> Hmm, the only thing vaguely related in the spec that I know of is this:
> 
>   PCI Express supports I/O Space for compatibility with legacy devices 
> which require their use.
>   Future revisions of this specification may deprecate the use of I/O 
> Space.
> 
> Alex, what did you refer to?

My evidence is largely by omission, but that might be that in practice
it's not used rather than explicitly forbidden.  I note that the bus
master enable bit specifies:

Bus Master Enable - Controls the ability of a Function to issue
Memory and I/O Read/Write Requests, and the ability of
a Port to forward Memory and I/O Read/Write Requests in
the Upstream direction.

That would suggest it's possible, but for PCI device assignment, I'm
not aware of any means through which we could support this.  There is
no support in the IOMMU core for mapping I/O port space, nor could we
trap such device initiated transactions to emulate them.  I can't spot
any mention of I/O port space in the VT-d spec, however the AMD-Vi spec
does include a field in the device table:

controlIoCtl: port I/O control. Specifies whether
device-initiated port I/O space transactions are blocked,
forwarded, or translated.

00b=Device-initiated port I/O is not allowed. The IOMMU target
aborts the transaction if a port I/O space transaction is
received. Translation requests are target aborted.

01b=Device-initiated port I/O space transactions are allowed.
The IOMMU must pass port I/O accesses untranslated. Translation
requests are target aborted.

10b=Transactions in the port I/O space address range are
translated by the IOMMU page tables as memory transactions.

11b=Reserved.

I don't see this field among the macros used by the Linux driver in
configuring these device entries, so I assume it's left to the default
value, ie. zero, blocking device initiated I/O port transactions.

So yes, I suppose device initiated I/O port transactions are possible,
but we have no support or reason to support them, so I'm going to go
ahead and continue believing any I/O port address space from the device
perspective is largely irrelevant ;)  Thanks,

Alex




Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-02-01 Thread Alex Williamson
On Wed, 2 Feb 2022 01:13:22 +
Jag Raman  wrote:

> > On Feb 1, 2022, at 5:47 PM, Alex Williamson  
> > wrote:
> > 
> > On Tue, 1 Feb 2022 21:24:08 +
> > Jag Raman  wrote:
> >   
> >>> On Feb 1, 2022, at 10:24 AM, Alex Williamson  
> >>> wrote:
> >>> 
> >>> On Tue, 1 Feb 2022 09:30:35 +
> >>> Stefan Hajnoczi  wrote:
> >>>   
> >>>> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:
> >>>>> On Fri, 28 Jan 2022 09:18:08 +
> >>>>> Stefan Hajnoczi  wrote:
> >>>>>   
> >>>>>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:  
> >>>>>>> If the goal here is to restrict DMA between devices, ie. peer-to-peer
> >>>>>>> (p2p), why are we trying to re-invent what an IOMMU already does? 
> >>>>>>>
> >>>>>> 
> >>>>>> The issue Dave raised is that vfio-user servers run in separate
> >>>>>> processses from QEMU with shared memory access to RAM but no direct
> >>>>>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
> >>>>>> example of a non-RAM MemoryRegion that can be the source/target of DMA
> >>>>>> requests.
> >>>>>> 
> >>>>>> I don't think IOMMUs solve this problem but luckily the vfio-user
> >>>>>> protocol already has messages that vfio-user servers can use as a
> >>>>>> fallback when DMA cannot be completed through the shared memory RAM
> >>>>>> accesses.
> >>>>>>   
> >>>>>>> In
> >>>>>>> fact, it seems like an IOMMU does this better in providing an IOVA
> >>>>>>> address space per BDF.  Is the dynamic mapping overhead too much?  
> >>>>>>> What
> >>>>>>> physical hardware properties or specifications could we leverage to
> >>>>>>> restrict p2p mappings to a device?  Should it be governed by machine
> >>>>>>> type to provide consistency between devices?  Should each "isolated"
> >>>>>>> bus be in a separate root complex?  Thanks,
> >>>>>> 
> >>>>>> There is a separate issue in this patch series regarding isolating the
> >>>>>> address space where BAR accesses are made (i.e. the global
> >>>>>> address_space_memory/io). When one process hosts multiple vfio-user
> >>>>>> server instances (e.g. a software-defined network switch with multiple
> >>>>>> ethernet devices) then each instance needs isolated memory and io 
> >>>>>> address
> >>>>>> spaces so that vfio-user clients don't cause collisions when they map
> >>>>>> BARs to the same address.
> >>>>>> 
> >>>>>> I think the the separate root complex idea is a good solution. This
> >>>>>> patch series takes a different approach by adding the concept of
> >>>>>> isolated address spaces into hw/pci/.  
> >>>>> 
> >>>>> This all still seems pretty sketchy, BARs cannot overlap within the
> >>>>> same vCPU address space, perhaps with the exception of when they're
> >>>>> being sized, but DMA should be disabled during sizing.
> >>>>> 
> >>>>> Devices within the same VM context with identical BARs would need to
> >>>>> operate in different address spaces.  For example a translation offset
> >>>>> in the vCPU address space would allow unique addressing to the devices,
> >>>>> perhaps using the translation offset bits to address a root complex and
> >>>>> masking those bits for downstream transactions.
> >>>>> 
> >>>>> In general, the device simply operates in an address space, ie. an
> >>>>> IOVA.  When a mapping is made within that address space, we perform a
> >>>>> translation as necessary to generate a guest physical address.  The
> >>>>> IOVA itself is only meaningful within the context of the address space,
> >>>>> there is no requirement or expectation for it to be globally unique.
> >>>>> 
> >>>>> If the vfio-user server is making some sort of requirement that IOVAs
> >

Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-02-01 Thread Alex Williamson
On Tue, 1 Feb 2022 21:24:08 +
Jag Raman  wrote:

> > On Feb 1, 2022, at 10:24 AM, Alex Williamson  
> > wrote:
> > 
> > On Tue, 1 Feb 2022 09:30:35 +
> > Stefan Hajnoczi  wrote:
> >   
> >> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:  
> >>> On Fri, 28 Jan 2022 09:18:08 +
> >>> Stefan Hajnoczi  wrote:
> >>>   
> >>>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:
> >>>>> If the goal here is to restrict DMA between devices, ie. peer-to-peer
> >>>>> (p2p), why are we trying to re-invent what an IOMMU already does?  
> >>>> 
> >>>> The issue Dave raised is that vfio-user servers run in separate
> >>>> processses from QEMU with shared memory access to RAM but no direct
> >>>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
> >>>> example of a non-RAM MemoryRegion that can be the source/target of DMA
> >>>> requests.
> >>>> 
> >>>> I don't think IOMMUs solve this problem but luckily the vfio-user
> >>>> protocol already has messages that vfio-user servers can use as a
> >>>> fallback when DMA cannot be completed through the shared memory RAM
> >>>> accesses.
> >>>>   
> >>>>> In
> >>>>> fact, it seems like an IOMMU does this better in providing an IOVA
> >>>>> address space per BDF.  Is the dynamic mapping overhead too much?  What
> >>>>> physical hardware properties or specifications could we leverage to
> >>>>> restrict p2p mappings to a device?  Should it be governed by machine
> >>>>> type to provide consistency between devices?  Should each "isolated"
> >>>>> bus be in a separate root complex?  Thanks,  
> >>>> 
> >>>> There is a separate issue in this patch series regarding isolating the
> >>>> address space where BAR accesses are made (i.e. the global
> >>>> address_space_memory/io). When one process hosts multiple vfio-user
> >>>> server instances (e.g. a software-defined network switch with multiple
> >>>> ethernet devices) then each instance needs isolated memory and io address
> >>>> spaces so that vfio-user clients don't cause collisions when they map
> >>>> BARs to the same address.
> >>>> 
> >>>> I think the the separate root complex idea is a good solution. This
> >>>> patch series takes a different approach by adding the concept of
> >>>> isolated address spaces into hw/pci/.
> >>> 
> >>> This all still seems pretty sketchy, BARs cannot overlap within the
> >>> same vCPU address space, perhaps with the exception of when they're
> >>> being sized, but DMA should be disabled during sizing.
> >>> 
> >>> Devices within the same VM context with identical BARs would need to
> >>> operate in different address spaces.  For example a translation offset
> >>> in the vCPU address space would allow unique addressing to the devices,
> >>> perhaps using the translation offset bits to address a root complex and
> >>> masking those bits for downstream transactions.
> >>> 
> >>> In general, the device simply operates in an address space, ie. an
> >>> IOVA.  When a mapping is made within that address space, we perform a
> >>> translation as necessary to generate a guest physical address.  The
> >>> IOVA itself is only meaningful within the context of the address space,
> >>> there is no requirement or expectation for it to be globally unique.
> >>> 
> >>> If the vfio-user server is making some sort of requirement that IOVAs
> >>> are unique across all devices, that seems very, very wrong.  Thanks,
> >> 
> >> Yes, BARs and IOVAs don't need to be unique across all devices.
> >> 
> >> The issue is that there can be as many guest physical address spaces as
> >> there are vfio-user clients connected, so per-client isolated address
> >> spaces are required. This patch series has a solution to that problem
> >> with the new pci_isol_as_mem/io() API.  
> > 
> > Sorry, this still doesn't follow for me.  A server that hosts multiple
> > devices across many VMs (I'm not sure if you're referring to the device
> > or the VM as a client) needs to deal with different address spaces per
> > device.  The 

Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-02-01 Thread Alex Williamson
On Tue, 1 Feb 2022 09:30:35 +
Stefan Hajnoczi  wrote:

> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:
> > On Fri, 28 Jan 2022 09:18:08 +
> > Stefan Hajnoczi  wrote:
> >   
> > > On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:  
> > > > If the goal here is to restrict DMA between devices, ie. peer-to-peer
> > > > (p2p), why are we trying to re-invent what an IOMMU already does?
> > > 
> > > The issue Dave raised is that vfio-user servers run in separate
> > > processses from QEMU with shared memory access to RAM but no direct
> > > access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
> > > example of a non-RAM MemoryRegion that can be the source/target of DMA
> > > requests.
> > > 
> > > I don't think IOMMUs solve this problem but luckily the vfio-user
> > > protocol already has messages that vfio-user servers can use as a
> > > fallback when DMA cannot be completed through the shared memory RAM
> > > accesses.
> > >   
> > > > In
> > > > fact, it seems like an IOMMU does this better in providing an IOVA
> > > > address space per BDF.  Is the dynamic mapping overhead too much?  What
> > > > physical hardware properties or specifications could we leverage to
> > > > restrict p2p mappings to a device?  Should it be governed by machine
> > > > type to provide consistency between devices?  Should each "isolated"
> > > > bus be in a separate root complex?  Thanks,
> > > 
> > > There is a separate issue in this patch series regarding isolating the
> > > address space where BAR accesses are made (i.e. the global
> > > address_space_memory/io). When one process hosts multiple vfio-user
> > > server instances (e.g. a software-defined network switch with multiple
> > > ethernet devices) then each instance needs isolated memory and io address
> > > spaces so that vfio-user clients don't cause collisions when they map
> > > BARs to the same address.
> > > 
> > > I think the the separate root complex idea is a good solution. This
> > > patch series takes a different approach by adding the concept of
> > > isolated address spaces into hw/pci/.  
> > 
> > This all still seems pretty sketchy, BARs cannot overlap within the
> > same vCPU address space, perhaps with the exception of when they're
> > being sized, but DMA should be disabled during sizing.
> > 
> > Devices within the same VM context with identical BARs would need to
> > operate in different address spaces.  For example a translation offset
> > in the vCPU address space would allow unique addressing to the devices,
> > perhaps using the translation offset bits to address a root complex and
> > masking those bits for downstream transactions.
> > 
> > In general, the device simply operates in an address space, ie. an
> > IOVA.  When a mapping is made within that address space, we perform a
> > translation as necessary to generate a guest physical address.  The
> > IOVA itself is only meaningful within the context of the address space,
> > there is no requirement or expectation for it to be globally unique.
> > 
> > If the vfio-user server is making some sort of requirement that IOVAs
> > are unique across all devices, that seems very, very wrong.  Thanks,  
> 
> Yes, BARs and IOVAs don't need to be unique across all devices.
> 
> The issue is that there can be as many guest physical address spaces as
> there are vfio-user clients connected, so per-client isolated address
> spaces are required. This patch series has a solution to that problem
> with the new pci_isol_as_mem/io() API.

Sorry, this still doesn't follow for me.  A server that hosts multiple
devices across many VMs (I'm not sure if you're referring to the device
or the VM as a client) needs to deal with different address spaces per
device.  The server needs to be able to uniquely identify every DMA,
which must be part of the interface protocol.  But I don't see how that
imposes a requirement of an isolated address space.  If we want the
device isolated because we don't trust the server, that's where an IOMMU
provides per device isolation.  What is the restriction of the
per-client isolated address space and why do we need it?  The server
needing to support multiple clients is not a sufficient answer to
impose new PCI bus types with an implicit restriction on the VM.
 
> What I find strange about this approach is that exported PCI devices are
> on PCI root ports that are connected to the machine's main PCI bus. The
> PCI devices don't 

Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-01-31 Thread Alex Williamson
On Fri, 28 Jan 2022 09:18:08 +
Stefan Hajnoczi  wrote:

> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:
> > If the goal here is to restrict DMA between devices, ie. peer-to-peer
> > (p2p), why are we trying to re-invent what an IOMMU already does?  
> 
> The issue Dave raised is that vfio-user servers run in separate
> processses from QEMU with shared memory access to RAM but no direct
> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
> example of a non-RAM MemoryRegion that can be the source/target of DMA
> requests.
> 
> I don't think IOMMUs solve this problem but luckily the vfio-user
> protocol already has messages that vfio-user servers can use as a
> fallback when DMA cannot be completed through the shared memory RAM
> accesses.
> 
> > In
> > fact, it seems like an IOMMU does this better in providing an IOVA
> > address space per BDF.  Is the dynamic mapping overhead too much?  What
> > physical hardware properties or specifications could we leverage to
> > restrict p2p mappings to a device?  Should it be governed by machine
> > type to provide consistency between devices?  Should each "isolated"
> > bus be in a separate root complex?  Thanks,  
> 
> There is a separate issue in this patch series regarding isolating the
> address space where BAR accesses are made (i.e. the global
> address_space_memory/io). When one process hosts multiple vfio-user
> server instances (e.g. a software-defined network switch with multiple
> ethernet devices) then each instance needs isolated memory and io address
> spaces so that vfio-user clients don't cause collisions when they map
> BARs to the same address.
> 
> I think the the separate root complex idea is a good solution. This
> patch series takes a different approach by adding the concept of
> isolated address spaces into hw/pci/.

This all still seems pretty sketchy, BARs cannot overlap within the
same vCPU address space, perhaps with the exception of when they're
being sized, but DMA should be disabled during sizing.

Devices within the same VM context with identical BARs would need to
operate in different address spaces.  For example a translation offset
in the vCPU address space would allow unique addressing to the devices,
perhaps using the translation offset bits to address a root complex and
masking those bits for downstream transactions.

In general, the device simply operates in an address space, ie. an
IOVA.  When a mapping is made within that address space, we perform a
translation as necessary to generate a guest physical address.  The
IOVA itself is only meaningful within the context of the address space,
there is no requirement or expectation for it to be globally unique.

If the vfio-user server is making some sort of requirement that IOVAs
are unique across all devices, that seems very, very wrong.  Thanks,

Alex




Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-01-27 Thread Alex Williamson
On Thu, 27 Jan 2022 08:30:13 +
Stefan Hajnoczi  wrote:

> On Wed, Jan 26, 2022 at 04:13:33PM -0500, Michael S. Tsirkin wrote:
> > On Wed, Jan 26, 2022 at 08:07:36PM +, Dr. David Alan Gilbert wrote:  
> > > * Stefan Hajnoczi (stefa...@redhat.com) wrote:  
> > > > On Wed, Jan 26, 2022 at 05:27:32AM +, Jag Raman wrote:  
> > > > > 
> > > > >   
> > > > > > On Jan 25, 2022, at 1:38 PM, Dr. David Alan Gilbert 
> > > > > >  wrote:
> > > > > > 
> > > > > > * Jag Raman (jag.ra...@oracle.com) wrote:  
> > > > > >> 
> > > > > >>   
> > > > > >>> On Jan 19, 2022, at 7:12 PM, Michael S. Tsirkin  
> > > > > >>> wrote:
> > > > > >>> 
> > > > > >>> On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman 
> > > > > >>> wrote:  
> > > > >  Allow PCI buses to be part of isolated CPU address spaces. This 
> > > > >  has a
> > > > >  niche usage.
> > > > >  
> > > > >  TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI 
> > > > >  devices in
> > > > >  the same machine/server. This would cause address space 
> > > > >  collision as
> > > > >  well as be a security vulnerability. Having separate address 
> > > > >  spaces for
> > > > >  each PCI bus would solve this problem.  
> > > > > >>> 
> > > > > >>> Fascinating, but I am not sure I understand. any examples?  
> > > > > >> 
> > > > > >> Hi Michael!
> > > > > >> 
> > > > > >> multiprocess QEMU and vfio-user implement a client-server model to 
> > > > > >> allow
> > > > > >> out-of-process emulation of devices. The client QEMU, which makes 
> > > > > >> ioctls
> > > > > >> to the kernel and runs VCPUs, could attach devices running in a 
> > > > > >> server
> > > > > >> QEMU. The server QEMU needs access to parts of the client’s RAM to
> > > > > >> perform DMA.  
> > > > > > 
> > > > > > Do you ever have the opposite problem? i.e. when an emulated PCI 
> > > > > > device  
> > > > > 
> > > > > That’s an interesting question.
> > > > >   
> > > > > > exposes a chunk of RAM-like space (frame buffer, or maybe a mapped 
> > > > > > file)
> > > > > > that the client can see.  What happens if two emulated devices need 
> > > > > > to
> > > > > > access each others emulated address space?  
> > > > > 
> > > > > In this case, the kernel driver would map the destination’s chunk of 
> > > > > internal RAM into
> > > > > the DMA space of the source device. Then the source device could 
> > > > > write to that
> > > > > mapped address range, and the IOMMU should direct those writes to the
> > > > > destination device.
> > > > > 
> > > > > I would like to take a closer look at the IOMMU implementation on how 
> > > > > to achieve
> > > > > this, and get back to you. I think the IOMMU would handle this. Could 
> > > > > you please
> > > > > point me to the IOMMU implementation you have in mind?  
> > > > 
> > > > I don't know if the current vfio-user client/server patches already
> > > > implement device-to-device DMA, but the functionality is supported by
> > > > the vfio-user protocol.
> > > > 
> > > > Basically: if the DMA regions lookup inside the vfio-user server fails,
> > > > fall back to VFIO_USER_DMA_READ/WRITE messages instead.
> > > > https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst#vfio-user-dma-read
> > > > 
> > > > Here is the flow:
> > > > 1. The vfio-user server with device A sends a DMA read to QEMU.
> > > > 2. QEMU finds the MemoryRegion associated with the DMA address and sees
> > > >it's a device.
> > > >a. If it's emulated inside the QEMU process then the normal
> > > >   device emulation code kicks in.
> > > >b. If it's another vfio-user PCI device then the vfio-user PCI proxy
> > > >   device forwards the DMA to the second vfio-user server's device 
> > > > B.  
> > > 
> > > I'm starting to be curious if there's a way to persuade the guest kernel
> > > to do it for us; in general is there a way to say to PCI devices that
> > > they can only DMA to the host and not other PCI devices?  
> > 
> > 
> > But of course - this is how e.g. VFIO protects host PCI devices from
> > each other when one of them is passed through to a VM.  
> 
> Michael: Are you saying just turn on vIOMMU? :)
> 
> Devices in different VFIO groups have their own IOMMU context, so their
> IOVA space is isolated. Just don't map other devices into the IOVA space
> and those other devices will be inaccessible.

Devices in different VFIO *containers* have their own IOMMU context.
Based on the group attachment to a container, groups can either have
shared or isolated IOVA space.  That determination is made by looking
at the address space of the bus, which is governed by the presence of a
vIOMMU.

If the goal here is to restrict DMA between devices, ie. peer-to-peer
(p2p), why are we trying to re-invent what an IOMMU already does?  In
fact, it seems like an IOMMU does this better in providing an IOVA
address space per BDF.  Is the dynamic mapping overhead too much?  What
physical hardware 

Re: [PATCH v3 0/2] tpm: CRB: Use ram_device for "tpm-crb-cmd" region

2022-01-27 Thread Alex Williamson
On Thu, 27 Jan 2022 08:51:15 +0100
Philippe Mathieu-Daudé  wrote:

> Hi Alex,
> 
> On 27/1/22 00:51, Alex Williamson wrote:
> > On Thu, 20 Jan 2022 01:12:40 +0100
> > Philippe Mathieu-Daudé  wrote:
> >   
> >> This is a respin of Eric's work, but not making tpm_crb.c target
> >> specific.
> >>
> >> Based-on: <2022012836.229419-1-f4...@amsat.org>
> >> "exec/cpu: Make host pages variables / macros 'target agnostic'"
> >> https://lore.kernel.org/qemu-devel/2022012836.229419-1-f4...@amsat.org/
> >>   
> 
> [*]
> 
> >> --
> >>
> >> Eric's v2 cover:
> >>
> >> This series aims at removing a spurious error message we get when
> >> launching a guest with a TPM-CRB device and VFIO-PCI devices.
> >>
> >> The CRB command buffer currently is a RAM MemoryRegion and given
> >> its base address alignment, it causes an error report on
> >> vfio_listener_region_add(). This series proposes to use a ram-device
> >> region instead which helps in better assessing the dma map error
> >> failure on VFIO side.
> >>
> >> Eric Auger (2):
> >>tpm: CRB: Use ram_device for "tpm-crb-cmd" region
> >>hw/vfio/common: Silence ram device offset alignment error traces
> >>
> >>   hw/tpm/tpm_crb.c | 22 --
> >>   hw/vfio/common.c | 15 ++-
> >>   hw/vfio/trace-events |  1 +
> >>   3 files changed, 35 insertions(+), 3 deletions(-)
> >>  
> > 
> > Unfortunately, FTB:
> > 
> > ../hw/tpm/tpm_crb.c: In function ‘tpm_crb_realize’:
> > ../hw/tpm/tpm_crb.c:297:33: warning: implicit declaration of function 
> > ‘HOST_PAGE_ALIGN’ [-Wimplicit-function-declaration]
> >297 | 
> > HOST_PAGE_ALIGN(CRB_CTRL_CMD_SIZE));
> >| ^~~
> > ../hw/tpm/tpm_crb.c:297:33: warning: nested extern declaration of 
> > ‘HOST_PAGE_ALIGN’ [-Wnested-externs]
> > 
> > This is a regression since Eric's v2.  Thanks,  
> 
> This series is based on another patch that Paolo already queued
> (see [*] earlier).
> 
> Next time I'll try to make it more explicit.

Sorry I missed that.  Since Paolo now has that pending in a pull
request I'll try again next week.  Thanks,

Alex




Re: [PATCH v3 0/2] tpm: CRB: Use ram_device for "tpm-crb-cmd" region

2022-01-26 Thread Alex Williamson
On Thu, 20 Jan 2022 01:12:40 +0100
Philippe Mathieu-Daudé  wrote:

> This is a respin of Eric's work, but not making tpm_crb.c target
> specific.
> 
> Based-on: <2022012836.229419-1-f4...@amsat.org>
> "exec/cpu: Make host pages variables / macros 'target agnostic'"
> https://lore.kernel.org/qemu-devel/2022012836.229419-1-f4...@amsat.org/
> 
> --
> 
> Eric's v2 cover:
> 
> This series aims at removing a spurious error message we get when
> launching a guest with a TPM-CRB device and VFIO-PCI devices.
> 
> The CRB command buffer currently is a RAM MemoryRegion and given
> its base address alignment, it causes an error report on
> vfio_listener_region_add(). This series proposes to use a ram-device
> region instead which helps in better assessing the dma map error
> failure on VFIO side.
> 
> Eric Auger (2):
>   tpm: CRB: Use ram_device for "tpm-crb-cmd" region
>   hw/vfio/common: Silence ram device offset alignment error traces
> 
>  hw/tpm/tpm_crb.c | 22 --
>  hw/vfio/common.c | 15 ++-
>  hw/vfio/trace-events |  1 +
>  3 files changed, 35 insertions(+), 3 deletions(-)
> 

Unfortunately, FTB:

../hw/tpm/tpm_crb.c: In function ‘tpm_crb_realize’:
../hw/tpm/tpm_crb.c:297:33: warning: implicit declaration of function 
‘HOST_PAGE_ALIGN’ [-Wimplicit-function-declaration]
  297 | HOST_PAGE_ALIGN(CRB_CTRL_CMD_SIZE));
  | ^~~
../hw/tpm/tpm_crb.c:297:33: warning: nested extern declaration of 
‘HOST_PAGE_ALIGN’ [-Wnested-externs]

This is a regression since Eric's v2.  Thanks,

Alex




Re: [PATCH v2 1/2] tpm: CRB: Use ram_device for "tpm-crb-cmd" region

2022-01-19 Thread Alex Williamson
On Wed, 19 Jan 2022 23:46:19 +0100
Philippe Mathieu-Daudé  wrote:

> On 18/1/22 16:33, Eric Auger wrote:
> > Representing the CRB cmd/response buffer as a standard
> > RAM region causes some trouble when the device is used
> > with VFIO. Indeed VFIO attempts to DMA_MAP this region
> > as usual RAM but this latter does not have a valid page
> > size alignment causing such an error report:
> > "vfio_listener_region_add received unaligned region".
> > To allow VFIO to detect that failing dma mapping
> > this region is not an issue, let's use a ram_device
> > memory region type instead.
> > 
> > The change in meson.build is required to include the
> > cpu.h header.
> > 
> > Signed-off-by: Eric Auger 
> > Tested-by: Stefan Berger 
> > 
> > ---
> > 
> > v1 -> v2:
> > - Add tpm_crb_unrealize
> > ---
> >   hw/tpm/meson.build |  2 +-
> >   hw/tpm/tpm_crb.c   | 22 --
> >   2 files changed, 21 insertions(+), 3 deletions(-)
> > 
> > diff --git a/hw/tpm/meson.build b/hw/tpm/meson.build
> > index 1c68d81d6a..3e74df945b 100644
> > --- a/hw/tpm/meson.build
> > +++ b/hw/tpm/meson.build
> > @@ -1,8 +1,8 @@
> >   softmmu_ss.add(when: 'CONFIG_TPM_TIS', if_true: files('tpm_tis_common.c'))
> >   softmmu_ss.add(when: 'CONFIG_TPM_TIS_ISA', if_true: 
> > files('tpm_tis_isa.c'))
> >   softmmu_ss.add(when: 'CONFIG_TPM_TIS_SYSBUS', if_true: 
> > files('tpm_tis_sysbus.c'))
> > -softmmu_ss.add(when: 'CONFIG_TPM_CRB', if_true: files('tpm_crb.c'))
> >   
> > +specific_ss.add(when: 'CONFIG_TPM_CRB', if_true: files('tpm_crb.c'))  
> 
> We don't need to make this file target-specific.
> 
> >   specific_ss.add(when: ['CONFIG_SOFTMMU', 'CONFIG_TPM_TIS'], if_true: 
> > files('tpm_ppi.c'))
> >   specific_ss.add(when: ['CONFIG_SOFTMMU', 'CONFIG_TPM_CRB'], if_true: 
> > files('tpm_ppi.c'))
> >   specific_ss.add(when: 'CONFIG_TPM_SPAPR', if_true: files('tpm_spapr.c'))
> > diff --git a/hw/tpm/tpm_crb.c b/hw/tpm/tpm_crb.c
> > index 58ebd1469c..6ec19a9911 100644
> > --- a/hw/tpm/tpm_crb.c
> > +++ b/hw/tpm/tpm_crb.c
> > @@ -25,6 +25,7 @@
> >   #include "sysemu/tpm_backend.h"
> >   #include "sysemu/tpm_util.h"
> >   #include "sysemu/reset.h"
> > +#include "cpu.h"
> >   #include "tpm_prop.h"
> >   #include "tpm_ppi.h"
> >   #include "trace.h"
> > @@ -43,6 +44,7 @@ struct CRBState {
> >   
> >   bool ppi_enabled;
> >   TPMPPI ppi;
> > +uint8_t *crb_cmd_buf;
> >   };
> >   typedef struct CRBState CRBState;
> >   
> > @@ -291,10 +293,14 @@ static void tpm_crb_realize(DeviceState *dev, Error 
> > **errp)
> >   return;
> >   }
> >   
> > +s->crb_cmd_buf = qemu_memalign(ç,
> > +HOST_PAGE_ALIGN(CRB_CTRL_CMD_SIZE));  
> 
> HOST_PAGE_ALIGN() and qemu_real_host_page_size() actually belong
> to "exec/cpu-common.h".
> 
> Alex, could you hold on a few days for this patch? I am going to send
> a cleanup series. Otherwise no worry, I will clean this on top too.

Sure.  Thanks,

Alex




Re: [PATCH v2 2/2] hw/vfio/common: Silence ram device offset alignment error traces

2022-01-19 Thread Alex Williamson
On Tue, 18 Jan 2022 16:33:06 +0100
Eric Auger  wrote:

> Failing to DMA MAP a ram_device should not cause an error message.
> This is currently happening with the TPM CRB command region and
> this is causing confusion.
> 
> We may want to keep the trace for debug purpose though.
> 
> Signed-off-by: Eric Auger 
> Tested-by: Stefan Berger 
> ---
>  hw/vfio/common.c | 15 ++-
>  hw/vfio/trace-events |  1 +
>  2 files changed, 15 insertions(+), 1 deletion(-)

Thanks!  Looks good to me.

Stefan, I can provide an ack here if you want to send a pull request
for both or likewise I can send a pull request with your ack on the
previous patch.  I suppose the patches themselves are technically
independent if you want to split them.  Whichever you prefer.

Acked-by: Alex Williamson 

> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 080046e3f5..9caa560b07 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -884,7 +884,20 @@ static void vfio_listener_region_add(MemoryListener 
> *listener,
>  if (unlikely((section->offset_within_address_space &
>~qemu_real_host_page_mask) !=
>   (section->offset_within_region & 
> ~qemu_real_host_page_mask))) {
> -error_report("%s received unaligned region", __func__);
> +if (memory_region_is_ram_device(section->mr)) { /* just debug 
> purpose */
> +trace_vfio_listener_region_add_bad_offset_alignment(
> +memory_region_name(section->mr),
> +section->offset_within_address_space,
> +section->offset_within_region, qemu_real_host_page_size);
> +} else { /* error case we don't want to be fatal */
> +error_report("%s received unaligned region %s iova=0x%"PRIx64
> + " offset_within_region=0x%"PRIx64
> + " qemu_real_host_page_mask=0x%"PRIx64,
> + __func__, memory_region_name(section->mr),
> + section->offset_within_address_space,
> + section->offset_within_region,
> + qemu_real_host_page_mask);
> +}
>  return;
>  }
>  
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 0ef1b5f4a6..ccd9d7610d 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -100,6 +100,7 @@ vfio_listener_region_add_skip(uint64_t start, uint64_t 
> end) "SKIPPING region_add
>  vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to 
> liobn fd %d"
>  vfio_listener_region_add_iommu(uint64_t start, uint64_t end) "region_add 
> [iommu] 0x%"PRIx64" - 0x%"PRIx64
>  vfio_listener_region_add_ram(uint64_t iova_start, uint64_t iova_end, void 
> *vaddr) "region_add [ram] 0x%"PRIx64" - 0x%"PRIx64" [%p]"
> +vfio_listener_region_add_bad_offset_alignment(const char *name, uint64_t 
> iova, uint64_t offset_within_region, uint64_t page_size) "Region \"%s\" 
> @0x%"PRIx64", offset_within_region=0x%"PRIx64", 
> qemu_real_host_page_mask=0x%"PRIx64 " cannot be mapped for DMA"
>  vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, 
> uint64_t size, uint64_t page_size) "Region \"%s\" 0x%"PRIx64" 
> size=0x%"PRIx64" is not aligned to 0x%"PRIx64" and cannot be mapped for DMA"
>  vfio_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING 
> region_del 0x%"PRIx64" - 0x%"PRIx64
>  vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 
> 0x%"PRIx64" - 0x%"PRIx64




Re: [PATCH] vfio/pci: Generate more relevant log messages for reset failures

2022-01-05 Thread Alex Williamson
On Wed, 5 Jan 2022 16:05:45 -0500
"Michael S. Tsirkin"  wrote:

> On Wed, Jan 05, 2022 at 12:56:42PM -0700, Alex Williamson wrote:
> > The VFIO_DEVICE_RESET ioctl might be backed by several different reset
> > methods, including a device specific reset (ie. custom reset code in
> > kernel), an ACPI reset (ie. custom reset code in firmware), FLR, PM,
> > and bus resets.  This listing is also the default priority order used
> > by the kernel for trying reset methods.  Traditionally we've had some
> > FUD regarding the PM reset as the extent of a "Soft Reset" is not well
> > defined in the PCI specification.  Therefore we try to guess what type
> > of reset a device might use for the VFIO_DEVICE_RESET and insert a bus
> > reset via the vfio hot reset interface if we think it could be a PM
> > reset.
> > 
> > This results in a couple odd tests for PM reset in our hot reset code,
> > as we assume if we don't detect has_pm_reset support that we can't
> > reset the device otherwise.  Starting with kernel v5.15, the kernel
> > exposes a sysfs attribute for devices that can tell us the priority
> > order for device resets, so long term (not implemented here) we no
> > longer need to play this guessing game, and if permissions allow we
> > could manipulate the order ourselves so that we don't need to inject
> > our own hot reset.
> > 
> > In the shorter term, implemented here, let's not assume we're out of
> > reset methods if we can't perform a hot reset and the device doesn't
> > support PM reset.  We can use reset_works as the authority, which
> > allows us to generate more comprehensible error messages for the case
> > when it actually doesn't work.
> > 
> > The impetus for this change is a result of commit d5daff7d3126 ("pcie:
> > implement slot power control for pcie root ports"), where powering off
> > a slot now results in a device reset.  If the slot is powered off as a
> > result of qdev_unplug() via the device request event, that device
> > request is potentially the result of an unbind operation in the host.
> > That unbind operation holds the kernel device lock, which causes the
> > VFIO_DEVICE_RESET ioctl to fail (or in the case of some kernels, has
> > cleared the flag indicating support of a device reset function).  We
> > can then end up with an SR-IOV VF device trying to trigger a hot reset,
> > which finds that it needs ownership of the PF group to perform such a
> > reset, resulting in confusing log messages.
> > 
> > Ultimately the above commit still introduces a log message that we
> > didn't have prior on such an unplug, but it's not unjustified to
> > perform such a reset, though it might be considered unnecessary.
> > Arguably failure to reset the device should always generate some sort
> > of meaningful log message.
> > 
> > Signed-off-by: Alex Williamson   
> 
> Looks reasonable. Just an extra idea: do we want to maybe validate the
> return code from the ioctl? I assume it's something like EBUSY right?

Ideally it'd be EAGAIN to denote the lock contention, but for some
reason there was a recent time when the kernel would clear the
pci_dev.reset_fn flag as part of pci_stop_dev() before unbinding the
driver from the device, in that case we get an ENOTTY.

Hmm, I'm remembering now that an issue with this approach to log all
device reset failures is that we're going to get false positives every
time we reboot a VM where we need a bus reset for multiple devices.  We
handle multiple devices via a reset handler but we'll still get a
redundant per device reset and we have no way to associate that per
device reset to a VM reset where the reset handler multi-device
mechanism may have been successful :-\  This would be very common with
desktop GPUs.  I'll plug away at this some more.  Thanks,

Alex




Re: [PATCH] vfio/pci: Generate more relevant log messages for reset failures

2022-01-05 Thread Alex Williamson


Sorry for the dupe and fat-fingering MST's email, please reply to the
other thread at 164141259622.4193261.8252690438434562107.stgit@omen.
Thanks,

Alex




[PATCH] vfio/pci: Generate more relevant log messages for reset failures

2022-01-05 Thread Alex Williamson
The VFIO_DEVICE_RESET ioctl might be backed by several different reset
methods, including a device specific reset (ie. custom reset code in
kernel), an ACPI reset (ie. custom reset code in firmware), FLR, PM,
and bus resets.  This listing is also the default priority order used
by the kernel for trying reset methods.  Traditionally we've had some
FUD regarding the PM reset as the extent of a "Soft Reset" is not well
defined in the PCI specification.  Therefore we try to guess what type
of reset a device might use for the VFIO_DEVICE_RESET and insert a bus
reset via the vfio hot reset interface if we think it could be a PM
reset.

This results in a couple odd tests for PM reset in our hot reset code,
as we assume if we don't detect has_pm_reset support that we can't
reset the device otherwise.  Starting with kernel v5.15, the kernel
exposes a sysfs attribute for devices that can tell us the priority
order for device resets, so long term (not implemented here) we no
longer need to play this guessing game, and if permissions allow we
could manipulate the order ourselves so that we don't need to inject
our own hot reset.

In the shorter term, implemented here, let's not assume we're out of
reset methods if we can't perform a hot reset and the device doesn't
support PM reset.  We can use reset_works as the authority, which
allows us to generate more comprehensible error messages for the case
when it actually doesn't work.

The impetus for this change is a result of commit d5daff7d3126 ("pcie:
implement slot power control for pcie root ports"), where powering off
a slot now results in a device reset.  If the slot is powered off as a
result of qdev_unplug() via the device request event, that device
request is potentially the result of an unbind operation in the host.
That unbind operation holds the kernel device lock, which causes the
VFIO_DEVICE_RESET ioctl to fail (or in the case of some kernels, has
cleared the flag indicating support of a device reset function).  We
can then end up with an SR-IOV VF device trying to trigger a hot reset,
which finds that it needs ownership of the PF group to perform such a
reset, resulting in confusing log messages.

Ultimately the above commit still introduces a log message that we
didn't have prior on such an unplug, but it's not unjustified to
perform such a reset, though it might be considered unnecessary.
Arguably failure to reset the device should always generate some sort
of meaningful log message.

Signed-off-by: Alex Williamson 
---
 hw/vfio/pci.c |   44 
 1 file changed, 32 insertions(+), 12 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 7b45353ce27f..ea697951556e 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2224,7 +2224,7 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool 
single)
 ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
 if (ret && errno != ENOSPC) {
 ret = -errno;
-if (!vdev->has_pm_reset) {
+if (!vdev->vbasedev.reset_works) {
 error_report("vfio: Cannot reset device %s, "
  "no available reset mechanism.", vdev->vbasedev.name);
 }
@@ -2270,7 +2270,7 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool 
single)
 }
 
 if (!group) {
-if (!vdev->has_pm_reset) {
+if (!vdev->vbasedev.reset_works) {
 error_report("vfio: Cannot reset device %s, "
  "depends on group %d which is not owned.",
  vdev->vbasedev.name, devices[i].group_id);
@@ -3162,6 +3162,8 @@ static void vfio_exitfn(PCIDevice *pdev)
 static void vfio_pci_reset(DeviceState *dev)
 {
 VFIOPCIDevice *vdev = VFIO_PCI(dev);
+Error *err = NULL;
+int ret;
 
 trace_vfio_pci_reset(vdev->vbasedev.name);
 
@@ -3175,26 +3177,44 @@ static void vfio_pci_reset(DeviceState *dev)
 goto post_reset;
 }
 
-if (vdev->vbasedev.reset_works &&
-(vdev->has_flr || !vdev->has_pm_reset) &&
-!ioctl(vdev->vbasedev.fd, VFIO_DEVICE_RESET)) {
-trace_vfio_pci_reset_flr(vdev->vbasedev.name);
-goto post_reset;
+if (vdev->vbasedev.reset_works && (vdev->has_flr || !vdev->has_pm_reset)) {
+if (!ioctl(vdev->vbasedev.fd, VFIO_DEVICE_RESET)) {
+trace_vfio_pci_reset_flr(vdev->vbasedev.name);
+goto post_reset;
+}
+
+error_setg_errno(, errno, "Unable to reset device");
 }
 
 /* See if we can do our own bus reset */
-if (!vfio_pci_hot_reset_one(vdev)) {
+ret = vfio_pci_hot_reset_one(vdev);
+if (!ret) {
 goto post_reset;
 }
 
+if (!err) {
+error_setg_errno(, -ret, "Unable to perform bus reset");
+}
+
 /* If nothing el

[PATCH] vfio/pci: Generate more relevant log messages for reset failures

2022-01-05 Thread Alex Williamson
The VFIO_DEVICE_RESET ioctl might be backed by several different reset
methods, including a device specific reset (ie. custom reset code in
kernel), an ACPI reset (ie. custom reset code in firmware), FLR, PM,
and bus resets.  This listing is also the default priority order used
by the kernel for trying reset methods.  Traditionally we've had some
FUD regarding the PM reset as the extent of a "Soft Reset" is not well
defined in the PCI specification.  Therefore we try to guess what type
of reset a device might use for the VFIO_DEVICE_RESET and insert a bus
reset via the vfio hot reset interface if we think it could be a PM
reset.

This results in a couple odd tests for PM reset in our hot reset code,
as we assume if we don't detect has_pm_reset support that we can't
reset the device otherwise.  Starting with kernel v5.15, the kernel
exposes a sysfs attribute for devices that can tell us the priority
order for device resets, so long term (not implemented here) we no
longer need to play this guessing game, and if permissions allow we
could manipulate the order ourselves so that we don't need to inject
our own hot reset.

In the shorter term, implemented here, let's not assume we're out of
reset methods if we can't perform a hot reset and the device doesn't
support PM reset.  We can use reset_works as the authority, which
allows us to generate more comprehensible error messages for the case
when it actually doesn't work.

The impetus for this change is a result of commit d5daff7d3126 ("pcie:
implement slot power control for pcie root ports"), where powering off
a slot now results in a device reset.  If the slot is powered off as a
result of qdev_unplug() via the device request event, that device
request is potentially the result of an unbind operation in the host.
That unbind operation holds the kernel device lock, which causes the
VFIO_DEVICE_RESET ioctl to fail (or in the case of some kernels, has
cleared the flag indicating support of a device reset function).  We
can then end up with an SR-IOV VF device trying to trigger a hot reset,
which finds that it needs ownership of the PF group to perform such a
reset, resulting in confusing log messages.

Ultimately the above commit still introduces a log message that we
didn't have prior on such an unplug, but it's not unjustified to
perform such a reset, though it might be considered unnecessary.
Arguably failure to reset the device should always generate some sort
of meaningful log message.

Signed-off-by: Alex Williamson 
---
 hw/vfio/pci.c |   44 
 1 file changed, 32 insertions(+), 12 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 7b45353ce27f..ea697951556e 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2224,7 +2224,7 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool 
single)
 ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
 if (ret && errno != ENOSPC) {
 ret = -errno;
-if (!vdev->has_pm_reset) {
+if (!vdev->vbasedev.reset_works) {
 error_report("vfio: Cannot reset device %s, "
  "no available reset mechanism.", vdev->vbasedev.name);
 }
@@ -2270,7 +2270,7 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool 
single)
 }
 
 if (!group) {
-if (!vdev->has_pm_reset) {
+if (!vdev->vbasedev.reset_works) {
 error_report("vfio: Cannot reset device %s, "
  "depends on group %d which is not owned.",
  vdev->vbasedev.name, devices[i].group_id);
@@ -3162,6 +3162,8 @@ static void vfio_exitfn(PCIDevice *pdev)
 static void vfio_pci_reset(DeviceState *dev)
 {
 VFIOPCIDevice *vdev = VFIO_PCI(dev);
+Error *err = NULL;
+int ret;
 
 trace_vfio_pci_reset(vdev->vbasedev.name);
 
@@ -3175,26 +3177,44 @@ static void vfio_pci_reset(DeviceState *dev)
 goto post_reset;
 }
 
-if (vdev->vbasedev.reset_works &&
-(vdev->has_flr || !vdev->has_pm_reset) &&
-!ioctl(vdev->vbasedev.fd, VFIO_DEVICE_RESET)) {
-trace_vfio_pci_reset_flr(vdev->vbasedev.name);
-goto post_reset;
+if (vdev->vbasedev.reset_works && (vdev->has_flr || !vdev->has_pm_reset)) {
+if (!ioctl(vdev->vbasedev.fd, VFIO_DEVICE_RESET)) {
+trace_vfio_pci_reset_flr(vdev->vbasedev.name);
+goto post_reset;
+}
+
+error_setg_errno(, errno, "Unable to reset device");
 }
 
 /* See if we can do our own bus reset */
-if (!vfio_pci_hot_reset_one(vdev)) {
+ret = vfio_pci_hot_reset_one(vdev);
+if (!ret) {
 goto post_reset;
 }
 
+if (!err) {
+error_setg_errno(, -ret, "Unable to perform bus reset");
+}
+
 /* If nothing el

Re: [PATCH] pci: Skip power-off reset when pending unplug

2022-01-05 Thread Alex Williamson
On Thu, 23 Dec 2021 08:33:09 -0500
"Michael S. Tsirkin"  wrote:

> On Wed, Dec 22, 2021 at 04:10:07PM -0700, Alex Williamson wrote:
> > On Wed, 22 Dec 2021 15:48:24 -0500
> > "Michael S. Tsirkin"  wrote:
> >   
> > > On Wed, Dec 22, 2021 at 12:08:09PM -0700, Alex Williamson wrote:  
> > > > On Tue, 21 Dec 2021 18:40:09 -0500
> > > > "Michael S. Tsirkin"  wrote:
> > > The reset is actually just an attempt to approximate power off.
> > > So I'm not sure that is right powering device off and then on
> > > is just a slow but reasonable way for guests to reset a device.  
> > 
> > Agree, I don't have a problem with resetting devices in response to the
> > slot being powered off, just that it's pointless, and in some scenarios
> > causes us additional grief when it occurs when the device is being
> > removed anyway.
> >   
> > > >  In that case we could reorganize things to let the unplug occur
> > > > before the power transition.
> > > 
> > > Hmm you mean unplug on host immediately when it starts blinking?
> > > But drivers are not notified at this point, are they?  
> > 
> > I think this is confusing Attention Indicator and Power Indicator
> > again.  
> 
> Let's try to clear it up.
> 
> Here's text from the SHPC spec, pcie spec is less clear imho but
> same idea IIUC.
> 
> The Power Indicator provides visual feedback to the human operator (if the 
> system
> software accepts the request initiated by the Attention Button) by blinking. 
> Once the
> Power Indicator begins blinking, a 5-second abort interval exists during 
> which a second
> depression of the Attention Button cancels the operation.
> 
> Attention Indicator is confusingly unrelated to the Attention Button.
> Right?

Yeah, I think that's where I was getting confused.  So a qdev_unplug()
results in "pushing" the attention button, the power indicator starts
flashing for 5s, during which an additional attention button press
could cancel the event.  After that 5s period and with the power
indicator still flashing, the power controller is set to off, followed
by the power indicator turning off.

> >  The write sequence I noted for the slot control register was as
> > follows:
> > 
> > 01f1 - > 02f1 -> 06f1 -> 07f1
> > 
> >  01f1:
> >Attention Indicator: OFF
> >Power Indicator: ON
> >Power Controller: ON
> > 
> >  02f1:
> >Power Indicator: ON -> BLINK
> > 
> >  06f1:
> >Power Controller: ON -> OFF
> > 
> >  07f1:
> >Power Indicator: BLINK -> OFF
> > 
> > The device reset currently occurs at 06f1, when the Power Controller
> > removes power to the slot.  The unplug doesn't occur until 07f1 when
> > the Power Indicator turns off.  On bare metal, the device power would
> > be in some indeterminate state between those two writes as the power
> > drains.  We've chosen to reset the device at the beginning of this
> > phase, where power is first removed (ie. instantaneous power drain),
> > but on a physical device it happens somewhere in between.  
> 
> Yes, this is true I think. But I think on bare metal it's guaranteed to
> happen within 1 second after power is removed, whatever the state of the
> power indicator.
> Also, Gerd attempted to add PV code that special cases KVM and
> removes all the multi-second waiting from unplug path.
> So I am not sure we should rely on the 1 second wait, either.

Right, if we don't reset when power is removed we're in a guessing game
of whether the guest is following our assumed transitions.

> >  It therefore
> > seems valid that it could happen at the moment the Power Indicator
> > turns off such that we could precede the device reset with any
> > necessary unplug operations.  
> 
> However the power indicator is merely an indicator for the
> human operator. My understanding is that driver that does not want to permit
> removing the device can turn off power without turning off
> the power indicator.

Yes, on bare metal there's likely some small window where the device
power state is indeterminate, but to take advantage of that we'd need
to do something like setup a 2s timer to reset the device, where that
timer gets canceled if the power indicator turns off in the meantime.
It's a lot of heuristics.

> > > >  Of course the original proposal also
> > > > essentially supports this interpretation, the slot power off reset does
> > > > not occur for devices with a pending unplug and those devices are
> > > > removed after the slot transitio

Re: [PATCH] pci: Skip power-off reset when pending unplug

2021-12-22 Thread Alex Williamson
On Wed, 22 Dec 2021 15:48:24 -0500
"Michael S. Tsirkin"  wrote:

> On Wed, Dec 22, 2021 at 12:08:09PM -0700, Alex Williamson wrote:
> > On Tue, 21 Dec 2021 18:40:09 -0500
> > "Michael S. Tsirkin"  wrote:
> >   
> > > On Tue, Dec 21, 2021 at 09:36:56AM -0700, Alex Williamson wrote:  
> > > > On Mon, 20 Dec 2021 18:03:56 -0500
> > > > "Michael S. Tsirkin"  wrote:
> > > > 
> > > > > On Mon, Dec 20, 2021 at 11:26:59AM -0700, Alex Williamson wrote:
> > > > > > The below referenced commit introduced a change where devices under 
> > > > > > a
> > > > > > root port slot are reset in response to removing power to the slot.
> > > > > > This improves emulation relative to bare metal when the slot is 
> > > > > > powered
> > > > > > off, but introduces an unnecessary step when devices under that slot
> > > > > > are slated for removal.
> > > > > > 
> > > > > > In the case of an assigned device, there are mandatory delays
> > > > > > associated with many device reset mechanisms which can stall the hot
> > > > > > unplug operation.  Also, in cases where the unplug request is 
> > > > > > triggered
> > > > > > via a release operation of the host driver, internal device locking 
> > > > > > in
> > > > > > the host kernel may result in a failure of the device reset 
> > > > > > mechanism,
> > > > > > which generates unnecessary log warnings.
> > > > > > 
> > > > > > Skip the reset for devices that are slated for unplug.
> > > > > > 
> > > > > > Cc: qemu-sta...@nongnu.org
> > > > > > Fixes: d5daff7d3126 ("pcie: implement slot power control for pcie 
> > > > > > root ports")
> > > > > > Signed-off-by: Alex Williamson   
> > > > > 
> > > > > I am not sure this is safe. IIUC pending_deleted_event
> > > > > is normally set after host admin requested device removal,
> > > > > while the reset could be triggered by guest for its own reasons
> > > > > such as suspend or driver reload.
> > > > 
> > > > Right, the case where I mention that we get the warning looks exactly
> > > > like the admin doing a device eject, it calls qdev_unplug().  I'm not
> > > > trying to prevent arbitrary guest resets of the device, in fact there
> > > > are cases where the guest really should be able to reset the device,
> > > > nested assignment in addition to the cases you mention.  Gerd noted
> > > > that this was an unintended side effect of the referenced patch to
> > > > reset device that are imminently being removed.
> > > > 
> > > > > Looking at this some more, I am not sure I understand the
> > > > > issue completely.
> > > > > We have:
> > > > > 
> > > > > if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val & PCI_EXP_SLTCTL_PCC) &&
> > > > > (val & PCI_EXP_SLTCTL_PIC_OFF) == PCI_EXP_SLTCTL_PIC_OFF &&
> > > > > (!(old_slt_ctl & PCI_EXP_SLTCTL_PCC) ||
> > > > > (old_slt_ctl & PCI_EXP_SLTCTL_PIC_OFF) != 
> > > > > PCI_EXP_SLTCTL_PIC_OFF)) {
> > > > > pcie_cap_slot_do_unplug(dev);
> > > > > }
> > > > > pcie_cap_update_power(dev);
> > > > > 
> > > > > so device unplug triggers first, reset follows and by that time
> > > > > there should be no devices under the bus, if there are then
> > > > > it's because guest did not clear the power indicator.
> > > > 
> > > > Note that the unplug only triggers here if the Power Indicator Control
> > > > is OFF, I see writes to SLTCTL in the following order:
> > > > 
> > > >  01f1 - > 02f1 -> 06f1 -> 07f1
> > > > 
> > > > So PIC changes to BLINK, then PCC changes the slot to OFF (this
> > > > triggers the reset), then PIC changes to OFF triggering the unplug.
> > > > 
> > > > The unnecessary reset that occurs here is universal.  Should the unplug
> > > > be occurring when:
> > > > 
> > > >   (val & PCI_EXP_SLTCTL_PIC_OFF) != PCI_EXP_SLTCTL_PIC_ON
> > > > 
> > > >

Re: [PATCH] pci: Skip power-off reset when pending unplug

2021-12-22 Thread Alex Williamson
On Tue, 21 Dec 2021 18:40:09 -0500
"Michael S. Tsirkin"  wrote:

> On Tue, Dec 21, 2021 at 09:36:56AM -0700, Alex Williamson wrote:
> > On Mon, 20 Dec 2021 18:03:56 -0500
> > "Michael S. Tsirkin"  wrote:
> >   
> > > On Mon, Dec 20, 2021 at 11:26:59AM -0700, Alex Williamson wrote:  
> > > > The below referenced commit introduced a change where devices under a
> > > > root port slot are reset in response to removing power to the slot.
> > > > This improves emulation relative to bare metal when the slot is powered
> > > > off, but introduces an unnecessary step when devices under that slot
> > > > are slated for removal.
> > > > 
> > > > In the case of an assigned device, there are mandatory delays
> > > > associated with many device reset mechanisms which can stall the hot
> > > > unplug operation.  Also, in cases where the unplug request is triggered
> > > > via a release operation of the host driver, internal device locking in
> > > > the host kernel may result in a failure of the device reset mechanism,
> > > > which generates unnecessary log warnings.
> > > > 
> > > > Skip the reset for devices that are slated for unplug.
> > > > 
> > > > Cc: qemu-sta...@nongnu.org
> > > > Fixes: d5daff7d3126 ("pcie: implement slot power control for pcie root 
> > > > ports")
> > > > Signed-off-by: Alex Williamson 
> > > 
> > > I am not sure this is safe. IIUC pending_deleted_event
> > > is normally set after host admin requested device removal,
> > > while the reset could be triggered by guest for its own reasons
> > > such as suspend or driver reload.  
> > 
> > Right, the case where I mention that we get the warning looks exactly
> > like the admin doing a device eject, it calls qdev_unplug().  I'm not
> > trying to prevent arbitrary guest resets of the device, in fact there
> > are cases where the guest really should be able to reset the device,
> > nested assignment in addition to the cases you mention.  Gerd noted
> > that this was an unintended side effect of the referenced patch to
> > reset device that are imminently being removed.
> >   
> > > Looking at this some more, I am not sure I understand the
> > > issue completely.
> > > We have:
> > > 
> > > if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val & PCI_EXP_SLTCTL_PCC) &&
> > > (val & PCI_EXP_SLTCTL_PIC_OFF) == PCI_EXP_SLTCTL_PIC_OFF &&
> > > (!(old_slt_ctl & PCI_EXP_SLTCTL_PCC) ||
> > > (old_slt_ctl & PCI_EXP_SLTCTL_PIC_OFF) != 
> > > PCI_EXP_SLTCTL_PIC_OFF)) {
> > > pcie_cap_slot_do_unplug(dev);
> > > }
> > > pcie_cap_update_power(dev);
> > > 
> > > so device unplug triggers first, reset follows and by that time
> > > there should be no devices under the bus, if there are then
> > > it's because guest did not clear the power indicator.  
> > 
> > Note that the unplug only triggers here if the Power Indicator Control
> > is OFF, I see writes to SLTCTL in the following order:
> > 
> >  01f1 - > 02f1 -> 06f1 -> 07f1
> > 
> > So PIC changes to BLINK, then PCC changes the slot to OFF (this
> > triggers the reset), then PIC changes to OFF triggering the unplug.
> > 
> > The unnecessary reset that occurs here is universal.  Should the unplug
> > be occurring when:
> > 
> >   (val & PCI_EXP_SLTCTL_PIC_OFF) != PCI_EXP_SLTCTL_PIC_ON
> > 
> > ?  
> 
> well blinking generally means "do not remove yet".

Blinking indicates that the slot is in a transition phase, which we
could also interpret to mean that power has been removed and this is
the time required for the power to settle.  By that token, it might be
reasonable that a power state induced reset doesn't actually occur
until the slot reaches both the slot power off and power indicator off
state.  In that case we could reorganize things to let the unplug occur
before the power transition.  Of course the original proposal also
essentially supports this interpretation, the slot power off reset does
not occur for devices with a pending unplug and those devices are
removed after the slot transition grace period.

> > > So I am not sure how to fix the assignment issues as I'm not sure how do
> > > they trigger, but here is a wild idea: maybe it should support an API
> > > for starting reset asynchronously, then if the following access is
> > > trying to reset a

Re: [PATCH] vfio/pci: Don't setup VFIO MSI-X for Kunlun VF

2021-12-21 Thread Alex Williamson
On Tue, 14 Dec 2021 13:45:34 +0800
Cai Huoqing  wrote:

> No support MSI-X in BAIDU KUNLUN Virtual Function devices,
> so add a quirk to avoid setuping VFIO MSI-X
> 
> Signed-off-by: Cai Huoqing 
> ---
>  hw/vfio/pci.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 7b45353ce2..15f76bbe56 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1994,6 +1994,13 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, 
> uint8_t pos, Error **errp)
>  ret = vfio_setup_pcie_cap(vdev, pos, size, errp);
>  break;
>  case PCI_CAP_ID_MSIX:
> +/*
> + * BAIDU KUNLUN Virtual Function devices for KUNLUN AI processor
> + * don't support MSI-X, so don't setup VFIO MSI-X here.
> + */
> +if (vdev->vendor_id == PCI_VENDOR_ID_BAIDU &&
> +vdev->device_id == PCI_DEVICE_ID_KUNLUN_VF)
> +break;
>  ret = vfio_msix_setup(vdev, pos, errp);
>  break;
>  case PCI_CAP_ID_PM:


So the VF exposes an MSI-X capability but it's entirely unsupported
and/or bogus?  If it's not bogus, why can't we support it?  How does
the host kernel driver know to avoid MSI-X?  Should we use the same
mechanism used by the host driver to quirk whether vfio-pci exposes the
MSI-X capability to userspace at all?  Thanks,

Alex




Re: [PATCH] pci: Skip power-off reset when pending unplug

2021-12-21 Thread Alex Williamson
On Mon, 20 Dec 2021 18:03:56 -0500
"Michael S. Tsirkin"  wrote:

> On Mon, Dec 20, 2021 at 11:26:59AM -0700, Alex Williamson wrote:
> > The below referenced commit introduced a change where devices under a
> > root port slot are reset in response to removing power to the slot.
> > This improves emulation relative to bare metal when the slot is powered
> > off, but introduces an unnecessary step when devices under that slot
> > are slated for removal.
> > 
> > In the case of an assigned device, there are mandatory delays
> > associated with many device reset mechanisms which can stall the hot
> > unplug operation.  Also, in cases where the unplug request is triggered
> > via a release operation of the host driver, internal device locking in
> > the host kernel may result in a failure of the device reset mechanism,
> > which generates unnecessary log warnings.
> > 
> > Skip the reset for devices that are slated for unplug.
> > 
> > Cc: qemu-sta...@nongnu.org
> > Fixes: d5daff7d3126 ("pcie: implement slot power control for pcie root 
> > ports")
> > Signed-off-by: Alex Williamson   
> 
> I am not sure this is safe. IIUC pending_deleted_event
> is normally set after host admin requested device removal,
> while the reset could be triggered by guest for its own reasons
> such as suspend or driver reload.

Right, the case where I mention that we get the warning looks exactly
like the admin doing a device eject, it calls qdev_unplug().  I'm not
trying to prevent arbitrary guest resets of the device, in fact there
are cases where the guest really should be able to reset the device,
nested assignment in addition to the cases you mention.  Gerd noted
that this was an unintended side effect of the referenced patch to
reset device that are imminently being removed.

> Looking at this some more, I am not sure I understand the
> issue completely.
> We have:
> 
> if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val & PCI_EXP_SLTCTL_PCC) &&
> (val & PCI_EXP_SLTCTL_PIC_OFF) == PCI_EXP_SLTCTL_PIC_OFF &&
> (!(old_slt_ctl & PCI_EXP_SLTCTL_PCC) ||
> (old_slt_ctl & PCI_EXP_SLTCTL_PIC_OFF) != PCI_EXP_SLTCTL_PIC_OFF)) {
> pcie_cap_slot_do_unplug(dev);
> }
> pcie_cap_update_power(dev);
> 
> so device unplug triggers first, reset follows and by that time
> there should be no devices under the bus, if there are then
> it's because guest did not clear the power indicator.

Note that the unplug only triggers here if the Power Indicator Control
is OFF, I see writes to SLTCTL in the following order:

 01f1 - > 02f1 -> 06f1 -> 07f1

So PIC changes to BLINK, then PCC changes the slot to OFF (this
triggers the reset), then PIC changes to OFF triggering the unplug.

The unnecessary reset that occurs here is universal.  Should the unplug
be occurring when:

  (val & PCI_EXP_SLTCTL_PIC_OFF) != PCI_EXP_SLTCTL_PIC_ON

?
 
> So I am not sure how to fix the assignment issues as I'm not sure how do
> they trigger, but here is a wild idea: maybe it should support an API
> for starting reset asynchronously, then if the following access is
> trying to reset again that second reset can just be skipped, while any
> other access will stall.

As above, there's not a concurrency problem, so I don't see how an
async API buys us anything.  It seems the ordering of the slot power
induced reset versus device unplug is not as you expected.  Can we fix
that?  Thanks,

Alex

> > ---
> >  hw/pci/pci.c |2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index e5993c1ef52b..f594da410797 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -2869,7 +2869,7 @@ void pci_set_power(PCIDevice *d, bool state)
> >  memory_region_set_enabled(>bus_master_enable_region,
> >(pci_get_word(d->config + PCI_COMMAND)
> > & PCI_COMMAND_MASTER) && d->has_power);
> > -if (!d->has_power) {
> > +if (!d->has_power && !d->qdev.pending_deleted_event) {
> >  pci_device_reset(d);
> >  }
> >  }
> >   
> 




[PATCH] pci: Skip power-off reset when pending unplug

2021-12-20 Thread Alex Williamson
The below referenced commit introduced a change where devices under a
root port slot are reset in response to removing power to the slot.
This improves emulation relative to bare metal when the slot is powered
off, but introduces an unnecessary step when devices under that slot
are slated for removal.

In the case of an assigned device, there are mandatory delays
associated with many device reset mechanisms which can stall the hot
unplug operation.  Also, in cases where the unplug request is triggered
via a release operation of the host driver, internal device locking in
the host kernel may result in a failure of the device reset mechanism,
which generates unnecessary log warnings.

Skip the reset for devices that are slated for unplug.

Cc: qemu-sta...@nongnu.org
Fixes: d5daff7d3126 ("pcie: implement slot power control for pcie root ports")
Signed-off-by: Alex Williamson 
---
 hw/pci/pci.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index e5993c1ef52b..f594da410797 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2869,7 +2869,7 @@ void pci_set_power(PCIDevice *d, bool state)
 memory_region_set_enabled(>bus_master_enable_region,
   (pci_get_word(d->config + PCI_COMMAND)
& PCI_COMMAND_MASTER) && d->has_power);
-if (!d->has_power) {
+if (!d->has_power && !d->qdev.pending_deleted_event) {
 pci_device_reset(d);
 }
 }





Re: [PATCH] vfio/migration: Improve to read/write full migration region per chunk

2021-11-22 Thread Alex Williamson
On Mon, 22 Nov 2021 09:40:56 +0200
Yishai Hadas  wrote:

> Gentle ping for review, CCing more people who may be involved.

I'll wait for comments from others, but since we're already in the 6.2
freeze and vfio migration is still experimental (and I'm on PTO this
week), I expect this to be queued when the next development window
opens.  Thanks,

Alex


> On 11/11/2021 11:50 AM, Yishai Hadas wrote:
> > Upon reading/writing the migration data there is no real reason to limit
> > the read/write system call from the file to be 8 bytes.
> >
> > In addition, there is no reason to depend on the file offset alignment.
> > The offset is just some logical value which depends also on the region
> > index and has nothing to do with the amount of data that can be
> > accessed.
> >
> > Move to read/write the full region size per chunk, this reduces
> > dramatically the number of the systems calls that are needed and improve
> > performance.
> >
> > Signed-off-by: Yishai Hadas 
> > ---
> >   hw/vfio/migration.c | 36 ++--
> >   1 file changed, 2 insertions(+), 34 deletions(-)
> >
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index ff6b45de6b5..b5f310bb831 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -62,40 +62,8 @@ static inline int vfio_mig_access(VFIODevice *vbasedev, 
> > void *val, int count,
> >   return 0;
> >   }
> >   
> > -static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count,
> > -   off_t off, bool iswrite)
> > -{
> > -int ret, done = 0;
> > -__u8 *tbuf = buf;
> > -
> > -while (count) {
> > -int bytes = 0;
> > -
> > -if (count >= 8 && !(off % 8)) {
> > -bytes = 8;
> > -} else if (count >= 4 && !(off % 4)) {
> > -bytes = 4;
> > -} else if (count >= 2 && !(off % 2)) {
> > -bytes = 2;
> > -} else {
> > -bytes = 1;
> > -}
> > -
> > -ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite);
> > -if (ret) {
> > -return ret;
> > -}
> > -
> > -count -= bytes;
> > -done += bytes;
> > -off += bytes;
> > -tbuf += bytes;
> > -}
> > -return done;
> > -}
> > -
> > -#define vfio_mig_read(f, v, c, o)   vfio_mig_rw(f, (__u8 *)v, c, o, 
> > false)
> > -#define vfio_mig_write(f, v, c, o)  vfio_mig_rw(f, (__u8 *)v, c, o, 
> > true)
> > +#define vfio_mig_read(f, v, c, o)   vfio_mig_access(f, (__u8 *)v, c, 
> > o, false)
> > +#define vfio_mig_write(f, v, c, o)  vfio_mig_access(f, (__u8 *)v, c, 
> > o, true)
> >   
> >   #define VFIO_MIG_STRUCT_OFFSET(f)   \
> >offsetof(struct 
> > vfio_device_migration_info, f)  
> 
> 




Re: [RFC v3 16/19] vfio-user: dma map/unmap operations

2021-11-19 Thread Alex Williamson
On Mon,  8 Nov 2021 16:46:44 -0800
John Johnson  wrote:

> Signed-off-by: Jagannathan Raman 
> Signed-off-by: Elena Ufimtseva 
> Signed-off-by: John G Johnson 
> ---
>  hw/vfio/pci.h |   1 +
>  hw/vfio/user-protocol.h   |  32 +++
>  hw/vfio/user.h|   1 +
>  include/hw/vfio/vfio-common.h |   4 +
>  hw/vfio/common.c  |  76 +---
>  hw/vfio/pci.c |   4 +
>  hw/vfio/user.c| 206 
> ++
>  7 files changed, 309 insertions(+), 15 deletions(-)
> 
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 643ff75..156fee2 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -193,6 +193,7 @@ OBJECT_DECLARE_SIMPLE_TYPE(VFIOUserPCIDevice, 
> VFIO_USER_PCI)
>  struct VFIOUserPCIDevice {
>  VFIOPCIDevice device;
>  char *sock_name;
> +bool secure_dma;/* disable shared mem for DMA */

  It's there, it's gone, it's back.

>  bool send_queued;   /* all sends are queued */
>  bool no_post;   /* all regions write are sync */
>  };
> diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
> index 5614efa..ca53fce 100644
> --- a/hw/vfio/user-protocol.h
> +++ b/hw/vfio/user-protocol.h
> @@ -83,6 +83,31 @@ typedef struct {
>  
>  
>  /*
> + * VFIO_USER_DMA_MAP
> + * imported from struct vfio_iommu_type1_dma_map
> + */
> +typedef struct {
> +VFIOUserHdr hdr;
> +uint32_t argsz;
> +uint32_t flags;
> +uint64_t offset;/* FD offset */
> +uint64_t iova;
> +uint64_t size;
> +} VFIOUserDMAMap;
> +
> +/*
> + * VFIO_USER_DMA_UNMAP
> + * imported from struct vfio_iommu_type1_dma_unmap
> + */
> +typedef struct {
> +VFIOUserHdr hdr;
> +uint32_t argsz;
> +uint32_t flags;
> +uint64_t iova;
> +uint64_t size;
> +} VFIOUserDMAUnmap;
> +
> +/*
>   * VFIO_USER_DEVICE_GET_INFO
>   * imported from struct_device_info
>   */
> @@ -146,4 +171,11 @@ typedef struct {
>  char data[];
>  } VFIOUserRegionRW;
>  
> +/*imported from struct vfio_bitmap */
> +typedef struct {
> +uint64_t pgsize;
> +uint64_t size;
> +char data[];
> +} VFIOUserBitmap;
> +
>  #endif /* VFIO_USER_PROTOCOL_H */
> diff --git a/hw/vfio/user.h b/hw/vfio/user.h
> index 8d03e7c..997f748 100644
> --- a/hw/vfio/user.h
> +++ b/hw/vfio/user.h
> @@ -74,6 +74,7 @@ typedef struct VFIOProxy {
>  
>  /* VFIOProxy flags */
>  #define VFIO_PROXY_CLIENT0x1
> +#define VFIO_PROXY_SECURE0x2
>  #define VFIO_PROXY_FORCE_QUEUED  0x4
>  #define VFIO_PROXY_NO_POST   0x8
>  
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index c0e7632..dcfae2c 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -90,6 +90,8 @@ typedef struct VFIOContainer {
>  VFIOContIO *io_ops;
>  bool initialized;
>  bool dirty_pages_supported;
> +bool will_commit;

The entire will_commit concept hidden in the map and unmap operations
from many patches ago should be introduced here, or later.

> +bool need_map_fd;
>  uint64_t dirty_pgsizes;
>  uint64_t max_dirty_bitmap_size;
>  unsigned long pgsizes;
> @@ -210,6 +212,7 @@ struct VFIOContIO {
>  int (*dirty_bitmap)(VFIOContainer *container,
>  struct vfio_iommu_type1_dirty_bitmap *bitmap,
>  struct vfio_iommu_type1_dirty_bitmap_get *range);
> +void (*wait_commit)(VFIOContainer *container);
>  };
>  
>  #define CONT_DMA_MAP(cont, map, fd, will_commit) \
> @@ -218,6 +221,7 @@ struct VFIOContIO {
>  ((cont)->io_ops->dma_unmap((cont), (unmap), (bitmap), (will_commit)))
>  #define CONT_DIRTY_BITMAP(cont, bitmap, range) \
>  ((cont)->io_ops->dirty_bitmap((cont), (bitmap), (range)))
> +#define CONT_WAIT_COMMIT(cont) ((cont)->io_ops->wait_commit(cont))
>  
>  extern VFIODevIO vfio_dev_io_ioctl;
>  extern VFIOContIO vfio_cont_io_ioctl;
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index fdd2702..0840c8f 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -411,6 +411,7 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
>  struct vfio_iommu_type1_dma_unmap *unmap;
>  struct vfio_bitmap *bitmap;
>  uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size;
> +bool will_commit = container->will_commit;
>  int ret;
>  
>  unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
> @@ -444,7 +445,7 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
>  goto unmap_exit;
>  }
>  
> -ret = CONT_DMA_UNMAP(container, unmap, bitmap, false);
> +ret = CONT_DMA_UNMAP(container, unmap, bitmap, will_commit);
>  if (!ret) {
>  cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data,
>  iotlb->translated_addr, pages);
> @@ -471,16 +472,17 @@ static int vfio_dma_unmap(VFIOContainer *container,
>  .iova = iova,
>  .size = size,
>  };
> +bool 

Re: [RFC v3 13/19] vfio-user: pci_user_realize PCI setup

2021-11-19 Thread Alex Williamson
On Mon,  8 Nov 2021 16:46:41 -0800
John Johnson  wrote:

> PCI BARs read from remote device
> PCI config reads/writes sent to remote server
> 
> Signed-off-by: Elena Ufimtseva 
> Signed-off-by: John G Johnson 
> Signed-off-by: Jagannathan Raman 
> ---
>  hw/vfio/pci.c | 89 
> ++-
>  1 file changed, 88 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index d5f9987..f8729b2 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3551,8 +3551,93 @@ static void vfio_user_pci_realize(PCIDevice *pdev, 
> Error **errp)
>  goto error;
>  }
>  

There's a LOT of duplication with the kernel realize function here.
Thanks,

Alex


> +/* Get a copy of config space */
> +ret = VDEV_REGION_READ(vbasedev, VFIO_PCI_CONFIG_REGION_INDEX, 0,
> +   MIN(pci_config_size(pdev), vdev->config_size),
> +   pdev->config);
> +if (ret < (int)MIN(pci_config_size(>pdev), vdev->config_size)) {
> +error_setg_errno(errp, -ret, "failed to read device config space");
> +goto error;
> +}
> +
> +/* vfio emulates a lot for us, but some bits need extra love */
> +vdev->emulated_config_bits = g_malloc0(vdev->config_size);
> +
> +/* QEMU can choose to expose the ROM or not */
> +memset(vdev->emulated_config_bits + PCI_ROM_ADDRESS, 0xff, 4);
> +/* QEMU can also add or extend BARs */
> +memset(vdev->emulated_config_bits + PCI_BASE_ADDRESS_0, 0xff, 6 * 4);
> +vdev->vendor_id = pci_get_word(pdev->config + PCI_VENDOR_ID);
> +vdev->device_id = pci_get_word(pdev->config + PCI_DEVICE_ID);
> +
> +/* QEMU can change multi-function devices to single function, or reverse 
> */
> +vdev->emulated_config_bits[PCI_HEADER_TYPE] =
> +  PCI_HEADER_TYPE_MULTI_FUNCTION;
> +
> +/* Restore or clear multifunction, this is always controlled by QEMU */
> +if (vdev->pdev.cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
> +vdev->pdev.config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
> +} else {
> +vdev->pdev.config[PCI_HEADER_TYPE] &= 
> ~PCI_HEADER_TYPE_MULTI_FUNCTION;
> +}
> +
> +/*
> + * Clear host resource mapping info.  If we choose not to register a
> + * BAR, such as might be the case with the option ROM, we can get
> + * confusing, unwritable, residual addresses from the host here.
> + */
> +memset(>pdev.config[PCI_BASE_ADDRESS_0], 0, 24);
> +memset(>pdev.config[PCI_ROM_ADDRESS], 0, 4);
> +
> +vfio_pci_size_rom(vdev);
> +
> +vfio_bars_prepare(vdev);
> +
> +vfio_msix_early_setup(vdev, );
> +if (err) {
> +error_propagate(errp, err);
> +goto error;
> +}
> +
> +vfio_bars_register(vdev);
> +
> +ret = vfio_add_capabilities(vdev, errp);
> +if (ret) {
> +goto out_teardown;
> +}
> +
> +/* QEMU emulates all of MSI & MSIX */
> +if (pdev->cap_present & QEMU_PCI_CAP_MSIX) {
> +memset(vdev->emulated_config_bits + pdev->msix_cap, 0xff,
> +   MSIX_CAP_LENGTH);
> +}
> +
> +if (pdev->cap_present & QEMU_PCI_CAP_MSI) {
> +memset(vdev->emulated_config_bits + pdev->msi_cap, 0xff,
> +   vdev->msi_cap_size);
> +}
> +
> +if (vdev->pdev.config[PCI_INTERRUPT_PIN] != 0) {
> +vdev->intx.mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
> + vfio_intx_mmap_enable, vdev);
> +pci_device_set_intx_routing_notifier(>pdev,
> + vfio_intx_routing_notifier);
> +vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
> +kvm_irqchip_add_change_notifier(>irqchip_change_notifier);
> +ret = vfio_intx_enable(vdev, errp);
> +if (ret) {
> +goto out_deregister;
> +}
> +}
> +
>  return;
>  
> +out_deregister:
> +pci_device_set_intx_routing_notifier(>pdev, NULL);
> +kvm_irqchip_remove_change_notifier(>irqchip_change_notifier);
> +out_teardown:
> +vfio_teardown_msi(vdev);
> +vfio_bars_exit(vdev);
>  error:
>  vfio_user_disconnect(proxy);
>  error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
> @@ -3565,7 +3650,9 @@ static void vfio_user_instance_finalize(Object *obj)
>  
>  vfio_put_device(vdev);
>  
> -vfio_user_disconnect(vbasedev->proxy);
> +if (vbasedev->proxy != NULL) {
> +vfio_user_disconnect(vbasedev->proxy);
> +}
>  }
>  
>  static Property vfio_user_pci_dev_properties[] = {




Re: [RFC v3 05/19] Add validation ops vector

2021-11-19 Thread Alex Williamson


Add a prefix on Subject: please.  Same for previous in series.

On Mon,  8 Nov 2021 16:46:33 -0800
John Johnson  wrote:

> Validates cases where the return values aren't fully trusted
> (prep work for vfio-user, where the return values from the
> remote process aren't trusted)
> 
> Signed-off-by: John G Johnson 
> ---
>  include/hw/vfio/vfio-common.h | 21 ++
>  hw/vfio/pci.c | 67 
> +++
>  2 files changed, 88 insertions(+)
> 
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 43fa948..c0dbbfb 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -125,6 +125,7 @@ typedef struct VFIOHostDMAWindow {
>  
>  typedef struct VFIODeviceOps VFIODeviceOps;
>  typedef struct VFIODevIO VFIODevIO;
> +typedef struct VFIOValidOps VFIOValidOps;
>  
>  typedef struct VFIODevice {
>  QLIST_ENTRY(VFIODevice) next;
> @@ -141,6 +142,7 @@ typedef struct VFIODevice {
>  bool enable_migration;
>  VFIODeviceOps *ops;
>  VFIODevIO *io_ops;
> +VFIOValidOps *valid_ops;
>  unsigned int num_irqs;
>  unsigned int num_regions;
>  unsigned int flags;
> @@ -214,6 +216,25 @@ struct VFIOContIO {
>  extern VFIODevIO vfio_dev_io_ioctl;
>  extern VFIOContIO vfio_cont_io_ioctl;
>  
> +/*
> + * This ops vector allows for bus-specific verification
> + * routines in cases where the server may not be fully
> + * trusted.
> + */
> +struct VFIOValidOps {
> +int (*validate_get_info)(VFIODevice *vdev, struct vfio_device_info 
> *info);
> +int (*validate_get_region_info)(VFIODevice *vdev,
> +struct vfio_region_info *info, int *fd);
> +int (*validate_get_irq_info)(VFIODevice *vdev, struct vfio_irq_info 
> *info);
> +};
> +
> +#define VDEV_VALID_INFO(vdev, info) \
> +((vdev)->valid_ops->validate_get_info((vdev), (info)))
> +#define VDEV_VALID_REGION_INFO(vdev, info, fd) \
> +((vdev)->valid_ops->validate_get_region_info((vdev), (info), (fd)))
> +#define VDEV_VALID_IRQ_INFO(vdev, irq) \
> +((vdev)->valid_ops->validate_get_irq_info((vdev), (irq)))
> +
>  #endif /* CONFIG_LINUX */
>  
>  typedef struct VFIOGroup {
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 28f21f8..6e2ce35 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3371,3 +3371,70 @@ static void register_vfio_pci_dev_type(void)
>  }
>  
>  type_init(register_vfio_pci_dev_type)
> +
> +
> +/*
> + * PCI validation ops - used when return values need
> + * validation before use
> + */
> +
> +static int vfio_pci_valid_info(VFIODevice *vbasedev,
> +   struct vfio_device_info *info)
> +{
> +/* must be PCI */
> +if ((info->flags & VFIO_DEVICE_FLAGS_PCI) == 0) {
> +return -EINVAL;
> +}
> +/* only other valid flag is reset */
> +if (info->flags & ~(VFIO_DEVICE_FLAGS_PCI | VFIO_DEVICE_FLAGS_RESET)) {
> +return -EINVAL;
> +}

This means QEMU vfio-pci breaks on any extension of the flags field.

> +/* account for extra migration region */
> +if (info->num_regions > VFIO_PCI_NUM_REGIONS + 1) {
> +return -EINVAL;
> +}

This is also invalid, there can be device specific regions beyond
migration.

> +if (info->num_irqs > VFIO_PCI_NUM_IRQS) {
> +return -EINVAL;
> +}

And device specific IRQs.

> +return 0;
> +}
> +
> +static int vfio_pci_valid_region_info(VFIODevice *vbasedev,
> +  struct vfio_region_info *info,
> +  int *fd)
> +{
> +if (info->flags & ~(VFIO_REGION_INFO_FLAG_READ |
> +VFIO_REGION_INFO_FLAG_WRITE |
> +VFIO_REGION_INFO_FLAG_MMAP |
> +VFIO_REGION_INFO_FLAG_CAPS)) {
> +return -EINVAL;
> +}

Similarly, this allows zero future extensions.  Notice for instance how
the CAPS flag was added later as a backwards compatible extension.

> +if (info->index > vbasedev->num_regions) {
> +return -EINVAL;
> +}
> +/* cap_offset in valid area */
> +if ((info->flags & VFIO_REGION_INFO_FLAG_CAPS) &&
> +(info->cap_offset < sizeof(*info) || info->cap_offset > 
> info->argsz)) {
> +return -EINVAL;
> +}
> +return 0;
> +}
> +
> +static int vfio_pci_valid_irq_info(VFIODevice *vbasedev,
> + struct vfio_irq_info *info)
> +{
> +if (info->flags & ~(VFIO_IRQ_INFO_EVENTFD | VFIO_IRQ_INFO_MASKABLE |
> +VFIO_IRQ_INFO_AUTOMASKED | VFIO_IRQ_INFO_NORESIZE)) {
> +return -EINVAL;
> +}

Similarly, nak.  Thanks,

Alex

> +if (info->index > vbasedev->num_irqs) {
> +return -EINVAL;
> +}
> +return 0;
> +}
> +
> +struct VFIOValidOps vfio_pci_valid_ops = {
> +.validate_get_info = vfio_pci_valid_info,
> +.validate_get_region_info = vfio_pci_valid_region_info,
> +.validate_get_irq_info = 

Re: [RFC v3 02/19] vfio-user: add VFIO base abstract class

2021-11-19 Thread Alex Williamson
On Mon,  8 Nov 2021 16:46:30 -0800
John Johnson  wrote:

> Add an abstract base class both the kernel driver
> and user socket implementations can use to share code.
> 
> Signed-off-by: John G Johnson 
> Signed-off-by: Elena Ufimtseva 
> Signed-off-by: Jagannathan Raman 
> ---
>  hw/vfio/pci.h |  16 +++--
>  hw/vfio/pci.c | 112 
> +++---
>  2 files changed, 81 insertions(+), 47 deletions(-)
> 
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 6477751..bbc78aa 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -114,8 +114,13 @@ typedef struct VFIOMSIXInfo {
>  unsigned long *pending;
>  } VFIOMSIXInfo;
>  
> -#define TYPE_VFIO_PCI "vfio-pci"
> -OBJECT_DECLARE_SIMPLE_TYPE(VFIOPCIDevice, VFIO_PCI)
> +/*
> + * TYPE_VFIO_PCI_BASE is an abstract type used to share code
> + * between VFIO implementations that use a kernel driver
> + * with those that use user sockets.
> + */
> +#define TYPE_VFIO_PCI_BASE "vfio-pci-base"
> +OBJECT_DECLARE_SIMPLE_TYPE(VFIOPCIDevice, VFIO_PCI_BASE)
>  
>  struct VFIOPCIDevice {
>  PCIDevice pdev;
> @@ -175,6 +180,13 @@ struct VFIOPCIDevice {
>  Notifier irqchip_change_notifier;
>  };
>  
> +#define TYPE_VFIO_PCI "vfio-pci"
> +OBJECT_DECLARE_SIMPLE_TYPE(VFIOKernPCIDevice, VFIO_PCI)
> +
> +struct VFIOKernPCIDevice {
> +VFIOPCIDevice device;
> +};
> +
>  /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw 
> */
>  static inline bool vfio_pci_is(VFIOPCIDevice *vdev, uint32_t vendor, 
> uint32_t device)
>  {
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index e1ea1d8..122edf8 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -231,7 +231,7 @@ static void vfio_intx_update(VFIOPCIDevice *vdev, 
> PCIINTxRoute *route)
>  
>  static void vfio_intx_routing_notifier(PCIDevice *pdev)
>  {
> -VFIOPCIDevice *vdev = VFIO_PCI(pdev);
> +VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
>  PCIINTxRoute route;
>  
>  if (vdev->interrupt != VFIO_INT_INTx) {
> @@ -457,7 +457,7 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector 
> *vector, MSIMessage msg,
>  static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
> MSIMessage *msg, IOHandler *handler)
>  {
> -VFIOPCIDevice *vdev = VFIO_PCI(pdev);
> +VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
>  VFIOMSIVector *vector;
>  int ret;
>  
> @@ -542,7 +542,7 @@ static int vfio_msix_vector_use(PCIDevice *pdev,
>  
>  static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
>  {
> -VFIOPCIDevice *vdev = VFIO_PCI(pdev);
> +VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
>  VFIOMSIVector *vector = >msi_vectors[nr];
>  
>  trace_vfio_msix_vector_release(vdev->vbasedev.name, nr);
> @@ -1063,7 +1063,7 @@ static const MemoryRegionOps vfio_vga_ops = {
>   */
>  static void vfio_sub_page_bar_update_mapping(PCIDevice *pdev, int bar)
>  {
> -VFIOPCIDevice *vdev = VFIO_PCI(pdev);
> +VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
>  VFIORegion *region = >bars[bar].region;
>  MemoryRegion *mmap_mr, *region_mr, *base_mr;
>  PCIIORegion *r;
> @@ -1109,7 +1109,7 @@ static void vfio_sub_page_bar_update_mapping(PCIDevice 
> *pdev, int bar)
>   */
>  uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
>  {
> -VFIOPCIDevice *vdev = VFIO_PCI(pdev);
> +VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
>  uint32_t emu_bits = 0, emu_val = 0, phys_val = 0, val;
>  
>  memcpy(_bits, vdev->emulated_config_bits + addr, len);
> @@ -1142,7 +1142,7 @@ uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t 
> addr, int len)
>  void vfio_pci_write_config(PCIDevice *pdev,
> uint32_t addr, uint32_t val, int len)
>  {
> -VFIOPCIDevice *vdev = VFIO_PCI(pdev);
> +VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
>  uint32_t val_le = cpu_to_le32(val);
>  
>  trace_vfio_pci_write_config(vdev->vbasedev.name, addr, val, len);
> @@ -2782,7 +2782,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice 
> *vdev)
>  
>  static void vfio_realize(PCIDevice *pdev, Error **errp)
>  {
> -VFIOPCIDevice *vdev = VFIO_PCI(pdev);
> +VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
>  VFIODevice *vbasedev_iter;
>  VFIOGroup *group;
>  char *tmp, *subsys, group_path[PATH_MAX], *group_name;
> @@ -3105,7 +3105,7 @@ error:
>  
>  static void vfio_instance_finalize(Object *obj)
>  {
> -VFIOPCIDevice *vdev = VFIO_PCI(obj);
> +VFIOPCIDevice *vdev = VFIO_PCI_BASE(obj);
>  VFIOGroup *group = vdev->vbasedev.group;
>  
>  vfio_display_finalize(vdev);
> @@ -3125,7 +3125,7 @@ static void vfio_instance_finalize(Object *obj)
>  
>  static void vfio_exitfn(PCIDevice *pdev)
>  {
> -VFIOPCIDevice *vdev = VFIO_PCI(pdev);
> +VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
>  
>  vfio_unregister_req_notifier(vdev);
>  vfio_unregister_err_notifier(vdev);
> @@ -3144,7 +3144,7 @@ static void 

Re: [RFC v3 07/19] vfio-user: connect vfio proxy to remote server

2021-11-19 Thread Alex Williamson
On Mon,  8 Nov 2021 16:46:35 -0800
John Johnson  wrote:

> Signed-off-by: John G Johnson 
> Signed-off-by: Elena Ufimtseva 
> Signed-off-by: Jagannathan Raman 
> ---
>  hw/vfio/user.h|  78 +++
>  include/hw/vfio/vfio-common.h |   2 +
>  hw/vfio/pci.c |  20 +
>  hw/vfio/user.c| 170 
> ++
>  MAINTAINERS   |   4 +
>  hw/vfio/meson.build   |   1 +
>  6 files changed, 275 insertions(+)
>  create mode 100644 hw/vfio/user.h
>  create mode 100644 hw/vfio/user.c
> 
> diff --git a/hw/vfio/user.h b/hw/vfio/user.h
> new file mode 100644
> index 000..301ef6a
> --- /dev/null
> +++ b/hw/vfio/user.h
> @@ -0,0 +1,78 @@
> +#ifndef VFIO_USER_H
> +#define VFIO_USER_H
> +
> +/*
> + * vfio protocol over a UNIX socket.
> + *
> + * Copyright © 2018, 2021 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + */
> +
> +typedef struct {
> +int send_fds;
> +int recv_fds;
> +int *fds;
> +} VFIOUserFDs;
> +
> +enum msg_type {
> +VFIO_MSG_NONE,
> +VFIO_MSG_ASYNC,
> +VFIO_MSG_WAIT,
> +VFIO_MSG_NOWAIT,
> +VFIO_MSG_REQ,
> +};
> +
> +typedef struct VFIOUserMsg {
> +QTAILQ_ENTRY(VFIOUserMsg) next;
> +VFIOUserFDs *fds;
> +uint32_t rsize;
> +uint32_t id;
> +QemuCond cv;
> +bool complete;
> +enum msg_type type;
> +} VFIOUserMsg;
> +
> +
> +enum proxy_state {
> +VFIO_PROXY_CONNECTED = 1,
> +VFIO_PROXY_ERROR = 2,
> +VFIO_PROXY_CLOSING = 3,
> +VFIO_PROXY_CLOSED = 4,
> +};
> +
> +typedef QTAILQ_HEAD(VFIOUserMsgQ, VFIOUserMsg) VFIOUserMsgQ;
> +
> +typedef struct VFIOProxy {
> +QLIST_ENTRY(VFIOProxy) next;
> +char *sockname;
> +struct QIOChannel *ioc;
> +void (*request)(void *opaque, VFIOUserMsg *msg);
> +void *req_arg;
> +int flags;
> +QemuCond close_cv;
> +AioContext *ctx;
> +QEMUBH *req_bh;
> +
> +/*
> + * above only changed when BQL is held
> + * below are protected by per-proxy lock
> + */
> +QemuMutex lock;
> +VFIOUserMsgQ free;
> +VFIOUserMsgQ pending;
> +VFIOUserMsgQ incoming;
> +VFIOUserMsgQ outgoing;
> +VFIOUserMsg *last_nowait;
> +enum proxy_state state;
> +} VFIOProxy;
> +
> +/* VFIOProxy flags */
> +#define VFIO_PROXY_CLIENT   0x1
> +
> +VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp);
> +void vfio_user_disconnect(VFIOProxy *proxy);
> +
> +#endif /* VFIO_USER_H */
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index c0dbbfb..224dbf8 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -76,6 +76,7 @@ typedef struct VFIOAddressSpace {
>  
>  struct VFIOGroup;
>  typedef struct VFIOContIO VFIOContIO;
> +typedef struct VFIOProxy VFIOProxy;
>  
>  typedef struct VFIOContainer {
>  VFIOAddressSpace *space;
> @@ -150,6 +151,7 @@ typedef struct VFIODevice {
>  Error *migration_blocker;
>  OnOffAuto pre_copy_dirty_page_tracking;
>  struct vfio_region_info **regions;
> +VFIOProxy *proxy;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index fa3e028..ebfabb1 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -43,6 +43,7 @@
>  #include "qapi/error.h"
>  #include "migration/blocker.h"
>  #include "migration/qemu-file.h"
> +#include "hw/vfio/user.h"
>  
>  #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
>  
> @@ -3476,6 +3477,9 @@ static void vfio_user_pci_realize(PCIDevice *pdev, 
> Error **errp)
>  VFIOUserPCIDevice *udev = VFIO_USER_PCI(pdev);
>  VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
>  VFIODevice *vbasedev = >vbasedev;
> +SocketAddress addr;
> +VFIOProxy *proxy;
> +Error *err = NULL;
>  
>  /*
>   * TODO: make option parser understand SocketAddress
> @@ -3488,6 +3492,16 @@ static void vfio_user_pci_realize(PCIDevice *pdev, 
> Error **errp)
>  return;
>  }
>  
> +memset(, 0, sizeof(addr));
> +addr.type = SOCKET_ADDRESS_TYPE_UNIX;
> +addr.u.q_unix.path = udev->sock_name;
> +proxy = vfio_user_connect_dev(, );
> +if (!proxy) {
> +error_setg(errp, "Remote proxy not found");
> +return;
> +}
> +vbasedev->proxy = proxy;
> +
>  vbasedev->name = g_strdup_printf("VFIO user <%s>", udev->sock_name);
>  vbasedev->dev = DEVICE(vdev);
>  vbasedev->fd = -1;
> @@ -3500,6 +3514,12 @@ static void vfio_user_pci_realize(PCIDevice *pdev, 
> Error **errp)
>  
>  static void vfio_user_instance_finalize(Object *obj)
>  {
> +VFIOPCIDevice *vdev = VFIO_PCI_BASE(obj);
> +VFIODevice *vbasedev = >vbasedev;
> +
> +vfio_put_device(vdev);

This looks suspiciously like the initial function in the previous patch
should not have been empty.  Thanks,

Alex

> +
> +

Re: [RFC v3 08/19] vfio-user: define socket receive functions

2021-11-19 Thread Alex Williamson
On Mon,  8 Nov 2021 16:46:36 -0800
John Johnson  wrote:

> Add infrastructure needed to receive incoming messages
> 
> Signed-off-by: John G Johnson 
> Signed-off-by: Elena Ufimtseva 
> Signed-off-by: Jagannathan Raman 
> ---
>  hw/vfio/pci.h   |   2 +-
>  hw/vfio/user-protocol.h |  62 +
>  hw/vfio/user.h  |   9 +-
>  hw/vfio/pci.c   |  12 +-
>  hw/vfio/user.c  | 326 
> 
>  MAINTAINERS |   1 +
>  6 files changed, 409 insertions(+), 3 deletions(-)
>  create mode 100644 hw/vfio/user-protocol.h
> 
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 08ac647..ec9f345 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -193,7 +193,7 @@ OBJECT_DECLARE_SIMPLE_TYPE(VFIOUserPCIDevice, 
> VFIO_USER_PCI)
>  struct VFIOUserPCIDevice {
>  VFIOPCIDevice device;
>  char *sock_name;
> -bool secure_dma; /* disable shared mem for DMA */

Don't introduce it into the series to start with, confusing to review.

> +bool send_queued;   /* all sends are queued */
>  };
>  
>  /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw 
> */
> diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
> new file mode 100644
> index 000..27062cb
> --- /dev/null
> +++ b/hw/vfio/user-protocol.h
> @@ -0,0 +1,62 @@
> +#ifndef VFIO_USER_PROTOCOL_H
> +#define VFIO_USER_PROTOCOL_H
> +
> +/*
> + * vfio protocol over a UNIX socket.
> + *
> + * Copyright © 2018, 2021 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + * Each message has a standard header that describes the command
> + * being sent, which is almost always a VFIO ioctl().
> + *
> + * The header may be followed by command-specific data, such as the
> + * region and offset info for read and write commands.
> + */
> +
> +typedef struct {
> +uint16_t id;
> +uint16_t command;
> +uint32_t size;
> +uint32_t flags;
> +uint32_t error_reply;
> +} VFIOUserHdr;
> +

A comment referencing the doc would probably be a good idea about here.

> +/* VFIOUserHdr commands */
> +enum vfio_user_command {
> +VFIO_USER_VERSION   = 1,
> +VFIO_USER_DMA_MAP   = 2,
> +VFIO_USER_DMA_UNMAP = 3,
> +VFIO_USER_DEVICE_GET_INFO   = 4,
> +VFIO_USER_DEVICE_GET_REGION_INFO= 5,
> +VFIO_USER_DEVICE_GET_REGION_IO_FDS  = 6,
> +VFIO_USER_DEVICE_GET_IRQ_INFO   = 7,
> +VFIO_USER_DEVICE_SET_IRQS   = 8,
> +VFIO_USER_REGION_READ   = 9,
> +VFIO_USER_REGION_WRITE  = 10,
> +VFIO_USER_DMA_READ  = 11,
> +VFIO_USER_DMA_WRITE = 12,
> +VFIO_USER_DEVICE_RESET  = 13,
> +VFIO_USER_DIRTY_PAGES   = 14,
> +VFIO_USER_MAX,
> +};
> +
> +/* VFIOUserHdr flags */
> +#define VFIO_USER_REQUEST   0x0
> +#define VFIO_USER_REPLY 0x1
> +#define VFIO_USER_TYPE  0xF
> +
> +#define VFIO_USER_NO_REPLY  0x10
> +#define VFIO_USER_ERROR 0x20
> +
> +
> +#define VFIO_USER_DEF_MAX_FDS   8
> +#define VFIO_USER_MAX_MAX_FDS   16
> +
> +#define VFIO_USER_DEF_MAX_XFER  (1024 * 1024)
> +#define VFIO_USER_MAX_MAX_XFER  (64 * 1024 * 1024)

These are essentially magic numbers, some discussion of how these
limits are derived would be useful for future contributors, but also
only DEV_MAX_XFER is used in this patch and it's confusing why the
macro isn't used directly.  Most of the logic surrounding these is
added in the next patch, so it doesn't really make sense to add them
here.  Thanks,

Alex

> +
> +
> +#endif /* VFIO_USER_PROTOCOL_H */
> diff --git a/hw/vfio/user.h b/hw/vfio/user.h
> index 301ef6a..bd3717f 100644
> --- a/hw/vfio/user.h
> +++ b/hw/vfio/user.h
> @@ -11,6 +11,8 @@
>   *
>   */
>  
> +#include "user-protocol.h"
> +
>  typedef struct {
>  int send_fds;
>  int recv_fds;
> @@ -27,6 +29,7 @@ enum msg_type {
>  
>  typedef struct VFIOUserMsg {
>  QTAILQ_ENTRY(VFIOUserMsg) next;
> +VFIOUserHdr *hdr;
>  VFIOUserFDs *fds;
>  uint32_t rsize;
>  uint32_t id;
> @@ -70,9 +73,13 @@ typedef struct VFIOProxy {
>  } VFIOProxy;
>  
>  /* VFIOProxy flags */
> -#define VFIO_PROXY_CLIENT   0x1
> +#define VFIO_PROXY_CLIENT0x1
> +#define VFIO_PROXY_FORCE_QUEUED  0x4
>  
>  VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp);
>  void vfio_user_disconnect(VFIOProxy *proxy);
> +void vfio_user_set_handler(VFIODevice *vbasedev,
> +   void (*handler)(void *opaque, VFIOUserMsg *msg),
> +   void *reqarg);
>  
>  #endif /* VFIO_USER_H */
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index ebfabb1..db45179 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3448,6 +3448,11 @@ struct VFIOValidOps vfio_pci_valid_ops = {
>   * vfio-user routines.
>   */
>  
> 

Re: [RFC v3 19/19] vfio-user: migration support

2021-11-19 Thread Alex Williamson
On Mon,  8 Nov 2021 16:46:47 -0800
John Johnson  wrote:

> bug fix: only set qemu file error if there is a file


I don't understand this commit log.  Is this meant to be a revision
log?  In general it would be nice to have more detailed commit logs for
many of these patches describing any nuances.

Note that the migration uAPI is being reevaluated on the kernel side,
so I expect we'll want to hold off on vfio-user beyond minimal stub
support.  Thanks,

Alex

> Signed-off-by: John G Johnson 
> Signed-off-by: Elena Ufimtseva 
> Signed-off-by: Jagannathan Raman 
> ---
>  hw/vfio/user-protocol.h | 18 +
>  hw/vfio/migration.c | 34 +++
>  hw/vfio/pci.c   |  7 +++
>  hw/vfio/user.c  | 54 
> +
>  4 files changed, 96 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
> index c5d9473..bad067a 100644
> --- a/hw/vfio/user-protocol.h
> +++ b/hw/vfio/user-protocol.h
> @@ -182,6 +182,10 @@ typedef struct {
>  char data[];
>  } VFIOUserDMARW;
>  
> +/*
> + * VFIO_USER_DIRTY_PAGES
> + */
> +
>  /*imported from struct vfio_bitmap */
>  typedef struct {
>  uint64_t pgsize;
> @@ -189,4 +193,18 @@ typedef struct {
>  char data[];
>  } VFIOUserBitmap;
>  
> +/* imported from struct vfio_iommu_type1_dirty_bitmap_get */
> +typedef struct {
> +uint64_t iova;
> +uint64_t size;
> +VFIOUserBitmap bitmap;
> +} VFIOUserBitmapRange;
> +
> +/* imported from struct vfio_iommu_type1_dirty_bitmap */
> +typedef struct {
> +VFIOUserHdr hdr;
> +uint32_t argsz;
> +uint32_t flags;
> +} VFIOUserDirtyPages;
> +
>  #endif /* VFIO_USER_PROTOCOL_H */
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 82f654a..3d379cb 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -27,6 +27,7 @@
>  #include "pci.h"
>  #include "trace.h"
>  #include "hw/hw.h"
> +#include "user.h"
>  
>  /*
>   * Flags to be used as unique delimiters for VFIO devices in the migration
> @@ -49,11 +50,13 @@ static int64_t bytes_transferred;
>  static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
>off_t off, bool iswrite)
>  {
> +VFIORegion *region = >migration->region;
>  int ret;
>  
> -ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
> -pread(vbasedev->fd, val, count, off);
> -if (ret < count) {
> +ret = iswrite ?
> +VDEV_REGION_WRITE(vbasedev, region->nr, off, count, val, false) :
> +VDEV_REGION_READ(vbasedev, region->nr, off, count, val);
> + if (ret < count) {
>  error_report("vfio_mig_%s %d byte %s: failed at offset 0x%"
>   HWADDR_PRIx", err: %s", iswrite ? "write" : "read", 
> count,
>   vbasedev->name, off, strerror(errno));
> @@ -111,9 +114,7 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, 
> uint32_t mask,
>  uint32_t value)
>  {
>  VFIOMigration *migration = vbasedev->migration;
> -VFIORegion *region = >region;
> -off_t dev_state_off = region->fd_offset +
> -  VFIO_MIG_STRUCT_OFFSET(device_state);
> +off_t dev_state_off = VFIO_MIG_STRUCT_OFFSET(device_state);
>  uint32_t device_state;
>  int ret;
>  
> @@ -201,13 +202,13 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice 
> *vbasedev, uint64_t *size)
>  int ret;
>  
>  ret = vfio_mig_read(vbasedev, _offset, sizeof(data_offset),
> -  region->fd_offset + 
> VFIO_MIG_STRUCT_OFFSET(data_offset));
> +VFIO_MIG_STRUCT_OFFSET(data_offset));
>  if (ret < 0) {
>  return ret;
>  }
>  
>  ret = vfio_mig_read(vbasedev, _size, sizeof(data_size),
> -region->fd_offset + 
> VFIO_MIG_STRUCT_OFFSET(data_size));
> +VFIO_MIG_STRUCT_OFFSET(data_size));
>  if (ret < 0) {
>  return ret;
>  }
> @@ -233,8 +234,7 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice 
> *vbasedev, uint64_t *size)
>  }
>  buf_allocated = true;
>  
> -ret = vfio_mig_read(vbasedev, buf, sec_size,
> -region->fd_offset + data_offset);
> +ret = vfio_mig_read(vbasedev, buf, sec_size, data_offset);
>  if (ret < 0) {
>  g_free(buf);
>  return ret;
> @@ -269,7 +269,7 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice 
> *vbasedev,
>  
>  do {
>  ret = vfio_mig_read(vbasedev, _offset, sizeof(data_offset),
> -  region->fd_offset + 
> VFIO_MIG_STRUCT_OFFSET(data_offset));
> +VFIO_MIG_STRUCT_OFFSET(data_offset));
>  if (ret < 0) {
>  return ret;
>  }
> @@ -309,8 +309,7 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice 
> 

Re: [RFC v3 06/19] vfio-user: Define type vfio_user_pci_dev_info

2021-11-19 Thread Alex Williamson
On Mon,  8 Nov 2021 16:46:34 -0800
John Johnson  wrote:

> New class for vfio-user with its class and instance
> constructors and destructors, and its pci ops.
> 
> Signed-off-by: Elena Ufimtseva 
> Signed-off-by: John G Johnson 
> Signed-off-by: Jagannathan Raman 
> ---
>  hw/vfio/pci.h   |  9 ++
>  hw/vfio/pci.c   | 97 
> +
>  hw/vfio/Kconfig | 10 ++
>  3 files changed, 116 insertions(+)
> 
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index bbc78aa..08ac647 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -187,6 +187,15 @@ struct VFIOKernPCIDevice {
>  VFIOPCIDevice device;
>  };
>  
> +#define TYPE_VFIO_USER_PCI "vfio-user-pci"
> +OBJECT_DECLARE_SIMPLE_TYPE(VFIOUserPCIDevice, VFIO_USER_PCI)
> +
> +struct VFIOUserPCIDevice {
> +VFIOPCIDevice device;
> +char *sock_name;
> +bool secure_dma; /* disable shared mem for DMA */
> +};
> +
>  /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw 
> */
>  static inline bool vfio_pci_is(VFIOPCIDevice *vdev, uint32_t vendor, 
> uint32_t device)
>  {
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 6e2ce35..fa3e028 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -19,6 +19,7 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include CONFIG_DEVICES
>  #include 
>  #include 
>  
> @@ -3438,3 +3439,99 @@ struct VFIOValidOps vfio_pci_valid_ops = {
>  .validate_get_region_info = vfio_pci_valid_region_info,
>  .validate_get_irq_info = vfio_pci_valid_irq_info,
>  };
> +
> +
> +#ifdef CONFIG_VFIO_USER_PCI
> +
> +/*
> + * vfio-user routines.
> + */
> +
> +/*
> + * Emulated devices don't use host hot reset
> + */
> +static int vfio_user_pci_no_reset(VFIODevice *vbasedev)
> +{
> +error_printf("vfio-user - no hot reset\n");
> +return 0;
> +}
> +
> +static void vfio_user_pci_not_needed(VFIODevice *vbasedev)
> +{
> +vbasedev->needs_reset = false;
> +}

Seems like we should make some of these optional rather than stubbing
dummy functions.

> +
> +static VFIODeviceOps vfio_user_pci_ops = {
> +.vfio_compute_needs_reset = vfio_user_pci_not_needed,
> +.vfio_hot_reset_multi = vfio_user_pci_no_reset,
> +.vfio_eoi = vfio_intx_eoi,
> +.vfio_get_object = vfio_pci_get_object,
> +.vfio_save_config = vfio_pci_save_config,
> +.vfio_load_config = vfio_pci_load_config,
> +};
> +
> +static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
> +{
> +ERRP_GUARD();
> +VFIOUserPCIDevice *udev = VFIO_USER_PCI(pdev);
> +VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
> +VFIODevice *vbasedev = >vbasedev;
> +
> +/*
> + * TODO: make option parser understand SocketAddress
> + * and use that instead of having scalar options
> + * for each socket type.
> + */
> +if (!udev->sock_name) {
> +error_setg(errp, "No socket specified");
> +error_append_hint(errp, "Use -device vfio-user-pci,socket=\n");
> +return;
> +}
> +
> +vbasedev->name = g_strdup_printf("VFIO user <%s>", udev->sock_name);
> +vbasedev->dev = DEVICE(vdev);
> +vbasedev->fd = -1;
> +vbasedev->type = VFIO_DEVICE_TYPE_PCI;
> +vbasedev->no_mmap = false;

Why hard coded rather than a property?  This is a useful debugging
feature to be able to trap all device accesses.  The device should work
either way.

> +vbasedev->ops = _user_pci_ops;
> +vbasedev->valid_ops = _pci_valid_ops;
> +
> +}
> +
> +static void vfio_user_instance_finalize(Object *obj)
> +{
> +}
> +
> +static Property vfio_user_pci_dev_properties[] = {
> +DEFINE_PROP_STRING("socket", VFIOUserPCIDevice, sock_name),
> +DEFINE_PROP_BOOL("secure-dma", VFIOUserPCIDevice, secure_dma, false),

Add this when it means something.  Thanks,

Alex

> +DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void vfio_user_pci_dev_class_init(ObjectClass *klass, void *data)
> +{
> +DeviceClass *dc = DEVICE_CLASS(klass);
> +PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
> +
> +device_class_set_props(dc, vfio_user_pci_dev_properties);
> +dc->desc = "VFIO over socket PCI device assignment";
> +pdc->realize = vfio_user_pci_realize;
> +}
> +
> +static const TypeInfo vfio_user_pci_dev_info = {
> +.name = TYPE_VFIO_USER_PCI,
> +.parent = TYPE_VFIO_PCI_BASE,
> +.instance_size = sizeof(VFIOUserPCIDevice),
> +.class_init = vfio_user_pci_dev_class_init,
> +.instance_init = vfio_instance_init,
> +.instance_finalize = vfio_user_instance_finalize,
> +};
> +
> +static void register_vfio_user_dev_type(void)
> +{
> +type_register_static(_user_pci_dev_info);
> +}
> +
> +type_init(register_vfio_user_dev_type)
> +
> +#endif /* VFIO_USER_PCI */
> diff --git a/hw/vfio/Kconfig b/hw/vfio/Kconfig
> index 7cdba05..301894e 100644
> --- a/hw/vfio/Kconfig
> +++ b/hw/vfio/Kconfig
> @@ -2,6 +2,10 @@ config VFIO
>  bool
>  depends on LINUX
>  
> +config VFIO_USER
> +bool
> +depends on VFIO
> +
>  config VFIO_PCI
>  

Re: [RFC v3 04/19] Add device IO ops vector

2021-11-19 Thread Alex Williamson
On Mon,  8 Nov 2021 16:46:32 -0800
John Johnson  wrote:

> Used for communication with VFIO driver
> (prep work for vfio-user, which will communicate over a socket)
> 
> Signed-off-by: John G Johnson 
> ---
>  include/hw/vfio/vfio-common.h |  28 
>  hw/vfio/common.c  | 159 
> ++
>  hw/vfio/pci.c | 146 --
>  3 files changed, 265 insertions(+), 68 deletions(-)
> 
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 9b3c5e5..43fa948 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -124,6 +124,7 @@ typedef struct VFIOHostDMAWindow {
>  } VFIOHostDMAWindow;
>  
>  typedef struct VFIODeviceOps VFIODeviceOps;
> +typedef struct VFIODevIO VFIODevIO;
>  
>  typedef struct VFIODevice {
>  QLIST_ENTRY(VFIODevice) next;
> @@ -139,12 +140,14 @@ typedef struct VFIODevice {
>  bool ram_block_discard_allowed;
>  bool enable_migration;
>  VFIODeviceOps *ops;
> +VFIODevIO *io_ops;
>  unsigned int num_irqs;
>  unsigned int num_regions;
>  unsigned int flags;
>  VFIOMigration *migration;
>  Error *migration_blocker;
>  OnOffAuto pre_copy_dirty_page_tracking;
> +struct vfio_region_info **regions;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> @@ -164,6 +167,30 @@ struct VFIODeviceOps {
>   * through ioctl() to the kernel VFIO driver, but vfio-user
>   * can use a socket to a remote process.
>   */
> +struct VFIODevIO {
> +int (*get_info)(VFIODevice *vdev, struct vfio_device_info *info);
> +int (*get_region_info)(VFIODevice *vdev,
> +   struct vfio_region_info *info, int *fd);
> +int (*get_irq_info)(VFIODevice *vdev, struct vfio_irq_info *irq);
> +int (*set_irqs)(VFIODevice *vdev, struct vfio_irq_set *irqs);
> +int (*region_read)(VFIODevice *vdev, uint8_t nr, off_t off, uint32_t 
> size,
> +   void *data);
> +int (*region_write)(VFIODevice *vdev, uint8_t nr, off_t off, uint32_t 
> size,
> +void *data, bool post);

The @post arg is not really developed in this patch, it would make
review easier to add the arg when it becomes implemented and used.

> +};
> +
> +#define VDEV_GET_INFO(vdev, info) \
> +((vdev)->io_ops->get_info((vdev), (info)))
> +#define VDEV_GET_REGION_INFO(vdev, info, fd) \
> +((vdev)->io_ops->get_region_info((vdev), (info), (fd)))
> +#define VDEV_GET_IRQ_INFO(vdev, irq) \
> +((vdev)->io_ops->get_irq_info((vdev), (irq)))
> +#define VDEV_SET_IRQS(vdev, irqs) \
> +((vdev)->io_ops->set_irqs((vdev), (irqs)))
> +#define VDEV_REGION_READ(vdev, nr, off, size, data) \
> +((vdev)->io_ops->region_read((vdev), (nr), (off), (size), (data)))
> +#define VDEV_REGION_WRITE(vdev, nr, off, size, data, post) \
> +((vdev)->io_ops->region_write((vdev), (nr), (off), (size), (data), 
> (post)))
>  
>  struct VFIOContIO {
>  int (*dma_map)(VFIOContainer *container,
> @@ -184,6 +211,7 @@ struct VFIOContIO {
>  #define CONT_DIRTY_BITMAP(cont, bitmap, range) \
>  ((cont)->io_ops->dirty_bitmap((cont), (bitmap), (range)))
>  
> +extern VFIODevIO vfio_dev_io_ioctl;
>  extern VFIOContIO vfio_cont_io_ioctl;
>  
>  #endif /* CONFIG_LINUX */
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 50748be..41fdd78 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -70,7 +70,7 @@ void vfio_disable_irqindex(VFIODevice *vbasedev, int index)
>  .count = 0,
>  };
>  
> -ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, _set);
> +VDEV_SET_IRQS(vbasedev, _set);
>  }
>  
>  void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index)
> @@ -83,7 +83,7 @@ void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int 
> index)
>  .count = 1,
>  };
>  
> -ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, _set);
> +VDEV_SET_IRQS(vbasedev, _set);
>  }
>  
>  void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index)
> @@ -96,7 +96,7 @@ void vfio_mask_single_irqindex(VFIODevice *vbasedev, int 
> index)
>  .count = 1,
>  };
>  
> -ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, _set);
> +VDEV_SET_IRQS(vbasedev, _set);
>  }
>  
>  static inline const char *action_to_str(int action)
> @@ -177,9 +177,7 @@ int vfio_set_irq_signaling(VFIODevice *vbasedev, int 
> index, int subindex,
>  pfd = (int32_t *)_set->data;
>  *pfd = fd;
>  
> -if (ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set)) {
> -ret = -errno;
> -}
> +ret = VDEV_SET_IRQS(vbasedev, irq_set);
>  g_free(irq_set);
>  
>  if (!ret) {
> @@ -214,6 +212,7 @@ void vfio_region_write(void *opaque, hwaddr addr,
>  uint32_t dword;
>  uint64_t qword;
>  } buf;
> +int ret;
>  
>  switch (size) {
>  case 1:
> @@ -233,13 +232,15 @@ void vfio_region_write(void *opaque, hwaddr addr,
>  break;
>  }
>  
> -if (pwrite(vbasedev->fd, , 

Re: [RFC v3 11/19] vfio-user: get region info

2021-11-19 Thread Alex Williamson
On Mon,  8 Nov 2021 16:46:39 -0800
John Johnson  wrote:

> Signed-off-by: Elena Ufimtseva 
> Signed-off-by: John G Johnson 
> Signed-off-by: Jagannathan Raman 
> ---
>  hw/vfio/user-protocol.h   | 14 
>  include/hw/vfio/vfio-common.h |  4 +++-
>  hw/vfio/common.c  | 30 +-
>  hw/vfio/user.c| 50 
> +++
>  4 files changed, 96 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
> index 13e44eb..104bf4f 100644
> --- a/hw/vfio/user-protocol.h
> +++ b/hw/vfio/user-protocol.h
> @@ -95,4 +95,18 @@ typedef struct {
>  uint32_t cap_offset;
>  } VFIOUserDeviceInfo;
>  
> +/*
> + * VFIO_USER_DEVICE_GET_REGION_INFO
> + * imported from struct_vfio_region_info
> + */
> +typedef struct {
> +VFIOUserHdr hdr;
> +uint32_t argsz;
> +uint32_t flags;
> +uint32_t index;
> +uint32_t cap_offset;
> +uint64_t size;
> +uint64_t offset;
> +} VFIOUserRegionInfo;
> +
>  #endif /* VFIO_USER_PROTOCOL_H */
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 224dbf8..e2d7ee1 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -56,6 +56,7 @@ typedef struct VFIORegion {
>  uint32_t nr_mmaps;
>  VFIOMmap *mmaps;
>  uint8_t nr; /* cache the region number for debug */
> +int remfd; /* fd if exported from remote process */
>  } VFIORegion;
>  
>  typedef struct VFIOMigration {
> @@ -150,8 +151,9 @@ typedef struct VFIODevice {
>  VFIOMigration *migration;
>  Error *migration_blocker;
>  OnOffAuto pre_copy_dirty_page_tracking;
> -struct vfio_region_info **regions;
>  VFIOProxy *proxy;
> +struct vfio_region_info **regions;
> +int *regfds;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 41fdd78..47ec28f 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -40,6 +40,7 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/migration.h"
> +#include "hw/vfio/user.h"
>  
>  VFIOGroupList vfio_group_list =
>  QLIST_HEAD_INITIALIZER(vfio_group_list);
> @@ -1491,6 +1492,16 @@ bool vfio_get_info_dma_avail(struct 
> vfio_iommu_type1_info *info,
>  return true;
>  }
>  
> +static int vfio_get_region_info_remfd(VFIODevice *vbasedev, int index)
> +{
> +struct vfio_region_info *info;
> +
> +if (vbasedev->regions == NULL || vbasedev->regions[index] == NULL) {
> +vfio_get_region_info(vbasedev, index, );
> +}
> +return vbasedev->regfds != NULL ? vbasedev->regfds[index] : -1;
> +}

This patch is really obscure due to the region info fd being added many
patches ago and only now being used.

Do we really want a parallel array to regions for storing these fds?

Why do we call an array of these fds "regfds" but a single one "remfd"?

Ugh, why do we have both the regfds array and a remfd per VFIORegion?

TBH, I'm still not sure why we're caching region infos at all, this
seems to be gratuitously bloated.

> +
>  static int vfio_setup_region_sparse_mmaps(VFIORegion *region,
>struct vfio_region_info *info)
>  {
> @@ -1544,6 +1555,7 @@ int vfio_region_setup(Object *obj, VFIODevice 
> *vbasedev, VFIORegion *region,
>  region->size = info->size;
>  region->fd_offset = info->offset;
>  region->nr = index;
> +region->remfd = vfio_get_region_info_remfd(vbasedev, index);

Why didn't we just get an fd back from vfio_get_region_info() that we
could use here?

>  
>  if (region->size) {
>  region->mem = g_new0(MemoryRegion, 1);
> @@ -1587,6 +1599,7 @@ int vfio_region_mmap(VFIORegion *region)
>  {
>  int i, prot = 0;
>  char *name;
> +int fd;
>  
>  if (!region->mem) {
>  return 0;
> @@ -1595,9 +1608,11 @@ int vfio_region_mmap(VFIORegion *region)
>  prot |= region->flags & VFIO_REGION_INFO_FLAG_READ ? PROT_READ : 0;
>  prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0;
>  
> +fd = region->remfd != -1 ? region->remfd : region->vbasedev->fd;

Gack, why can't VFIORegion.fd be either the remote process fd or the
vbasedevfd to avoid all these switches?  Thanks,

Alex

> +
>  for (i = 0; i < region->nr_mmaps; i++) {
>  region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot,
> - MAP_SHARED, region->vbasedev->fd,
> + MAP_SHARED, fd,
>   region->fd_offset +
>   region->mmaps[i].offset);
>  if (region->mmaps[i].mmap == MAP_FAILED) {
> @@ -2379,10 +2394,17 @@ void vfio_put_base_device(VFIODevice *vbasedev)
>  int i;
>  
>  for (i = 0; i < vbasedev->num_regions; i++) {
> +if (vbasedev->regfds != NULL && vbasedev->regfds[i] != -1) {
> +

Re: [RFC v3 12/19] vfio-user: region read/write

2021-11-19 Thread Alex Williamson
On Mon,  8 Nov 2021 16:46:40 -0800
John Johnson  wrote:

> Signed-off-by: Elena Ufimtseva 
> Signed-off-by: John G Johnson 
> Signed-off-by: Jagannathan Raman 
> ---
>  hw/vfio/pci.h |   1 +
>  hw/vfio/user-protocol.h   |  12 +
>  hw/vfio/user.h|   1 +
>  include/hw/vfio/vfio-common.h |   1 +
>  hw/vfio/common.c  |   7 ++-
>  hw/vfio/pci.c |   7 +++
>  hw/vfio/user.c| 101 
> ++
>  7 files changed, 129 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index ec9f345..643ff75 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -194,6 +194,7 @@ struct VFIOUserPCIDevice {
>  VFIOPCIDevice device;
>  char *sock_name;
>  bool send_queued;   /* all sends are queued */
> +bool no_post;   /* all regions write are sync */
>  };
>  
>  /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw 
> */
> diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
> index 104bf4f..56904cf 100644
> --- a/hw/vfio/user-protocol.h
> +++ b/hw/vfio/user-protocol.h
> @@ -109,4 +109,16 @@ typedef struct {
>  uint64_t offset;
>  } VFIOUserRegionInfo;
>  
> +/*
> + * VFIO_USER_REGION_READ
> + * VFIO_USER_REGION_WRITE
> + */
> +typedef struct {
> +VFIOUserHdr hdr;
> +uint64_t offset;
> +uint32_t region;
> +uint32_t count;
> +char data[];
> +} VFIOUserRegionRW;
> +
>  #endif /* VFIO_USER_PROTOCOL_H */
> diff --git a/hw/vfio/user.h b/hw/vfio/user.h
> index 19edd84..f2098f2 100644
> --- a/hw/vfio/user.h
> +++ b/hw/vfio/user.h
> @@ -75,6 +75,7 @@ typedef struct VFIOProxy {
>  /* VFIOProxy flags */
>  #define VFIO_PROXY_CLIENT0x1
>  #define VFIO_PROXY_FORCE_QUEUED  0x4
> +#define VFIO_PROXY_NO_POST   0x8
>  
>  VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp);
>  void vfio_user_disconnect(VFIOProxy *proxy);
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index e2d7ee1..b498964 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -56,6 +56,7 @@ typedef struct VFIORegion {
>  uint32_t nr_mmaps;
>  VFIOMmap *mmaps;
>  uint8_t nr; /* cache the region number for debug */
> +bool post_wr; /* writes can be posted */

As with the fd in the previous patch, this is where the concept of
posted writes should be introduced throughout.  Or maybe even better
would be to introduce write support without posting and the next patch
could expose posted writes.  Thanks,

Alex

>  int remfd; /* fd if exported from remote process */
>  } VFIORegion;
>  
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 47ec28f..e19f321 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -213,6 +213,7 @@ void vfio_region_write(void *opaque, hwaddr addr,
>  uint32_t dword;
>  uint64_t qword;
>  } buf;
> +bool post = region->post_wr;
>  int ret;
>  
>  switch (size) {
> @@ -233,7 +234,11 @@ void vfio_region_write(void *opaque, hwaddr addr,
>  break;
>  }
>  
> -ret = VDEV_REGION_WRITE(vbasedev, region->nr, addr, size, , false);
> +/* read-after-write hazard if guest can directly access region */
> +if (region->nr_mmaps) {
> +post = false;
> +}
> +ret = VDEV_REGION_WRITE(vbasedev, region->nr, addr, size, , post);
>  if (ret != size) {
>  const char *err = ret < 0 ? strerror(-ret) : "short write";
>  
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 40eb9e6..d5f9987 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1665,6 +1665,9 @@ static void vfio_bar_prepare(VFIOPCIDevice *vdev, int 
> nr)
>  bar->type = pci_bar & (bar->ioport ? ~PCI_BASE_ADDRESS_IO_MASK :
>   ~PCI_BASE_ADDRESS_MEM_MASK);
>  bar->size = bar->region.size;
> +
> +/* IO regions are sync, memory can be async */
> +bar->region.post_wr = (bar->ioport == 0);
>  }
>  
>  static void vfio_bars_prepare(VFIOPCIDevice *vdev)
> @@ -3513,6 +3516,9 @@ static void vfio_user_pci_realize(PCIDevice *pdev, 
> Error **errp)
>  if (udev->send_queued) {
>  proxy->flags |= VFIO_PROXY_FORCE_QUEUED;
>  }
> +if (udev->no_post) {
> +proxy->flags |= VFIO_PROXY_NO_POST;
> +}
>  
>  vfio_user_validate_version(vbasedev, );
>  if (err != NULL) {
> @@ -3565,6 +3571,7 @@ static void vfio_user_instance_finalize(Object *obj)
>  static Property vfio_user_pci_dev_properties[] = {
>  DEFINE_PROP_STRING("socket", VFIOUserPCIDevice, sock_name),
>  DEFINE_PROP_BOOL("x-send-queued", VFIOUserPCIDevice, send_queued, false),
> +DEFINE_PROP_BOOL("x-no-posted-writes", VFIOUserPCIDevice, no_post, 
> false),
>  DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/hw/vfio/user.c b/hw/vfio/user.c
> index b40c4ed..781cbfd 100644
> --- a/hw/vfio/user.c
> +++ b/hw/vfio/user.c
> @@ -50,6 +50,8 @@ static void 

Re: [PATCH v2] vfio: Fix memory leak of hostwin

2021-11-17 Thread Alex Williamson
On Wed, 17 Nov 2021 09:47:39 +0800
Peng Liang  wrote:

> hostwin is allocated and added to hostwin_list in vfio_host_win_add, but
> it is only deleted from hostwin_list in vfio_host_win_del, which causes
> a memory leak.  Also, freeing all elements in hostwin_list is missing in
> vfio_disconnect_container.
> 
> Fix: 2e4109de8e58 ("vfio/spapr: Create DMA window dynamically (SPAPR IOMMU 
> v2)")
> CC: qemu-sta...@nongnu.org
> Signed-off-by: Peng Liang 
> ---
> v1 -> v2:
> - Don't change to _SAFE variant in vfio_host_win_del. [Alex]
> ---
>  hw/vfio/common.c | 8 
>  1 file changed, 8 insertions(+)

Thanks, pull request sent to include this in 6.2.

Alex

> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index dd387b0d3959..080046e3f511 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -551,6 +551,7 @@ static int vfio_host_win_del(VFIOContainer *container, 
> hwaddr min_iova,
>  QLIST_FOREACH(hostwin, >hostwin_list, hostwin_next) {
>  if (hostwin->min_iova == min_iova && hostwin->max_iova == max_iova) {
>  QLIST_REMOVE(hostwin, hostwin_next);
> +g_free(hostwin);
>  return 0;
>  }
>  }
> @@ -2239,6 +2240,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>  if (QLIST_EMPTY(>group_list)) {
>  VFIOAddressSpace *space = container->space;
>  VFIOGuestIOMMU *giommu, *tmp;
> +VFIOHostDMAWindow *hostwin, *next;
>  
>  QLIST_REMOVE(container, next);
>  
> @@ -2249,6 +2251,12 @@ static void vfio_disconnect_container(VFIOGroup *group)
>  g_free(giommu);
>  }
>  
> +QLIST_FOREACH_SAFE(hostwin, >hostwin_list, hostwin_next,
> +   next) {
> +QLIST_REMOVE(hostwin, hostwin_next);
> +g_free(hostwin);
> +}
> +
>  trace_vfio_disconnect_container(container->fd);
>  close(container->fd);
>  g_free(container);




[PULL 1/1] vfio: Fix memory leak of hostwin

2021-11-17 Thread Alex Williamson
From: Peng Liang 

hostwin is allocated and added to hostwin_list in vfio_host_win_add, but
it is only deleted from hostwin_list in vfio_host_win_del, which causes
a memory leak.  Also, freeing all elements in hostwin_list is missing in
vfio_disconnect_container.

Fix: 2e4109de8e58 ("vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)")
CC: qemu-sta...@nongnu.org
Signed-off-by: Peng Liang 
Link: https://lore.kernel.org/r/2027014739.1839263-1-liangpen...@huawei.com
Signed-off-by: Alex Williamson 
---
 hw/vfio/common.c |8 
 1 file changed, 8 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index dd387b0d3959..080046e3f511 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -551,6 +551,7 @@ static int vfio_host_win_del(VFIOContainer *container, 
hwaddr min_iova,
 QLIST_FOREACH(hostwin, >hostwin_list, hostwin_next) {
 if (hostwin->min_iova == min_iova && hostwin->max_iova == max_iova) {
 QLIST_REMOVE(hostwin, hostwin_next);
+g_free(hostwin);
 return 0;
 }
 }
@@ -2239,6 +2240,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
 if (QLIST_EMPTY(>group_list)) {
 VFIOAddressSpace *space = container->space;
 VFIOGuestIOMMU *giommu, *tmp;
+VFIOHostDMAWindow *hostwin, *next;
 
 QLIST_REMOVE(container, next);
 
@@ -2249,6 +2251,12 @@ static void vfio_disconnect_container(VFIOGroup *group)
 g_free(giommu);
 }
 
+QLIST_FOREACH_SAFE(hostwin, >hostwin_list, hostwin_next,
+   next) {
+QLIST_REMOVE(hostwin, hostwin_next);
+g_free(hostwin);
+}
+
 trace_vfio_disconnect_container(container->fd);
 close(container->fd);
 g_free(container);





[PULL 0/1] VFIO fixes 2021-11-17 (for v6.2)

2021-11-17 Thread Alex Williamson
The following changes since commit 3bb87484e77d22cf4e580a78856529c982195d32:

  Merge tag 'pull-request-2021-11-17' of https://gitlab.com/thuth/qemu into 
staging (2021-11-17 12:35:51 +0100)

are available in the Git repository at:

  git://github.com/awilliam/qemu-vfio.git tags/vfio-fixes-2027.0

for you to fetch changes up to f3bc3a73c908df15966e66f88d5a633bd42fd029:

  vfio: Fix memory leak of hostwin (2021-11-17 11:25:55 -0700)


VFIO fixes 2021-11-17

 * Fix hostwin memory leak (Peng Liang)


Peng Liang (1):
  vfio: Fix memory leak of hostwin

 hw/vfio/common.c | 8 
 1 file changed, 8 insertions(+)





Re: [PATCH] vfio: Fix memory leak of hostwin

2021-11-16 Thread Alex Williamson
On Tue, 16 Nov 2021 19:56:26 +0800
Peng Liang  wrote:

> hostwin is allocated and added to hostwin_list in vfio_host_win_add, but
> it is only deleted from hostwin_list in vfio_host_win_del, which causes
> a memory leak.  Also, freeing all elements in hostwin_list is missing in
> vfio_disconnect_container.
> 
> Fix: 2e4109de8e58 ("vfio/spapr: Create DMA window dynamically (SPAPR IOMMU 
> v2)")
> CC: qemu-sta...@nongnu.org
> Signed-off-by: Peng Liang 
> ---
>  hw/vfio/common.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index dd387b0d3959..2cce60c5fac3 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -546,11 +546,12 @@ static void vfio_host_win_add(VFIOContainer *container,
>  static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
>   hwaddr max_iova)
>  {
> -VFIOHostDMAWindow *hostwin;
> +VFIOHostDMAWindow *hostwin, *next;
>  
> -QLIST_FOREACH(hostwin, >hostwin_list, hostwin_next) {
> +QLIST_FOREACH_SAFE(hostwin, >hostwin_list, hostwin_next, 
> next) {

Unnecessary conversion to _SAFE variant here, we don't continue to walk
the list after removing an object.

>  if (hostwin->min_iova == min_iova && hostwin->max_iova == max_iova) {
>  QLIST_REMOVE(hostwin, hostwin_next);
> +g_free(hostwin);
>  return 0;
>  }
>  }
> @@ -2239,6 +2240,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>  if (QLIST_EMPTY(>group_list)) {
>  VFIOAddressSpace *space = container->space;
>  VFIOGuestIOMMU *giommu, *tmp;
> +VFIOHostDMAWindow *hostwin, *next;
>  
>  QLIST_REMOVE(container, next);
>  
> @@ -2249,6 +2251,12 @@ static void vfio_disconnect_container(VFIOGroup *group)
>  g_free(giommu);
>  }
>  
> +QLIST_FOREACH_SAFE(hostwin, >hostwin_list, hostwin_next,
> +   next) {
> +QLIST_REMOVE(hostwin, hostwin_next);
> +g_free(hostwin);
> +}
> +

This usage looks good.  Thanks,

Alex

>  trace_vfio_disconnect_container(container->fd);
>  close(container->fd);
>  g_free(container);




Re: [PATCH v5 0/6] optimize the downtime for vfio migration

2021-11-03 Thread Alex Williamson
On Wed, 3 Nov 2021 16:16:51 +0800
"Longpeng(Mike)"  wrote:

> Hi guys,
>  
> In vfio migration resume phase, the cost would increase if the
> vfio device has more unmasked vectors. We try to optimize it in
> this series.
>  
> You can see the commit message in PATCH 6 for details.
>  
> Patch 1-3 are simple cleanups and fixup.
> Patch 4-5 are the preparations for the optimization.
> Patch 6 optimizes the vfio msix setup path.
> 
> Changes v4->v5:
>  - setup the notifier and irqfd in the same function to makes
>the code neater.[Alex]

I wish this was posted a day earlier, QEMU entered soft-freeze for the
6.2 release yesterday[1].  Since vfio migration is still an
experimental feature, let's pick this up when the next development
window opens, and please try to get an ack from Paolo for the deferred
msi route function in the meantime.  Thanks,

Alex

[1]https://wiki.qemu.org/Planning/6.2

> 
> Changes v3->v4:
>  - fix several typos and grammatical errors [Alex]
>  - remove the patches that fix and clean the MSIX common part
>from this series [Alex]
>  - Patch 6:
> - use vector->use directly and fill it with -1 on error
>   paths [Alex]
> - add comment before enable deferring to commit [Alex]
> - move the code that do_use/release on vector 0 into an
>   "else" branch [Alex]
> - introduce vfio_prepare_kvm_msi_virq_batch() that enables
>   the 'defer_kvm_irq_routing' flag [Alex]
> - introduce vfio_commit_kvm_msi_virq_batch() that clears the
>   'defer_kvm_irq_routing' flag and does further work [Alex]
> 
> Changes v2->v3:
>  - fix two errors [Longpeng]
> 
> Changes v1->v2:
>  - fix several typos and grammatical errors [Alex, Philippe]
>  - split fixups and cleanups into separate patches  [Alex, Philippe]
>  - introduce kvm_irqchip_add_deferred_msi_route to
>minimize code changes[Alex]
>  - enable the optimization in msi setup path[Alex]
> 
> Longpeng (Mike) (6):
>   vfio: simplify the conditional statements in vfio_msi_enable
>   vfio: move re-enabling INTX out of the common helper
>   vfio: simplify the failure path in vfio_msi_enable
>   kvm: irqchip: extract kvm_irqchip_add_deferred_msi_route
>   Revert "vfio: Avoid disabling and enabling vectors repeatedly in VFIO
> migration"
>   vfio: defer to commit kvm irq routing when enable msi/msix
> 
>  accel/kvm/kvm-all.c  |  15 -
>  hw/vfio/pci.c| 176 
> ---
>  hw/vfio/pci.h|   1 +
>  include/sysemu/kvm.h |   6 ++
>  4 files changed, 130 insertions(+), 68 deletions(-)
> 




[PULL 2/2] vfio/common: Add a trace point when a MMIO RAM section cannot be mapped

2021-11-01 Thread Alex Williamson
From: Kunkun Jiang 

The MSI-X structures of some devices and other non-MSI-X structures
may be in the same BAR. They may share one host page, especially in
the case of large page granularity, such as 64K.

For example, MSIX-Table size of 82599 NIC is 0x30 and the offset in
Bar 3(size 64KB) is 0x0. vfio_listener_region_add() will be called
to map the remaining range (0x30-0x). If host page size is 64KB,
it will return early at 'int128_ge((int128_make64(iova), llend))'
without any message. Let's add a trace point to inform users like commit
5c08600547c0 ("vfio: Use a trace point when a RAM section cannot be DMA mapped")
did.

Signed-off-by: Kunkun Jiang 
Link: https://lore.kernel.org/r/20211027090406.761-3-jiangkun...@huawei.com
Signed-off-by: Alex Williamson 
---
 hw/vfio/common.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a784b219e6d4..dd387b0d3959 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -893,6 +893,13 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
 
 if (int128_ge(int128_make64(iova), llend)) {
+if (memory_region_is_ram_device(section->mr)) {
+trace_vfio_listener_region_add_no_dma_map(
+memory_region_name(section->mr),
+section->offset_within_address_space,
+int128_getlo(section->size),
+qemu_real_host_page_size);
+}
 return;
 }
 end = int128_get64(int128_sub(llend, int128_one()));





[PULL 0/2] VFIO update 2021-11-01 (for v6.2)

2021-11-01 Thread Alex Williamson
The following changes since commit af531756d25541a1b3b3d9a14e72e7fedd941a2e:

  Merge remote-tracking branch 'remotes/philmd/tags/renesas-20211030' into 
staging (2021-10-30 11:31:41 -0700)

are available in the Git repository at:

  git://github.com/awilliam/qemu-vfio.git tags/vfio-update-20211101.0

for you to fetch changes up to e4b34708388b20f1ceb55f1d563d8da925a32424:

  vfio/common: Add a trace point when a MMIO RAM section cannot be mapped 
(2021-11-01 12:17:51 -0600)


VFIO update 2021-11-01

 * Re-enable expanded sub-page BAR mappings after migration (Kunkun Jiang)

 * Trace dropped listener sections due to page alignment (Kunkun Jiang)


Kunkun Jiang (2):
  vfio/pci: Add support for mmapping sub-page MMIO BARs after live migration
  vfio/common: Add a trace point when a MMIO RAM section cannot be mapped

 hw/vfio/common.c |  7 +++
 hw/vfio/pci.c| 19 ++-
 2 files changed, 25 insertions(+), 1 deletion(-)




[PULL 1/2] vfio/pci: Add support for mmapping sub-page MMIO BARs after live migration

2021-11-01 Thread Alex Williamson
From: Kunkun Jiang 

We can expand MemoryRegions of sub-page MMIO BARs in
vfio_pci_write_config() to improve IO performance for some
devices. However, the MemoryRegions of destination VM are
not expanded any more after live migration. Because their
addresses have been updated in vmstate_load_state()
(vfio_pci_load_config) and vfio_sub_page_bar_update_mapping()
will not be called.

This may result in poor performance after live migration.
So iterate BARs in vfio_pci_load_config() and try to update
sub-page BARs.

Reported-by: Nianyao Tang 
Reported-by: Qixin Gan 
Signed-off-by: Kunkun Jiang 
Link: https://lore.kernel.org/r/20211027090406.761-2-jiangkun...@huawei.com
Signed-off-by: Alex Williamson 
---
 hw/vfio/pci.c |   19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5cdf1d4298a7..7b45353ce27f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2453,7 +2453,12 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, 
QEMUFile *f)
 {
 VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
 PCIDevice *pdev = >pdev;
-int ret;
+pcibus_t old_addr[PCI_NUM_REGIONS - 1];
+int bar, ret;
+
+for (bar = 0; bar < PCI_ROM_SLOT; bar++) {
+old_addr[bar] = pdev->io_regions[bar].addr;
+}
 
 ret = vmstate_load_state(f, _vfio_pci_config, vdev, 1);
 if (ret) {
@@ -2463,6 +2468,18 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, 
QEMUFile *f)
 vfio_pci_write_config(pdev, PCI_COMMAND,
   pci_get_word(pdev->config + PCI_COMMAND), 2);
 
+for (bar = 0; bar < PCI_ROM_SLOT; bar++) {
+/*
+ * The address may not be changed in some scenarios
+ * (e.g. the VF driver isn't loaded in VM).
+ */
+if (old_addr[bar] != pdev->io_regions[bar].addr &&
+vdev->bars[bar].region.size > 0 &&
+vdev->bars[bar].region.size < qemu_real_host_page_size) {
+vfio_sub_page_bar_update_mapping(pdev, bar);
+}
+}
+
 if (msi_enabled(pdev)) {
 vfio_msi_enable(vdev);
 } else if (msix_enabled(pdev)) {





  1   2   3   4   5   6   7   8   9   10   >