Re: [RFC PATCH 04/10] vfio/pci: Use vfio_device_unmap_mapping_range()

2021-02-24 Thread Jason Gunthorpe
On Wed, Feb 24, 2021 at 02:55:05PM -0700, Alex Williamson wrote:

> Ok, but how does this really help us, unless you're also proposing some
> redesign of the memory_lock semaphore?  Even if we're zapping all the
> affected devices for a bus reset that doesn't eliminate that we still
> require device level granularity for other events.

Ok, I missed the device level one, forget this remark about the reflk
then, per vfio_pci_device is the right granularity

> > >  struct vfio_pci_device {
> > >   struct pci_dev  *pdev;
> > > + struct vfio_device  *device;  
> > 
> > Ah, I did this too, but I didn't use a pointer :)
> 
> vfio_device is embedded in vfio.c, so that worries me.

I'm working on what we talked about in the other thread to show how
VFIO looks if it follows the normal Linux container_of idiom and to
then remove the vfio_mdev.c indirection as an illustration.

I want to directly make the case the VFIO is better off following the
standard kernel designs and driver core norms so we can find an
agreement to get Max's work on vfio_pci going forward.

You were concerned about hand waving, and maintainability so I'm
willing to put in some more work to make it concrete.

I hope you'll keep an open mind

> > All the places trying to call vfio_device_put() when they really want
> > a vfio_pci_device * become simpler now. Eg struct vfio_devices wants
> > to have an array of vfio_pci_device, and get_pf_vdev() only needs to
> > return one pointer.
> 
> Sure, that example would be a good simplification.  I'm not sure see
> other cases where we're going out of our way to manage the vfio_device
> versus vfio_pci_device objects though.

What I've found is there are lots of little costs all over the
place. The above was just easy to describe in this context.

In my mind the biggest negative cost is the type erasure. Lots of
places are using 'device *', 'void *', 'kobj *' as generic handles to
things that are actually already concretely typed. In the kernel I
would say concretely typing things is considered a virtue.

Having gone through alot of VFIO now, as it relates to the
vfio_device, I think the type erasure directly hurts readability and
maintainability. It just takes too much information away from the
reader and the compiler. The core code is managable, but when you
chuck mdev into it that follows the same, it really starts hurting.

Jason


Re: [RFC PATCH 04/10] vfio/pci: Use vfio_device_unmap_mapping_range()

2021-02-24 Thread Alex Williamson
On Mon, 22 Feb 2021 13:22:30 -0400
Jason Gunthorpe  wrote:

> On Mon, Feb 22, 2021 at 09:51:13AM -0700, Alex Williamson wrote:
> 
> > +   vfio_device_unmap_mapping_range(vdev->device,
> > +   VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX),
> > +   VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX) -
> > +   VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX));  
> 
> Isn't this the same as invalidating everything? I see in
> vfio_pci_mmap():
> 
>   if (index >= VFIO_PCI_ROM_REGION_INDEX)
>   return -EINVAL;

No, immediately above that is:

if (index >= VFIO_PCI_NUM_REGIONS) {
int regnum = index - VFIO_PCI_NUM_REGIONS;
struct vfio_pci_region *region = vdev->region + regnum;

if (region && region->ops && region->ops->mmap &&
(region->flags & VFIO_REGION_INFO_FLAG_MMAP))
return region->ops->mmap(vdev, region, vma);
return -EINVAL;
}

We can have device specific regions that can support mmap, but those
regions aren't necessarily on-device memory so we can't assume they're
tied to the memory bit in the command register.
 
> > @@ -2273,15 +2112,13 @@ static int vfio_pci_try_zap_and_vma_lock_cb(struct 
> > pci_dev *pdev, void *data)
> >  
> > vdev = vfio_device_data(device);
> >  
> > -   /*
> > -* Locking multiple devices is prone to deadlock, runaway and
> > -* unwind if we hit contention.
> > -*/
> > -   if (!vfio_pci_zap_and_vma_lock(vdev, true)) {
> > +   if (!down_write_trylock(>memory_lock)) {
> > vfio_device_put(device);
> > return -EBUSY;
> > }  
> 
> And this is only done as part of VFIO_DEVICE_PCI_HOT_RESET?

Yes.

> It looks like VFIO_DEVICE_PCI_HOT_RESET effects the entire slot?

Yes.

> How about putting the inode on the reflck structure, which is also
> per-slot, and then a single unmap_mapping_range() will take care of
> everything, no need to iterate over things in the driver core.
>
> Note the vm->pg_off space doesn't have any special meaning, it is
> fine that two struct vfio_pci_device's are sharing the same address
> space and using an incompatible overlapping pg_offs

Ok, but how does this really help us, unless you're also proposing some
redesign of the memory_lock semaphore?  Even if we're zapping all the
affected devices for a bus reset that doesn't eliminate that we still
require device level granularity for other events.  Maybe there's some
layering of the inodes that you're implying that allows both, but it
still feels like a minor optimization if we need to traverse devices
for the memory_lock.

> > diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> > b/drivers/vfio/pci/vfio_pci_private.h
> > index 9cd1882a05af..ba37f4eeefd0 100644
> > +++ b/drivers/vfio/pci/vfio_pci_private.h
> > @@ -101,6 +101,7 @@ struct vfio_pci_mmap_vma {
> >  
> >  struct vfio_pci_device {
> > struct pci_dev  *pdev;
> > +   struct vfio_device  *device;  
> 
> Ah, I did this too, but I didn't use a pointer :)

vfio_device is embedded in vfio.c, so that worries me.

> All the places trying to call vfio_device_put() when they really want
> a vfio_pci_device * become simpler now. Eg struct vfio_devices wants
> to have an array of vfio_pci_device, and get_pf_vdev() only needs to
> return one pointer.

Sure, that example would be a good simplification.  I'm not sure see
other cases where we're going out of our way to manage the vfio_device
versus vfio_pci_device objects though.  Thanks,

Alex



Re: [RFC PATCH 04/10] vfio/pci: Use vfio_device_unmap_mapping_range()

2021-02-22 Thread Jason Gunthorpe
On Mon, Feb 22, 2021 at 09:51:13AM -0700, Alex Williamson wrote:

> + vfio_device_unmap_mapping_range(vdev->device,
> + VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX),
> + VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX) -
> + VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX));

Isn't this the same as invalidating everything? I see in
vfio_pci_mmap():

if (index >= VFIO_PCI_ROM_REGION_INDEX)
return -EINVAL;

> @@ -2273,15 +2112,13 @@ static int vfio_pci_try_zap_and_vma_lock_cb(struct 
> pci_dev *pdev, void *data)
>  
>   vdev = vfio_device_data(device);
>  
> - /*
> -  * Locking multiple devices is prone to deadlock, runaway and
> -  * unwind if we hit contention.
> -  */
> - if (!vfio_pci_zap_and_vma_lock(vdev, true)) {
> + if (!down_write_trylock(>memory_lock)) {
>   vfio_device_put(device);
>   return -EBUSY;
>   }

And this is only done as part of VFIO_DEVICE_PCI_HOT_RESET?

It looks like VFIO_DEVICE_PCI_HOT_RESET effects the entire slot?

How about putting the inode on the reflck structure, which is also
per-slot, and then a single unmap_mapping_range() will take care of
everything, no need to iterate over things in the driver core.

Note the vm->pg_off space doesn't have any special meaning, it is
fine that two struct vfio_pci_device's are sharing the same address
space and using an incompatible overlapping pg_offs

> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index 9cd1882a05af..ba37f4eeefd0 100644
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -101,6 +101,7 @@ struct vfio_pci_mmap_vma {
>  
>  struct vfio_pci_device {
>   struct pci_dev  *pdev;
> + struct vfio_device  *device;

Ah, I did this too, but I didn't use a pointer :)

All the places trying to call vfio_device_put() when they really want
a vfio_pci_device * become simpler now. Eg struct vfio_devices wants
to have an array of vfio_pci_device, and get_pf_vdev() only needs to
return one pointer.

Jason


[RFC PATCH 04/10] vfio/pci: Use vfio_device_unmap_mapping_range()

2021-02-22 Thread Alex Williamson
With the vfio device fd tied to the address space of the pseudo fs
inode, we can use the mm to track all vmas that might be mmap'ing
device BARs, which removes our vma_list and all the complicated
lock ordering necessary to manually zap each related vma.

Suggested-by: Jason Gunthorpe 
Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci.c |  217 ---
 drivers/vfio/pci/vfio_pci_private.h |3 
 2 files changed, 28 insertions(+), 192 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index f0a1d05f0137..115f10f7b096 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -225,7 +225,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device 
*vdev)
 
 static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
 static void vfio_pci_disable(struct vfio_pci_device *vdev);
-static int vfio_pci_try_zap_and_vma_lock_cb(struct pci_dev *pdev, void *data);
+static int vfio_pci_mem_trylock_and_zap_cb(struct pci_dev *pdev, void *data);
 
 /*
  * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
@@ -1168,7 +1168,7 @@ static long vfio_pci_ioctl(void *device_data,
struct vfio_pci_group_info info;
struct vfio_devices devs = { .cur_index = 0 };
bool slot = false;
-   int i, group_idx, mem_idx = 0, count = 0, ret = 0;
+   int i, group_idx, count = 0, ret = 0;
 
minsz = offsetofend(struct vfio_pci_hot_reset, count);
 
@@ -1268,32 +1268,16 @@ static long vfio_pci_ioctl(void *device_data,
}
 
/*
-* We need to get memory_lock for each device, but devices
-* can share mmap_lock, therefore we need to zap and hold
-* the vma_lock for each device, and only then get each
-* memory_lock.
+* Try to get the memory_lock write lock for all devices and
+* zap all BAR mmaps.
 */
ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
-   vfio_pci_try_zap_and_vma_lock_cb,
+   vfio_pci_mem_trylock_and_zap_cb,
, slot);
-   if (ret)
-   goto hot_reset_release;
-
-   for (; mem_idx < devs.cur_index; mem_idx++) {
-   struct vfio_pci_device *tmp;
-
-   tmp = vfio_device_data(devs.devices[mem_idx]);
-
-   ret = down_write_trylock(>memory_lock);
-   if (!ret) {
-   ret = -EBUSY;
-   goto hot_reset_release;
-   }
-   mutex_unlock(>vma_lock);
-   }
 
/* User has access, do the reset */
-   ret = pci_reset_bus(vdev->pdev);
+   if (!ret)
+   ret = pci_reset_bus(vdev->pdev);
 
 hot_reset_release:
for (i = 0; i < devs.cur_index; i++) {
@@ -1303,10 +1287,7 @@ static long vfio_pci_ioctl(void *device_data,
device = devs.devices[i];
tmp = vfio_device_data(device);
 
-   if (i < mem_idx)
-   up_write(>memory_lock);
-   else
-   mutex_unlock(>vma_lock);
+   up_write(>memory_lock);
vfio_device_put(device);
}
kfree(devs.devices);
@@ -1452,100 +1433,18 @@ static ssize_t vfio_pci_write(void *device_data, const 
char __user *buf,
return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
 }
 
-/* Return 1 on zap and vma_lock acquired, 0 on contention (only with @try) */
-static int vfio_pci_zap_and_vma_lock(struct vfio_pci_device *vdev, bool try)
+static void vfio_pci_zap_bars(struct vfio_pci_device *vdev)
 {
-   struct vfio_pci_mmap_vma *mmap_vma, *tmp;
-
-   /*
-* Lock ordering:
-* vma_lock is nested under mmap_lock for vm_ops callback paths.
-* The memory_lock semaphore is used by both code paths calling
-* into this function to zap vmas and the vm_ops.fault callback
-* to protect the memory enable state of the device.
-*
-* When zapping vmas we need to maintain the mmap_lock => vma_lock
-* ordering, which requires using vma_lock to walk vma_list to
-* acquire an mm, then dropping vma_lock to get the mmap_lock and
-* reacquiring vma_lock.  This logic is derived from similar
-* requirements in uverbs_user_mmap_disassociate().
-*
-* mmap_lock must always be the top-level lock when it is taken.
-* Therefore we can only hold the memory_lock write lock when
-* vma_list is empty, as we'd need to take