Re: [PATCH 1/3] vfio/type1: Support faulting PFNMAP vmas

2020-05-04 Thread Alex Williamson
On Fri, 1 May 2020 20:50:33 -0300
Jason Gunthorpe  wrote:

> On Fri, May 01, 2020 at 03:39:08PM -0600, Alex Williamson wrote:
> > With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
> > the range being faulted into the vma.  Add support to manually provide
> > that, in the same way as done on KVM with hva_to_pfn_remapped().
> > 
> > Signed-off-by: Alex Williamson 
> >  drivers/vfio/vfio_iommu_type1.c |   36 +---
> >  1 file changed, 33 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c 
> > b/drivers/vfio/vfio_iommu_type1.c
> > index cc1d64765ce7..4a4cb7cd86b2 100644
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -317,6 +317,32 @@ static int put_pfn(unsigned long pfn, int prot)
> > return 0;
> >  }
> >  
> > +static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct 
> > *mm,
> > +   unsigned long vaddr, unsigned long *pfn,
> > +   bool write_fault)
> > +{
> > +   int ret;
> > +
> > +   ret = follow_pfn(vma, vaddr, pfn);
> > +   if (ret) {
> > +   bool unlocked = false;
> > +
> > +   ret = fixup_user_fault(NULL, mm, vaddr,
> > +  FAULT_FLAG_REMOTE |
> > +  (write_fault ?  FAULT_FLAG_WRITE : 0),
> > +  );
> > +   if (unlocked)
> > +   return -EAGAIN;
> > +
> > +   if (ret)
> > +   return ret;
> > +
> > +   ret = follow_pfn(vma, vaddr, pfn);
> > +   }
> > +
> > +   return ret;
> > +}
> > +
> >  static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> >  int prot, unsigned long *pfn)
> >  {
> > @@ -339,12 +365,16 @@ static int vaddr_get_pfn(struct mm_struct *mm, 
> > unsigned long vaddr,
> >  
> > vaddr = untagged_addr(vaddr);
> >  
> > +retry:
> > vma = find_vma_intersection(mm, vaddr, vaddr + 1);
> >  
> > if (vma && vma->vm_flags & VM_PFNMAP) {
> > -   if (!follow_pfn(vma, vaddr, pfn) &&
> > -   is_invalid_reserved_pfn(*pfn))
> > -   ret = 0;
> > +   ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
> > +   if (ret == -EAGAIN)
> > +   goto retry;
> > +
> > +   if (!ret && !is_invalid_reserved_pfn(*pfn))
> > +   ret = -EFAULT;  
> 
> I suggest checking vma->vm_ops == _pci_mmap_ops and adding a
> comment that this is racy and needs to be fixed up. The ops check
> makes this only used by other vfio bars and should prevent some
> abuses of this hacky thing

We can't do that, vfio-pci is only one bus driver within the vfio
ecosystem.

> However, I wonder if this chould just link itself into the
> vma->private data so that when the vfio that owns the bar goes away,
> so does the iommu mapping?

I don't really see why we wouldn't use mmu notifiers so that the vfio
iommu backend and vfio bus driver remain independent.

> I feel like this patch set is not complete unless it also handles the
> shootdown of this path too?

It would be nice to solve both issues and I'll start working on the mmu
notifier side of things, but this series does solve a real issue on
its own and we're not changing the iommu mapping behavior here.  Thanks,

Alex



[GIT PULL] VFIO fixes for v5.7-rc4

2020-05-01 Thread Alex Williamson
Hi Linus,

The following changes since commit ae83d0b416db002fe95601e7f97f64b59514d936:

  Linux 5.7-rc2 (2020-04-19 14:35:30 -0700)

are available in the Git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v5.7-rc4

for you to fetch changes up to 5cbf3264bc715e9eb384e2b68601f8c02bb9a61d:

  vfio/type1: Fix VA->PA translation for PFNMAP VMAs in vaddr_get_pfn() 
(2020-04-23 12:10:01 -0600)


VFIO fixes for v5.7-rc4

 - copy_*_user validity check for new vfio_dma_rw interface (Yan Zhao)

 - Fix a potential math overflow (Yan Zhao)

 - Use follow_pfn() for calculating PFNMAPs (Sean Christopherson)


Sean Christopherson (1):
  vfio/type1: Fix VA->PA translation for PFNMAP VMAs in vaddr_get_pfn()

Yan Zhao (2):
  vfio: checking of validity of user vaddr in vfio_dma_rw
  vfio: avoid possible overflow in vfio_iommu_type1_pin_pages

 drivers/vfio/vfio_iommu_type1.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)



[PATCH] vfio-pci: Mask cap zero

2020-05-01 Thread Alex Williamson
There is no PCI spec defined capability with ID 0, therefore we don't
expect to find it in a capability chain and we use this index in an
internal array for tracking the sizes of various capabilities to handle
standard config space.  Therefore if a device does present us with a
capability ID 0, we mark our capability map with nonsense that can
trigger conflicts with other capabilities in the chain.  Ignore ID 0
when walking the capability chain, handling it as a hidden capability.

Seen on an NVIDIA Tesla T4.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci_config.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index 87d0cc8c86ad..5935a804cb88 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1487,7 +1487,7 @@ static int vfio_cap_init(struct vfio_pci_device *vdev)
if (ret)
return ret;
 
-   if (cap <= PCI_CAP_ID_MAX) {
+   if (cap && cap <= PCI_CAP_ID_MAX) {
len = pci_cap_length[cap];
if (len == 0xFF) { /* Variable length */
len = vfio_cap_len(vdev, cap, pos);



[PATCH 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory

2020-05-01 Thread Alex Williamson
Accessing the disabled memory space of a PCI device would typically
result in a master abort response on conventional PCI, or an
unsupported request on PCI express.  The user would generally see
these as a -1 response for the read return data and the write would be
silently discarded, possibly with an uncorrected, non-fatal AER error
triggered on the host.  Some systems however take it upon themselves
to bring down the entire system when they see something that might
indicate a loss of data, such as this discarded write to a disabled
memory space.

To avoid this, we want to try to block the user from accessing memory
spaces while they're disabled.  We start with a semaphore around the
memory enable bit, where writers modify the memory enable state and
must be serialized, while readers make use of the memory region and
can access in parallel.  Writers include both direct manipulation via
the command register, as well as any reset path where the internal
mechanics of the reset may both explicitly and implicitly disable
memory access, and manipulation of the MSI-X configuration, where the
MSI-X vector table resides in MMIO space of the device.  Readers
include the read and write file ops to access the vfio device fd
offsets as well as memory mapped access.  In the latter case, we make
use of our new vma list support to zap, or invalidate, those memory
mappings in order to force them to be faulted back in on access.

Our semaphore usage will stall user access to MMIO spaces across
internal operations like reset, but the user might experience new
behavior when trying to access the MMIO space while disabled via the
PCI command register.  Access via read or write while disabled will
return -EIO and access via memory maps will result in a SIGBUS.  This
is expected to be compatible with known use cases and potentially
provides better error handling capabilities than present in the
hardware, while avoiding the more readily accessible and severe
platform error responses that might otherwise occur.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci.c |  200 ---
 drivers/vfio/pci/vfio_pci_config.c  |   31 +
 drivers/vfio/pci/vfio_pci_intrs.c   |   18 +++
 drivers/vfio/pci/vfio_pci_private.h |4 +
 drivers/vfio/pci/vfio_pci_rdwr.c|   12 ++
 5 files changed, 246 insertions(+), 19 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index da2fef666d9c..ce2bb3e62b18 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "vfio_pci_private.h"
 
@@ -184,6 +185,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device 
*vdev)
 
 static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
 static void vfio_pci_disable(struct vfio_pci_device *vdev);
+static int vfio_pci_lock_mem(struct pci_dev *pdev, void *data);
 
 /*
  * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
@@ -736,6 +738,12 @@ int vfio_pci_register_dev_region(struct vfio_pci_device 
*vdev,
return 0;
 }
 
+struct vfio_devices {
+   struct vfio_device **devices;
+   int cur_index;
+   int max_index;
+};
+
 static long vfio_pci_ioctl(void *device_data,
   unsigned int cmd, unsigned long arg)
 {
@@ -984,8 +992,17 @@ static long vfio_pci_ioctl(void *device_data,
return ret;
 
} else if (cmd == VFIO_DEVICE_RESET) {
-   return vdev->reset_works ?
-   pci_try_reset_function(vdev->pdev) : -EINVAL;
+   int ret;
+
+   if (!vdev->reset_works)
+   return -EINVAL;
+
+   down_write(>memory_lock);
+   vfio_pci_zap_mmap_vmas(vdev);
+   ret = pci_try_reset_function(vdev->pdev);
+   up_write(>memory_lock);
+
+   return ret;
 
} else if (cmd == VFIO_DEVICE_GET_PCI_HOT_RESET_INFO) {
struct vfio_pci_hot_reset_info hdr;
@@ -1065,6 +1082,7 @@ static long vfio_pci_ioctl(void *device_data,
int32_t *group_fds;
struct vfio_pci_group_entry *groups;
struct vfio_pci_group_info info;
+   struct vfio_devices devs = { .cur_index = 0 };
bool slot = false;
int i, count = 0, ret = 0;
 
@@ -1153,11 +1171,39 @@ static long vfio_pci_ioctl(void *device_data,
ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
vfio_pci_validate_devs,
, slot);
-   if (!ret)
-   /* User has access, do the reset */
-   ret = pci_reset_bus(vdev->pdev);
+   if (ret)
+   goto hot_reset_release;
+
+   devs.max_index = count;
+   devs.devi

[PATCH 1/3] vfio/type1: Support faulting PFNMAP vmas

2020-05-01 Thread Alex Williamson
With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
the range being faulted into the vma.  Add support to manually provide
that, in the same way as done on KVM with hva_to_pfn_remapped().

Signed-off-by: Alex Williamson 
---
 drivers/vfio/vfio_iommu_type1.c |   36 +---
 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index cc1d64765ce7..4a4cb7cd86b2 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -317,6 +317,32 @@ static int put_pfn(unsigned long pfn, int prot)
return 0;
 }
 
+static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
+   unsigned long vaddr, unsigned long *pfn,
+   bool write_fault)
+{
+   int ret;
+
+   ret = follow_pfn(vma, vaddr, pfn);
+   if (ret) {
+   bool unlocked = false;
+
+   ret = fixup_user_fault(NULL, mm, vaddr,
+  FAULT_FLAG_REMOTE |
+  (write_fault ?  FAULT_FLAG_WRITE : 0),
+  );
+   if (unlocked)
+   return -EAGAIN;
+
+   if (ret)
+   return ret;
+
+   ret = follow_pfn(vma, vaddr, pfn);
+   }
+
+   return ret;
+}
+
 static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 int prot, unsigned long *pfn)
 {
@@ -339,12 +365,16 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
 
vaddr = untagged_addr(vaddr);
 
+retry:
vma = find_vma_intersection(mm, vaddr, vaddr + 1);
 
if (vma && vma->vm_flags & VM_PFNMAP) {
-   if (!follow_pfn(vma, vaddr, pfn) &&
-   is_invalid_reserved_pfn(*pfn))
-   ret = 0;
+   ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
+   if (ret == -EAGAIN)
+   goto retry;
+
+   if (!ret && !is_invalid_reserved_pfn(*pfn))
+   ret = -EFAULT;
}
 done:
up_read(>mmap_sem);



[PATCH 2/3] vfio-pci: Fault mmaps to enable vma tracking

2020-05-01 Thread Alex Williamson
Rather than calling remap_pfn_range() when a region is mmap'd, setup
a vm_ops handler to support dynamic faulting of the range on access.
This allows us to manage a list of vmas actively mapping the area that
we can later use to invalidate those mappings.  The open callback
invalidates the vma range so that all tracking is inserted in the
fault handler and removed in the close handler.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci.c |   76 ++-
 drivers/vfio/pci/vfio_pci_private.h |7 +++
 2 files changed, 81 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 6c6b37b5c04e..da2fef666d9c 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1299,6 +1299,70 @@ static ssize_t vfio_pci_write(void *device_data, const 
char __user *buf,
return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
 }
 
+static int vfio_pci_add_vma(struct vfio_pci_device *vdev,
+   struct vm_area_struct *vma)
+{
+   struct vfio_pci_mmap_vma *mmap_vma;
+
+   mmap_vma = kzalloc(sizeof(*mmap_vma), GFP_KERNEL);
+   if (!mmap_vma)
+   return -ENOMEM;
+
+   mmap_vma->vma = vma;
+
+   mutex_lock(>vma_lock);
+   list_add(_vma->vma_next, >vma_list);
+   mutex_unlock(>vma_lock);
+
+   return 0;
+}
+
+/*
+ * Zap mmaps on open so that we can fault them in on access and therefore
+ * our vma_list only tracks mappings accessed since last zap.
+ */
+static void vfio_pci_mmap_open(struct vm_area_struct *vma)
+{
+   zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
+}
+
+static void vfio_pci_mmap_close(struct vm_area_struct *vma)
+{
+   struct vfio_pci_device *vdev = vma->vm_private_data;
+   struct vfio_pci_mmap_vma *mmap_vma;
+
+   mutex_lock(>vma_lock);
+   list_for_each_entry(mmap_vma, >vma_list, vma_next) {
+   if (mmap_vma->vma == vma) {
+   list_del(_vma->vma_next);
+   kfree(mmap_vma);
+   break;
+   }
+   }
+   mutex_unlock(>vma_lock);
+}
+
+static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
+{
+   struct vm_area_struct *vma = vmf->vma;
+   struct vfio_pci_device *vdev = vma->vm_private_data;
+
+   if (vfio_pci_add_vma(vdev, vma))
+   return VM_FAULT_OOM;
+
+   if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+   vma->vm_end - vma->vm_start, vma->vm_page_prot))
+   return VM_FAULT_SIGBUS;
+
+   return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct vfio_pci_mmap_ops = {
+   .open = vfio_pci_mmap_open,
+   .close = vfio_pci_mmap_close,
+   .fault = vfio_pci_mmap_fault,
+};
+
 static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 {
struct vfio_pci_device *vdev = device_data;
@@ -1357,8 +1421,14 @@ static int vfio_pci_mmap(void *device_data, struct 
vm_area_struct *vma)
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
 
-   return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-  req_len, vma->vm_page_prot);
+   /*
+* See remap_pfn_range(), called from vfio_pci_fault() but we can't
+* change vm_flags within the fault handler.  Set them now.
+*/
+   vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
+   vma->vm_ops = _pci_mmap_ops;
+
+   return 0;
 }
 
 static void vfio_pci_request(void *device_data, unsigned int count)
@@ -1608,6 +1678,8 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
spin_lock_init(>irqlock);
mutex_init(>ioeventfds_lock);
INIT_LIST_HEAD(>ioeventfds_list);
+   mutex_init(>vma_lock);
+   INIT_LIST_HEAD(>vma_list);
 
ret = vfio_add_group_dev(>dev, _pci_ops, vdev);
if (ret)
diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index 36ec69081ecd..9b25f9f6ce1d 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -92,6 +92,11 @@ struct vfio_pci_vf_token {
int users;
 };
 
+struct vfio_pci_mmap_vma {
+   struct vm_area_struct   *vma;
+   struct list_headvma_next;
+};
+
 struct vfio_pci_device {
struct pci_dev  *pdev;
void __iomem*barmap[PCI_STD_NUM_BARS];
@@ -132,6 +137,8 @@ struct vfio_pci_device {
struct list_headioeventfds_list;
struct vfio_pci_vf_token*vf_token;
struct notifier_block   nb;
+   struct mutexvma_lock;
+   struct list_headvma_list;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)



[PATCH 0/3] vfio-pci: Block user access to disabled device MMIO

2020-05-01 Thread Alex Williamson
Add tracking of the device memory enable bit and block/fault accesses
to device MMIO space while disabled.  This provides synchronous fault
handling for CPU accesses to the device and prevents the user from
triggering platform level error handling present on some systems.
Device reset and MSI-X vector table accesses are also included such
that access is blocked across reset and vector table accesses do not
depend on the user configuration of the device.

This is based on the vfio for-linus branch currently in next, making
use of follow_pfn() in vaddr_get_pfn() and therefore requiring patch
1/ to force the user fault in the case that a PFNMAP vma might be
DMA mapped before user access.  Further PFNMAP iommu invalidation
tracking is not yet included here.

As noted in the comments, I'm copying quite a bit of the logic from
rdma code for performing the zap_vma_ptes() calls and I'm also
attempting to resolve lock ordering issues in the fault handler to
lockdep's satisfaction.  I appreciate extra eyes on these sections in
particular.

I expect this to be functionally equivalent for any well behaved
userspace driver, but obviously there is a potential for the user to
get -EIO or SIGBUS on device access.  The device is provided to the
user enabled and device resets will restore the command register, so
by my evaluation a user would need to explicitly disable the memory
enable bit to trigger these faults.  We could potentially remap vmas
to a zero page rather than SIGBUS if we experience regressions, but
without known code requiring that, SIGBUS seems the appropriate
response to this condition.  Thanks,

Alex

---

Alex Williamson (3):
  vfio/type1: Support faulting PFNMAP vmas
  vfio-pci: Fault mmaps to enable vma tracking
  vfio-pci: Invalidate mmaps and block MMIO access on disabled memory


 drivers/vfio/pci/vfio_pci.c |  268 +--
 drivers/vfio/pci/vfio_pci_config.c  |   31 
 drivers/vfio/pci/vfio_pci_intrs.c   |   18 ++
 drivers/vfio/pci/vfio_pci_private.h |   11 +
 drivers/vfio/pci/vfio_pci_rdwr.c|   12 ++
 drivers/vfio/vfio_iommu_type1.c |   36 -
 6 files changed, 356 insertions(+), 20 deletions(-)



Re: [regression?] Re: [PATCH v6 06/12] mm/gup: track FOLL_PIN pages

2020-04-29 Thread Alex Williamson
On Tue, 28 Apr 2020 21:29:03 -0300
Jason Gunthorpe  wrote:

> On Tue, Apr 28, 2020 at 02:12:23PM -0600, Alex Williamson wrote:
> 
> > > > Maybe I was just getting lucky before this commit.  For a
> > > > VM_PFNMAP, vaddr_get_pfn() only needs pin_user_pages_remote() to return
> > > > error and the vma information that we setup in vfio_pci_mmap().
> > > 
> > > I've written on this before, vfio should not be passing pages to the
> > > iommu that it cannot pin eg it should not touch VM_PFNMAP vma's in the
> > > first place.
> > > 
> > > It is a use-after-free security issue the way it is..  
> > 
> > Where is the user after free?  Here I'm trying to map device mmio space
> > through the iommu, which we need to enable p2p when the user owns
> > multiple devices.  
> 
> Yes, I gathered what the intent was..
> 
> > The device is owned by the user, bound to vfio-pci, and can't be
> > unbound while the user has it open.  The iommu mappings are torn
> > down on release.  I guess I don't understand the problem.  
> 
> For PFNMAP VMAs the lifecycle rule is basically that the PFN inside
> the VMA can only be used inside the mmap_sem that read it. Ie you
> cannot take a PFN outside the mmap_sem and continue to use it.
> 
> This is because the owner of the VMA owns the lifetime of that PFN,
> and under the write side of the mmap_sem it can zap the PFN, or close
> the VMA. Afterwards the VMA owner knows that there are no active
> reference to the PFN in the system and can reclaim the PFN
> 
> ie the PFNMAP has no per-page pin counter. All lifetime revolves around
> the mmap_sem and the vma.
> 
> What vfio does is take the PFN out of the mmap_sem and program it into
> the iommu.
> 
> So when the VMA owner decides the PFN has no references, it actually
> doesn't: vfio continues to access it beyond its permitted lifetime.
> 
> HW like mlx5 and GPUs have BAR pages which have security
> properties. Once the PFN is returned to the driver the security
> context of the PFN can be reset and re-assigned to another
> process. Using VFIO a hostile user space can retain access to the BAR
> page and upon its reassignment access a security context they were not
> permitted to access.
> 
> This is why GUP does not return PFNMAP pages and vfio should not carry
> a reference outside the mmap_sem. It breaks all the lifetime rules.

Thanks for the explanation.  I'm inferring that there is no solution to
this, but why can't we use mmu notifiers to invalidate the iommu on zap
or close?  I know that at least QEMU won't consider these sorts of
mapping fatal, so we could possibly change the default and make support
for such mappings opt-in, but I don't know if I'd break DPDK, or
potentially users within QEMU that make use of p2p between devices.
Thanks,

Alex



Re: [regression?] Re: [PATCH v6 06/12] mm/gup: track FOLL_PIN pages

2020-04-28 Thread Alex Williamson
On Tue, 28 Apr 2020 16:22:51 -0300
Jason Gunthorpe  wrote:

> On Tue, Apr 28, 2020 at 01:07:52PM -0600, Alex Williamson wrote:
> > On Tue, 28 Apr 2020 14:49:57 -0300
> > Jason Gunthorpe  wrote:
> >   
> > > On Tue, Apr 28, 2020 at 10:54:55AM -0600, Alex Williamson wrote:  
> > > >  static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> > > >  {
> > > > struct vfio_pci_device *vdev = device_data;
> > > > @@ -1253,8 +1323,14 @@ static int vfio_pci_mmap(void *device_data, 
> > > > struct vm_area_struct *vma)
> > > > vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > > > vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) 
> > > > + pgoff;
> > > >  
> > > > +   vma->vm_ops = _pci_mmap_ops;
> > > > +
> > > > +#if 1
> > > > +   return 0;
> > > > +#else
> > > > return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> > > > -  req_len, vma->vm_page_prot);
> > > > +  vma->vm_end - vma->vm_start, 
> > > > vma->vm_page_prot);
> > > 
> > > The remap_pfn_range here is what tells get_user_pages this is a
> > > non-struct page mapping:
> > > 
> > >   vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
> > > 
> > > Which has to be set when the VMA is created, they shouldn't be
> > > modified during fault.  
> > 
> > Aha, thanks Jason!  So fundamentally, pin_user_pages_remote() should
> > never have been faulting in this vma since the pages are non-struct
> > page backed.   
> 
> gup should not try to pin them.. I think the VM will still call fault
> though, not sure from memory?

Hmm, at commit 3faa52c03f44 the behavior is that I don't see a fault on
pin, maybe that's a bug.  But trying to rebase to current top of tree,
now my DMA mapping gets an -EFAULT, so something is still funky :-\

> > Maybe I was just getting lucky before this commit.  For a
> > VM_PFNMAP, vaddr_get_pfn() only needs pin_user_pages_remote() to return
> > error and the vma information that we setup in vfio_pci_mmap().  
> 
> I've written on this before, vfio should not be passing pages to the
> iommu that it cannot pin eg it should not touch VM_PFNMAP vma's in the
> first place.
> 
> It is a use-after-free security issue the way it is..

Where is the user after free?  Here I'm trying to map device mmio space
through the iommu, which we need to enable p2p when the user owns
multiple devices.  The device is owned by the user, bound to vfio-pci,
and can't be unbound while the user has it open.  The iommu mappings
are torn down on release.  I guess I don't understand the problem.

> > only need the fault handler to trigger for user access, which is what I
> > see with this change.  That should work for me.
> >   
> > > Also the vma code above looked a little strange to me, if you do send
> > > something like this cc me and I can look at it. I did some work like
> > > this for rdma a while ago..  
> > 
> > Cool, I'll do that.  I'd like to be able to zap the vmas from user
> > access at a later point and I have doubts that I'm holding the
> > refs/locks that I need to for that.  Thanks,  
> 
> Check rdma_umap_ops, it does what you described (actually it replaces
> them with 0 page, but along the way it zaps too).

Ok, thanks,

Alex



Re: [regression?] Re: [PATCH v6 06/12] mm/gup: track FOLL_PIN pages

2020-04-28 Thread Alex Williamson
On Tue, 28 Apr 2020 14:49:57 -0300
Jason Gunthorpe  wrote:

> On Tue, Apr 28, 2020 at 10:54:55AM -0600, Alex Williamson wrote:
> >  static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> >  {
> > struct vfio_pci_device *vdev = device_data;
> > @@ -1253,8 +1323,14 @@ static int vfio_pci_mmap(void *device_data, struct 
> > vm_area_struct *vma)
> > vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> >  
> > +   vma->vm_ops = _pci_mmap_ops;
> > +
> > +#if 1
> > +   return 0;
> > +#else
> > return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> > -  req_len, vma->vm_page_prot);
> > +  vma->vm_end - vma->vm_start, vma->vm_page_prot); 
> >  
> 
> The remap_pfn_range here is what tells get_user_pages this is a
> non-struct page mapping:
> 
>   vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
> 
> Which has to be set when the VMA is created, they shouldn't be
> modified during fault.

Aha, thanks Jason!  So fundamentally, pin_user_pages_remote() should
never have been faulting in this vma since the pages are non-struct
page backed.  Maybe I was just getting lucky before this commit.  For a
VM_PFNMAP, vaddr_get_pfn() only needs pin_user_pages_remote() to return
error and the vma information that we setup in vfio_pci_mmap().  We
only need the fault handler to trigger for user access, which is what I
see with this change.  That should work for me.

> Also the vma code above looked a little strange to me, if you do send
> something like this cc me and I can look at it. I did some work like
> this for rdma a while ago..

Cool, I'll do that.  I'd like to be able to zap the vmas from user
access at a later point and I have doubts that I'm holding the
refs/locks that I need to for that.  Thanks,

Alex



Re: [regression?] Re: [PATCH v6 06/12] mm/gup: track FOLL_PIN pages

2020-04-28 Thread Alex Williamson
On Fri, 24 Apr 2020 15:58:29 -0700
John Hubbard  wrote:

> On 2020-04-24 13:15, Alex Williamson wrote:
> > On Fri, 24 Apr 2020 12:20:03 -0700
> > John Hubbard  wrote:
> >   
> >> On 2020-04-24 11:18, Alex Williamson wrote:
> >> ...  
> >>> Hi John,
> >>>
> >>> I'm seeing a regression bisected back to this commit (3faa52c03f44
> >>> mm/gup: track FOLL_PIN pages).  I've attached some vfio-pci test code
> >>> that reproduces this by mmap'ing a page of MMIO space of a device and
> >>> then tries to map that through the IOMMU, so this should be attempting
> >>> a gup/pin of a PFNMAP page.  Previously this failed gracefully (-EFAULT),
> >>> but now results in:  
> >>
> >>
> >> Hi Alex,
> >>
> >> Thanks for this report, and especially for source code to test it,
> >> seeing as how I can't immediately spot the problem just from the crash
> >> data so far.  I'll get set up and attempt a repro.
> >>
> >> Actually this looks like it should be relatively easier than the usual
> >> sort of "oops, we leaked a pin_user_pages() or unpin_user_pages() call,
> >> good luck finding which one" report that I fear the most. :) This one
> >> looks more like a crash that happens directly, when calling into the
> >> pin_user_pages_remote() code. Which should be a lot easier to solve...
> >>
> >> btw, if you are set up for it, it would be nice to know what source file
> >> and line number corresponds to the RIP (get_pfnblock_flags_mask+0x22)
> >> below. But if not, no problem, because I've likely got to do the repro
> >> in any case.  
> > 
> > Hey John,
> > 
> > TBH I'm feeling a lot less confident about this bisect.  This was
> > readily reproducible to me on a clean tree a bit ago, but now it
> > eludes me.  Let me go back and figure out what's going on before you
> > spend any more time on it.  Thanks,
> >   
> 
> OK. But I'm keeping the repro program! :)  It made it quick and easy to
> set up a vfio test, so it was worth doing in any case.

Great, because I've traced my steps, re-bisected and came back to the
same upstream commit with the same test program.  The major difference
is that I thought I was seeing this on pure upstream, but some vfio
code that I'm trying to prepare for upstream snuck in, so this isn't a
pure upstream regression, but the changes I was making worked on v5.6
and does not work with this commit.  Maybe still a latent regression,
maybe a bug in my changes.

> Anyway, I wanted to double check this just out of paranoia, and so
> now I have a data point for you: your test program runs and passes for
> me using today's linux.git kernel, with an NVIDIA GPU as the vfio
> device, and the kernel log is clean. No hint of any problems.

Yep, I agree.  The vfio change I'm experimenting with is to move the
remap_pfn_range() from vfio_pci_mmap() to a vm_ops.fault handler.  This
is why I have the test program creating an mmap of the device mmio
space and then immediately mapping it through the iommu without
touching it.  If the vma gets faulted in the dma mapping path via
pin_user_pages_remote(), I see the crash I reported initially.  If the
test program is modified to access the mmap before doing the dma
mapping, everything works normally.  In either case, the fault handler
is called and satisfies the fault with remap_pfn_range() and returns
VM_FAULT_NOPAGE (vfio patch attached).

Here's the crash I'm seeing with some further debugging:

BUG: unable to handle page fault for address: a5b8bfe14f38
#PF: supervisor read access in kernel mode
#PF: error_code(0x) - not-present page
PGD 0 P4D 0 
Oops:  [#1] SMP NOPTI
CPU: 70 PID: 3343 Comm: vfio-pci-dma-ma Not tainted 5.6.0-3faa52c03f44+ #20
Hardware name: AMD Corporation Diesel/Diesel, BIOS TDL100CB 03/17/2020
RIP: 0010:get_pfnblock_flags_mask+0x22/0x70
Code: c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 48 8b 05 bc e1 d9 01 48 89 f7 49 
89 c8 48 c1 ef 0f 48 85 c0 74 48 48 89 f1 48 c1 e9 17 <48> 8b 04 c8 48 85 c0 74 
0b 40 0f b6 ff 48 c1 e7 04 48 01 f8 48 c1
RSP: 0018:b2e3c910fcc8 EFLAGS: 00010216
RAX: 95b8bff5 RBX: 0001 RCX: 01fd89e7
RDX: 0002 RSI: fec4f3a899ba RDI: 0001fd89e751
RBP: b2e3c910fd88 R08: 0007 R09: 95a4aa79fce8
R10:  R11: b2e3c910f840 R12: 0001
R13:  R14: 0001 R15: 959caa266e80
FS:  7f1a95023740() GS:959caf18() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: a5b8bfe14f38 CR3: 000462a1e000 CR4: 003406e0
Call Trace:
 __gup_longterm_locked+0x274/0x620
 vaddr_get_pfn+0x74/0x110 [vfio_

[GIT PULL] VFIO fixes for v5.4-rc5

2019-10-23 Thread Alex Williamson
Hi Linus,

The following changes since commit 4f5cafb5cb8471e54afdc9054d973535614f7675:

  Linux 5.4-rc3 (2019-10-13 16:37:36 -0700)

are available in the Git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v5.4-rc5

for you to fetch changes up to 95f89e090618efca63918b658c2002e57d393036:

  vfio/type1: Initialize resv_msi_base (2019-10-15 14:07:01 -0600)


VFIO fixes for v5.4-rc5

 - Fix (false) uninitialized variable warning (Joerg Roedel)


Joerg Roedel (1):
  vfio/type1: Initialize resv_msi_base

 drivers/vfio/vfio_iommu_type1.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



Re: [PATCH] vfio/type1: Initialize resv_msi_base

2019-10-18 Thread Alex Williamson
On Tue, 15 Oct 2019 17:16:50 +0200
Joerg Roedel  wrote:

> From: Joerg Roedel 
> 
> After enabling CONFIG_IOMMU_DMA on X86 a new warning appears when
> compiling vfio:
> 
> drivers/vfio/vfio_iommu_type1.c: In function ‘vfio_iommu_type1_attach_group’:
> drivers/vfio/vfio_iommu_type1.c:1827:7: warning: ‘resv_msi_base’ may be used 
> uninitialized in this function [-Wmaybe-uninitialized]
>ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
>^
> 
> The warning is a false positive, because the call to iommu_get_msi_cookie()
> only happens when vfio_iommu_has_sw_msi() returned true. And that only
> happens when it also set resv_msi_base.
> 
> But initialize the variable anyway to get rid of the warning.
> 
> Signed-off-by: Joerg Roedel 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 96fddc1dafc3..d864277ea16f 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1658,7 +1658,7 @@ static int vfio_iommu_type1_attach_group(void 
> *iommu_data,
>   struct bus_type *bus = NULL;
>   int ret;
>   bool resv_msi, msi_remap;
> - phys_addr_t resv_msi_base;
> + phys_addr_t resv_msi_base = 0;
>   struct iommu_domain_geometry geo;
>   LIST_HEAD(iova_copy);
>   LIST_HEAD(group_resv_regions);

Thanks Joerg!  Added to vfio for-linus branch with Connie and Eric's
reviews for v5.4.  Thanks,

Alex


Re: [PATCH] vfio/type1: remove hugepage checks in is_invalid_reserved_pfn()

2019-10-18 Thread Alex Williamson
On Fri, 18 Oct 2019 14:42:32 +0800
Ben Luo  wrote:

> A friendly reminder :)
> 

Thanks Ben!  I've added this to the vfio next branch for v5.5 with
Andrea's R-b.  Thanks,

Alex

> 在 2019/10/4 上午12:41, Andrea Arcangeli 写道:
> > On Thu, Oct 03, 2019 at 11:49:42AM +0800, Ben Luo wrote:  
> >> Currently, no hugepage split code can transfer the reserved bit
> >> from head to tail during the split, so checking the head can't make
> >> a difference in a racing condition with hugepage spliting.
> >>
> >> The buddy wouldn't allow a driver to allocate an hugepage if any
> >> subpage is reserved in the e820 map at boot, if any driver sets the
> >> reserved bit of head page before mapping the hugepage in userland,
> >> it needs to set the reserved bit in all subpages to be safe.
> >>
> >> Signed-off-by: Ben Luo   
> > Reviewed-by: Andrea Arcangeli 
> >
> >  
> >> ---
> >>   drivers/vfio/vfio_iommu_type1.c | 26 --
> >>   1 file changed, 4 insertions(+), 22 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c 
> >> b/drivers/vfio/vfio_iommu_type1.c
> >> index 054391f..e2019ba 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -287,31 +287,13 @@ static int vfio_lock_acct(struct vfio_dma *dma, long 
> >> npage, bool async)
> >>* Some mappings aren't backed by a struct page, for example an mmap'd
> >>* MMIO range for our own or another device.  These use a different
> >>* pfn conversion and shouldn't be tracked as locked pages.
> >> + * For compound pages, any driver that sets the reserved bit in head
> >> + * page needs to set the reserved bit in all subpages to be safe.
> >>*/
> >>   static bool is_invalid_reserved_pfn(unsigned long pfn)
> >>   {
> >> -  if (pfn_valid(pfn)) {
> >> -  bool reserved;
> >> -  struct page *tail = pfn_to_page(pfn);
> >> -  struct page *head = compound_head(tail);
> >> -  reserved = !!(PageReserved(head));
> >> -  if (head != tail) {
> >> -  /*
> >> -   * "head" is not a dangling pointer
> >> -   * (compound_head takes care of that)
> >> -   * but the hugepage may have been split
> >> -   * from under us (and we may not hold a
> >> -   * reference count on the head page so it can
> >> -   * be reused before we run PageReferenced), so
> >> -   * we've to check PageTail before returning
> >> -   * what we just read.
> >> -   */
> >> -  smp_rmb();
> >> -  if (PageTail(tail))
> >> -  return reserved;
> >> -  }
> >> -  return PageReserved(tail);
> >> -  }
> >> +  if (pfn_valid(pfn))
> >> +  return PageReserved(pfn_to_page(pfn));
> >>   
> >>return true;
> >>   }
> >> -- 
> >> 1.8.3.1
> >>  



Re: [PATCH RFC 0/1] VFIO: Region-specific file descriptors

2019-10-01 Thread Alex Williamson
On Mon, 30 Sep 2019 18:55:32 -0500
Shawn Anastasio  wrote:

> This patch adds region file descriptors to VFIO, a simple file descriptor type
> that allows read/write/mmap operations on a single region of a VFIO device.
> 
> This feature is particularly useful for privileged applications that use VFIO
> and wish to share file descriptors with unprivileged applications without
> handing over full control of the device.

Such as?  How do we defined "privileged"?  VFIO already allows
"unprivileged applications" to own a device, only file permissions are
necessary for the VFIO group.  Does region level granularity really
allow us to claim that the consumer of this fd doesn't have full
control of the device?  Clearly device ioctls, including interrupts,
and DMA mappings are not granted with only access to a region, but said
unprivileged application may have absolute full control of the device
itself via that region.

> It also allows applications to use
> regular offsets in read/write/mmap instead of the region index + offset that
> must be used with device file descriptors.

How is this actually an issue that needs a solution?

> The current implementation is very raw (PCI only, no reference counting which
> is probably wrong), but I wanted to get a sense to see if this feature is
> desired. If it is, tips on how to implement this more correctly are
> appreciated.

Handling the ownership and life cycle of the region fds is the more
complicated problem.  If an unprivileged user has an mmap to a device
owned by a privileged user, how does it get revoked by the privileged
part of this equation?  How do we decide which regions merit this
support, for instance a device specific region could have just as
viable a use case as a BAR.  Why does this code limit support to
regions supporting mmap but then support read/write as well?

Technically, isn't the extent of functionality provided in this RFC
already available via the PCI resource files in sysfs?

Without a concrete use case, this looks like a solution in search of a
problem.  Thanks,

Alex


Re: [PATCH v2 13/13] vfio/type1: track iommu backed group attach

2019-09-30 Thread Alex Williamson
On Mon, 30 Sep 2019 12:41:03 +
"Liu, Yi L"  wrote:

> Hi Alex,
> 
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Thursday, September 26, 2019 10:37 AM
> > To: Liu, Yi L 
> > Subject: Re: [PATCH v2 13/13] vfio/type1: track iommu backed group attach
> > 
> > On Thu,  5 Sep 2019 16:08:43 +0800
> > Liu Yi L  wrote:
> >   
> > > With the introduction of iommu aware mdev group, user may wrap a PF/VF
> > > as a mdev. Such mdevs will be called as wrapped PF/VF mdevs in following
> > > statements. If it's applied on a non-singleton iommu group, there would
> > > be multiple domain attach on an iommu_device group (equal to iommu backed
> > > group). Reason is that mdev group attaches is finally an iommu_device
> > > group attach in the end. And existing vfio_domain.gorup_list has no idea
> > > about it. Thus multiple attach would happen.
> > >
> > > What's more, under default domain policy, group attach is allowed only
> > > when its in-use domain is equal to its default domain as the code below:
> > >
> > > static int __iommu_attach_group(struct iommu_domain *domain, ..)
> > > {
> > >   ..
> > >   if (group->default_domain && group->domain != group->default_domain)
> > >   return -EBUSY;
> > >   ...
> > > }
> > >
> > > So for the above scenario, only the first group attach on the
> > > non-singleton iommu group will be successful. Subsequent group
> > > attaches will be failed. However, this is a fairly valid usage case
> > > if the wrapped PF/VF mdevs and other devices are assigned to a single
> > > VM. We may want to prevent it. In other words, the subsequent group
> > > attaches should return success before going to __iommu_attach_group().
> > >
> > > However, if user tries to assign the wrapped PF/VF mdevs and other
> > > devices to different VMs, the subsequent group attaches on a single
> > > iommu_device group should be failed. This means the subsequent group
> > > attach should finally calls into __iommu_attach_group() and be failed.
> > >
> > > To meet the above requirements, this patch introduces vfio_group_object
> > > structure to track the group attach of an iommu_device group (a.ka.
> > > iommu backed group). Each vfio_domain will have a group_obj_list to
> > > record the vfio_group_objects. The search of the group_obj_list should
> > > use iommu_device group if a group is mdev group.
> > >
> > >   struct vfio_group_object {
> > >   atomic_tcount;
> > >   struct iommu_group  *iommu_group;
> > >   struct vfio_domain  *domain;
> > >   struct list_headnext;
> > >   };
> > >
> > > Each time, a successful group attach should either have a new
> > > vfio_group_object created or count increasing of an existing
> > > vfio_group_object instance. Details can be found in
> > > vfio_domain_attach_group_object().
> > >
> > > For group detach, should have count decreasing. Please check
> > > vfio_domain_detach_group_object().
> > >
> > > As the vfio_domain.group_obj_list is within vfio container(vfio_iommu)
> > > scope, if user wants to passthru a non-singleton to multiple VMs, it
> > > will be failed as VMs will have separate vfio containers. Also, if
> > > vIOMMU is exposed, it will also fail the attempts of assigning multiple
> > > devices (via vfio-pci or PF/VF wrapped mdev) to a single VM. This is
> > > aligned with current vfio passthru rules.
> > >
> > > Cc: Kevin Tian 
> > > Cc: Lu Baolu 
> > > Suggested-by: Alex Williamson 
> > > Signed-off-by: Liu Yi L 
> > > ---
> > >  drivers/vfio/vfio_iommu_type1.c | 167  
> >   
> > >  1 file changed, 154 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c 
> > > b/drivers/vfio/vfio_iommu_type1.c
> > > index 317430d..6a67bd6 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -75,6 +75,7 @@ struct vfio_domain {
> > >   struct iommu_domain *domain;
> > >   struct list_headnext;
> > >   struct list_headgroup_list;
> > > + struct list_headgroup_obj_list;
> > >   int prot;   /* IOMMU_CACHE */
> > >   bool  

Re: [PATCH v2 13/13] vfio/type1: track iommu backed group attach

2019-09-25 Thread Alex Williamson
On Thu,  5 Sep 2019 16:08:43 +0800
Liu Yi L  wrote:

> With the introduction of iommu aware mdev group, user may wrap a PF/VF
> as a mdev. Such mdevs will be called as wrapped PF/VF mdevs in following
> statements. If it's applied on a non-singleton iommu group, there would
> be multiple domain attach on an iommu_device group (equal to iommu backed
> group). Reason is that mdev group attaches is finally an iommu_device
> group attach in the end. And existing vfio_domain.gorup_list has no idea
> about it. Thus multiple attach would happen.
> 
> What's more, under default domain policy, group attach is allowed only
> when its in-use domain is equal to its default domain as the code below:
> 
> static int __iommu_attach_group(struct iommu_domain *domain, ..)
> {
>   ..
>   if (group->default_domain && group->domain != group->default_domain)
>   return -EBUSY;
>   ...
> }
> 
> So for the above scenario, only the first group attach on the
> non-singleton iommu group will be successful. Subsequent group
> attaches will be failed. However, this is a fairly valid usage case
> if the wrapped PF/VF mdevs and other devices are assigned to a single
> VM. We may want to prevent it. In other words, the subsequent group
> attaches should return success before going to __iommu_attach_group().
> 
> However, if user tries to assign the wrapped PF/VF mdevs and other
> devices to different VMs, the subsequent group attaches on a single
> iommu_device group should be failed. This means the subsequent group
> attach should finally calls into __iommu_attach_group() and be failed.
> 
> To meet the above requirements, this patch introduces vfio_group_object
> structure to track the group attach of an iommu_device group (a.ka.
> iommu backed group). Each vfio_domain will have a group_obj_list to
> record the vfio_group_objects. The search of the group_obj_list should
> use iommu_device group if a group is mdev group.
> 
>   struct vfio_group_object {
>   atomic_tcount;
>   struct iommu_group  *iommu_group;
>   struct vfio_domain  *domain;
>   struct list_headnext;
>   };
> 
> Each time, a successful group attach should either have a new
> vfio_group_object created or count increasing of an existing
> vfio_group_object instance. Details can be found in
> vfio_domain_attach_group_object().
> 
> For group detach, should have count decreasing. Please check
> vfio_domain_detach_group_object().
> 
> As the vfio_domain.group_obj_list is within vfio container(vfio_iommu)
> scope, if user wants to passthru a non-singleton to multiple VMs, it
> will be failed as VMs will have separate vfio containers. Also, if
> vIOMMU is exposed, it will also fail the attempts of assigning multiple
> devices (via vfio-pci or PF/VF wrapped mdev) to a single VM. This is
> aligned with current vfio passthru rules.
> 
> Cc: Kevin Tian 
> Cc: Lu Baolu 
> Suggested-by: Alex Williamson 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 167 
> 
>  1 file changed, 154 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 317430d..6a67bd6 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -75,6 +75,7 @@ struct vfio_domain {
>   struct iommu_domain *domain;
>   struct list_headnext;
>   struct list_headgroup_list;
> + struct list_headgroup_obj_list;
>   int prot;   /* IOMMU_CACHE */
>   boolfgsp;   /* Fine-grained super pages */
>  };
> @@ -97,6 +98,13 @@ struct vfio_group {
>   boolmdev_group; /* An mdev group */
>  };
>  
> +struct vfio_group_object {
> + atomic_tcount;
> + struct iommu_group  *iommu_group;
> + struct vfio_domain  *domain;
> + struct list_headnext;
> +};
> +

So vfio_domain already has a group_list for all the groups attached to
that iommu domain.  We add a vfio_group_object list, which is also
effectively a list of groups attached to the domain, but we're tracking
something different with it.  All groups seem to get added as a
vfio_group_object, so why do we need both lists?  As I suspected when
we discussed this last, this adds complexity for something that's
currently being proposed as a sample driver.

>  /*
>   * Guest RAM pinning working set or DMA target
>   */
> @@ -1263,6 +1271,85 @@ static struct vfio_group *find_iommu_group(struct 
> vfio_domain *domain,
>   ret

Re: [PATCH v2 12/13] vfio/type1: use iommu_attach_group() for wrapping PF/VF as mdev

2019-09-25 Thread Alex Williamson
On Thu,  5 Sep 2019 15:59:29 +0800
Liu Yi L  wrote:

> This patch uses iommu_attach_group() to do group attach when it is
> for the case of wrapping a PF/VF as a mdev. iommu_attach_device()
> doesn't support non-singleton iommu group attach. With this change,
> wrapping PF/VF as mdev can work on non-singleton iommu groups.
> 
> Cc: Kevin Tian 
> Cc: Lu Baolu 
> Suggested-by: Alex Williamson 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 22 ++
>  1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 054391f..317430d 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1312,13 +1312,20 @@ static int vfio_mdev_attach_domain(struct device 
> *dev, void *data)
>  {
>   struct iommu_domain *domain = data;
>   struct device *iommu_device;
> + struct iommu_group *group;
>  
>   iommu_device = vfio_mdev_get_iommu_device(dev);
>   if (iommu_device) {
>   if (iommu_dev_feature_enabled(iommu_device, IOMMU_DEV_FEAT_AUX))
>   return iommu_aux_attach_device(domain, iommu_device);
> - else
> - return iommu_attach_device(domain, iommu_device);
> + else {
> + group = iommu_group_get(iommu_device);
> + if (!group) {
> + WARN_ON(1);

What's the value of the WARN_ON here and below?

iommu_group_get() increments the kobject reference, looks like it's
leaked.  Thanks,

Alex

> + return -EINVAL;
> + }
> + return iommu_attach_group(domain, group);
> + }
>   }
>  
>   return -EINVAL;
> @@ -1328,13 +1335,20 @@ static int vfio_mdev_detach_domain(struct device 
> *dev, void *data)
>  {
>   struct iommu_domain *domain = data;
>   struct device *iommu_device;
> + struct iommu_group *group;
>  
>   iommu_device = vfio_mdev_get_iommu_device(dev);
>   if (iommu_device) {
>   if (iommu_dev_feature_enabled(iommu_device, IOMMU_DEV_FEAT_AUX))
>   iommu_aux_detach_device(domain, iommu_device);
> - else
> - iommu_detach_device(domain, iommu_device);
> + else {
> + group = iommu_group_get(iommu_device);
> + if (!group) {
> + WARN_ON(1);
> + return -EINVAL;
> + }
> + iommu_detach_group(domain, group);
> + }
>   }
>  
>   return 0;



Re: [PATCH v2 10/13] samples: refine vfio-mdev-pci driver

2019-09-25 Thread Alex Williamson
On Thu,  5 Sep 2019 15:59:27 +0800
Liu Yi L  wrote:

> From: Alex Williamson 
> 
> This patch refines the implementation of original vfio-mdev-pci driver.
> 
> And the vfio-mdev-pci-type_name will be named per the following rule:
> 
>   vmdev->attr.name = kasprintf(GFP_KERNEL,
>"%04x:%04x:%04x:%04x:%06x:%02x",
>pdev->vendor, pdev->device,
>pdev->subsystem_vendor,
>pdev->subsystem_device, pdev->class,
>pdev->revision);
> 
> Before usage, check the /sys/bus/pci/devices/$bdf/mdev_supported_types/
> to ensure the final mdev_supported_types.
> 
> Cc: Kevin Tian 
> Cc: Lu Baolu 
> Signed-off-by: Alex Williamson 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/pci/vfio_mdev_pci.c | 123 
> +++
>  1 file changed, 72 insertions(+), 51 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_mdev_pci.c 
> b/drivers/vfio/pci/vfio_mdev_pci.c
> index 07c8067..09143d3 100644
> --- a/drivers/vfio/pci/vfio_mdev_pci.c
> +++ b/drivers/vfio/pci/vfio_mdev_pci.c
> @@ -65,18 +65,22 @@ MODULE_PARM_DESC(disable_idle_d3,
>  
>  static struct pci_driver vfio_mdev_pci_driver;
>  
> -static ssize_t
> -name_show(struct kobject *kobj, struct device *dev, char *buf)
> -{
> - return sprintf(buf, "%s-type1\n", dev_name(dev));
> -}
> -
> -MDEV_TYPE_ATTR_RO(name);
> +struct vfio_mdev_pci_device {
> + struct vfio_pci_device vdev;
> + struct mdev_parent_ops ops;
> + struct attribute_group *groups[2];
> + struct attribute_group attr;
> + atomic_t avail;
> +};
>  
>  static ssize_t
>  available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
>  {
> - return sprintf(buf, "%d\n", 1);
> + struct vfio_mdev_pci_device *vmdev;
> +
> + vmdev = pci_get_drvdata(to_pci_dev(dev));
> +
> + return sprintf(buf, "%d\n", atomic_read(>avail));
>  }
>  
>  MDEV_TYPE_ATTR_RO(available_instances);
> @@ -90,62 +94,57 @@ static ssize_t device_api_show(struct kobject *kobj, 
> struct device *dev,
>  MDEV_TYPE_ATTR_RO(device_api);
>  
>  static struct attribute *vfio_mdev_pci_types_attrs[] = {
> - _type_attr_name.attr,
>   _type_attr_device_api.attr,
>   _type_attr_available_instances.attr,
>   NULL,
>  };
>  
> -static struct attribute_group vfio_mdev_pci_type_group1 = {
> - .name  = "type1",
> - .attrs = vfio_mdev_pci_types_attrs,
> -};
> -
> -struct attribute_group *vfio_mdev_pci_type_groups[] = {
> - _mdev_pci_type_group1,
> - NULL,
> -};
> -
>  struct vfio_mdev_pci {
>   struct vfio_pci_device *vdev;
>   struct mdev_device *mdev;
> - unsigned long handle;
>  };
>  
>  static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device 
> *mdev)
>  {
>   struct device *pdev;
> - struct vfio_pci_device *vdev;
> + struct vfio_mdev_pci_device *vmdev;
>   struct vfio_mdev_pci *pmdev;
>   int ret;
>  
>   pdev = mdev_parent_dev(mdev);
> - vdev = dev_get_drvdata(pdev);
> + vmdev = dev_get_drvdata(pdev);
> +
> + if (atomic_dec_if_positive(>avail) < 0)
> + return -ENOSPC;
> +
>   pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
> - if (pmdev == NULL) {
> - ret = -EBUSY;
> - goto out;
> - }
> + if (!pmdev)
> + return -ENOMEM;

Needs an atomic_inc(>avail) in this error path.  Thanks,

Alex

>  
>   pmdev->mdev = mdev;
> - pmdev->vdev = vdev;
> + pmdev->vdev = >vdev;
>   mdev_set_drvdata(mdev, pmdev);
>   ret = mdev_set_iommu_device(mdev_dev(mdev), pdev);
>   if (ret) {
>   pr_info("%s, failed to config iommu isolation for mdev: %s on 
> pf: %s\n",
>   __func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
> - goto out;
> + kfree(pmdev);
> + atomic_inc(>avail);
> + return ret;
>   }
>  
> -out:
> - return ret;
> + return 0;
>  }
>  
>  static int vfio_mdev_pci_remove(struct mdev_device *mdev)
>  {
>   struct vfio_mdev_pci *pmdev = mdev_get_drvdata(mdev);
> + struct vfio_mdev_pci_device *vmdev;
> +
> + vmdev = container_of(pmdev->vdev, struct vfio_mdev_pci_device, vdev);
>  
>   kfree(pmdev);
> + atomic_inc(>avail);
>   pr_info("%s, succeeded for mdev: %s\n&q

Re: [PATCH v2 11/13] samples/vfio-mdev-pci: call vfio_add_group_dev()

2019-09-25 Thread Alex Williamson
On Thu,  5 Sep 2019 15:59:28 +0800
Liu Yi L  wrote:

> This patch adds vfio_add_group_dev() calling in probe() to make
> vfio-mdev-pci work well with non-singleton iommu group. User could
> bind devices from a non-singleton iommu group to either vfio-pci
> driver or this sample driver. Existing passthru policy works well
> for this non-singleton group.
> 
> This is actually a policy choice. A device driver can make this call
> if it wants to be vfio viable. And it needs to provide dummy
> vfio_device_ops which is required by vfio framework. To prevent user
> from opening the device from the iommu backed group fd, the open
> callback of the dummy vfio_device_ops should return -ENODEV to fail
> the VFIO_GET_DEVICE_FD request from userspace.
> 
> Cc: Kevin Tian 
> Cc: Lu Baolu 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/pci/vfio_mdev_pci.c | 91 
> 
>  1 file changed, 82 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_mdev_pci.c 
> b/drivers/vfio/pci/vfio_mdev_pci.c
> index 09143d3..a61c20d 100644
> --- a/drivers/vfio/pci/vfio_mdev_pci.c
> +++ b/drivers/vfio/pci/vfio_mdev_pci.c
> @@ -107,19 +107,27 @@ struct vfio_mdev_pci {
>  static int vfio_mdev_pci_create(struct kobject *kobj, struct mdev_device 
> *mdev)
>  {
>   struct device *pdev;
> + struct vfio_device *device;
>   struct vfio_mdev_pci_device *vmdev;
>   struct vfio_mdev_pci *pmdev;
>   int ret;
>  
>   pdev = mdev_parent_dev(mdev);
> - vmdev = dev_get_drvdata(pdev);
> + device = vfio_device_get_from_dev(pdev);
> + vmdev = vfio_device_data(device);
>  
> - if (atomic_dec_if_positive(>avail) < 0)
> - return -ENOSPC;
> + if (atomic_dec_if_positive(>avail) < 0) {
> + ret = -ENOSPC;
> + goto out;
> + }
>  
> + pr_info("%s, available instance: %d\n",
> + __func__, atomic_read(>avail));
>   pmdev = kzalloc(sizeof(struct vfio_mdev_pci), GFP_KERNEL);
> - if (!pmdev)
> - return -ENOMEM;
> + if (!pmdev) {
> + ret = -ENOMEM;
> + goto out;
> + }
>  
>   pmdev->mdev = mdev;
>   pmdev->vdev = >vdev;
> @@ -130,10 +138,11 @@ static int vfio_mdev_pci_create(struct kobject *kobj, 
> struct mdev_device *mdev)
>   __func__, dev_name(mdev_dev(mdev)), dev_name(pdev));
>   kfree(pmdev);
>   atomic_inc(>avail);
> - return ret;
>   }
>  
> - return 0;
> +out:
> + vfio_device_put(device);
> + return ret;
>  }
>  
>  static int vfio_mdev_pci_remove(struct mdev_device *mdev)
> @@ -145,6 +154,8 @@ static int vfio_mdev_pci_remove(struct mdev_device *mdev)
>  
>   kfree(pmdev);
>   atomic_inc(>avail);
> + pr_info("%s, available instance: %d\n",
> + __func__, atomic_read(>avail));
>   pr_info("%s, succeeded for mdev: %s\n", __func__,
>dev_name(mdev_dev(mdev)));
>  
> @@ -236,12 +247,65 @@ static ssize_t vfio_mdev_pci_write(struct mdev_device 
> *mdev,
>   return vfio_pci_write(pmdev->vdev, (char __user *)buf, count, ppos);
>  }
>  
> +static int vfio_pci_dummy_open(void *device_data)
> +{
> + struct vfio_mdev_pci_device *vmdev =
> + (struct vfio_mdev_pci_device *) device_data;
> + pr_warn("Device %s is not viable for vfio-pci passthru, please follow"
> + " vfio-mdev passthru path as it has been wrapped as mdev!!!\n",
> + dev_name(>vdev.pdev->dev));
> + return -ENODEV;
> +}
> +
> +static void vfio_pci_dummy_release(void *device_data)
> +{
> +}

Theoretically .release will never be called.  If we're paranoid, we
could keep it with a pr_warn.

> +
> +long vfio_pci_dummy_ioctl(void *device_data,
> +unsigned int cmd, unsigned long arg)
> +{
> + return 0;
> +}
> +
> +ssize_t vfio_pci_dummy_read(void *device_data, char __user *buf,
> +  size_t count, loff_t *ppos)
> +{
> + return 0;
> +}
> +
> +ssize_t vfio_pci_dummy_write(void *device_data, const char __user *buf,
> +   size_t count, loff_t *ppos)
> +{
> + return 0;
> +}
> +
> +int vfio_pci_dummy_mmap(void *device_data, struct vm_area_struct *vma)
> +{
> + return 0;
> +}
> +
> +void vfio_pci_dummy_request(void *device_data, unsigned int count)
> +{
> +}

AFAICT, none of .ioctl, .read, .write, .mmap, or .request need to be
provided, only .open and only .release for paranoia.

> +
> +static const struct vfio_device_ops vfio_pci_dummy_ops = {
> + .name   = "vfio-pci",

This is impersonating vfio-pci, shouldn't we use something like
"vfio-mdev-pci-dummy".  Thanks,

Alex

> + .open   = vfio_pci_dummy_open,
> + .release= vfio_pci_dummy_release,
> + .ioctl  = vfio_pci_dummy_ioctl,
> + .read   = vfio_pci_dummy_read,
> + .write  = vfio_pci_dummy_write,
> + .mmap  

Re: [PATCH v2 08/13] vfio/pci: protect cap/ecap_perm bits alloc/free with atomic op

2019-09-25 Thread Alex Williamson
On Thu,  5 Sep 2019 15:59:25 +0800
Liu Yi L  wrote:

> There is a case in which cap_perms and ecap_perms can be reallocated
> by different modules. e.g. the vfio-mdev-pci sample driver. To secure
> the initialization of cap_perms and ecap_perms, this patch adds an
> atomic variable to track the user of cap/ecap_perms bits. First caller
> of vfio_pci_init_perm_bits() will initialize the bits. While the last
> caller of vfio_pci_uninit_perm_bits() will free the bits.

Yes, but it still allows races; we're not really protecting the data.
If driver A begins freeing the shared data in the uninit path, driver B
could start allocating shared data in the init path and we're left with
either use after free issues or memory leaks.  Probably better to hold
a semaphore around the allocation/free and a non-atomic for reference
counting.  Thanks,

Alex
 
> Cc: Kevin Tian 
> Cc: Lu Baolu 
> Suggested-by: Alex Williamson 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/pci/vfio_pci_config.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c 
> b/drivers/vfio/pci/vfio_pci_config.c
> index f0891bd..1b3e6e5 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -992,11 +992,17 @@ static int __init init_pci_ext_cap_pwr_perm(struct 
> perm_bits *perm)
>   return 0;
>  }
>  
> +/* Track the user number of the cap/ecap perm_bits */
> +atomic_t vfio_pci_perm_bits_users = ATOMIC_INIT(0);
> +
>  /*
>   * Initialize the shared permission tables
>   */
>  void vfio_pci_uninit_perm_bits(void)
>  {
> + if (atomic_dec_return(_pci_perm_bits_users))
> + return;
> +
>   free_perm_bits(_perms[PCI_CAP_ID_BASIC]);
>  
>   free_perm_bits(_perms[PCI_CAP_ID_PM]);
> @@ -1013,6 +1019,9 @@ int __init vfio_pci_init_perm_bits(void)
>  {
>   int ret;
>  
> + if (atomic_inc_return(_pci_perm_bits_users) != 1)
> + return 0;
> +
>   /* Basic config space */
>   ret = init_pci_cap_basic_perm(_perms[PCI_CAP_ID_BASIC]);
>  



Re: [PATCH v2 02/13] vfio_pci: refine user config reference in vfio-pci module

2019-09-25 Thread Alex Williamson
On Thu,  5 Sep 2019 15:59:19 +0800
Liu Yi L  wrote:

> This patch adds three fields in struct vfio_pci_device to pass the user
> configs of vfio-pci module to some functions which could be common in
> future usage.
> 
> Cc: Kevin Tian 
> Cc: Lu Baolu 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/pci/vfio_pci.c | 24 +++-
>  drivers/vfio/pci/vfio_pci_private.h |  9 +++--
>  2 files changed, 22 insertions(+), 11 deletions(-)

A subtle behavioral difference here is that disable_idle_d3 and
nointxmask are runtime modifiable parameters, if the value is changed
in sysfs the device will adopt the new behavior on its next
transition.  After this patch, each device operates in the mode defined
at the time it was probed.  Should we maybe refresh the value at key
points, like the user opening or releasing the device so that it tracks
the module parameter?  I think we could defend not changing the
behavior of a device while it's in use by a user.  Otherwise we might
want to make the module parameter read-only to avoid the
inconsistency.

> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 38271df..fed2687 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -69,7 +69,8 @@ static unsigned int vfio_pci_set_vga_decode(void *opaque, 
> bool single_vga)
>   unsigned char max_busnr;
>   unsigned int decodes;
>  
> - if (single_vga || !vfio_vga_disabled() || pci_is_root_bus(pdev->bus))
> + if (single_vga || !vfio_vga_disabled(vdev) ||
> + pci_is_root_bus(pdev->bus))
>   return VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM |
>  VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM;
>  
> @@ -273,7 +274,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>   if (!vdev->pci_saved_state)
>   pci_dbg(pdev, "%s: Couldn't store saved state\n", __func__);
>  
> - if (likely(!nointxmask)) {
> + if (likely(!vdev->nointxmask)) {
>   if (vfio_pci_nointx(pdev)) {
>   pci_info(pdev, "Masking broken INTx support\n");
>   vdev->nointx = true;
> @@ -310,7 +311,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>   } else
>   vdev->msix_bar = 0xFF;
>  
> - if (!vfio_vga_disabled() && vfio_pci_is_vga(pdev))
> + if (!vfio_vga_disabled(vdev) && vfio_pci_is_vga(pdev))
>   vdev->has_vga = true;
>  
>  
> @@ -436,7 +437,7 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
>  
>   vfio_pci_try_bus_reset(vdev);
>  
> - if (!disable_idle_d3)
> + if (!vdev->disable_idle_d3)
>   vfio_pci_set_power_state(vdev, PCI_D3hot);
>  }
>  
> @@ -1304,6 +1305,11 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
> struct pci_device_id *id)
>   spin_lock_init(>irqlock);
>   mutex_init(>ioeventfds_lock);
>   INIT_LIST_HEAD(>ioeventfds_list);
> + vdev->nointxmask = nointxmask;
> +#ifdef CONFIG_VFIO_PCI_VGA
> + vdev->disable_vga = disable_vga;
> +#endif
> + vdev->disable_idle_d3 = disable_idle_d3;
>  
>   ret = vfio_add_group_dev(>dev, _pci_ops, vdev);
>   if (ret) {
> @@ -1328,7 +1334,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
> struct pci_device_id *id)
>  
>   vfio_pci_probe_power_state(vdev);
>  
> - if (!disable_idle_d3) {
> + if (!vdev->disable_idle_d3) {
>   /*
>* pci-core sets the device power state to an unknown value at
>* bootup and after being removed from a driver.  The only
> @@ -1359,7 +1365,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
>   kfree(vdev->region);
>   mutex_destroy(>ioeventfds_lock);
>  
> - if (!disable_idle_d3)
> + if (!vdev->disable_idle_d3)
>   vfio_pci_set_power_state(vdev, PCI_D0);
>  
>   kfree(vdev->pm_save);
> @@ -1594,7 +1600,7 @@ static void vfio_pci_try_bus_reset(struct 
> vfio_pci_device *vdev)
>   if (!ret) {
>   tmp->needs_reset = false;
>  
> - if (tmp != vdev && !disable_idle_d3)
> + if (tmp != vdev && !tmp->disable_idle_d3)
>   vfio_pci_set_power_state(tmp, PCI_D3hot);
>   }
>  
> @@ -1610,7 +1616,7 @@ static void __exit vfio_pci_cleanup(void)
>   vfio_pci_uninit_perm_bits();
>  }
>  
> -static void __init vfio_pci_fill_ids(void)
> +static void __init vfio_pci_fill_ids(char *ids)
>  {
>   char *p, *id;
>   int rc;
> @@ -1665,7 +1671,7 @@ static int __init vfio_pci_init(void)
>   if (ret)
>   goto out_driver;
>  
> - vfio_pci_fill_ids();
> + vfio_pci_fill_ids([0]);

Or just 'ids'.  Thanks,

Alex

>  
>   return 0;
>  
> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index f12d92c..68521d2 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ 

Re: [PATCH v6 2/6] vfio: Introduce vGPU display irq type

2019-09-24 Thread Alex Williamson
On Tue, 24 Sep 2019 14:41:39 +0800
Tina Zhang  wrote:

> Introduce vGPU specific irq type VFIO_IRQ_TYPE_GFX, and
> VFIO_IRQ_SUBTYPE_GFX_DISPLAY_IRQ as the subtype for vGPU display.
> 
> Introduce vfio_irq_info_cap_display_plane_events capability to notify
> user space with the vGPU's plane update events
> 
> v3:
> - Add more description to VFIO_IRQ_SUBTYPE_GFX_DISPLAY_IRQ and
>   VFIO_IRQ_INFO_CAP_DISPLAY. (Alex & Gerd)
> 
> v2:
> - Add VFIO_IRQ_SUBTYPE_GFX_DISPLAY_IRQ description. (Alex & Kechen)
> - Introduce vfio_irq_info_cap_display_plane_events. (Gerd & Alex)
> 
> Signed-off-by: Tina Zhang 
> ---
>  include/uapi/linux/vfio.h | 38 ++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index aa6850f1daef..2946a028b0c3 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -476,6 +476,44 @@ struct vfio_irq_info_cap_type {
>   __u32 subtype;  /* type specific */
>  };
>  
> +/* vGPU IRQ TYPE */
> +#define VFIO_IRQ_TYPE_GFX(1)
> +
> +/* sub-types for VFIO_IRQ_TYPE_GFX */
> +/*
> + * vGPU device display refresh interrupt request. This irq asking for
> + * a user space display refresh, covers all display updates events,
> + * i.e. user space can stop the display update timer and fully depend
> + * on getting the notification if an update is needed.
> + */
> +#define VFIO_IRQ_SUBTYPE_GFX_DISPLAY_IRQ (1)
> +
> +/*
> + * Display capability of reporting display refresh interrupt events.

Perhaps, "Capability for interpreting GFX_DISPLAY_IRQ eventfd value"

> + * This gives user space the capability to identify different display
> + * updates events of the display refresh interrupt request.
> + *
> + * When notified by VFIO_IRQ_SUBTYPE_GFX_DISPLAY_IRQ, user space can
> + * use the eventfd counter value to identify which plane has been
> + * updated.
> + *
> + * Note that there might be some cases like counter_value >
> + * (cur_event_val + pri_event_val), if notifications haven't been
> + * handled on time in user mode. These cases can be handled as all
> + * plane updated case or signle plane updated case depending on the
> + * value.

Seems like in the GVT-g implementation such a value is not possible.
In fact, when this capability is provided, doesn't userspace interpret
the eventfd value more as a bitmask of events rather than a counter?
If so, (cur_event_val + pri_event_val) may be mathematically accurate,
but maybe obfuscates the logical interpretation... or maybe that's just
me.

> + *
> + * cur_event_val: eventfd counter value for cursor plane change event.
> + * pri_event_val: eventfd counter value for primary plane change event.

I think there should be a note that this capability is optional and
lacking this feature, userspace should refresh all display events on
notification.

> + */
> +#define VFIO_IRQ_INFO_CAP_DISPLAY2
> +
> +struct vfio_irq_info_cap_display_plane_events {
> + struct vfio_info_cap_header header;
> + __u64 cur_event_val;
> + __u64 pri_event_val;

AIUI, the GVT-g implementation sets a single bit and userspace expects
one or both of those bits to be set on notification.  Should we
simplify this a bit and just define these as cur_event_bit,
pri_event_bit and use a __u8 for each to define the bit position?
Thanks,

Alex


Re: [PATCH v6 5/6] drm/i915/gvt: Deliver async primary plane page flip events at vblank

2019-09-24 Thread Alex Williamson
On Tue, 24 Sep 2019 14:41:42 +0800
Tina Zhang  wrote:

> From: Kechen Lu 
> 
> Only sync primary plane page flip events are checked and delivered
> as the display refresh events before, this patch tries to deliver async
> primary page flip events bounded by vblanks.
> 
> To deliver correct async page flip, the new async flip bitmap is
> introduced and in vblank emulation handler to check bitset.
> 
> Signed-off-by: Kechen Lu 
> Signed-off-by: Tina Zhang 
> ---
>  drivers/gpu/drm/i915/gvt/cmd_parser.c |  6 --
>  drivers/gpu/drm/i915/gvt/display.c| 10 ++
>  drivers/gpu/drm/i915/gvt/gvt.h|  2 ++
>  drivers/gpu/drm/i915/gvt/handlers.c   |  5 +++--
>  4 files changed, 19 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/cmd_parser.c 
> b/drivers/gpu/drm/i915/gvt/cmd_parser.c
> index e753b1e706e2..1abb05431177 100644
> --- a/drivers/gpu/drm/i915/gvt/cmd_parser.c
> +++ b/drivers/gpu/drm/i915/gvt/cmd_parser.c
> @@ -1365,9 +1365,11 @@ static int gen8_update_plane_mmio_from_mi_display_flip(
>   if (info->plane == PLANE_PRIMARY)
>   vgpu_vreg_t(vgpu, PIPE_FLIPCOUNT_G4X(info->pipe))++;
>  
> - if (info->async_flip)
> + if (info->async_flip) {
>   intel_vgpu_trigger_virtual_event(vgpu, info->event);
> - else
> + set_bit(info->plane,
> + vgpu->display.async_flip_event[info->pipe]);
> + } else
>   set_bit(info->event, vgpu->irq.flip_done_event[info->pipe]);
>  
>   return 0;
> diff --git a/drivers/gpu/drm/i915/gvt/display.c 
> b/drivers/gpu/drm/i915/gvt/display.c
> index 9f2c2cd10369..9acde0bdd5f4 100644
> --- a/drivers/gpu/drm/i915/gvt/display.c
> +++ b/drivers/gpu/drm/i915/gvt/display.c
> @@ -419,6 +419,16 @@ static void emulate_vblank_on_pipe(struct intel_vgpu 
> *vgpu, int pipe)
>   intel_vgpu_trigger_virtual_event(vgpu, event);
>   }
>  
> + for_each_set_bit(event, vgpu->display.async_flip_event[pipe],
> + I915_MAX_PLANES) {
> + clear_bit(event, vgpu->display.async_flip_event[pipe]);
> + if (!pipe_is_enabled(vgpu, pipe))
> + continue;
> +
> + if (event == PLANE_PRIMARY)
> + eventfd_signal_val |= DISPLAY_PRI_REFRESH_EVENT_VAL;

Is it worthwhile to continue the for_each_set_bit here, or should we
clear the remaining bits and break from the loop?  Thanks,

Alex

> + }
> +
>   if (eventfd_signal_val)
>   vgpu->no_pageflip_count = 0;
>   else if (!eventfd_signal_val && vgpu->no_pageflip_count > 
> PAGEFLIP_DELAY_THR)
> diff --git a/drivers/gpu/drm/i915/gvt/gvt.h b/drivers/gpu/drm/i915/gvt/gvt.h
> index cc39b449b061..73769a87b407 100644
> --- a/drivers/gpu/drm/i915/gvt/gvt.h
> +++ b/drivers/gpu/drm/i915/gvt/gvt.h
> @@ -128,6 +128,8 @@ struct intel_vgpu_display {
>   struct intel_vgpu_i2c_edid i2c_edid;
>   struct intel_vgpu_port ports[I915_MAX_PORTS];
>   struct intel_vgpu_sbi sbi;
> + DECLARE_BITMAP(async_flip_event[I915_MAX_PIPES],
> +I915_MAX_PLANES);
>  };
>  
>  struct vgpu_sched_ctl {
> diff --git a/drivers/gpu/drm/i915/gvt/handlers.c 
> b/drivers/gpu/drm/i915/gvt/handlers.c
> index 45a9124e53b6..e5a022c2e7bb 100644
> --- a/drivers/gpu/drm/i915/gvt/handlers.c
> +++ b/drivers/gpu/drm/i915/gvt/handlers.c
> @@ -760,9 +760,10 @@ static int pri_surf_mmio_write(struct intel_vgpu *vgpu, 
> unsigned int offset,
>  
>   vgpu_vreg_t(vgpu, PIPE_FLIPCOUNT_G4X(pipe))++;
>  
> - if (vgpu_vreg_t(vgpu, DSPCNTR(pipe)) & PLANE_CTL_ASYNC_FLIP)
> + if (vgpu_vreg_t(vgpu, DSPCNTR(pipe)) & PLANE_CTL_ASYNC_FLIP) {
>   intel_vgpu_trigger_virtual_event(vgpu, event);
> - else
> + set_bit(PLANE_PRIMARY, vgpu->display.async_flip_event[pipe]);
> + } else
>   set_bit(event, vgpu->irq.flip_done_event[pipe]);
>  
>   return 0;



Re: [PATCH v6 4/6] drm/i915/gvt: Deliver vGPU refresh event to userspace

2019-09-24 Thread Alex Williamson
On Tue, 24 Sep 2019 14:41:41 +0800
Tina Zhang  wrote:

> Deliver the display refresh events to the user land. Userspace can use
> the irq mask/unmask mechanism to disable or enable the event delivery.
> 
> As we know, delivering refresh event at each vblank safely avoids
> tearing and unexpected event overwhelming, but there are still spaces
> to optimize.
> 
> For handling the normal case, deliver the page flip refresh
> event at each vblank, in other words, bounded by vblanks. Skipping some
> events bring performance enhancement while not hurting user experience.
> 
> For single framebuffer case, deliver the refresh events to userspace at
> all vblanks. This heuristic at each vblank leverages pageflip_count
> incresements to determine if there is no page flip happens after a certain
> period and so that the case is regarded as single framebuffer one.
> Although this heuristic makes incorrect decision sometimes and it depends
> on guest behavior, for example, when no cursor movements happen, the
> user experience does not harm and front buffer is still correctly acquired.
> Meanwhile, in actual single framebuffer case, the user experience is
> enhanced compared with page flip events only.
> 
> Addtionally, to mitigate the events delivering footprints, one eventfd and
> 8 byte eventfd counter partition are leveraged.
> 
> v3:
> - make no_pageflip_count be per-vgpu instead of static. (Zhenyu)
> 
> v2:
> - Support vfio_irq_info_cap_display_plane_events. (Tina)
> 
> Signed-off-by: Tina Zhang 
> Signed-off-by: Kechen Lu 
> ---
>  drivers/gpu/drm/i915/gvt/display.c |  20 
>  drivers/gpu/drm/i915/gvt/gvt.h |   3 +
>  drivers/gpu/drm/i915/gvt/kvmgt.c   | 159 +++--
>  3 files changed, 173 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/display.c 
> b/drivers/gpu/drm/i915/gvt/display.c
> index 1a0a4ae4826e..9f2c2cd10369 100644
> --- a/drivers/gpu/drm/i915/gvt/display.c
> +++ b/drivers/gpu/drm/i915/gvt/display.c
> @@ -34,6 +34,8 @@
>  
>  #include "i915_drv.h"
>  #include "gvt.h"
> +#include 
> +#include 
>  
>  static int get_edp_pipe(struct intel_vgpu *vgpu)
>  {
> @@ -387,6 +389,8 @@ void intel_gvt_check_vblank_emulation(struct intel_gvt 
> *gvt)
>   mutex_unlock(>lock);
>  }
>  
> +#define PAGEFLIP_DELAY_THR 10
> +
>  static void emulate_vblank_on_pipe(struct intel_vgpu *vgpu, int pipe)
>  {
>   struct drm_i915_private *dev_priv = vgpu->gvt->dev_priv;
> @@ -396,7 +400,9 @@ static void emulate_vblank_on_pipe(struct intel_vgpu 
> *vgpu, int pipe)
>   [PIPE_B] = PIPE_B_VBLANK,
>   [PIPE_C] = PIPE_C_VBLANK,
>   };
> + int pri_flip_event = SKL_FLIP_EVENT(pipe, PLANE_PRIMARY);
>   int event;
> + u64 eventfd_signal_val = 0;
>  
>   if (pipe < PIPE_A || pipe > PIPE_C)
>   return;
> @@ -407,9 +413,23 @@ static void emulate_vblank_on_pipe(struct intel_vgpu 
> *vgpu, int pipe)
>   if (!pipe_is_enabled(vgpu, pipe))
>   continue;
>  
> + if (event == pri_flip_event)
> + eventfd_signal_val |= DISPLAY_PRI_REFRESH_EVENT_VAL;
> +
>   intel_vgpu_trigger_virtual_event(vgpu, event);
>   }
>  
> + if (eventfd_signal_val)
> + vgpu->no_pageflip_count = 0;
> + else if (!eventfd_signal_val && vgpu->no_pageflip_count > 
> PAGEFLIP_DELAY_THR)

The !eventfd_signal_val test is redundant since this is and else branch
of a test for eventfd_signal_val.

> + eventfd_signal_val |= DISPLAY_PRI_REFRESH_EVENT_VAL;
> + else
> + vgpu->no_pageflip_count++;
> +
> + if (vgpu->vdev.vblank_trigger && !vgpu->vdev.display_event_mask &&
> + eventfd_signal_val)
> + eventfd_signal(vgpu->vdev.vblank_trigger, eventfd_signal_val);
> +
>   if (pipe_is_enabled(vgpu, pipe)) {
>   vgpu_vreg_t(vgpu, PIPE_FRMCOUNT_G4X(pipe))++;
>   intel_vgpu_trigger_virtual_event(vgpu, vblank_event[pipe]);
> diff --git a/drivers/gpu/drm/i915/gvt/gvt.h b/drivers/gpu/drm/i915/gvt/gvt.h
> index 8008047d026c..cc39b449b061 100644
> --- a/drivers/gpu/drm/i915/gvt/gvt.h
> +++ b/drivers/gpu/drm/i915/gvt/gvt.h
> @@ -205,6 +205,8 @@ struct intel_vgpu {
>   int num_irqs;
>   struct eventfd_ctx *intx_trigger;
>   struct eventfd_ctx *msi_trigger;
> + struct eventfd_ctx *vblank_trigger;
> + bool display_event_mask;
>  
>   /*
>* Two caches are used to avoid mapping duplicated pages (eg.
> @@ -229,6 +231,7 @@ struct intel_vgpu {
>   struct idr object_idr;
>  
>   struct completion vblank_done;
> + int no_pageflip_count;
>  
>   u32 scan_nonprivbb;
>  };
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c 
> b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index 269506300310..f30b7a5272e8 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -1250,6 +1250,8 @@ static int 

Re: [PATCH 5/6] vringh: fix copy direction of vringh_iov_push_kern()

2019-09-23 Thread Alex Williamson
On Mon, 23 Sep 2019 21:03:30 +0800
Jason Wang  wrote:

> We want to copy from iov to buf, so the direction was wrong.
> 
> Signed-off-by: Jason Wang 
> ---
>  drivers/vhost/vringh.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)


Why is this included in the series?  Seems like an unrelated fix being
held up within a proposal for a new feature.  Thanks,

Alex
 
> diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c
> index 08ad0d1f0476..a0a2d74967ef 100644
> --- a/drivers/vhost/vringh.c
> +++ b/drivers/vhost/vringh.c
> @@ -852,6 +852,12 @@ static inline int xfer_kern(void *src, void *dst, size_t 
> len)
>   return 0;
>  }
>  
> +static inline int kern_xfer(void *dst, void *src, size_t len)
> +{
> + memcpy(dst, src, len);
> + return 0;
> +}
> +
>  /**
>   * vringh_init_kern - initialize a vringh for a kernelspace vring.
>   * @vrh: the vringh to initialize.
> @@ -958,7 +964,7 @@ EXPORT_SYMBOL(vringh_iov_pull_kern);
>  ssize_t vringh_iov_push_kern(struct vringh_kiov *wiov,
>const void *src, size_t len)
>  {
> - return vringh_iov_xfer(wiov, (void *)src, len, xfer_kern);
> + return vringh_iov_xfer(wiov, (void *)src, len, kern_xfer);
>  }
>  EXPORT_SYMBOL(vringh_iov_push_kern);
>  



[GIT PULL] VFIO updates for v5.4-rc1

2019-09-20 Thread Alex Williamson
Hi Linus,

The following changes since commit e3fb13b7e47cd18b2bd067ea8a491020b4644baf:

  Merge tag 'modules-for-v5.3-rc6' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux (2019-08-23 09:22:00 
-0700)

are available in the Git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v5.4-rc1

for you to fetch changes up to e6c5d727db0a86a3ff21aca6824aae87f3bc055f:

  Merge branches 'v5.4/vfio/alexey-tce-memory-free-v1', 
'v5.4/vfio/connie-re-arrange-v2', 'v5.4/vfio/hexin-pci-reset-v3', 
'v5.4/vfio/parav-mtty-uuid-v2' and 'v5.4/vfio/shameer-iova-list-v8' into 
v5.4/vfio/next (2019-08-23 11:26:24 -0600)


VFIO updates for v5.4-rc1

 - Fix spapr iommu error case case (Alexey Kardashevskiy)

 - Consolidate region type definitions (Cornelia Huck)

 - Restore saved original PCI state on release (hexin)

 - Simplify mtty sample driver interrupt path (Parav Pandit)

 - Support for reporting valid IOVA regions to user (Shameer Kolothum)


Alex Williamson (1):
  Merge branches 'v5.4/vfio/alexey-tce-memory-free-v1', 
'v5.4/vfio/connie-re-arrange-v2', 'v5.4/vfio/hexin-pci-reset-v3', 
'v5.4/vfio/parav-mtty-uuid-v2' and 'v5.4/vfio/shameer-iova-list-v8' into 
v5.4/vfio/next

Alexey Kardashevskiy (1):
  vfio/spapr_tce: Fix incorrect tce_iommu_group memory free

Cornelia Huck (1):
  vfio: re-arrange vfio region definitions

Parav Pandit (1):
  vfio-mdev/mtty: Simplify interrupt generation

Shameer Kolothum (6):
  vfio/type1: Introduce iova list and add iommu aperture validity check
  vfio/type1: Check reserved region conflict and update iova list
  vfio/type1: Update iova list on detach
  vfio/type1: check dma map request is within a valid iova range
  vfio/type1: Add IOVA range capability support
  vfio/type1: remove duplicate retrieval of reserved regions

hexin (1):
  vfio_pci: Restore original state on release

 drivers/vfio/pci/vfio_pci.c |  17 +-
 drivers/vfio/vfio_iommu_spapr_tce.c |   9 +-
 drivers/vfio/vfio_iommu_type1.c | 518 +++-
 include/uapi/linux/vfio.h   |  71 +++--
 samples/vfio-mdev/mtty.c|  39 +--
 5 files changed, 583 insertions(+), 71 deletions(-)


Re: [PATCH v4 3/4] vfio: zpci: defining the VFIO headers

2019-09-19 Thread Alex Williamson
On Thu, 19 Sep 2019 16:55:57 -0400
Matthew Rosato  wrote:

> On 9/19/19 11:20 AM, Cornelia Huck wrote:
> > On Fri,  6 Sep 2019 20:13:50 -0400
> > Matthew Rosato  wrote:
> >   
> >> From: Pierre Morel 
> >>
> >> We define a new device region in vfio.h to be able to
> >> get the ZPCI CLP information by reading this region from
> >> userland.
> >>
> >> We create a new file, vfio_zdev.h to define the structure
> >> of the new region we defined in vfio.h
> >>
> >> Signed-off-by: Pierre Morel 
> >> Signed-off-by: Matthew Rosato 
> >> ---
> >>  include/uapi/linux/vfio.h  |  1 +
> >>  include/uapi/linux/vfio_zdev.h | 35 +++
> >>  2 files changed, 36 insertions(+)
> >>  create mode 100644 include/uapi/linux/vfio_zdev.h
> >>
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index 8f10748..8328c87 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -371,6 +371,7 @@ struct vfio_region_gfx_edid {
> >>   * to do TLB invalidation on a GPU.
> >>   */
> >>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD  (1)
> >> +#define VFIO_REGION_SUBTYPE_ZDEV_CLP  (2)  
> > 
> > Using a subtype is fine, but maybe add a comment what this is for?
> >   
> 
> Fair point.  Maybe something like "IBM ZDEV CLP is used to pass zPCI
> device features to guest"

And if you're going to use a PCI vendor ID subtype, maintain consistent
naming, VFIO_REGION_SUBTYPE_IBM_ZPCI_CLP or something.  Ideally there'd
also be a reference to the struct provided through this region
otherwise it's rather obscure to find by looking for the call to
vfio_pci_register_dev_region() and ops defined for the region.  I
wouldn't be opposed to defining the region structure here too rather
than a separate file, but I guess you're following the example set by
ccw.

> >>  
> >>  /*
> >>   * The MSIX mappable capability informs that MSIX data of a BAR can be 
> >> mmapped
> >> diff --git a/include/uapi/linux/vfio_zdev.h 
> >> b/include/uapi/linux/vfio_zdev.h
> >> new file mode 100644
> >> index 000..55e0d6d
> >> --- /dev/null
> >> +++ b/include/uapi/linux/vfio_zdev.h
> >> @@ -0,0 +1,35 @@
> >> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> >> +/*
> >> + * Region definition for ZPCI devices
> >> + *
> >> + * Copyright IBM Corp. 2019
> >> + *
> >> + * Author(s): Pierre Morel 
> >> + */
> >> +
> >> +#ifndef _VFIO_ZDEV_H_
> >> +#define _VFIO_ZDEV_H_
> >> +
> >> +#include 
> >> +
> >> +/**
> >> + * struct vfio_region_zpci_info - ZPCI information.  
> > 
> > Hm... probably should also get some more explanation. E.g. is that
> > derived from a hardware structure?
> >   
> 
> The structure itself is not mapped 1:1 to a hardware structure, but it
> does serve as a collection of information that was derived from other
> hardware structures.
> 
> "Used for passing hardware feature information about a zpci device
> between the host and guest" ?
> 
> >> + *
> >> + */
> >> +struct vfio_region_zpci_info {
> >> +  __u64 dasm;
> >> +  __u64 start_dma;
> >> +  __u64 end_dma;
> >> +  __u64 msi_addr;
> >> +  __u64 flags;
> >> +  __u16 pchid;
> >> +  __u16 mui;
> >> +  __u16 noi;
> >> +  __u16 maxstbl;
> >> +  __u8 version;
> >> +  __u8 gid;
> >> +#define VFIO_PCI_ZDEV_FLAGS_REFRESH 1
> >> +  __u8 util_str[];
> >> +} __packed;
> >> +
> >> +#endif  

I'm half tempted to suggest that this struct could be exposed directly
through an info capability, the trouble is where.  It would be somewhat
awkward to pick an arbitrary BAR or config space region to expose this
info.  The VFIO_DEVICE_GET_INFO ioctl could include it, but we don't
support capabilities on that return structure and I'm not sure it's
worth implementing versus the solution here.  Just a thought.  Thanks,

Alex


Re: [PATCH] sample: vfio mdev display - Fix a missing error code in an error handling path

2019-09-18 Thread Alex Williamson
On Mon, 16 Sep 2019 22:22:40 +0200
Christophe JAILLET  wrote:

> 'ret' is known to be 0 at this point. So explicitly set it to -ENOMEM if
> 'framebuffer_alloc()' fails.
> 
> Signed-off-by: Christophe JAILLET 
> ---
>  samples/vfio-mdev/mdpy-fb.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/samples/vfio-mdev/mdpy-fb.c b/samples/vfio-mdev/mdpy-fb.c
> index 2719bb259653..6fe0187f47a2 100644
> --- a/samples/vfio-mdev/mdpy-fb.c
> +++ b/samples/vfio-mdev/mdpy-fb.c
> @@ -131,8 +131,10 @@ static int mdpy_fb_probe(struct pci_dev *pdev,
>width, height);
>  
>   info = framebuffer_alloc(sizeof(struct mdpy_fb_par), >dev);
> - if (!info)
> + if (!info) {
> + ret = -ENOMEM;
>   goto err_release_regions;
> + }
>   pci_set_drvdata(pdev, info);
>   par = info->par;
>  

I think you're only scratching the surface here, this looks more
complete to me:

diff --git a/samples/vfio-mdev/mdpy-fb.c b/samples/vfio-mdev/mdpy-fb.c
index 2719bb259653..a760e130bd0d 100644
--- a/samples/vfio-mdev/mdpy-fb.c
+++ b/samples/vfio-mdev/mdpy-fb.c
@@ -117,22 +117,27 @@ static int mdpy_fb_probe(struct pci_dev *pdev,
if (format != DRM_FORMAT_XRGB) {
pci_err(pdev, "format mismatch (0x%x != 0x%x)\n",
format, DRM_FORMAT_XRGB);
-   return -EINVAL;
+   ret = -EINVAL;
+   goto err_release_regions;
}
if (width < 100  || width > 1) {
pci_err(pdev, "width (%d) out of range\n", width);
-   return -EINVAL;
+   ret = -EINVAL;
+   goto err_release_regions;
}
if (height < 100 || height > 1) {
pci_err(pdev, "height (%d) out of range\n", height);
-   return -EINVAL;
+   ret = -EINVAL;
+   goto err_release_regions;
}
pci_info(pdev, "mdpy found: %dx%d framebuffer\n",
 width, height);
 
info = framebuffer_alloc(sizeof(struct mdpy_fb_par), >dev);
-   if (!info)
+   if (!info) {
+   ret = -ENOMEM;
goto err_release_regions;
+   }
pci_set_drvdata(pdev, info);
par = info->par;
 


Re: [RFC PATCH 2/4] mdev: introduce helper to set per device dma ops

2019-09-17 Thread Alex Williamson
On Tue, 10 Sep 2019 16:19:33 +0800
Jason Wang  wrote:

> This patch introduces mdev_set_dma_ops() which allows parent to set
> per device DMA ops. This help for the kernel driver to setup a correct
> DMA mappings.
> 
> Signed-off-by: Jason Wang 
> ---
>  drivers/vfio/mdev/mdev_core.c | 7 +++
>  include/linux/mdev.h  | 2 ++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index b558d4cfd082..eb28552082d7 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "mdev_private.h"
>  
> @@ -27,6 +28,12 @@ static struct class_compat *mdev_bus_compat_class;
>  static LIST_HEAD(mdev_list);
>  static DEFINE_MUTEX(mdev_list_lock);
>  
> +void mdev_set_dma_ops(struct mdev_device *mdev, struct dma_map_ops *ops)
> +{
> + set_dma_ops(>dev, ops);
> +}
> +EXPORT_SYMBOL(mdev_set_dma_ops);
> +

Why does mdev need to be involved here?  Your sample driver in 4/4 calls
this from its create callback, where it could just as easily call:

  set_dma_ops(mdev_dev(mdev), ops);

Thanks,
Alex

>  struct device *mdev_parent_dev(struct mdev_device *mdev)
>  {
>   return mdev->parent->dev;
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> index 0ce30ca78db0..7195f40bf8bf 100644
> --- a/include/linux/mdev.h
> +++ b/include/linux/mdev.h
> @@ -145,4 +145,6 @@ struct device *mdev_parent_dev(struct mdev_device *mdev);
>  struct device *mdev_dev(struct mdev_device *mdev);
>  struct mdev_device *mdev_from_dev(struct device *dev);
>  
> +void mdev_set_dma_ops(struct mdev_device *mdev, struct dma_map_ops *ops);
> +
>  #endif /* MDEV_H */



Re: [PATCH 00/11] KVM: x86/mmu: Restore fast invalidate/zap flow

2019-09-16 Thread Alex Williamson
On Thu, 12 Sep 2019 19:46:01 -0700
Sean Christopherson  wrote:

> Restore the fast invalidate flow for zapping shadow pages and use it
> whenever vCPUs can be active in the VM.  This fixes (in theory, not yet
> confirmed) a regression reported by James Harvey where KVM can livelock
> in kvm_mmu_zap_all() when it's invoked in response to a memslot update.
> 
> The fast invalidate flow was removed as it was deemed to be unnecessary
> after its primary user, memslot flushing, was reworked to zap only the
> memslot in question instead of all shadow pages.  Unfortunately, zapping
> only the memslot being (re)moved during a memslot update introduced a
> regression for VMs with assigned devices.  Because we could not discern
> why zapping only the relevant memslot broke device assignment, or if the
> regression extended beyond device assignment, we reverted to zapping all
> shadow pages when a memslot is (re)moved.
> 
> The revert to "zap all" failed to account for subsequent changes that
> have been made to kvm_mmu_zap_all() between then and now.  Specifically,
> kvm_mmu_zap_all() now conditionally drops reschedules and drops mmu_lock
> if a reschedule is needed or if the lock is contended.  Dropping the lock
> allows other vCPUs to add shadow pages, and, with enough vCPUs, can cause
> kvm_mmu_zap_all() to get stuck in an infinite loop as it can never zap all
> pages before observing lock contention or the need to reschedule.
> 
> The reasoning behind having kvm_mmu_zap_all() conditionally reschedule was
> that it would only be used when the VM is inaccesible, e.g. when its
> mm_struct is dying or when the VM itself is being destroyed.  In that case,
> playing nice with the rest of the kernel instead of hogging cycles to free
> unused shadow pages made sense.
> 
> Since it's unlikely we'll root cause the device assignment regression any
> time soon, and that simply removing the conditional rescheduling isn't
> guaranteed to return us to a known good state, restore the fast invalidate
> flow for zapping on memslot updates, including mmio generation wraparound.
> Opportunisticaly tack on a bug fix and a couple enhancements.
> 
> Alex and James, it probably goes without saying... please test, especially
> patch 01/11 as a standalone patch as that'll likely need to be applied to
> stable branches, assuming it works.  Thanks!

It looks like Paolo already included patch 01/11 in v5.3, I tested that
and it behaves ok for the GPU assignment windows issue.  I applied the
remaining 10 patches on v5.3 and tested those separately.  They also
behave well for this test case.

Tested-by: Alex Williamson 

Thanks,
Alex 

> 
> Sean Christopherson (11):
>   KVM: x86/mmu: Reintroduce fast invalidate/zap for flushing memslot
>   KVM: x86/mmu: Treat invalid shadow pages as obsolete
>   KVM: x86/mmu: Use fast invalidate mechanism to zap MMIO sptes
>   KVM: x86/mmu: Revert "Revert "KVM: MMU: show mmu_valid_gen in shadow
> page related tracepoints""
>   KVM: x86/mmu: Revert "Revert "KVM: MMU: add tracepoint for
> kvm_mmu_invalidate_all_pages""
>   KVM: x86/mmu: Revert "Revert "KVM: MMU: zap pages in batch""
>   KVM: x86/mmu: Revert "Revert "KVM: MMU: collapse TLB flushes when zap
> all pages""
>   KVM: x86/mmu: Revert "Revert "KVM: MMU: reclaim the zapped-obsolete
> page first""
>   KVM: x86/mmu: Revert "KVM: x86/mmu: Remove is_obsolete() call"
>   KVM: x86/mmu: Explicitly track only a single invalid mmu generation
>   KVM: x86/mmu: Skip invalid pages during zapping iff root_count is zero
> 
>  arch/x86/include/asm/kvm_host.h |   4 +-
>  arch/x86/kvm/mmu.c  | 154 
>  arch/x86/kvm/mmutrace.h |  42 +++--
>  arch/x86/kvm/x86.c  |   1 +
>  4 files changed, 173 insertions(+), 28 deletions(-)
> 



Re: [RFC V1 3/7] x86/ims: Add support for a new IMS irq domain

2019-09-13 Thread Alex Williamson
On Thu, 12 Sep 2019 18:32:04 -0700
Megha Dey  wrote:

> This patch adds support for the creation of a new IMS irq domain. It
> creates a new irq_chip associated with the IMS domain and adds the
> necessary domain operations to it.
> 
> Cc: Jacob Pan 
> Signed-off-by: Sanjay Kumar 
> Signed-off-by: Megha Dey 
> ---
>  arch/x86/include/asm/msi.h   |  4 ++
>  arch/x86/kernel/apic/Makefile|  1 +
>  arch/x86/kernel/apic/ims.c   | 93 
> 
>  arch/x86/kernel/apic/msi.c   |  4 +-
>  drivers/vfio/mdev/mdev_core.c|  6 +++
>  drivers/vfio/mdev/mdev_private.h |  1 -
>  include/linux/mdev.h |  2 +
>  7 files changed, 108 insertions(+), 3 deletions(-)
>  create mode 100644 arch/x86/kernel/apic/ims.c
> 
> diff --git a/arch/x86/include/asm/msi.h b/arch/x86/include/asm/msi.h
> index 25ddd09..51f9d25 100644
> --- a/arch/x86/include/asm/msi.h
> +++ b/arch/x86/include/asm/msi.h
> @@ -11,4 +11,8 @@ int pci_msi_prepare(struct irq_domain *domain, struct 
> device *dev, int nvec,
>  
>  void pci_msi_set_desc(msi_alloc_info_t *arg, struct msi_desc *desc);
>  
> +struct msi_domain_info;
> +
> +irq_hw_number_t msi_get_hwirq(struct msi_domain_info *info,
> + msi_alloc_info_t *arg);
>  #endif /* _ASM_X86_MSI_H */
> diff --git a/arch/x86/kernel/apic/Makefile b/arch/x86/kernel/apic/Makefile
> index a6fcaf16..75a2270 100644
> --- a/arch/x86/kernel/apic/Makefile
> +++ b/arch/x86/kernel/apic/Makefile
> @@ -12,6 +12,7 @@ obj-y   += hw_nmi.o
>  
>  obj-$(CONFIG_X86_IO_APIC)+= io_apic.o
>  obj-$(CONFIG_PCI_MSI)+= msi.o
> +obj-$(CONFIG_MSI_IMS)+= ims.o
>  obj-$(CONFIG_SMP)+= ipi.o
>  
>  ifeq ($(CONFIG_X86_64),y)
> diff --git a/arch/x86/kernel/apic/ims.c b/arch/x86/kernel/apic/ims.c
> new file mode 100644
> index 000..d9808a5
> --- /dev/null
> +++ b/arch/x86/kernel/apic/ims.c
> @@ -0,0 +1,93 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright © 2019 Intel Corporation.
> + *
> + * Author: Megha Dey 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/*
> + * Determine if a dev is mdev or not. Return NULL if not mdev device.
> + * Return mdev's parent dev if success.
> + */
> +static inline struct device *mdev_to_parent(struct device *dev)
> +{
> + struct device *ret = NULL;
> + struct device *(*fn)(struct device *dev);
> + struct bus_type *bus = symbol_get(mdev_bus_type);
> +
> + if (bus && dev->bus == bus) {
> + fn = symbol_get(mdev_dev_to_parent_dev);
> + ret = fn(dev);
> + symbol_put(mdev_dev_to_parent_dev);
> + symbol_put(mdev_bus_type);

Leaks a reference to the mdev module if dev->bus != bus.  The new
version of dev_is_mdev() unconditionally leaks a reference.  Thanks,

Alex

> + }
> +
> + return ret;
> +}
> +
> +static struct pci_dev *ims_get_pci_dev(struct device *dev)
> +{
> + struct pci_dev *pdev;
> +
> + if (dev_is_mdev(dev)) {
> + struct device *parent = mdev_to_parent(dev);
> +
> + pdev = to_pci_dev(parent);
> + } else {
> + pdev = to_pci_dev(dev);
> + }
> +
> + return pdev;
> +}
> +
> +int dev_ims_prepare(struct irq_domain *domain, struct device *dev, int nvec,
> + msi_alloc_info_t *arg)
> +{
> + struct pci_dev *pdev = ims_get_pci_dev(dev);
> +
> + init_irq_alloc_info(arg, NULL);
> + arg->msi_dev = pdev;
> + arg->type = X86_IRQ_ALLOC_TYPE_MSIX;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(dev_ims_prepare);
> +
> +#ifdef CONFIG_IRQ_REMAP
> +
> +static struct msi_domain_ops dev_ims_domain_ops = {
> + .get_hwirq  = msi_get_hwirq,
> + .msi_prepare= dev_ims_prepare,
> +};
> +
> +static struct irq_chip dev_ims_ir_controller = {
> + .name   = "IR-DEV-IMS",
> + .irq_unmask = dev_ims_unmask_irq,
> + .irq_mask   = dev_ims_mask_irq,
> + .irq_ack= irq_chip_ack_parent,
> + .irq_retrigger  = irq_chip_retrigger_hierarchy,
> + .irq_set_vcpu_affinity  = irq_chip_set_vcpu_affinity_parent,
> + .flags  = IRQCHIP_SKIP_SET_WAKE,
> + .irq_write_msi_msg  = dev_ims_write_msg,
> +};
> +
> +static struct msi_domain_info ims_ir_domain_info = {
> + .flags  = MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
> +   MSI_FLAG_MULTI_PCI_MSI | MSI_FLAG_PCI_MSIX,
> + .ops= _ims_domain_ops,
> + .chip   = _ims_ir_controller,
> + .handler= handle_edge_irq,
> + .handler_name   = "edge",
> +};
> +
> +struct irq_domain *arch_create_ims_irq_domain(struct irq_domain *parent)
> +{
> + return pci_msi_create_irq_domain(NULL, _ir_domain_info, parent);
> +}
> +
> +#endif
> diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
> index 435bcda..65da813 100644
> --- 

Re: [PATCH v3 0/5] Introduce variable length mdev alias

2019-09-13 Thread Alex Williamson
On Wed, 11 Sep 2019 16:38:49 +
Parav Pandit  wrote:

> > -Original Message-
> > From: linux-kernel-ow...@vger.kernel.org  > ow...@vger.kernel.org> On Behalf Of Parav Pandit  
> > Sent: Wednesday, September 11, 2019 10:31 AM
> > To: Alex Williamson 
> > Cc: Jiri Pirko ; kwankh...@nvidia.com;
> > coh...@redhat.com; da...@davemloft.net; k...@vger.kernel.org; linux-
> > ker...@vger.kernel.org; net...@vger.kernel.org
> > Subject: RE: [PATCH v3 0/5] Introduce variable length mdev alias
> > 
> > Hi Alex,
> >   
> > > -Original Message-
> > > From: Alex Williamson 
> > > Sent: Wednesday, September 11, 2019 8:56 AM
> > > To: Parav Pandit 
> > > Cc: Jiri Pirko ; kwankh...@nvidia.com;
> > > coh...@redhat.com; da...@davemloft.net; k...@vger.kernel.org; linux-
> > > ker...@vger.kernel.org; net...@vger.kernel.org
> > > Subject: Re: [PATCH v3 0/5] Introduce variable length mdev alias
> > >
> > > On Mon, 9 Sep 2019 20:42:32 +
> > > Parav Pandit  wrote:
> > >  
> > > > Hi Alex,
> > > >  
> > > > > -Original Message-
> > > > > From: Parav Pandit 
> > > > > Sent: Sunday, September 1, 2019 11:25 PM
> > > > > To: alex.william...@redhat.com; Jiri Pirko ;
> > > > > kwankh...@nvidia.com; coh...@redhat.com; da...@davemloft.net
> > > > > Cc: k...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > > net...@vger.kernel.org; Parav Pandit 
> > > > > Subject: [PATCH v3 0/5] Introduce variable length mdev alias
> > > > >
> > > > > To have consistent naming for the netdevice of a mdev and to have
> > > > > consistent naming of the devlink port [1] of a mdev, which is
> > > > > formed using phys_port_name of the devlink port, current UUID is
> > > > > not usable because UUID is too long.
> > > > >
> > > > > UUID in string format is 36-characters long and in binary 128-bit.
> > > > > Both formats are not able to fit within 15 characters limit of
> > > > > netdev  
> > > name.  
> > > > >
> > > > > It is desired to have mdev device naming consistent using UUID.
> > > > > So that widely used user space framework such as ovs [2] can make
> > > > > use of mdev representor in similar way as PCIe SR-IOV VF and PF  
> > > representors.  
> > > > >
> > > > > Hence,
> > > > > (a) mdev alias is created which is derived using sha1 from the
> > > > > mdev  
> > > name.  
> > > > > (b) Vendor driver describes how long an alias should be for the
> > > > > child mdev created for a given parent.
> > > > > (c) Mdev aliases are unique at system level.
> > > > > (d) alias is created optionally whenever parent requested.
> > > > > This ensures that non networking mdev parents can function without
> > > > > alias creation overhead.
> > > > >
> > > > > This design is discussed at [3].
> > > > >
> > > > > An example systemd/udev extension will have,
> > > > >
> > > > > 1. netdev name created using mdev alias available in sysfs.
> > > > >
> > > > > mdev UUID=83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
> > > > > mdev 12 character alias=cd5b146a80a5
> > > > >
> > > > > netdev name of this mdev = enmcd5b146a80a5 Here en = Ethernet link
> > > > > m = mediated device
> > > > >
> > > > > 2. devlink port phys_port_name created using mdev alias.
> > > > > devlink phys_port_name=pcd5b146a80a5
> > > > >
> > > > > This patchset enables mdev core to maintain unique alias for a mdev.
> > > > >
> > > > > Patch-1 Introduces mdev alias using sha1.
> > > > > Patch-2 Ensures that mdev alias is unique in a system.
> > > > > Patch-3 Exposes mdev alias in a sysfs hirerchy, update
> > > > > Documentation
> > > > > Patch-4 Introduces mdev_alias() API.
> > > > > Patch-5 Extends mtty driver to optionally provide alias generation.
> > > > > This also enables to test UUID based sha1 collision and trigger
> > > > > error handling for duplicate sha1 results.
> > > > >
> > > > > [1] http://man7.org/linux/man-pages/man8/devlink-port.8.html
> > > > &g

Re: [PATCH v2] vfio/type1: avoid redundant PageReserved checking

2019-09-13 Thread Alex Williamson
On Mon, 2 Sep 2019 15:32:42 +0800
Ben Luo  wrote:

> 在 2019/8/30 上午1:06, Alex Williamson 写道:
> > On Fri, 30 Aug 2019 00:58:22 +0800
> > Ben Luo  wrote:
> >  
> >> 在 2019/8/28 下午11:55, Alex Williamson 写道:  
> >>> On Wed, 28 Aug 2019 12:28:04 +0800
> >>> Ben Luo  wrote:
> >>> 
> >>>> currently, if the page is not a tail of compound page, it will be
> >>>> checked twice for the same thing.
> >>>>
> >>>> Signed-off-by: Ben Luo 
> >>>> ---
> >>>>drivers/vfio/vfio_iommu_type1.c | 3 +--
> >>>>1 file changed, 1 insertion(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
> >>>> b/drivers/vfio/vfio_iommu_type1.c
> >>>> index 054391f..d0f7346 100644
> >>>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>>> @@ -291,11 +291,10 @@ static int vfio_lock_acct(struct vfio_dma *dma, 
> >>>> long npage, bool async)
> >>>>static bool is_invalid_reserved_pfn(unsigned long pfn)
> >>>>{
> >>>>  if (pfn_valid(pfn)) {
> >>>> -bool reserved;
> >>>>  struct page *tail = pfn_to_page(pfn);
> >>>>  struct page *head = compound_head(tail);
> >>>> -reserved = !!(PageReserved(head));
> >>>>  if (head != tail) {
> >>>> +bool reserved = PageReserved(head);
> >>>>  /*
> >>>>   * "head" is not a dangling pointer
> >>>>   * (compound_head takes care of that)  
> >>> Thinking more about this, the code here was originally just a copy of
> >>> kvm_is_mmio_pfn() which was simplified in v3.12 with the commit below.
> >>> Should we instead do the same thing here?  Thanks,
> >>>
> >>> Alex  
> >> ok, and kvm_is_mmio_pfn() has also been updated since then, I will take
> >> a look at that and compose a new patch  
> > I'm not sure if the further updates are quite as relevant for vfio, but
> > appreciate your review of them.  Thanks,
> >
> > Alex  
> 
> After studying the related code, my personal understandings are:
> 
> kvm_is_mmio_pfn() is used to find out whether a memory range is MMIO 
> mapped so that to set
> the proper MTRR TYPE to spte.
> 
> is_invalid_reserved_pfn() is used in two scenarios:
>      1. to tell whether a page should be counted against user's mlock 
> limits, as the function's name
> implies, all 'invalid' PFNs who are not backed by struct page and those 
> reserved pages (including
> zero page and those from NVDIMM DAX) should be excluded.
> 2. to check if we have got a valid and pinned pfn for the vma 
> with VM_PFNMAP flag.
> 
> So, for the zero page and 'RAM' backed PFNs without 'struct page', 
> kvm_is_mmio_pfn() should
> return false because they are not MMIO and are cacheable, but 
> is_invalid_reserved_pfn() should
> return true since they are truely reserved or invalid and should not be 
> counted against user's
> mlock limits.
> 
> For fsdax-page, current get_user_pages() returns -EOPNOTSUPP, and VFIO 
> also returns this
> error code to user, seems not support fsdax for now, so there is no 
> chance to call into
> is_invalid_reserved_pfn() currently, if fsdax is to be supported, not 
> only this function needs to be
> updated, vaddr_get_pfn() also needs some changes.
> 
> Now, with the assumption that PFNs of compound pages with reserved bit 
> set in head will not be
> passed to is_invalid_reserved_pfn(), we can simplify this function to:
> 
> static bool is_invalid_reserved_pfn(unsigned long pfn)
> {
>      if (pfn_valid(pfn))
>      return PageReserved(pfn_to_page(pfn));
> 
>      return true;
> }
> 
> But, I am not sure if the assumption above is true, if not, we still 
> need to check the reserved bit of
> head for a tail page as this PATCH v2 does.

I believe what you've described is correct.  Andrea, have we missed
anything?  Thanks,

Alex


> >  
> >>> commit 11feeb498086a3a5907b8148bdf1786a9b18fc55
> >>> Author: Andrea Arcangeli 
> >>> Date:   Thu Jul 25 03:04:38 2013 +0200
> >>>
> >>>   kvm: optimize away THP checks in kvm_is_mmio_pfn()
> >>>   
> >>>   The checks on PG_reserved in the page structure on h

Re: [PATCH v6 0/3] genirq/vfio: Introduce irq_update_devid() and optimize VFIO irq ops

2019-09-13 Thread Alex Williamson
On Tue, 10 Sep 2019 14:30:16 +0800
Ben Luo  wrote:

> A friendly reminder.

The vfio patch looks ok to me.  Thomas, do you have further comments or
a preference on how to merge these?  I'd tend to prefer the vfio
changes through my branch for testing and can pull the irq changes with
your approval, but we could do the reverse or split them and I could
follow with the vfio change once the irq changes are in mainline.
Thanks,

Alex

> 在 2019/9/2 下午12:01, Ben Luo 写道:
> > Currently, VFIO takes a free-then-request-irq way to do interrupt
> > affinity setting and masking/unmasking for a VM with device passthru
> > via VFIO. Sometimes it only changes the cookie data of irqaction or even
> > changes nothing. The free-then-request-irq not only adds more latency,
> > but also increases the risk of losing interrupt, which may lead to a
> > VM hang forever in waiting for IO completion
> >
> > This patchset solved the issue by:
> > Patch 2 introduces irq_update_devid() to only update dev_id of irqaction
> > Patch 3 make use of this function and optimize irq operations in VFIO
> >
> > changes from v5:
> >   - Patch 3: remove an error log to avoid potential DDoS attacking
> >   _ Patch 3: fix typo in comment
> >
> > changes from v4:
> >   - Patch 3: follow the previous behavior to disable interrupt on error path
> >   - Patch 3: do irqbypass registration before update or free the interrupt
> >   - Patch 3: add more comments
> >
> > changes from v3:
> >   - Patch 2: rename the new function to irq_update_devid()
> >   - Patch 2: use disbale_irq() to avoid a twist for threaded interrupt
> >   - ALL: amend commit messages and code comments
> >
> > changes from v2:
> >   - reformat to avoid quoted string split across lines and etc.
> >
> > changes from v1:
> >   - add Patch 1 to enhance error recovery etc. in free irq per tglx's 
> > comments
> >   - enhance error recovery code and debugging info in irq_update_devid
> >   - use __must_check in external referencing of this function
> >   - use EXPORT_SYMBOL_GPL for irq_update_devid
> >   - reformat code of patch 3 for better readability
> >
> > Ben Luo (3):
> >genirq: enhance error recovery code in free irq
> >genirq: introduce irq_update_devid()
> >vfio/pci: make use of irq_update_devid() and optimize irq ops
> >
> >   drivers/vfio/pci/vfio_pci_intrs.c | 118 
> > ++
> >   include/linux/interrupt.h |   3 +
> >   kernel/irq/manage.c   | 105 +
> >   3 files changed, 177 insertions(+), 49 deletions(-)
> >  



Re: [PATCH v3 0/5] Introduce variable length mdev alias

2019-09-11 Thread Alex Williamson
On Mon, 9 Sep 2019 20:42:32 +
Parav Pandit  wrote:

> Hi Alex,
> 
> > -Original Message-
> > From: Parav Pandit 
> > Sent: Sunday, September 1, 2019 11:25 PM
> > To: alex.william...@redhat.com; Jiri Pirko ;
> > kwankh...@nvidia.com; coh...@redhat.com; da...@davemloft.net
> > Cc: k...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > net...@vger.kernel.org; Parav Pandit 
> > Subject: [PATCH v3 0/5] Introduce variable length mdev alias
> > 
> > To have consistent naming for the netdevice of a mdev and to have consistent
> > naming of the devlink port [1] of a mdev, which is formed using
> > phys_port_name of the devlink port, current UUID is not usable because UUID
> > is too long.
> > 
> > UUID in string format is 36-characters long and in binary 128-bit.
> > Both formats are not able to fit within 15 characters limit of netdev name.
> > 
> > It is desired to have mdev device naming consistent using UUID.
> > So that widely used user space framework such as ovs [2] can make use of
> > mdev representor in similar way as PCIe SR-IOV VF and PF representors.
> > 
> > Hence,
> > (a) mdev alias is created which is derived using sha1 from the mdev name.
> > (b) Vendor driver describes how long an alias should be for the child mdev
> > created for a given parent.
> > (c) Mdev aliases are unique at system level.
> > (d) alias is created optionally whenever parent requested.
> > This ensures that non networking mdev parents can function without alias
> > creation overhead.
> > 
> > This design is discussed at [3].
> > 
> > An example systemd/udev extension will have,
> > 
> > 1. netdev name created using mdev alias available in sysfs.
> > 
> > mdev UUID=83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
> > mdev 12 character alias=cd5b146a80a5
> > 
> > netdev name of this mdev = enmcd5b146a80a5 Here en = Ethernet link m =
> > mediated device
> > 
> > 2. devlink port phys_port_name created using mdev alias.
> > devlink phys_port_name=pcd5b146a80a5
> > 
> > This patchset enables mdev core to maintain unique alias for a mdev.
> > 
> > Patch-1 Introduces mdev alias using sha1.
> > Patch-2 Ensures that mdev alias is unique in a system.
> > Patch-3 Exposes mdev alias in a sysfs hirerchy, update Documentation
> > Patch-4 Introduces mdev_alias() API.
> > Patch-5 Extends mtty driver to optionally provide alias generation.
> > This also enables to test UUID based sha1 collision and trigger error 
> > handling
> > for duplicate sha1 results.
> > 
> > [1] http://man7.org/linux/man-pages/man8/devlink-port.8.html
> > [2] https://docs.openstack.org/os-vif/latest/user/plugins/ovs.html
> > [3] https://patchwork.kernel.org/cover/11084231/
> > 
> > ---
> > Changelog:
> > v2->v3:
> >  - Addressed comment from Yunsheng Lin
> >  - Changed strcmp() ==0 to !strcmp()
> >  - Addressed comment from Cornelia Hunk
> >  - Merged sysfs Documentation patch with syfs patch
> >  - Added more description for alias return value  
> 
> Did you get a chance review this updated series?
> I addressed Cornelia's and yours comment.
> I do not think allocating alias memory twice, once for comparison and
> once for storing is good idea or moving alias generation logic inside
> the mdev_list_lock(). So I didn't address that suggestion of
> Cornelia. 

Sorry, I'm at LPC this week.  I agree, I don't think the double
allocation is necessary, I thought the comment was sufficient to
clarify null'ing the variable.  It's awkward, but seems correct.

I'm not sure what we do with this patch series though, has the real
consumer of this even been proposed?  It feels optimistic to include at
this point.  We've used the sample driver as a placeholder in the past
for mdev_uuid(), but we arrived at that via a conversion rather than
explicitly adding the API.  Please let me know where the consumer
patches stand, perhaps it would make more sense for them to go in
together rather than risk adding an unused API.  Thanks,

Alex

> > v1->v2:
> >  - Corrected a typo from 'and' to 'an'
> >  - Addressed comments from Alex Williamson
> >  - Kept mdev_device naturally aligned
> >  - Added error checking for crypt_*() calls
> >  - Moved alias NULL check at beginning
> >  - Added mdev_alias() API
> >  - Updated mtty driver to show example mdev_alias() usage
> >  - Changed return type of generate_alias() from int to char*
> > v0->v1:
> >  - Addressed comments from Alex Williamson, Cornelia Hunk and Mark
> > Bloch
> >  - Moved alias length check outside of the parent lock
> 

Re: [PATCH] PCI: Add PCIE ACS quirk for IPROC PAXB

2019-09-05 Thread Alex Williamson
On Thu, 5 Sep 2019 17:26:49 -0500
Bjorn Helgaas  wrote:

> [+cc Alex]
> 
> On Tue, Aug 20, 2019 at 10:09:45AM +0530, Srinath Mannam wrote:
> > From: Abhinav Ratna 
> > 
> > IPROC PAXB RC doesn't support ACS capabilities and control registers.
> > Add quirk to have separate IOMMU groups for all EPs and functions connected
> > to root port, by masking RR/CR/SV/UF bits.
> > 
> > Signed-off-by: Abhinav Ratna 
> > Signed-off-by: Srinath Mannam   
> 
> I tentatively applied this to pci/misc with Scott's ack for v5.4.
> 
> I tweaked the patch itself to follow the style of similar quirks
> (interdiff is below, plus a diff of the commit log).  Please make sure
> I didn't break it.
> 
> I also went out on a limb and reworded the comment to give what I
> *think* is the justification for this patch, as opposed to merely a
> description of the code.  I'm making a lot of assumptions there, so
> please confirm that they're correct, or supply alternate justification
> if they're not.

Agreed, this really needs to be the vendor vouching for ACS equivalent
functionality, not simply splitting IOMMU groups because it's
inconvenient.  Thanks,

Alex

> 
> > ---
> >  drivers/pci/quirks.c | 16 
> >  1 file changed, 16 insertions(+)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 0f16acc..f9584c0 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -4466,6 +4466,21 @@ static int pci_quirk_mf_endpoint_acs(struct pci_dev 
> > *dev, u16 acs_flags)
> > return acs_flags ? 0 : 1;
> >  }
> >  
> > +static int pcie_quirk_brcm_bridge_acs(struct pci_dev *dev, u16 acs_flags)
> > +{
> > +   /*
> > +* IPROC PAXB RC doesn't support ACS capabilities and control registers.
> > +* Add quirk to to have separate IOMMU groups for all EPs and functions
> > +* connected to root port, by masking RR/CR/SV/UF bits.
> > +*/
> > +
> > +   u16 flags = (PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_SV);
> > +   int ret = acs_flags & ~flags ? 0 : 1;
> > +
> > +   return ret;
> > +}
> > +
> > +
> >  static const struct pci_dev_acs_enabled {
> > u16 vendor;
> > u16 device;
> > @@ -4559,6 +4574,7 @@ static const struct pci_dev_acs_enabled {
> > { PCI_VENDOR_ID_AMPERE, 0xE00A, pci_quirk_xgene_acs },
> > { PCI_VENDOR_ID_AMPERE, 0xE00B, pci_quirk_xgene_acs },
> > { PCI_VENDOR_ID_AMPERE, 0xE00C, pci_quirk_xgene_acs },
> > +   { PCI_VENDOR_ID_BROADCOM, 0xD714, pcie_quirk_brcm_bridge_acs },
> > { 0 }
> >  };  
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 77c0330ac922..2edbce35e8c5 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -4466,21 +4466,19 @@ static int pci_quirk_mf_endpoint_acs(struct pci_dev 
> *dev, u16 acs_flags)
>   return acs_flags ? 0 : 1;
>  }
>  
> -static int pcie_quirk_brcm_bridge_acs(struct pci_dev *dev, u16 acs_flags)
> +static int pci_quirk_brcm_acs(struct pci_dev *dev, u16 acs_flags)
>  {
>   /*
> -  * IPROC PAXB RC doesn't support ACS capabilities and control registers.
> -  * Add quirk to to have separate IOMMU groups for all EPs and functions
> -  * connected to root port, by masking RR/CR/SV/UF bits.
> +  * iProc PAXB Root Ports don't advertise an ACS capability, but
> +  * they do not allow peer-to-peer transactions between Root Ports.
> +  * Allow each Root Port to be in a separate IOMMU group by masking
> +  * SV/RR/CR/UF bits.
>*/
> + acs_flags &= ~(PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF);
>  
> - u16 flags = (PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_SV);
> - int ret = acs_flags & ~flags ? 0 : 1;
> -
> - return ret;
> + return acs_flags ? 0 : 1;
>  }
>  
> -
>  static const struct pci_dev_acs_enabled {
>   u16 vendor;
>   u16 device;
> @@ -4574,7 +4572,7 @@ static const struct pci_dev_acs_enabled {
>   { PCI_VENDOR_ID_AMPERE, 0xE00A, pci_quirk_xgene_acs },
>   { PCI_VENDOR_ID_AMPERE, 0xE00B, pci_quirk_xgene_acs },
>   { PCI_VENDOR_ID_AMPERE, 0xE00C, pci_quirk_xgene_acs },
> - { PCI_VENDOR_ID_BROADCOM, 0xD714, pcie_quirk_brcm_bridge_acs },
> + { PCI_VENDOR_ID_BROADCOM, 0xD714, pci_quirk_brcm_acs },
>   { 0 }
>  };
>  
> 
> 
> 
> @@ -1,49 +1,49 @@
> -commit b50ae502eff0
> +commit 46b2c32df7a4
>  Author: Abhinav Ratna 
>  Date:   Tue Aug 20 10:09:45 2019 +0530
>  
> -PCI: Add PCIE ACS quirk for IPROC PAXB
> +PCI: Add ACS quirk for iProc PAXB
>  
> -IPROC PAXB RC doesn't support ACS capabilities and control registers.
> -Add quirk to have separate IOMMU groups for all EPs and functions 
> connected
> -to root port, by masking RR/CR/SV/UF bits.
> +iProc PAXB Root Ports don't advertise an ACS capability, but they do not
> +allow peer-to-peer transactions between Root Ports.  Add an ACS quirk so
> +each Root Port can be in a separate IOMMU group.
>  
> +[bhelgaas: commit log, comment, use common implementation style]
>  Link: 
> 

Re: [PATCH v5 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops

2019-08-30 Thread Alex Williamson
On Fri, 30 Aug 2019 16:42:06 +0800
Ben Luo  wrote:

> When userspace (e.g. qemu) triggers a switch between KVM
> irqfd and userspace eventfd, only dev_id of irqaction
> (i.e. the "trigger" in this patch's context) will be
> changed, but a free-then-request-irq action is taken in
> current code. And, irq affinity setting in VM will also
> trigger a free-then-request-irq action, which actually
> changes nothing, but only need to bounce the irqbypass
> registraion in case that posted-interrupt is in use.
> 
> This patch makes use of irq_update_devid() and optimize
> both cases above, which reduces the risk of losing interrupt
> and also cuts some overhead.
> 
> Signed-off-by: Ben Luo 
> ---
>  drivers/vfio/pci/vfio_pci_intrs.c | 124 
> ++
>  1 file changed, 87 insertions(+), 37 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
> b/drivers/vfio/pci/vfio_pci_intrs.c
> index 3fa3f72..d3a93d7 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -284,70 +284,120 @@ static int vfio_msi_enable(struct vfio_pci_device 
> *vdev, int nvec, bool msix)
>  static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
> int vector, int fd, bool msix)
>  {
> + struct eventfd_ctx *trigger = NULL;
>   struct pci_dev *pdev = vdev->pdev;
> - struct eventfd_ctx *trigger;
>   int irq, ret;
>  
>   if (vector < 0 || vector >= vdev->num_ctx)
>   return -EINVAL;
>  
> + if (fd >= 0) {
> + trigger = eventfd_ctx_fdget(fd);
> + if (IS_ERR(trigger)) {
> + /* oops, going to disable this interrupt */
> + dev_info(>dev,
> +  "get ctx error on bad fd: %d for vector:%d\n",
> +  fd, vector);

I think a user could trigger this maliciously as a denial of service by
simply providing a bogus file descriptor.  The user is informed of the
error by the return value, why do we need to spam the logs?

> + }
> + }
> +
>   irq = pci_irq_vector(pdev, vector);
>  
> + /*
> +  * 'trigger' is NULL or invalid, disable the interrupt
> +  * 'trigger' is same as before, only bounce the bypass registration
> +  * 'trigger' is a new invalid one, update it to irqaction and other

s/invalid/valid/

> +  * data structures referencing to the old one; fallback to disable
> +  * the interrupt on error
> +  */
>   if (vdev->ctx[vector].trigger) {
> - free_irq(irq, vdev->ctx[vector].trigger);
> + /*
> +  * even if the trigger is unchanged we need to bounce the
> +  * interrupt bypass connection to allow affinity changes in
> +  * the guest to be realized.
> +  */
>   irq_bypass_unregister_producer(>ctx[vector].producer);
> - kfree(vdev->ctx[vector].name);
> - eventfd_ctx_put(vdev->ctx[vector].trigger);
> - vdev->ctx[vector].trigger = NULL;
> +
> + if (vdev->ctx[vector].trigger == trigger) {
> + /* avoid duplicated referencing to the same trigger */
> + eventfd_ctx_put(trigger);
> +
> + } else if (trigger && !IS_ERR(trigger)) {
> + ret = irq_update_devid(irq,
> +vdev->ctx[vector].trigger, 
> trigger);
> + if (unlikely(ret)) {
> + dev_info(>dev,
> +  "update devid of %d (token %p) failed: 
> %d\n",
> +  irq, vdev->ctx[vector].trigger, ret);
> + eventfd_ctx_put(trigger);
> + free_irq(irq, vdev->ctx[vector].trigger);
> + kfree(vdev->ctx[vector].name);
> + eventfd_ctx_put(vdev->ctx[vector].trigger);
> + vdev->ctx[vector].trigger = NULL;
> + return ret;
> + }
> + eventfd_ctx_put(vdev->ctx[vector].trigger);
> + vdev->ctx[vector].producer.token = trigger;
> + vdev->ctx[vector].trigger = trigger;
> +
> + } else {
> + free_irq(irq, vdev->ctx[vector].trigger);
> + kfree(vdev->ctx[vector].name);
> + eventfd_ctx_put(vdev->ctx[vector].trigger);
> + vdev->ctx[vector].trigger = NULL;
> + }
>   }
>  
>   if (fd < 0)
>   return 0;
> + else if (IS_ERR(trigger))
> + return PTR_ERR(trigger);
>  
> - vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)",
> -msix ? "x" : "", vector,
> -pci_name(pdev));
> - if 

Re: [PATCH v2] vfio/type1: avoid redundant PageReserved checking

2019-08-29 Thread Alex Williamson
On Fri, 30 Aug 2019 00:58:22 +0800
Ben Luo  wrote:

> 在 2019/8/28 下午11:55, Alex Williamson 写道:
> > On Wed, 28 Aug 2019 12:28:04 +0800
> > Ben Luo  wrote:
> >  
> >> currently, if the page is not a tail of compound page, it will be
> >> checked twice for the same thing.
> >>
> >> Signed-off-by: Ben Luo 
> >> ---
> >>   drivers/vfio/vfio_iommu_type1.c | 3 +--
> >>   1 file changed, 1 insertion(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c 
> >> b/drivers/vfio/vfio_iommu_type1.c
> >> index 054391f..d0f7346 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -291,11 +291,10 @@ static int vfio_lock_acct(struct vfio_dma *dma, long 
> >> npage, bool async)
> >>   static bool is_invalid_reserved_pfn(unsigned long pfn)
> >>   {
> >>if (pfn_valid(pfn)) {
> >> -  bool reserved;
> >>struct page *tail = pfn_to_page(pfn);
> >>struct page *head = compound_head(tail);
> >> -  reserved = !!(PageReserved(head));
> >>if (head != tail) {
> >> +  bool reserved = PageReserved(head);
> >>/*
> >> * "head" is not a dangling pointer
> >> * (compound_head takes care of that)  
> > Thinking more about this, the code here was originally just a copy of
> > kvm_is_mmio_pfn() which was simplified in v3.12 with the commit below.
> > Should we instead do the same thing here?  Thanks,
> >
> > Alex  
> ok, and kvm_is_mmio_pfn() has also been updated since then, I will take 
> a look at that and compose a new patch

I'm not sure if the further updates are quite as relevant for vfio, but
appreciate your review of them.  Thanks,

Alex

> >
> > commit 11feeb498086a3a5907b8148bdf1786a9b18fc55
> > Author: Andrea Arcangeli 
> > Date:   Thu Jul 25 03:04:38 2013 +0200
> >
> >  kvm: optimize away THP checks in kvm_is_mmio_pfn()
> >  
> >  The checks on PG_reserved in the page structure on head and tail pages
> >  aren't necessary because split_huge_page wouldn't transfer the
> >  PG_reserved bit from head to tail anyway.
> >  
> >  This was a forward-thinking check done in the case PageReserved was
> >  set by a driver-owned page mapped in userland with something like
> >  remap_pfn_range in a VM_PFNMAP region, but using hugepmds (not
> >  possible right now). It was meant to be very safe, but it's overkill
> >  as it's unlikely split_huge_page could ever run without the driver
> >  noticing and tearing down the hugepage itself.
> >  
> >  And if a driver in the future will really want to map a reserved
> >  hugepage in userland using an huge pmd it should simply take care of
> >  marking all subpages reserved too to keep KVM safe. This of course
> >  would require such a hypothetical driver to tear down the huge pmd
> >  itself and splitting the hugepage itself, instead of relaying on
> >  split_huge_page, but that sounds very reasonable, especially
> >  considering split_huge_page wouldn't currently transfer the reserved
> >  bit anyway.
> >  
> >  Signed-off-by: Andrea Arcangeli 
> >  Signed-off-by: Gleb Natapov 
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index d2836788561e..0fc25aed79a8 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -102,28 +102,8 @@ static bool largepages_enabled = true;
> >   
> >   bool kvm_is_mmio_pfn(pfn_t pfn)
> >   {
> > -   if (pfn_valid(pfn)) {
> > -   int reserved;
> > -   struct page *tail = pfn_to_page(pfn);
> > -   struct page *head = compound_trans_head(tail);
> > -   reserved = PageReserved(head);
> > -   if (head != tail) {
> > -   /*
> > -* "head" is not a dangling pointer
> > -* (compound_trans_head takes care of that)
> > -* but the hugepage may have been splitted
> > -* from under us (and we may not hold a
> > -* reference count on the head page so it can
> > -* be reused before we run PageReferenced), so
> > -* we've to check PageTail before returning
> > -* what we just read.
> > -*/
> > -   smp_rmb();
> > -   if (PageTail(tail))
> > -   return reserved;
> > -   }
> > -   return PageReserved(tail);
> > -   }
> > +   if (pfn_valid(pfn))
> > +   return PageReserved(pfn_to_page(pfn));
> >   
> >  return true;
> >   }  



Re: [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops

2019-08-29 Thread Alex Williamson
On Thu, 29 Aug 2019 13:40:59 +0800
Ben Luo  wrote:

> 在 2019/8/29 上午1:23, Alex Williamson 写道:
> > On Wed, 28 Aug 2019 18:08:02 +0800
> > Ben Luo  wrote:
> >  
> >> 在 2019/8/28 上午4:33, Alex Williamson 写道:  
> >>> On Thu, 22 Aug 2019 23:34:43 +0800
> >>> Ben Luo  wrote:
> >>> 
> >>>> When userspace (e.g. qemu) triggers a switch between KVM
> >>>> irqfd and userspace eventfd, only dev_id of irq action
> >>>> (i.e. the "trigger" in this patch's context) will be
> >>>> changed, but a free-then-request-irq action is taken in
> >>>> current code. And, irq affinity setting in VM will also
> >>>> trigger a free-then-request-irq action, which actually
> >>>> changes nothing, but only fires a producer re-registration
> >>>> to update irte in case that posted-interrupt is in use.
> >>>>
> >>>> This patch makes use of irq_update_devid() and optimize
> >>>> both cases above, which reduces the risk of losing interrupt
> >>>> and also cuts some overhead.
> >>>>
> >>>> Signed-off-by: Ben Luo 
> >>>> ---
> >>>>drivers/vfio/pci/vfio_pci_intrs.c | 112 
> >>>> +-
> >>>>1 file changed, 74 insertions(+), 38 deletions(-)
> >>>>
> >>>> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
> >>>> b/drivers/vfio/pci/vfio_pci_intrs.c
> >>>> index 3fa3f72..60d3023 100644
> >>>> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> >>>> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> >>>> @@ -284,70 +284,106 @@ static int vfio_msi_enable(struct vfio_pci_device 
> >>>> *vdev, int nvec, bool msix)
> >>>>static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
> >>>>int vector, int fd, bool msix)
> >>>>{
> >>>> +struct eventfd_ctx *trigger = NULL;
> >>>>  struct pci_dev *pdev = vdev->pdev;
> >>>> -struct eventfd_ctx *trigger;
> >>>>  int irq, ret;
> >>>>
> >>>>  if (vector < 0 || vector >= vdev->num_ctx)
> >>>>  return -EINVAL;
> >>>>
> >>>> +if (fd >= 0) {
> >>>> +trigger = eventfd_ctx_fdget(fd);
> >>>> +if (IS_ERR(trigger))
> >>>> +return PTR_ERR(trigger);
> >>>> +}  
> >>> I think this is a user visible change.  Previously the vector is
> >>> disabled first, then if an error occurs re-enabling, we return an errno
> >>> with the vector disabled.  Here we instead fail the ioctl and leave the
> >>> state as if it had never happened.  For instance with QEMU, if they
> >>> were trying to change from KVM to userspace signaling and entered this
> >>> condition, previously the interrupt would signal to neither eventfd, now
> >>> it would continue signaling to KVM. If QEMU's intent was to emulate
> >>> vector masking, this could induce unhandled interrupts in the guest.
> >>> Maybe we need a tear-down on fault here to maintain that behavior, or
> >>> do you see some justification for the change?  
> >> Thanks for your comments, this reminds me to think more about the
> >> effects to users.
> >>
> >> After I reviewed the related code in Qemu and VFIO, I think maybe there
> >> is a problem in current behavior
> >> for the signal path changing case. Qemu has neither recovery nor retry
> >> code in case that ioctl with
> >> 'VFIO_DEVICE_SET_IRQS' command fails, so if the old signal path has been
> >> disabled on fault of setting
> >> up new path, the corresponding vector may be disabled forever. Following
> >> is an example from qemu's
> >> vfio_msix_vector_do_use():
> >>
> >>       ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
> >>       g_free(irq_set);
> >>       if (ret) {
> >>       error_report("vfio: failed to modify vector, %d", ret);
> >>       }
> >>
> >> I think the singal path before changing should be still working at this
> >> moment and the caller should keep it
> >> working if the changing fails, so that at least we still have the old
> >

Re: [PATCH v1 2/5] mdev: Make mdev alias unique among all mdevs

2019-08-28 Thread Alex Williamson
On Tue, 27 Aug 2019 14:16:51 -0500
Parav Pandit  wrote:

> Mdev alias should be unique among all the mdevs, so that when such alias
> is used by the mdev users to derive other objects, there is no
> collision in a given system.
> 
> Signed-off-by: Parav Pandit 
> 
> ---
> Changelog:
> v0->v1:
>  - Fixed inclusiong of alias for NULL check
>  - Added ratelimited debug print for sha1 hash collision error
> ---
>  drivers/vfio/mdev/mdev_core.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 62d29f57fe0c..4b9899e40665 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -375,6 +375,13 @@ int mdev_device_create(struct kobject *kobj, struct 
> device *dev,
>   ret = -EEXIST;
>   goto mdev_fail;
>   }
> + if (tmp->alias && alias && strcmp(tmp->alias, alias) == 0) {

Nit, test if the device we adding has an alias before the device we're
testing against.  The compiler can better optimize keeping alias hot.
Thanks,

Alex

> + mutex_unlock(_list_lock);
> + ret = -EEXIST;
> + dev_dbg_ratelimited(dev, "Hash collision in alias 
> creation for UUID %pUl\n",
> + uuid);
> + goto mdev_fail;
> + }
>   }
>  
>   mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);



Re: [PATCH v1 1/5] mdev: Introduce sha1 based mdev alias

2019-08-28 Thread Alex Williamson
On Wed, 28 Aug 2019 15:25:44 -0600
Alex Williamson  wrote:

> On Tue, 27 Aug 2019 14:16:50 -0500
> Parav Pandit  wrote:
> >  module_init(mdev_init)
> > diff --git a/drivers/vfio/mdev/mdev_private.h 
> > b/drivers/vfio/mdev/mdev_private.h
> > index 7d922950caaf..cf1c0d9842c6 100644
> > --- a/drivers/vfio/mdev/mdev_private.h
> > +++ b/drivers/vfio/mdev/mdev_private.h
> > @@ -33,6 +33,7 @@ struct mdev_device {
> > struct kobject *type_kobj;
> > struct device *iommu_device;
> > bool active;
> > +   const char *alias;

Nit, put this above active to avoid creating a hole in the structure.
Thanks,

Alex


Re: [PATCH v1 1/5] mdev: Introduce sha1 based mdev alias

2019-08-28 Thread Alex Williamson
On Tue, 27 Aug 2019 14:16:50 -0500
Parav Pandit  wrote:

> Some vendor drivers want an identifier for an mdev device that is
> shorter than the UUID, due to length restrictions in the consumers of
> that identifier.
> 
> Add a callback that allows a vendor driver to request an alias of a
> specified length to be generated for an mdev device. If generated,
> that alias is checked for collisions.
> 
> It is an optional attribute.
> mdev alias is generated using sha1 from the mdev name.
> 
> Signed-off-by: Parav Pandit 
> 
> ---
> Changelog:
> 
> v0->v1:
>  - Moved alias length check outside of the parent lock
>  - Moved alias and digest allocation from kvzalloc to kzalloc
>  - [0] changed to alias
>  - alias_length check is nested under get_alias_length callback check
>  - Changed comments to start with an empty line
>  - Fixed cleaunup of hash if mdev_bus_register() fails
>  - Added comment where alias memory ownership is handed over to mdev device
>  - Updated commit log to indicate motivation for this feature
> ---
>  drivers/vfio/mdev/mdev_core.c| 110 ++-
>  drivers/vfio/mdev/mdev_private.h |   5 +-
>  drivers/vfio/mdev/mdev_sysfs.c   |  13 ++--
>  include/linux/mdev.h |   4 ++
>  4 files changed, 122 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index b558d4cfd082..62d29f57fe0c 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -10,9 +10,11 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "mdev_private.h"
>  
> @@ -27,6 +29,8 @@ static struct class_compat *mdev_bus_compat_class;
>  static LIST_HEAD(mdev_list);
>  static DEFINE_MUTEX(mdev_list_lock);
>  
> +static struct crypto_shash *alias_hash;
> +
>  struct device *mdev_parent_dev(struct mdev_device *mdev)
>  {
>   return mdev->parent->dev;
> @@ -150,6 +154,16 @@ int mdev_register_device(struct device *dev, const 
> struct mdev_parent_ops *ops)
>   if (!ops || !ops->create || !ops->remove || !ops->supported_type_groups)
>   return -EINVAL;
>  
> + if (ops->get_alias_length) {
> + unsigned int digest_size;
> + unsigned int aligned_len;
> +
> + aligned_len = roundup(ops->get_alias_length(), 2);
> + digest_size = crypto_shash_digestsize(alias_hash);
> + if (aligned_len / 2 > digest_size)
> + return -EINVAL;
> + }
> +
>   dev = get_device(dev);
>   if (!dev)
>   return -EINVAL;
> @@ -259,6 +273,7 @@ static void mdev_device_free(struct mdev_device *mdev)
>   mutex_unlock(_list_lock);
>  
>   dev_dbg(>dev, "MDEV: destroying\n");
> + kfree(mdev->alias);
>   kfree(mdev);
>  }
>  
> @@ -269,18 +284,88 @@ static void mdev_device_release(struct device *dev)
>   mdev_device_free(mdev);
>  }
>  
> -int mdev_device_create(struct kobject *kobj,
> -struct device *dev, const guid_t *uuid)
> +static const char *
> +generate_alias(const char *uuid, unsigned int max_alias_len)
> +{
> + struct shash_desc *hash_desc;
> + unsigned int digest_size;
> + unsigned char *digest;
> + unsigned int alias_len;
> + char *alias;
> + int ret = 0;
> +
> + /*
> +  * Align to multiple of 2 as bin2hex will generate
> +  * even number of bytes.
> +  */
> + alias_len = roundup(max_alias_len, 2);
> + alias = kzalloc(alias_len + 1, GFP_KERNEL);
> + if (!alias)
> + return NULL;
> +
> + /* Allocate and init descriptor */
> + hash_desc = kvzalloc(sizeof(*hash_desc) +
> +  crypto_shash_descsize(alias_hash),
> +  GFP_KERNEL);
> + if (!hash_desc)
> + goto desc_err;
> +
> + hash_desc->tfm = alias_hash;
> +
> + digest_size = crypto_shash_digestsize(alias_hash);
> +
> + digest = kzalloc(digest_size, GFP_KERNEL);
> + if (!digest) {
> + ret = -ENOMEM;
> + goto digest_err;
> + }
> + crypto_shash_init(hash_desc);
> + crypto_shash_update(hash_desc, uuid, UUID_STRING_LEN);
> + crypto_shash_final(hash_desc, digest);

All of these can fail and many, if not most, of the callers appear
that they might test the return value.  Thanks,

Alex

> + bin2hex(alias, digest, min_t(unsigned int, digest_size, alias_len / 2));
> + /*
> +  * When alias length is odd, zero out and additional last byte
> +  * that bin2hex has copied.
> +  */
> + if (max_alias_len % 2)
> + alias[max_alias_len] = 0;
> +
> + kfree(digest);
> + kvfree(hash_desc);
> + return alias;
> +
> +digest_err:
> + kvfree(hash_desc);
> +desc_err:
> + kfree(alias);
> + return NULL;
> +}
> +
> +int mdev_device_create(struct kobject *kobj, struct device *dev,
> +const char *uuid_str, const guid_t *uuid)
>  {
>   

Re: [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops

2019-08-28 Thread Alex Williamson
On Wed, 28 Aug 2019 18:08:02 +0800
Ben Luo  wrote:

> 在 2019/8/28 上午4:33, Alex Williamson 写道:
> > On Thu, 22 Aug 2019 23:34:43 +0800
> > Ben Luo  wrote:
> >  
> >> When userspace (e.g. qemu) triggers a switch between KVM
> >> irqfd and userspace eventfd, only dev_id of irq action
> >> (i.e. the "trigger" in this patch's context) will be
> >> changed, but a free-then-request-irq action is taken in
> >> current code. And, irq affinity setting in VM will also
> >> trigger a free-then-request-irq action, which actually
> >> changes nothing, but only fires a producer re-registration
> >> to update irte in case that posted-interrupt is in use.
> >>
> >> This patch makes use of irq_update_devid() and optimize
> >> both cases above, which reduces the risk of losing interrupt
> >> and also cuts some overhead.
> >>
> >> Signed-off-by: Ben Luo 
> >> ---
> >>   drivers/vfio/pci/vfio_pci_intrs.c | 112 
> >> +-
> >>   1 file changed, 74 insertions(+), 38 deletions(-)
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
> >> b/drivers/vfio/pci/vfio_pci_intrs.c
> >> index 3fa3f72..60d3023 100644
> >> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> >> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> >> @@ -284,70 +284,106 @@ static int vfio_msi_enable(struct vfio_pci_device 
> >> *vdev, int nvec, bool msix)
> >>   static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
> >>  int vector, int fd, bool msix)
> >>   {
> >> +  struct eventfd_ctx *trigger = NULL;
> >>struct pci_dev *pdev = vdev->pdev;
> >> -  struct eventfd_ctx *trigger;
> >>int irq, ret;
> >>   
> >>if (vector < 0 || vector >= vdev->num_ctx)
> >>return -EINVAL;
> >>   
> >> +  if (fd >= 0) {
> >> +  trigger = eventfd_ctx_fdget(fd);
> >> +  if (IS_ERR(trigger))
> >> +  return PTR_ERR(trigger);
> >> +  }  
> > I think this is a user visible change.  Previously the vector is
> > disabled first, then if an error occurs re-enabling, we return an errno
> > with the vector disabled.  Here we instead fail the ioctl and leave the
> > state as if it had never happened.  For instance with QEMU, if they
> > were trying to change from KVM to userspace signaling and entered this
> > condition, previously the interrupt would signal to neither eventfd, now
> > it would continue signaling to KVM. If QEMU's intent was to emulate
> > vector masking, this could induce unhandled interrupts in the guest.
> > Maybe we need a tear-down on fault here to maintain that behavior, or
> > do you see some justification for the change?  
> 
> Thanks for your comments, this reminds me to think more about the 
> effects to users.
> 
> After I reviewed the related code in Qemu and VFIO, I think maybe there 
> is a problem in current behavior
> for the signal path changing case. Qemu has neither recovery nor retry 
> code in case that ioctl with
> 'VFIO_DEVICE_SET_IRQS' command fails, so if the old signal path has been 
> disabled on fault of setting
> up new path, the corresponding vector may be disabled forever. Following 
> is an example from qemu's
> vfio_msix_vector_do_use():
> 
>      ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
>      g_free(irq_set);
>      if (ret) {
>      error_report("vfio: failed to modify vector, %d", ret);
>      }
> 
> I think the singal path before changing should be still working at this 
> moment and the caller should keep it
> working if the changing fails, so that at least we still have the old 
> path instead of no path.
> 
> For masking vector case, the 'fd' should be -1, and the interrupt will 
> be freed as before this patch.

QEMU doesn't really have an opportunity to signal an error to the
guest, we're emulating the hardware masking of MSI and MSI-X.  The
guest is simply trying to write a mask bit in the vector, there's no
provision in the PCI spec that setting this bit can fail.  The current
behavior is that the vector is disabled on error.  We can argue whether
that's the optimal behavior, but it's the existing behavior and
changing it would require and evaluation of all existing users.

> >> +
> >>irq = pci_irq_vector(pdev, vector);
> >>   
> >> +  /*
> >> +   * For KVM-VFIO case, interrupt from passthrough device will be directly
> >> +   * delivered to VM a

Re: [PATCH v2] vfio/type1: avoid redundant PageReserved checking

2019-08-28 Thread Alex Williamson
On Wed, 28 Aug 2019 12:28:04 +0800
Ben Luo  wrote:

> currently, if the page is not a tail of compound page, it will be
> checked twice for the same thing.
> 
> Signed-off-by: Ben Luo 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 054391f..d0f7346 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -291,11 +291,10 @@ static int vfio_lock_acct(struct vfio_dma *dma, long 
> npage, bool async)
>  static bool is_invalid_reserved_pfn(unsigned long pfn)
>  {
>   if (pfn_valid(pfn)) {
> - bool reserved;
>   struct page *tail = pfn_to_page(pfn);
>   struct page *head = compound_head(tail);
> - reserved = !!(PageReserved(head));
>   if (head != tail) {
> + bool reserved = PageReserved(head);
>   /*
>* "head" is not a dangling pointer
>* (compound_head takes care of that)

Thinking more about this, the code here was originally just a copy of
kvm_is_mmio_pfn() which was simplified in v3.12 with the commit below.
Should we instead do the same thing here?  Thanks,

Alex

commit 11feeb498086a3a5907b8148bdf1786a9b18fc55
Author: Andrea Arcangeli 
Date:   Thu Jul 25 03:04:38 2013 +0200

kvm: optimize away THP checks in kvm_is_mmio_pfn()

The checks on PG_reserved in the page structure on head and tail pages
aren't necessary because split_huge_page wouldn't transfer the
PG_reserved bit from head to tail anyway.

This was a forward-thinking check done in the case PageReserved was
set by a driver-owned page mapped in userland with something like
remap_pfn_range in a VM_PFNMAP region, but using hugepmds (not
possible right now). It was meant to be very safe, but it's overkill
as it's unlikely split_huge_page could ever run without the driver
noticing and tearing down the hugepage itself.

And if a driver in the future will really want to map a reserved
hugepage in userland using an huge pmd it should simply take care of
marking all subpages reserved too to keep KVM safe. This of course
would require such a hypothetical driver to tear down the huge pmd
itself and splitting the hugepage itself, instead of relaying on
split_huge_page, but that sounds very reasonable, especially
considering split_huge_page wouldn't currently transfer the reserved
bit anyway.

Signed-off-by: Andrea Arcangeli 
Signed-off-by: Gleb Natapov 

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d2836788561e..0fc25aed79a8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -102,28 +102,8 @@ static bool largepages_enabled = true;
 
 bool kvm_is_mmio_pfn(pfn_t pfn)
 {
-   if (pfn_valid(pfn)) {
-   int reserved;
-   struct page *tail = pfn_to_page(pfn);
-   struct page *head = compound_trans_head(tail);
-   reserved = PageReserved(head);
-   if (head != tail) {
-   /*
-* "head" is not a dangling pointer
-* (compound_trans_head takes care of that)
-* but the hugepage may have been splitted
-* from under us (and we may not hold a
-* reference count on the head page so it can
-* be reused before we run PageReferenced), so
-* we've to check PageTail before returning
-* what we just read.
-*/
-   smp_rmb();
-   if (PageTail(tail))
-   return reserved;
-   }
-   return PageReserved(tail);
-   }
+   if (pfn_valid(pfn))
+   return PageReserved(pfn_to_page(pfn));
 
return true;
 }


Re: [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops

2019-08-27 Thread Alex Williamson
On Thu, 22 Aug 2019 23:34:43 +0800
Ben Luo  wrote:

> When userspace (e.g. qemu) triggers a switch between KVM
> irqfd and userspace eventfd, only dev_id of irq action
> (i.e. the "trigger" in this patch's context) will be
> changed, but a free-then-request-irq action is taken in
> current code. And, irq affinity setting in VM will also
> trigger a free-then-request-irq action, which actually
> changes nothing, but only fires a producer re-registration
> to update irte in case that posted-interrupt is in use.
> 
> This patch makes use of irq_update_devid() and optimize
> both cases above, which reduces the risk of losing interrupt
> and also cuts some overhead.
> 
> Signed-off-by: Ben Luo 
> ---
>  drivers/vfio/pci/vfio_pci_intrs.c | 112 
> +-
>  1 file changed, 74 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
> b/drivers/vfio/pci/vfio_pci_intrs.c
> index 3fa3f72..60d3023 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -284,70 +284,106 @@ static int vfio_msi_enable(struct vfio_pci_device 
> *vdev, int nvec, bool msix)
>  static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
> int vector, int fd, bool msix)
>  {
> + struct eventfd_ctx *trigger = NULL;
>   struct pci_dev *pdev = vdev->pdev;
> - struct eventfd_ctx *trigger;
>   int irq, ret;
>  
>   if (vector < 0 || vector >= vdev->num_ctx)
>   return -EINVAL;
>  
> + if (fd >= 0) {
> + trigger = eventfd_ctx_fdget(fd);
> + if (IS_ERR(trigger))
> + return PTR_ERR(trigger);
> + }

I think this is a user visible change.  Previously the vector is
disabled first, then if an error occurs re-enabling, we return an errno
with the vector disabled.  Here we instead fail the ioctl and leave the
state as if it had never happened.  For instance with QEMU, if they
were trying to change from KVM to userspace signaling and entered this
condition, previously the interrupt would signal to neither eventfd, now
it would continue signaling to KVM.  If QEMU's intent was to emulate
vector masking, this could induce unhandled interrupts in the guest.
Maybe we need a tear-down on fault here to maintain that behavior, or
do you see some justification for the change?

> +
>   irq = pci_irq_vector(pdev, vector);
>  
> + /*
> +  * For KVM-VFIO case, interrupt from passthrough device will be directly
> +  * delivered to VM after producer and consumer connected successfully.
> +  * If producer and consumer are disconnected, this interrupt process
> +  * will fall back to remap mode, where interrupt handler uses 'trigger'
> +  * to find the right way to deliver the interrupt to VM. So, it is safe
> +  * to do irq_update_devid() before irq_bypass_unregister_producer() 
> which
> +  * switches interrupt process to remap mode. To producer and consumer,
> +  * 'trigger' is only a token used for pairing them togather.
> +  */
>   if (vdev->ctx[vector].trigger) {
> - free_irq(irq, vdev->ctx[vector].trigger);
> - irq_bypass_unregister_producer(>ctx[vector].producer);
> - kfree(vdev->ctx[vector].name);
> - eventfd_ctx_put(vdev->ctx[vector].trigger);
> - vdev->ctx[vector].trigger = NULL;
> + if (vdev->ctx[vector].trigger == trigger) {
> + /* switch back to remap mode */
> + 
> irq_bypass_unregister_producer(>ctx[vector].producer);

I think we leak the fd context we acquired above in this case.

Why do we do anything in this case, couldn't we just 'put' the extra ctx
and return 0 here?

> + } else if (trigger) {
> + ret = irq_update_devid(irq,
> +vdev->ctx[vector].trigger, 
> trigger);
> + if (unlikely(ret)) {
> + dev_info(>dev,
> +  "update devid of %d (token %p) failed: 
> %d\n",
> +  irq, vdev->ctx[vector].trigger, ret);
> + eventfd_ctx_put(trigger);
> + return ret;
> + }
> + 
> irq_bypass_unregister_producer(>ctx[vector].producer);

Can you explain this ordering, I would have expected that we'd
unregister the bypass before we updated the devid.  Thanks,

Alex

> + eventfd_ctx_put(vdev->ctx[vector].trigger);
> + vdev->ctx[vector].producer.token = trigger;
> + vdev->ctx[vector].trigger = trigger;
> + } else {
> + free_irq(irq, vdev->ctx[vector].trigger);
> + 
> irq_bypass_unregister_producer(>ctx[vector].producer);
> + kfree(vdev->ctx[vector].name);
> + 

Re: [PATCH] vfio/type1: avoid redundant PageReserved checking

2019-08-27 Thread Alex Williamson
On Tue, 27 Aug 2019 20:49:48 +0800
Ben Luo  wrote:

> currently, if the page is not a tail of compound page, it will be
> checked twice for the same thing.
> 
> Signed-off-by: Ben Luo 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 054391f..cbe0d88 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -291,11 +291,10 @@ static int vfio_lock_acct(struct vfio_dma *dma, long 
> npage, bool async)
>  static bool is_invalid_reserved_pfn(unsigned long pfn)
>  {
>   if (pfn_valid(pfn)) {
> - bool reserved;
>   struct page *tail = pfn_to_page(pfn);
>   struct page *head = compound_head(tail);
> - reserved = !!(PageReserved(head));
>   if (head != tail) {
> + bool reserved = !!(PageReserved(head));
>   /*
>* "head" is not a dangling pointer
>* (compound_head takes care of that)
> @@ -310,7 +309,7 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
>   if (PageTail(tail))
>   return reserved;
>   }
> - return PageReserved(tail);
> + return !!(PageReserved(tail));
>   }
>  
>   return true;

Logic seems fine to me, though I'd actually prefer to get rid of the !!
in the first use than duplicate it at the second use.  Thanks,

Alex


Re: [PATCH 0/4] Introduce variable length mdev alias

2019-08-27 Thread Alex Williamson
On Tue, 27 Aug 2019 13:11:17 +
Parav Pandit  wrote:

> Hi Alex, Cornelia,
> 
> > -Original Message-
> > From: kvm-ow...@vger.kernel.org  On Behalf
> > Of Parav Pandit
> > Sent: Tuesday, August 27, 2019 2:11 AM
> > To: alex.william...@redhat.com; Jiri Pirko ;
> > kwankh...@nvidia.com; coh...@redhat.com; da...@davemloft.net
> > Cc: k...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > net...@vger.kernel.org; Parav Pandit 
> > Subject: [PATCH 0/4] Introduce variable length mdev alias
> > 
> > To have consistent naming for the netdevice of a mdev and to have consistent
> > naming of the devlink port [1] of a mdev, which is formed using
> > phys_port_name of the devlink port, current UUID is not usable because UUID
> > is too long.
> > 
> > UUID in string format is 36-characters long and in binary 128-bit.
> > Both formats are not able to fit within 15 characters limit of netdev name.
> > 
> > It is desired to have mdev device naming consistent using UUID.
> > So that widely used user space framework such as ovs [2] can make use of
> > mdev representor in similar way as PCIe SR-IOV VF and PF representors.
> > 
> > Hence,
> > (a) mdev alias is created which is derived using sha1 from the mdev name.
> > (b) Vendor driver describes how long an alias should be for the child mdev
> > created for a given parent.
> > (c) Mdev aliases are unique at system level.
> > (d) alias is created optionally whenever parent requested.
> > This ensures that non networking mdev parents can function without alias
> > creation overhead.
> > 
> > This design is discussed at [3].
> > 
> > An example systemd/udev extension will have,
> > 
> > 1. netdev name created using mdev alias available in sysfs.
> > 
> > mdev UUID=83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
> > mdev 12 character alias=cd5b146a80a5
> > 
> > netdev name of this mdev = enmcd5b146a80a5 Here en = Ethernet link m =
> > mediated device
> > 
> > 2. devlink port phys_port_name created using mdev alias.
> > devlink phys_port_name=pcd5b146a80a5
> > 
> > This patchset enables mdev core to maintain unique alias for a mdev.
> > 
> > Patch-1 Introduces mdev alias using sha1.
> > Patch-2 Ensures that mdev alias is unique in a system.
> > Patch-3 Exposes mdev alias in a sysfs hirerchy.
> > Patch-4 Extends mtty driver to optionally provide alias generation.
> > This also enables to test UUID based sha1 collision and trigger error 
> > handling
> > for duplicate sha1 results.
> > 
> > In future when networking driver wants to use mdev alias, mdev_alias() API 
> > will
> > be added to derive devlink port name.
> >   
> Now that majority of above patches looks in shape and I addressed all 
> comments,
> In next v1 post, I was considering to include mdev_alias() and have
> example use in mtty driver.
> 
> This way, subsequent series of mlx5_core who intents to use
> mdev_alias() API makes it easy to review and merge through Dave M,
> netdev tree. Is that ok with you?

What would be the timing for the mlx5_core use case?  Can we coordinate
within the same development cycle?  I wouldn't want someone to come
clean up the sample driver and remove the API ;)  Thanks,

Alex


Re: [PATCH 1/4] mdev: Introduce sha1 based mdev alias

2019-08-27 Thread Alex Williamson
On Tue, 27 Aug 2019 15:35:10 +0200
Cornelia Huck  wrote:

> On Tue, 27 Aug 2019 11:57:07 +
> Parav Pandit  wrote:
> 
> > > -Original Message-
> > > From: Cornelia Huck 
> > > Sent: Tuesday, August 27, 2019 5:11 PM
> > > To: Parav Pandit 
> > > Cc: alex.william...@redhat.com; Jiri Pirko ;
> > > kwankh...@nvidia.com; da...@davemloft.net; k...@vger.kernel.org; linux-
> > > ker...@vger.kernel.org; net...@vger.kernel.org
> > > Subject: Re: [PATCH 1/4] mdev: Introduce sha1 based mdev alias
> > > 
> > > On Tue, 27 Aug 2019 11:33:54 +
> > > Parav Pandit  wrote:
> > > 
> > > > > -Original Message-
> > > > > From: Cornelia Huck 
> > > > > Sent: Tuesday, August 27, 2019 4:54 PM
> > > > > To: Parav Pandit 
> > > > > Cc: alex.william...@redhat.com; Jiri Pirko ;
> > > > > kwankh...@nvidia.com; da...@davemloft.net; k...@vger.kernel.org;
> > > > > linux- ker...@vger.kernel.org; net...@vger.kernel.org
> > > > > Subject: Re: [PATCH 1/4] mdev: Introduce sha1 based mdev alias
> > > > >
> > > > > On Tue, 27 Aug 2019 11:12:23 +
> > > > > Parav Pandit  wrote:
> > > > >
> > > > > > > -Original Message-
> > > > > > > From: Cornelia Huck 
> > > > > > > Sent: Tuesday, August 27, 2019 3:54 PM
> > > > > > > To: Parav Pandit 
> > > > > > > Cc: alex.william...@redhat.com; Jiri Pirko ;
> > > > > > > kwankh...@nvidia.com; da...@davemloft.net; k...@vger.kernel.org;
> > > > > > > linux- ker...@vger.kernel.org; net...@vger.kernel.org
> > > > > > > Subject: Re: [PATCH 1/4] mdev: Introduce sha1 based mdev alias
> > > > > > >
> > > 
> > > > > > > What about:
> > > > > > >
> > > > > > > * @get_alias_length: optional callback to specify length of the
> > > > > > > alias to
> > > > > create
> > > > > > > *Returns unsigned integer: length of the 
> > > > > > > alias to be created,
> > > > > > > *  0 to not create an 
> > > > > > > alias
> > > > > > >
> > > > > > Ack.
> > > > > >
> > > > > > > I also think it might be beneficial to add a device parameter
> > > > > > > here now (rather than later); that seems to be something that 
> > > > > > > makes
> > > sense.
> > > > > > >
> > > > > > Without showing the use, it shouldn't be added.
> > > > >
> > > > > It just feels like an omission: Why should the vendor driver only be
> > > > > able to return one value here, without knowing which device it is for?
> > > > > If a driver supports different devices, it may have different
> > > > > requirements for them.
> > > > >
> > > > Sure. Lets first have this requirement to add it.
> > > > I am against adding this length field itself without an actual vendor 
> > > > use case,
> > > which is adding some complexity in code today.
> > > > But it was ok to have length field instead of bool.
> > > >
> > > > Lets not further add "no-requirement futuristic knobs" which hasn't 
> > > > shown its
> > > need yet.
> > > > When a vendor driver needs it, there is nothing prevents such addition. 
> > > >
> > > 
> > > Frankly, I do not see how it adds complexity; the other callbacks have 
> > > device
> > > arguments already,
> > Other ioctls such as create, remove, mmap, likely need to access the parent.
> > Hence it make sense to have parent pointer in there.
> > 
> > I am not against complexity, I am just saying, at present there is no 
> > use-case. Let have use case and we add it.
> >   
> > > and the vendor driver is free to ignore it if it does not have
> > > a use for it. I'd rather add the argument before a possible future user 
> > > tries
> > > weird hacks to allow multiple values, but I'll leave the decision to the
> > > maintainers.
> > Why would a possible future user tries a weird hack?
> > If user needs to access parent device, that driver maintainer should ask 
> > for it.  
> 
> I've seen the situation often enough that folks tried to do hacks
> instead of enhancing the interface.
> 
> Again, let's get a maintainer opinion.

Sure, make someone else have an opinion ;)  I don't have a strong one.
The argument against a dev arg, as I see it, is that it's unused
currently, so why should we try to predict a future use case.  The
argument for, is that we're defining an API between the core and vendor
driver, where our job in defining that API could certainly be seen as
anticipating future use cases so as not to unnecessarily churn the
API.  So do we lean towards a more stable API or do we lean towards
minimalism?

when called form mdev_register_device(), the arg we'd add seems obvious
because we really have nothing more to work with than the parent
device.  But this is only a sanity test and the value there seems
questionable anyway.  If we look to the real use case in
mdev_device_create() then clearly dev stands out as a likely useful
arg, but is the type or kobj also useful?  Would we forfeit the sanity
test to include those?  I don't have a lot of confidence in being able
to 

Re: [PATCH 2/4] mdev: Make mdev alias unique among all mdevs

2019-08-27 Thread Alex Williamson
On Tue, 27 Aug 2019 16:13:27 +
Parav Pandit  wrote:

> > -Original Message-
> > From: Alex Williamson 
> > Sent: Tuesday, August 27, 2019 8:59 PM
> > To: Cornelia Huck 
> > Cc: Parav Pandit ; Jiri Pirko ;
> > kwankh...@nvidia.com; da...@davemloft.net; k...@vger.kernel.org; linux-
> > ker...@vger.kernel.org; net...@vger.kernel.org
> > Subject: Re: [PATCH 2/4] mdev: Make mdev alias unique among all mdevs
> > 
> > On Tue, 27 Aug 2019 13:29:46 +0200
> > Cornelia Huck  wrote:
> >   
> > > On Tue, 27 Aug 2019 11:08:59 +
> > > Parav Pandit  wrote:
> > >  
> > > > > -Original Message-
> > > > > From: Cornelia Huck 
> > > > > Sent: Tuesday, August 27, 2019 3:59 PM
> > > > > To: Parav Pandit 
> > > > > Cc: alex.william...@redhat.com; Jiri Pirko ;
> > > > > kwankh...@nvidia.com; da...@davemloft.net; k...@vger.kernel.org;
> > > > > linux- ker...@vger.kernel.org; net...@vger.kernel.org
> > > > > Subject: Re: [PATCH 2/4] mdev: Make mdev alias unique among all
> > > > > mdevs
> > > > >
> > > > > On Mon, 26 Aug 2019 15:41:17 -0500 Parav Pandit
> > > > >  wrote:
> > > > >  
> > > > > > Mdev alias should be unique among all the mdevs, so that when
> > > > > > such alias is used by the mdev users to derive other objects,
> > > > > > there is no collision in a given system.
> > > > > >
> > > > > > Signed-off-by: Parav Pandit 
> > > > > > ---
> > > > > >  drivers/vfio/mdev/mdev_core.c | 5 +
> > > > > >  1 file changed, 5 insertions(+)
> > > > > >
> > > > > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > > > > b/drivers/vfio/mdev/mdev_core.c index e825ff38b037..6eb37f0c6369
> > > > > > 100644
> > > > > > --- a/drivers/vfio/mdev/mdev_core.c
> > > > > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > > > > @@ -375,6 +375,11 @@ int mdev_device_create(struct kobject *kobj,  
> > struct  
> > > > > device *dev,  
> > > > > > ret = -EEXIST;
> > > > > > goto mdev_fail;
> > > > > > }
> > > > > > +   if (tmp->alias && strcmp(tmp->alias, alias) == 0) {  
> > > > >
> > > > > Any way we can relay to the caller that the uuid was fine, but
> > > > > that we had a hash collision? Duplicate uuids are much more obvious 
> > > > > than  
> > a collision here.  
> > > > >  
> > > > How do you want to relay this rare event?
> > > > Netlink interface has way to return the error message back, but sysfs 
> > > > is  
> > limited due to its error code based interface.  
> > >
> > > I don't know, that's why I asked :)
> > >
> > > The problem is that "uuid already used" and "hash collision" are
> > > indistinguishable. While "use a different uuid" will probably work in
> > > both cases, "increase alias length" might be a good alternative in
> > > some cases.
> > >
> > > But if there is no good way to relay the problem, we can live with it.  
> > 
> > It's a rare event, maybe just dev_dbg(dev, "Hash collision creating alias 
> > \"%s\"
> > for mdev device %pUl\n",...
> >   
> Ok.
> dev_dbg_once() to avoid message flood.

I'd suggest a rate-limit rather than a once.  The fact that the kernel
may have experienced a collision at some time in the past does not help
someone debug why they can't create a device now.  The only way we're
going to get a flood is if a user sufficiently privileged to create
mdev devices stumbles onto a collision and continues to repeat the same
operation.  That falls into shoot-yourself-in-the-foot behavior imo.
Thanks,

Alex


Re: [PATCH 2/4] mdev: Make mdev alias unique among all mdevs

2019-08-27 Thread Alex Williamson
On Tue, 27 Aug 2019 13:29:46 +0200
Cornelia Huck  wrote:

> On Tue, 27 Aug 2019 11:08:59 +
> Parav Pandit  wrote:
> 
> > > -Original Message-
> > > From: Cornelia Huck 
> > > Sent: Tuesday, August 27, 2019 3:59 PM
> > > To: Parav Pandit 
> > > Cc: alex.william...@redhat.com; Jiri Pirko ;
> > > kwankh...@nvidia.com; da...@davemloft.net; k...@vger.kernel.org; linux-
> > > ker...@vger.kernel.org; net...@vger.kernel.org
> > > Subject: Re: [PATCH 2/4] mdev: Make mdev alias unique among all mdevs
> > > 
> > > On Mon, 26 Aug 2019 15:41:17 -0500
> > > Parav Pandit  wrote:
> > > 
> > > > Mdev alias should be unique among all the mdevs, so that when such
> > > > alias is used by the mdev users to derive other objects, there is no
> > > > collision in a given system.
> > > >
> > > > Signed-off-by: Parav Pandit 
> > > > ---
> > > >  drivers/vfio/mdev/mdev_core.c | 5 +
> > > >  1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > > b/drivers/vfio/mdev/mdev_core.c index e825ff38b037..6eb37f0c6369
> > > > 100644
> > > > --- a/drivers/vfio/mdev/mdev_core.c
> > > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > > @@ -375,6 +375,11 @@ int mdev_device_create(struct kobject *kobj, 
> > > > struct
> > > device *dev,
> > > > ret = -EEXIST;
> > > > goto mdev_fail;
> > > > }
> > > > +   if (tmp->alias && strcmp(tmp->alias, alias) == 0) {
> > > 
> > > Any way we can relay to the caller that the uuid was fine, but that we 
> > > had a
> > > hash collision? Duplicate uuids are much more obvious than a collision 
> > > here.
> > > 
> > How do you want to relay this rare event?
> > Netlink interface has way to return the error message back, but sysfs is 
> > limited due to its error code based interface.  
> 
> I don't know, that's why I asked :)
> 
> The problem is that "uuid already used" and "hash collision" are
> indistinguishable. While "use a different uuid" will probably work in
> both cases, "increase alias length" might be a good alternative in some
> cases.
> 
> But if there is no good way to relay the problem, we can live with it.

It's a rare event, maybe just dev_dbg(dev, "Hash collision creating alias 
\"%s\" for mdev device %pUl\n",...

Thanks,
Alex

> > > > +   mutex_unlock(_list_lock);
> > > > +   ret = -EEXIST;
> > > > +   goto mdev_fail;
> > > > +   }
> > > > }
> > > >
> > > > mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> >   
> 



Re: [PATCH 2/4] mdev: Make mdev alias unique among all mdevs

2019-08-27 Thread Alex Williamson
On Tue, 27 Aug 2019 04:28:37 +
Parav Pandit  wrote:

> Hi Mark,
> 
> > -Original Message-
> > From: Mark Bloch 
> > Sent: Tuesday, August 27, 2019 4:32 AM
> > To: Parav Pandit ; alex.william...@redhat.com; Jiri
> > Pirko ; kwankh...@nvidia.com; coh...@redhat.com;
> > da...@davemloft.net
> > Cc: k...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > net...@vger.kernel.org
> > Subject: Re: [PATCH 2/4] mdev: Make mdev alias unique among all mdevs
> > 
> > 
> > 
> > On 8/26/19 1:41 PM, Parav Pandit wrote:  
> > > Mdev alias should be unique among all the mdevs, so that when such
> > > alias is used by the mdev users to derive other objects, there is no
> > > collision in a given system.
> > >
> > > Signed-off-by: Parav Pandit 
> > > ---
> > >  drivers/vfio/mdev/mdev_core.c | 5 +
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > b/drivers/vfio/mdev/mdev_core.c index e825ff38b037..6eb37f0c6369
> > > 100644
> > > --- a/drivers/vfio/mdev/mdev_core.c
> > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > @@ -375,6 +375,11 @@ int mdev_device_create(struct kobject *kobj,  
> > struct device *dev,  
> > >   ret = -EEXIST;
> > >   goto mdev_fail;
> > >   }
> > > + if (tmp->alias && strcmp(tmp->alias, alias) == 0) {  
> > 
> > alias can be NULL here no?
> >   
> If alias is NULL, tmp->alias would also be null because for given parent 
> either we have alias or we don’t.
> So its not possible to have tmp->alias as null and alias as non null.
> But it may be good/defensive to add check for both.

mdev_list is a global list of all mdev devices, how can we make any
assumptions that an element has the same parent?  Thanks,

Alex
 
> > > + mutex_unlock(_list_lock);
> > > + ret = -EEXIST;
> > > + goto mdev_fail;
> > > + }
> > >   }
> > >
> > >   mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> > >  
> > 
> > Mark  



Re: [PATCH 3/4] mdev: Expose mdev alias in sysfs tree

2019-08-26 Thread Alex Williamson
On Mon, 26 Aug 2019 15:41:18 -0500
Parav Pandit  wrote:

> Expose mdev alias as string in a sysfs tree so that such attribute can
> be used to generate netdevice name by systemd/udev or can be used to
> match other kernel objects based on the alias of the mdev.
> 
> Signed-off-by: Parav Pandit 
> ---
>  drivers/vfio/mdev/mdev_sysfs.c | 13 +
>  1 file changed, 13 insertions(+)
> 
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> index 43afe0e80b76..59f4e3cc5233 100644
> --- a/drivers/vfio/mdev/mdev_sysfs.c
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -246,7 +246,20 @@ static ssize_t remove_store(struct device *dev, struct 
> device_attribute *attr,
>  
>  static DEVICE_ATTR_WO(remove);
>  
> +static ssize_t alias_show(struct device *device,
> +   struct device_attribute *attr, char *buf)
> +{
> + struct mdev_device *dev = mdev_from_dev(device);
> +
> + if (!dev->alias)
> + return -EOPNOTSUPP;

Wouldn't it be better to not create the alias at all?  Thanks,

Alex

> +
> + return sprintf(buf, "%s\n", dev->alias);
> +}
> +static DEVICE_ATTR_RO(alias);
> +
>  static const struct attribute *mdev_device_attrs[] = {
> + _attr_alias.attr,
>   _attr_remove.attr,
>   NULL,
>  };



Re: [PATCH 1/4] mdev: Introduce sha1 based mdev alias

2019-08-26 Thread Alex Williamson
On Mon, 26 Aug 2019 19:44:56 -0600
Alex Williamson  wrote:

> On Mon, 26 Aug 2019 15:41:16 -0500
> Parav Pandit  wrote:
> 
> > Whenever a parent requests to generate mdev alias, generate a mdev
> > alias.
> > It is an optional attribute that parent can request to generate
> > for each of its child mdev.
> > mdev alias is generated using sha1 from the mdev name.
> > 
> > Signed-off-by: Parav Pandit 
> > ---
> >  drivers/vfio/mdev/mdev_core.c| 98 +++-
> >  drivers/vfio/mdev/mdev_private.h |  5 +-
> >  drivers/vfio/mdev/mdev_sysfs.c   | 13 +++--
> >  include/linux/mdev.h |  4 ++
> >  4 files changed, 111 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> > index b558d4cfd082..e825ff38b037 100644
> > --- a/drivers/vfio/mdev/mdev_core.c
> > +++ b/drivers/vfio/mdev/mdev_core.c
> > @@ -10,9 +10,11 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include "mdev_private.h"
> >  
> > @@ -27,6 +29,8 @@ static struct class_compat *mdev_bus_compat_class;
> >  static LIST_HEAD(mdev_list);
> >  static DEFINE_MUTEX(mdev_list_lock);
> >  
> > +static struct crypto_shash *alias_hash;
> > +
> >  struct device *mdev_parent_dev(struct mdev_device *mdev)
> >  {
> > return mdev->parent->dev;
> > @@ -164,6 +168,18 @@ int mdev_register_device(struct device *dev, const 
> > struct mdev_parent_ops *ops)
> > goto add_dev_err;
> > }
> >  
> > +   if (ops->get_alias_length) {
> > +   unsigned int digest_size;
> > +   unsigned int aligned_len;
> > +
> > +   aligned_len = roundup(ops->get_alias_length(), 2);
> > +   digest_size = crypto_shash_digestsize(alias_hash);
> > +   if (aligned_len / 2 > digest_size) {
> > +   ret = -EINVAL;
> > +   goto add_dev_err;
> > +   }
> > +   }  
> 
> This looks like a sanity check, it could be done outside of the
> parent_list_lock, even before we get a parent device reference.
> 
> I think we're using a callback for get_alias_length() rather than a
> fixed field to support the mtty module option added in patch 4, right?
> Its utility is rather limited with no args.  I could imagine that if a
> parent wanted to generate an alias that could be incorporated into a
> string with the parent device name that it would be useful to call this
> with the parent device as an arg.  I guess we can save that until a
> user comes along though.
> 
> There doesn't seem to be anything serializing use of alias_hash.
> 
> > +
> > parent = kzalloc(sizeof(*parent), GFP_KERNEL);
> > if (!parent) {
> > ret = -ENOMEM;
> > @@ -259,6 +275,7 @@ static void mdev_device_free(struct mdev_device *mdev)
> > mutex_unlock(_list_lock);
> >  
> > dev_dbg(>dev, "MDEV: destroying\n");
> > +   kvfree(mdev->alias);
> > kfree(mdev);
> >  }
> >  
> > @@ -269,18 +286,86 @@ static void mdev_device_release(struct device *dev)
> > mdev_device_free(mdev);
> >  }
> >  
> > -int mdev_device_create(struct kobject *kobj,
> > -  struct device *dev, const guid_t *uuid)
> > +static const char *
> > +generate_alias(const char *uuid, unsigned int max_alias_len)
> > +{
> > +   struct shash_desc *hash_desc;
> > +   unsigned int digest_size;
> > +   unsigned char *digest;
> > +   unsigned int alias_len;
> > +   char *alias;
> > +   int ret = 0;
> > +
> > +   /* Align to multiple of 2 as bin2hex will generate
> > +* even number of bytes.
> > +*/  
> 
> Comment style for non-networking code please.
> 
> > +   alias_len = roundup(max_alias_len, 2);
> > +   alias = kvzalloc(alias_len + 1, GFP_KERNEL);  

Oops, here's the null termination of alias for the even case (+ 1),
ignore the comment below about odd/even.  Thanks,

Alex

> 
> The size we're generating here should be small enough to just use
> kzalloc(), probably below too.
> 
> > +   if (!alias)
> > +   return NULL;
> > +
> > +   /* Allocate and init descriptor */
> > +   hash_desc = kvzalloc(sizeof(*hash_desc) +
> > +crypto_shash_descsize(alias_hash),
> > +GFP_KERNEL);
> > +   if (!hash_desc)
> > +  

Re: [PATCH 1/4] mdev: Introduce sha1 based mdev alias

2019-08-26 Thread Alex Williamson
On Mon, 26 Aug 2019 15:41:16 -0500
Parav Pandit  wrote:

> Whenever a parent requests to generate mdev alias, generate a mdev
> alias.
> It is an optional attribute that parent can request to generate
> for each of its child mdev.
> mdev alias is generated using sha1 from the mdev name.
> 
> Signed-off-by: Parav Pandit 
> ---
>  drivers/vfio/mdev/mdev_core.c| 98 +++-
>  drivers/vfio/mdev/mdev_private.h |  5 +-
>  drivers/vfio/mdev/mdev_sysfs.c   | 13 +++--
>  include/linux/mdev.h |  4 ++
>  4 files changed, 111 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index b558d4cfd082..e825ff38b037 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -10,9 +10,11 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "mdev_private.h"
>  
> @@ -27,6 +29,8 @@ static struct class_compat *mdev_bus_compat_class;
>  static LIST_HEAD(mdev_list);
>  static DEFINE_MUTEX(mdev_list_lock);
>  
> +static struct crypto_shash *alias_hash;
> +
>  struct device *mdev_parent_dev(struct mdev_device *mdev)
>  {
>   return mdev->parent->dev;
> @@ -164,6 +168,18 @@ int mdev_register_device(struct device *dev, const 
> struct mdev_parent_ops *ops)
>   goto add_dev_err;
>   }
>  
> + if (ops->get_alias_length) {
> + unsigned int digest_size;
> + unsigned int aligned_len;
> +
> + aligned_len = roundup(ops->get_alias_length(), 2);
> + digest_size = crypto_shash_digestsize(alias_hash);
> + if (aligned_len / 2 > digest_size) {
> + ret = -EINVAL;
> + goto add_dev_err;
> + }
> + }

This looks like a sanity check, it could be done outside of the
parent_list_lock, even before we get a parent device reference.

I think we're using a callback for get_alias_length() rather than a
fixed field to support the mtty module option added in patch 4, right?
Its utility is rather limited with no args.  I could imagine that if a
parent wanted to generate an alias that could be incorporated into a
string with the parent device name that it would be useful to call this
with the parent device as an arg.  I guess we can save that until a
user comes along though.

There doesn't seem to be anything serializing use of alias_hash.

> +
>   parent = kzalloc(sizeof(*parent), GFP_KERNEL);
>   if (!parent) {
>   ret = -ENOMEM;
> @@ -259,6 +275,7 @@ static void mdev_device_free(struct mdev_device *mdev)
>   mutex_unlock(_list_lock);
>  
>   dev_dbg(>dev, "MDEV: destroying\n");
> + kvfree(mdev->alias);
>   kfree(mdev);
>  }
>  
> @@ -269,18 +286,86 @@ static void mdev_device_release(struct device *dev)
>   mdev_device_free(mdev);
>  }
>  
> -int mdev_device_create(struct kobject *kobj,
> -struct device *dev, const guid_t *uuid)
> +static const char *
> +generate_alias(const char *uuid, unsigned int max_alias_len)
> +{
> + struct shash_desc *hash_desc;
> + unsigned int digest_size;
> + unsigned char *digest;
> + unsigned int alias_len;
> + char *alias;
> + int ret = 0;
> +
> + /* Align to multiple of 2 as bin2hex will generate
> +  * even number of bytes.
> +  */

Comment style for non-networking code please.

> + alias_len = roundup(max_alias_len, 2);
> + alias = kvzalloc(alias_len + 1, GFP_KERNEL);

The size we're generating here should be small enough to just use
kzalloc(), probably below too.

> + if (!alias)
> + return NULL;
> +
> + /* Allocate and init descriptor */
> + hash_desc = kvzalloc(sizeof(*hash_desc) +
> +  crypto_shash_descsize(alias_hash),
> +  GFP_KERNEL);
> + if (!hash_desc)
> + goto desc_err;
> +
> + hash_desc->tfm = alias_hash;
> +
> + digest_size = crypto_shash_digestsize(alias_hash);
> +
> + digest = kvzalloc(digest_size, GFP_KERNEL);
> + if (!digest) {
> + ret = -ENOMEM;
> + goto digest_err;
> + }
> + crypto_shash_init(hash_desc);
> + crypto_shash_update(hash_desc, uuid, UUID_STRING_LEN);
> + crypto_shash_final(hash_desc, digest);
> + bin2hex([0], digest,

[0], ie. alias

> + min_t(unsigned int, digest_size, alias_len / 2));
> + /* When alias length is odd, zero out and additional last byte
> +  * that bin2hex has copied.
> +  */
> + if (max_alias_len % 2)
> + alias[max_alias_len] = 0;

Doesn't this give us a null terminated string for odd numbers but not
even numbers?  Probably best to define that we always provide a null
terminated string then we could do this unconditionally.

> +
> + kvfree(digest);
> + kvfree(hash_desc);
> + return alias;
> +
> +digest_err:
> + kvfree(hash_desc);
> +desc_err:
> +  

Re: [PATCH v2 0/2] Simplify mtty driver and mdev core

2019-08-23 Thread Alex Williamson
On Sat, 24 Aug 2019 03:56:08 +
Parav Pandit  wrote:

> > -Original Message-
> > From: Alex Williamson 
> > Sent: Saturday, August 24, 2019 1:14 AM
> > To: Parav Pandit 
> > Cc: Jiri Pirko ; Jiri Pirko ; David S 
> > . Miller
> > ; Kirti Wankhede ; Cornelia
> > Huck ; k...@vger.kernel.org; linux-
> > ker...@vger.kernel.org; cjia ; net...@vger.kernel.org
> > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > 
> > On Fri, 23 Aug 2019 18:00:30 +
> > Parav Pandit  wrote:
> >   
> > > > -Original Message-
> > > > From: Alex Williamson 
> > > > Sent: Friday, August 23, 2019 10:47 PM
> > > > To: Parav Pandit 
> > > > Cc: Jiri Pirko ; Jiri Pirko ;
> > > > David S . Miller ; Kirti Wankhede
> > > > ; Cornelia Huck ;
> > > > k...@vger.kernel.org; linux- ker...@vger.kernel.org; cjia
> > > > ; net...@vger.kernel.org
> > > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > >
> > > > On Fri, 23 Aug 2019 16:14:04 +
> > > > Parav Pandit  wrote:
> > > >  
> > > > > > > Idea is to have mdev alias as optional.
> > > > > > > Each mdev_parent says whether it wants mdev_core to generate
> > > > > > > an alias or not. So only networking device drivers would set it 
> > > > > > > to true.
> > > > > > > For rest, alias won't be generated, and won't be compared
> > > > > > > either during creation time. User continue to provide only uuid.  
> > > > > >
> > > > > > Ok
> > > > > >  
> > > > > > > I am tempted to have alias collision detection only within
> > > > > > > children mdevs of the same parent, but doing so will always
> > > > > > > mandate to prefix in netdev name. And currently we are left
> > > > > > > with only 3 characters to prefix it, so that may not be good 
> > > > > > > either.
> > > > > > > Hence, I think mdev core wide alias is better with 12 characters. 
> > > > > > >  
> > > > > >
> > > > > > I suppose it depends on the API, if the vendor driver can ask
> > > > > > the mdev core for an alias as part of the device creation
> > > > > > process, then it could manage the netdev namespace for all its
> > > > > > devices, choosing how many characters to use, and fail the
> > > > > > creation if it can't meet a uniqueness requirement.  IOW,
> > > > > > mdev-core would always provide a full
> > > > > > sha1 and therefore gets itself out of the uniqueness/collision 
> > > > > > aspects.
> > > > > >  
> > > > > This doesn't work. At mdev core level 20 bytes sha1 are unique, so
> > > > > mdev core allowed to create a mdev.  
> > > >
> > > > The mdev vendor driver has the opportunity to fail the device
> > > > creation in mdev_parent_ops.create().
> > > >  
> > > That is not helpful for below reasons.
> > > 1. vendor driver doesn't have visibility in other vendor's alias.
> > > 2. Even for single vendor, it needs to maintain global list of devices to 
> > > see  
> > collision.  
> > > 3. multiple vendors needs to implement same scheme.
> > >
> > > Mdev core should be the owner. Shifting ownership from one layer to a
> > > lower layer in vendor driver doesn't solve the problem (if there is
> > > one, which I think doesn't exist).
> > >  
> > > > > And then devlink core chooses
> > > > > only 6 bytes (12 characters) and there is collision. Things fall
> > > > > apart. Since mdev provides unique uuid based scheme, it's the mdev
> > > > > core's ownership to provide unique aliases.  
> > > >
> > > > You're suggesting/contemplating multiple solutions here, 3-char
> > > > prefix + 12- char sha1 vs  + ?-char sha1.  Also, the
> > > > 15-char total limit is imposed by an external subsystem, where the
> > > > vendor driver is the gateway between that subsystem and mdev.  How
> > > > would mdev integrate with another subsystem that maybe only has
> > > > 9-chars available?  Would the vendor driver API specify "I need an
> > > > alias" or would it specify "I need an X-char length al

Re: [PATCH v2] vfio: re-arrange vfio region definitions

2019-08-23 Thread Alex Williamson
On Wed, 14 Aug 2019 11:52:14 -0600
Alex Williamson  wrote:

> On Tue,  6 Aug 2019 11:30:00 +0200
> Cornelia Huck  wrote:
> 
> > It is easy to miss already defined region types. Let's re-arrange
> > the definitions a bit and add more comments to make it hopefully
> > a bit clearer.
> > 
> > No functional change.
> > 
> > Signed-off-by: Cornelia Huck 
> > ---
> > v1 -> v2:
> >   - moved all pci subtypes together
> >   - tweaked comments a bit more
> > ---
> >  include/uapi/linux/vfio.h | 45 ++-
> >  1 file changed, 26 insertions(+), 19 deletions(-)  
> 
> Thanks Connie!  This looks good to me, I'll queue it for v5.4.  Thanks,

Thanks for your patience, Connie.  This is now in the vfio next branch
for v5.4.  Thanks,

Alex

> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 8f10748dac79..e809b22f6a60 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -295,15 +295,38 @@ struct vfio_region_info_cap_type {
> > __u32 subtype;  /* type specific */
> >  };
> >  
> > +/*
> > + * List of region types, global per bus driver.
> > + * If you introduce a new type, please add it here.
> > + */
> > +
> > +/* PCI region type containing a PCI vendor part */
> >  #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE   (1 << 31)
> >  #define VFIO_REGION_TYPE_PCI_VENDOR_MASK   (0x)
> > +#define VFIO_REGION_TYPE_GFX(1)
> > +#define VFIO_REGION_TYPE_CCW   (2)
> > +
> > +/* sub-types for VFIO_REGION_TYPE_PCI_* */
> >  
> > -/* 8086 Vendor sub-types */
> > +/* 8086 vendor PCI sub-types */
> >  #define VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION (1)
> >  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2)
> >  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG  (3)
> >  
> > -#define VFIO_REGION_TYPE_GFX(1)
> > +/* 10de vendor PCI sub-types */
> > +/*
> > + * NVIDIA GPU NVlink2 RAM is coherent RAM mapped onto the host address 
> > space.
> > + */
> > +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM (1)
> > +
> > +/* 1014 vendor PCI sub-types */
> > +/*
> > + * IBM NPU NVlink2 ATSD (Address Translation Shootdown) register of NPU
> > + * to do TLB invalidation on a GPU.
> > + */
> > +#define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD   (1)
> > +
> > +/* sub-types for VFIO_REGION_TYPE_GFX */
> >  #define VFIO_REGION_SUBTYPE_GFX_EDID(1)
> >  
> >  /**
> > @@ -353,25 +376,9 @@ struct vfio_region_gfx_edid {
> >  #define VFIO_DEVICE_GFX_LINK_STATE_DOWN  2
> >  };
> >  
> > -#define VFIO_REGION_TYPE_CCW   (2)
> > -/* ccw sub-types */
> > +/* sub-types for VFIO_REGION_TYPE_CCW */
> >  #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD  (1)
> >  
> > -/*
> > - * 10de vendor sub-type
> > - *
> > - * NVIDIA GPU NVlink2 RAM is coherent RAM mapped onto the host address 
> > space.
> > - */
> > -#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM (1)
> > -
> > -/*
> > - * 1014 vendor sub-type
> > - *
> > - * IBM NPU NVlink2 ATSD (Address Translation Shootdown) register of NPU
> > - * to do TLB invalidation on a GPU.
> > - */
> > -#define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD   (1)
> > -
> >  /*
> >   * The MSIX mappable capability informs that MSIX data of a BAR can be 
> > mmapped
> >   * which allows direct access to non-MSIX registers which happened to be 
> > within  
> 



Re: [PATCH v2 1/2] vfio-mdev/mtty: Simplify interrupt generation

2019-08-23 Thread Alex Williamson
On Thu,  8 Aug 2019 09:12:54 -0500
Parav Pandit  wrote:

> While generating interrupt, mdev_state is already available for which
> interrupt is generated.
> Instead of doing indirect way from state->device->uuid-> to searching
> state linearly in linked list on every interrupt generation,
> directly use the available state.
> 
> Hence, simplify the code to use mdev_state and remove unused helper
> function with that.
> 
> Reviewed-by: Cornelia Huck 
> Signed-off-by: Parav Pandit 
> ---
>  samples/vfio-mdev/mtty.c | 39 ---
>  1 file changed, 8 insertions(+), 31 deletions(-)

Applied this commit to vfio next branch with Christoph's review for
v5.4.  As Connie has another use case for the mdev_uuid() API in
flight, I'm not applying patch 2/.  Thanks,

Alex

> diff --git a/samples/vfio-mdev/mtty.c b/samples/vfio-mdev/mtty.c
> index 92e770a06ea2..ce84a300a4da 100644
> --- a/samples/vfio-mdev/mtty.c
> +++ b/samples/vfio-mdev/mtty.c
> @@ -152,20 +152,9 @@ static const struct file_operations vd_fops = {
>  
>  /* function prototypes */
>  
> -static int mtty_trigger_interrupt(const guid_t *uuid);
> +static int mtty_trigger_interrupt(struct mdev_state *mdev_state);
>  
>  /* Helper functions */
> -static struct mdev_state *find_mdev_state_by_uuid(const guid_t *uuid)
> -{
> - struct mdev_state *mds;
> -
> - list_for_each_entry(mds, _devices_list, next) {
> - if (guid_equal(mdev_uuid(mds->mdev), uuid))
> - return mds;
> - }
> -
> - return NULL;
> -}
>  
>  static void dump_buffer(u8 *buf, uint32_t count)
>  {
> @@ -337,8 +326,7 @@ static void handle_bar_write(unsigned int index, struct 
> mdev_state *mdev_state,
>   pr_err("Serial port %d: Fifo level trigger\n",
>   index);
>  #endif
> - mtty_trigger_interrupt(
> - mdev_uuid(mdev_state->mdev));
> + mtty_trigger_interrupt(mdev_state);
>   }
>   } else {
>  #if defined(DEBUG_INTR)
> @@ -352,8 +340,7 @@ static void handle_bar_write(unsigned int index, struct 
> mdev_state *mdev_state,
>*/
>   if (mdev_state->s[index].uart_reg[UART_IER] &
>   UART_IER_RLSI)
> - mtty_trigger_interrupt(
> - mdev_uuid(mdev_state->mdev));
> + mtty_trigger_interrupt(mdev_state);
>   }
>   mutex_unlock(_state->rxtx_lock);
>   break;
> @@ -372,8 +359,7 @@ static void handle_bar_write(unsigned int index, struct 
> mdev_state *mdev_state,
>   pr_err("Serial port %d: IER_THRI write\n",
>   index);
>  #endif
> - mtty_trigger_interrupt(
> - mdev_uuid(mdev_state->mdev));
> + mtty_trigger_interrupt(mdev_state);
>   }
>  
>   mutex_unlock(_state->rxtx_lock);
> @@ -444,7 +430,7 @@ static void handle_bar_write(unsigned int index, struct 
> mdev_state *mdev_state,
>  #if defined(DEBUG_INTR)
>   pr_err("Serial port %d: MCR_OUT2 write\n", index);
>  #endif
> - mtty_trigger_interrupt(mdev_uuid(mdev_state->mdev));
> + mtty_trigger_interrupt(mdev_state);
>   }
>  
>   if ((mdev_state->s[index].uart_reg[UART_IER] & UART_IER_MSI) &&
> @@ -452,7 +438,7 @@ static void handle_bar_write(unsigned int index, struct 
> mdev_state *mdev_state,
>  #if defined(DEBUG_INTR)
>   pr_err("Serial port %d: MCR RTS/DTR write\n", index);
>  #endif
> - mtty_trigger_interrupt(mdev_uuid(mdev_state->mdev));
> + mtty_trigger_interrupt(mdev_state);
>   }
>   break;
>  
> @@ -503,8 +489,7 @@ static void handle_bar_read(unsigned int index, struct 
> mdev_state *mdev_state,
>  #endif
>   if (mdev_state->s[index].uart_reg[UART_IER] &
>UART_IER_THRI)
> - mtty_trigger_interrupt(
> - mdev_uuid(mdev_state->mdev));
> + mtty_trigger_interrupt(mdev_state);
>   }
>   mutex_unlock(_state->rxtx_lock);
>  
> @@ -1028,17 +1013,9 @@ static int mtty_set_irqs(struct mdev_device *mdev, 
> uint32_t flags,
>   return ret;
>  }
>  
> -static int mtty_trigger_interrupt(const guid_t *uuid)
> +static int mtty_trigger_interrupt(struct mdev_state *mdev_state)
>  {
>   int ret = -1;
> - struct mdev_state *mdev_state;
> -
> - mdev_state = find_mdev_state_by_uuid(uuid);
> -
> - if 

Re: [PATCH v3] vfio_pci: Restore original state on release

2019-08-23 Thread Alex Williamson
On Thu, 22 Aug 2019 11:35:19 +0800
hexin  wrote:

> vfio_pci_enable() saves the device's initial configuration information
> with the intent that it is restored in vfio_pci_disable().  However,
> the commit referenced in Fixes: below replaced the call to
> __pci_reset_function_locked(), which is not wrapped in a state save
> and restore, with pci_try_reset_function(), which overwrites the
> restored device state with the current state before applying it to the
> device.  Reinstate use of __pci_reset_function_locked() to return to
> the desired behavior.
> 
> Fixes: 890ed578df82 ("vfio-pci: Use pci "try" reset interface")
> Signed-off-by: hexin 
> Signed-off-by: Liu Qi 
> Signed-off-by: Zhang Yu 
> ---

Applied to vfio next branch for v5.4.  Thanks,

Alex

> v2->v3:
> - change commit log 
> v1->v2:
> - add fixes tag
> - add comment to warn 
> 
> [1] 
> https://lore.kernel.org/linux-pci/1565926427-21675-1-git-send-email-hexi...@baidu.com
> [2] 
> https://lore.kernel.org/linux-pci/1566042663-16694-1-git-send-email-hexi...@baidu.com
> 
>  drivers/vfio/pci/vfio_pci.c | 17 +
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 703948c..0220616 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -438,11 +438,20 @@ static void vfio_pci_disable(struct vfio_pci_device 
> *vdev)
>   pci_write_config_word(pdev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
>  
>   /*
> -  * Try to reset the device.  The success of this is dependent on
> -  * being able to lock the device, which is not always possible.
> +  * Try to get the locks ourselves to prevent a deadlock. The
> +  * success of this is dependent on being able to lock the device,
> +  * which is not always possible.
> +  * We can not use the "try" reset interface here, which will
> +  * overwrite the previously restored configuration information.
>*/
> - if (vdev->reset_works && !pci_try_reset_function(pdev))
> - vdev->needs_reset = false;
> + if (vdev->reset_works && pci_cfg_access_trylock(pdev)) {
> + if (device_trylock(>dev)) {
> + if (!__pci_reset_function_locked(pdev))
> + vdev->needs_reset = false;
> + device_unlock(>dev);
> + }
> + pci_cfg_access_unlock(pdev);
> + }
>  
>   pci_restore_state(pdev);
>  out:



Re: [PATCH v2 0/2] Simplify mtty driver and mdev core

2019-08-23 Thread Alex Williamson
On Fri, 23 Aug 2019 18:00:30 +
Parav Pandit  wrote:

> > -Original Message-
> > From: Alex Williamson 
> > Sent: Friday, August 23, 2019 10:47 PM
> > To: Parav Pandit 
> > Cc: Jiri Pirko ; Jiri Pirko ; David S 
> > . Miller
> > ; Kirti Wankhede ; Cornelia
> > Huck ; k...@vger.kernel.org; linux-
> > ker...@vger.kernel.org; cjia ; net...@vger.kernel.org
> > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > 
> > On Fri, 23 Aug 2019 16:14:04 +
> > Parav Pandit  wrote:
> >   
> > > > > Idea is to have mdev alias as optional.
> > > > > Each mdev_parent says whether it wants mdev_core to generate an
> > > > > alias or not. So only networking device drivers would set it to true.
> > > > > For rest, alias won't be generated, and won't be compared either
> > > > > during creation time. User continue to provide only uuid.  
> > > >
> > > > Ok
> > > >  
> > > > > I am tempted to have alias collision detection only within
> > > > > children mdevs of the same parent, but doing so will always
> > > > > mandate to prefix in netdev name. And currently we are left with
> > > > > only 3 characters to prefix it, so that may not be good either.
> > > > > Hence, I think mdev core wide alias is better with 12 characters.  
> > > >
> > > > I suppose it depends on the API, if the vendor driver can ask the
> > > > mdev core for an alias as part of the device creation process, then
> > > > it could manage the netdev namespace for all its devices, choosing
> > > > how many characters to use, and fail the creation if it can't meet a
> > > > uniqueness requirement.  IOW, mdev-core would always provide a full
> > > > sha1 and therefore gets itself out of the uniqueness/collision aspects.
> > > >  
> > > This doesn't work. At mdev core level 20 bytes sha1 are unique, so
> > > mdev core allowed to create a mdev.  
> > 
> > The mdev vendor driver has the opportunity to fail the device creation in
> > mdev_parent_ops.create().
> >   
> That is not helpful for below reasons.
> 1. vendor driver doesn't have visibility in other vendor's alias.
> 2. Even for single vendor, it needs to maintain global list of devices to see 
> collision.
> 3. multiple vendors needs to implement same scheme.
> 
> Mdev core should be the owner. Shifting ownership from one layer to a
> lower layer in vendor driver doesn't solve the problem (if there is
> one, which I think doesn't exist).
> 
> > > And then devlink core chooses
> > > only 6 bytes (12 characters) and there is collision. Things fall
> > > apart. Since mdev provides unique uuid based scheme, it's the mdev
> > > core's ownership to provide unique aliases.  
> > 
> > You're suggesting/contemplating multiple solutions here, 3-char
> > prefix + 12- char sha1 vs  + ?-char sha1.  Also, the
> > 15-char total limit is imposed by an external subsystem, where the
> > vendor driver is the gateway between that subsystem and mdev.  How
> > would mdev integrate with another subsystem that maybe only has
> > 9-chars available?  Would the vendor driver API specify "I need an
> > alias" or would it specify "I need an X-char length alias"?  
> Yes, Vendor driver should say how long the alias it wants.
> However before we implement that, I suggest let such
> vendor/user/driver arrive which needs that. Such variable length
> alias can be added at that time and even with that alias collision
> can be detected by single mdev module.

If we agree that different alias lengths are possible, then I would
request that minimally an mdev sample driver be modified to request an
alias with a length that can be adjusted without recompiling in order
to exercise the collision path.

If mdev-core is guaranteeing uniqueness, does this indicate that each
alias length constitutes a separate namespace?  ie. strictly a
strcmp(), not a strncmp() to the shorter alias.

> > Does it make sense that mdev-core would fail creation of a device
> > if there's a collision in the 12-char address space between
> > different subsystems?  For example, does enm0123456789ab really
> > collide with xyz0123456789ab?   
> I think so, because at mdev level its 12-char alias matters.
> Choosing the prefix not adding prefix is really a user space choice.
> 
> >  So if
> > mdev were to provided a 40-char sha1, is it possible that the
> > vendor driver could consume this in its create callback, truncate
> > it to the number of 

Re: [PATCH v2 0/2] Simplify mtty driver and mdev core

2019-08-23 Thread Alex Williamson
On Fri, 23 Aug 2019 16:14:04 +
Parav Pandit  wrote:

> > -Original Message-
> > From: Alex Williamson 
> > Sent: Friday, August 23, 2019 9:22 PM
> > To: Parav Pandit 
> > Cc: Jiri Pirko ; Jiri Pirko ; David S 
> > . Miller
> > ; Kirti Wankhede ; Cornelia
> > Huck ; k...@vger.kernel.org; linux-
> > ker...@vger.kernel.org; cjia ; net...@vger.kernel.org
> > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > 
> > On Fri, 23 Aug 2019 14:53:06 +
> > Parav Pandit  wrote:
> >   
> > > > -Original Message-
> > > > From: Alex Williamson 
> > > > Sent: Friday, August 23, 2019 7:58 PM
> > > > To: Parav Pandit 
> > > > Cc: Jiri Pirko ; Jiri Pirko ;
> > > > David S . Miller ; Kirti Wankhede
> > > > ; Cornelia Huck ;
> > > > k...@vger.kernel.org; linux- ker...@vger.kernel.org; cjia
> > > > ; net...@vger.kernel.org
> > > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > >
> > > > On Fri, 23 Aug 2019 08:14:39 +
> > > > Parav Pandit  wrote:
> > > >  
> > > > > Hi Alex,
> > > > >
> > > > >  
> > > > > > -Original Message-
> > > > > > From: Jiri Pirko 
> > > > > > Sent: Friday, August 23, 2019 1:42 PM
> > > > > > To: Parav Pandit 
> > > > > > Cc: Alex Williamson ; Jiri Pirko
> > > > > > ; David S . Miller ;
> > > > > > Kirti Wankhede ; Cornelia Huck  
> > > > ;  
> > > > > > k...@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > > > > > ; net...@vger.kernel.org
> > > > > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > > > >
> > > > > > Thu, Aug 22, 2019 at 03:33:30PM CEST, pa...@mellanox.com wrote:  
> > > > > > >
> > > > > > >  
> > > > > > >> -Original Message-
> > > > > > >> From: Jiri Pirko 
> > > > > > >> Sent: Thursday, August 22, 2019 5:50 PM
> > > > > > >> To: Parav Pandit 
> > > > > > >> Cc: Alex Williamson ; Jiri Pirko
> > > > > > >> ; David S . Miller ;
> > > > > > >> Kirti Wankhede ; Cornelia Huck  
> > > > > > ;  
> > > > > > >> k...@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > > > > > >> ; net...@vger.kernel.org
> > > > > > >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev
> > > > > > >> core
> > > > > > >>
> > > > > > >> Thu, Aug 22, 2019 at 12:04:02PM CEST, pa...@mellanox.com wrote:  
> > > > > > >> >
> > > > > > >> >  
> > > > > > >> >> -Original Message-
> > > > > > >> >> From: Jiri Pirko 
> > > > > > >> >> Sent: Thursday, August 22, 2019 3:28 PM
> > > > > > >> >> To: Parav Pandit 
> > > > > > >> >> Cc: Alex Williamson ; Jiri
> > > > > > >> >> Pirko ; David S . Miller
> > > > > > >> >> ; Kirti Wankhede
> > > > > > >> >> ; Cornelia Huck  
> > > > > > >> ;  
> > > > > > >> >> k...@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > > > > > >> >> ; net...@vger.kernel.org
> > > > > > >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev
> > > > > > >> >> core
> > > > > > >> >>
> > > > > > >> >> Thu, Aug 22, 2019 at 11:42:13AM CEST, pa...@mellanox.com  
> > wrote:  
> > > > > > >> >> >
> > > > > > >> >> >  
> > > > > > >> >> >> -Original Message-
> > > > > > >> >> >> From: Jiri Pirko 
> > > > > > >> >> >> Sent: Thursday, August 22, 2019 2:59 PM
> > > > > > >> >> >> To: Parav Pandit 
> > > > > > >> >> >> Cc: Alex Williamson ; Jiri
> > > > > > >> >> >> Pirko ; David S . Miller
> > > > > >

Re: [PATCH v2 0/2] Simplify mtty driver and mdev core

2019-08-23 Thread Alex Williamson
On Fri, 23 Aug 2019 14:53:06 +
Parav Pandit  wrote:

> > -Original Message-
> > From: Alex Williamson 
> > Sent: Friday, August 23, 2019 7:58 PM
> > To: Parav Pandit 
> > Cc: Jiri Pirko ; Jiri Pirko ; David S 
> > . Miller
> > ; Kirti Wankhede ; Cornelia
> > Huck ; k...@vger.kernel.org; linux-
> > ker...@vger.kernel.org; cjia ; net...@vger.kernel.org
> > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > 
> > On Fri, 23 Aug 2019 08:14:39 +
> > Parav Pandit  wrote:
> >   
> > > Hi Alex,
> > >
> > >  
> > > > -Original Message-
> > > > From: Jiri Pirko 
> > > > Sent: Friday, August 23, 2019 1:42 PM
> > > > To: Parav Pandit 
> > > > Cc: Alex Williamson ; Jiri Pirko
> > > > ; David S . Miller ; Kirti
> > > > Wankhede ; Cornelia Huck  
> > ;  
> > > > k...@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > > > ; net...@vger.kernel.org
> > > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > >
> > > > Thu, Aug 22, 2019 at 03:33:30PM CEST, pa...@mellanox.com wrote:  
> > > > >
> > > > >  
> > > > >> -Original Message-
> > > > >> From: Jiri Pirko 
> > > > >> Sent: Thursday, August 22, 2019 5:50 PM
> > > > >> To: Parav Pandit 
> > > > >> Cc: Alex Williamson ; Jiri Pirko
> > > > >> ; David S . Miller ;
> > > > >> Kirti Wankhede ; Cornelia Huck  
> > > > ;  
> > > > >> k...@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > > > >> ; net...@vger.kernel.org
> > > > >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > > >>
> > > > >> Thu, Aug 22, 2019 at 12:04:02PM CEST, pa...@mellanox.com wrote:  
> > > > >> >
> > > > >> >  
> > > > >> >> -Original Message-
> > > > >> >> From: Jiri Pirko 
> > > > >> >> Sent: Thursday, August 22, 2019 3:28 PM
> > > > >> >> To: Parav Pandit 
> > > > >> >> Cc: Alex Williamson ; Jiri Pirko
> > > > >> >> ; David S . Miller ;
> > > > >> >> Kirti Wankhede ; Cornelia Huck  
> > > > >> ;  
> > > > >> >> k...@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > > > >> >> ; net...@vger.kernel.org
> > > > >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > > >> >>
> > > > >> >> Thu, Aug 22, 2019 at 11:42:13AM CEST, pa...@mellanox.com wrote:  
> > > > >> >> >
> > > > >> >> >  
> > > > >> >> >> -Original Message-
> > > > >> >> >> From: Jiri Pirko 
> > > > >> >> >> Sent: Thursday, August 22, 2019 2:59 PM
> > > > >> >> >> To: Parav Pandit 
> > > > >> >> >> Cc: Alex Williamson ; Jiri
> > > > >> >> >> Pirko ; David S . Miller
> > > > >> >> >> ; Kirti Wankhede
> > > > >> >> >> ; Cornelia Huck  
> > > > >> >> ;  
> > > > >> >> >> k...@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > > > >> >> >> ; net...@vger.kernel.org
> > > > >> >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev
> > > > >> >> >> core
> > > > >> >> >>
> > > > >> >> >> Wed, Aug 21, 2019 at 08:23:17AM CEST, pa...@mellanox.com  
> > wrote:  
> > > > >> >> >> >
> > > > >> >> >> >  
> > > > >> >> >> >> -Original Message-
> > > > >> >> >> >> From: Alex Williamson 
> > > > >> >> >> >> Sent: Wednesday, August 21, 2019 10:56 AM
> > > > >> >> >> >> To: Parav Pandit 
> > > > >> >> >> >> Cc: Jiri Pirko ; David S . Miller
> > > > >> >> >> >> ; Kirti Wankhede
> > > > >> >> >> >> ; Cornelia Huck
> > > > >> >

Re: [PATCH v2 0/2] Simplify mtty driver and mdev core

2019-08-23 Thread Alex Williamson
On Fri, 23 Aug 2019 08:14:39 +
Parav Pandit  wrote:

> Hi Alex,
> 
> 
> > -Original Message-
> > From: Jiri Pirko 
> > Sent: Friday, August 23, 2019 1:42 PM
> > To: Parav Pandit 
> > Cc: Alex Williamson ; Jiri Pirko
> > ; David S . Miller ; Kirti
> > Wankhede ; Cornelia Huck ;
> > k...@vger.kernel.org; linux-kernel@vger.kernel.org; cjia ;
> > net...@vger.kernel.org
> > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > 
> > Thu, Aug 22, 2019 at 03:33:30PM CEST, pa...@mellanox.com wrote:  
> > >
> > >  
> > >> -Original Message-----
> > >> From: Jiri Pirko 
> > >> Sent: Thursday, August 22, 2019 5:50 PM
> > >> To: Parav Pandit 
> > >> Cc: Alex Williamson ; Jiri Pirko
> > >> ; David S . Miller ; Kirti
> > >> Wankhede ; Cornelia Huck  
> > ;  
> > >> k...@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > >> ; net...@vger.kernel.org
> > >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > >>
> > >> Thu, Aug 22, 2019 at 12:04:02PM CEST, pa...@mellanox.com wrote:  
> > >> >
> > >> >  
> > >> >> -Original Message-
> > >> >> From: Jiri Pirko 
> > >> >> Sent: Thursday, August 22, 2019 3:28 PM
> > >> >> To: Parav Pandit 
> > >> >> Cc: Alex Williamson ; Jiri Pirko
> > >> >> ; David S . Miller ; Kirti
> > >> >> Wankhede ; Cornelia Huck  
> > >> ;  
> > >> >> k...@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > >> >> ; net...@vger.kernel.org
> > >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > >> >>
> > >> >> Thu, Aug 22, 2019 at 11:42:13AM CEST, pa...@mellanox.com wrote:  
> > >> >> >
> > >> >> >  
> > >> >> >> -Original Message-
> > >> >> >> From: Jiri Pirko 
> > >> >> >> Sent: Thursday, August 22, 2019 2:59 PM
> > >> >> >> To: Parav Pandit 
> > >> >> >> Cc: Alex Williamson ; Jiri Pirko
> > >> >> >> ; David S . Miller ;
> > >> >> >> Kirti Wankhede ; Cornelia Huck  
> > >> >> ;  
> > >> >> >> k...@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > >> >> >> ; net...@vger.kernel.org
> > >> >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > >> >> >>
> > >> >> >> Wed, Aug 21, 2019 at 08:23:17AM CEST, pa...@mellanox.com wrote:  
> > >> >> >> >
> > >> >> >> >  
> > >> >> >> >> -Original Message-
> > >> >> >> >> From: Alex Williamson 
> > >> >> >> >> Sent: Wednesday, August 21, 2019 10:56 AM
> > >> >> >> >> To: Parav Pandit 
> > >> >> >> >> Cc: Jiri Pirko ; David S . Miller
> > >> >> >> >> ; Kirti Wankhede
> > >> >> >> >> ; Cornelia Huck ;
> > >> >> >> >> k...@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > >> >> >> >> ; net...@vger.kernel.org
> > >> >> >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev
> > >> >> >> >> core
> > >> >> >> >>  
> > >> >> >> >> > > > > Just an example of the alias, not proposing how it's 
> > >> >> >> >> > > > > set.
> > >> >> >> >> > > > > In fact, proposing that the user does not set it,
> > >> >> >> >> > > > > mdev-core provides one  
> > >> >> >> >> > > automatically.  
> > >> >> >> >> > > > >  
> > >> >> >> >> > > > > > > Since there seems to be some prefix overhead, as
> > >> >> >> >> > > > > > > I ask about above in how many characters we
> > >> >> >> >> > > > > > > actually have to work with in IFNAMESZ, maybe we
> > >> >> >>

Re: [PATCH v2] vfio_pci: Replace pci_try_reset_function() with __pci_reset_function_locked() to ensure that the pci device configuration space is restored to its original state

2019-08-21 Thread Alex Williamson
On Wed, 21 Aug 2019 23:13:08 +0800
hexin  wrote:

> Alex Williamson  于2019年8月20日周二 上午3:53写道:
> >
> > On Sat, 17 Aug 2019 19:51:03 +0800
> > hexin  wrote:
> >  
> > > In vfio_pci_enable(), save the device's initial configuration information
> > > and then restore the configuration in vfio_pci_disable(). However, the
> > > execution result is not the same. Since the pci_try_reset_function()
> > > function saves the current state before resetting, the configuration
> > > information restored by pci_load_and_free_saved_state() will be
> > > overwritten. The __pci_reset_function_locked() function can be used
> > > to prevent the configuration space from being overwritten.
> > >
> > > Fixes: 890ed578df82 ("vfio-pci: Use pci "try" reset interface")
> > > Signed-off-by: hexin 
> > > Signed-off-by: Liu Qi 
> > > Signed-off-by: Zhang Yu 
> > > ---
> > >  drivers/vfio/pci/vfio_pci.c | 17 +
> > >  1 file changed, 13 insertions(+), 4 deletions(-)  
> >
> > This looks good, but the subject is too long and I find the commit log
> > somewhat confusing.  May I update these as follows?
> >
> > vfio_pci: Restore original state on release
> >
> > vfio_pci_enable() saves the device's initial configuration information
> > with the intent that it is restored in vfio_pci_disable().  However,
> > commit 890ed578df82 ("vfio-pci: Use pci "try" reset interface")
> > replaced the call to __pci_reset_function_locked(), which is not wrapped
> > in a state save and restore, with pci_try_reset_function(), which
> > overwrites the restored device state with the current state before
> > applying it to the device.  Restore use of __pci_reset_function_locked()
> > to return to the desired behavior.
> >
> > Thanks,
> > Alex
> >
> >  
> 
> Thanks for your update, the updated commit log is clearer than before.
> At the same time, when I use checkpatch.pl to detect the patch, there
> will be the
> following error:
> 
> ERROR: Please use git commit description style 'commit <12+ chars of
> sha1> ("")'  
> - ie: 'commit 890ed578df82 ("vfio-pci: Use pci "try" reset interface")'
> 
> Line 2785 ~ 2801 in checkpatch.pl, the script can't handle the commit message
> which contains double quotes because of the expression `([^"]+)`. Like
> the "try" above.
> Maybe checkpatch.pl needs to be modified.

I think we're following the intention of the rule, and as you've
identified it's the implementation of the rule checker that's unable
to handle a commit title with internal quotes.  We can ignore it, and
maybe follow up with a checkpatch.pl patch, or we could just avoid it
as follows:

vfio_pci: Restore original state on release

vfio_pci_enable() saves the device's initial configuration information
with the intent that it is restored in vfio_pci_disable().  However,
the commit referenced in Fixes: below replaced the call to
__pci_reset_function_locked(), which is not wrapped in a state save
and restore, with pci_try_reset_function(), which overwrites the
restored device state with the current state before applying it to the
device.  Reinstate use of __pci_reset_function_locked() to return to
the desired behavior.

Thanks,
Alex

> > > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > > index 703948c..0220616 100644
> > > --- a/drivers/vfio/pci/vfio_pci.c
> > > +++ b/drivers/vfio/pci/vfio_pci.c
> > > @@ -438,11 +438,20 @@ static void vfio_pci_disable(struct vfio_pci_device 
> > > *vdev)
> > >   pci_write_config_word(pdev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
> > >
> > >   /*
> > > -  * Try to reset the device.  The success of this is dependent on
> > > -  * being able to lock the device, which is not always possible.
> > > +  * Try to get the locks ourselves to prevent a deadlock. The
> > > +  * success of this is dependent on being able to lock the device,
> > > +  * which is not always possible.
> > > +  * We can not use the "try" reset interface here, which will
> > > +  * overwrite the previously restored configuration information.
> > >*/
> > > - if (vdev->reset_works && !pci_try_reset_function(pdev))
> > > - vdev->needs_reset = false;
> > > + if (vdev->reset_works && pci_cfg_access_trylock(pdev)) {
> > > + if (device_trylock(>dev)) {
> > > + if (!__pci_reset_function_locked(pdev))
> > > + vdev->needs_reset = false;
> > > + device_unlock(>dev);
> > > + }
> > > + pci_cfg_access_unlock(pdev);
> > > + }
> > >
> > >   pci_restore_state(pdev);
> > >  out:  
> >  



Re: [PATCH v2 0/2] Simplify mtty driver and mdev core

2019-08-20 Thread Alex Williamson
On Wed, 21 Aug 2019 05:01:52 +
Parav Pandit  wrote:

> > -Original Message-
> > From: Alex Williamson 
> > Sent: Wednesday, August 21, 2019 10:27 AM
> > To: Parav Pandit 
> > Cc: Jiri Pirko ; David S . Miller ;
> > Kirti Wankhede ; Cornelia Huck
> > ; k...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > cjia ; net...@vger.kernel.org
> > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > 
> > On Wed, 21 Aug 2019 04:40:15 +
> > Parav Pandit  wrote:
> >   
> > > > -Original Message-
> > > > From: Alex Williamson 
> > > > Sent: Wednesday, August 21, 2019 9:51 AM
> > > > To: Parav Pandit 
> > > > Cc: Jiri Pirko ; David S . Miller
> > > > ; Kirti Wankhede ;
> > > > Cornelia Huck ; k...@vger.kernel.org;
> > > > linux-kernel@vger.kernel.org; cjia ;
> > > > net...@vger.kernel.org
> > > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > >
> > > > On Wed, 21 Aug 2019 03:42:25 +
> > > > Parav Pandit  wrote:
> > > >  
> > > > > > -Original Message-
> > > > > > From: Alex Williamson 
> > > > > > Sent: Tuesday, August 20, 2019 10:49 PM
> > > > > > To: Parav Pandit 
> > > > > > Cc: Jiri Pirko ; David S . Miller
> > > > > > ; Kirti Wankhede ;
> > > > > > Cornelia Huck ; k...@vger.kernel.org;
> > > > > > linux-kernel@vger.kernel.org; cjia ;
> > > > > > net...@vger.kernel.org
> > > > > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > > > >
> > > > > > On Tue, 20 Aug 2019 08:58:02 + Parav Pandit
> > > > > >  wrote:
> > > > > >  
> > > > > > > + Dave.
> > > > > > >
> > > > > > > Hi Jiri, Dave, Alex, Kirti, Cornelia,
> > > > > > >
> > > > > > > Please provide your feedback on it, how shall we proceed?
> > > > > > >
> > > > > > > Short summary of requirements.
> > > > > > > For a given mdev (mediated device [1]), there is one
> > > > > > > representor netdevice and devlink port in switchdev mode
> > > > > > > (similar to SR-IOV VF), And there is one netdevice for the actual 
> > > > > > > mdev  
> > when mdev is probed.  
> > > > > > >
> > > > > > > (a) representor netdev and devlink port should be able derive
> > > > > > > phys_port_name(). So that representor netdev name can be built
> > > > > > > deterministically across reboots.
> > > > > > >
> > > > > > > (b) for mdev's netdevice, mdev's device should have an attribute.
> > > > > > > This attribute can be used by udev rules/systemd or something
> > > > > > > else to rename netdev name deterministically.
> > > > > > >
> > > > > > > (c) IFNAMSIZ of 16 bytes is too small to fit whole UUID.
> > > > > > > A simple grep IFNAMSIZ in stack hints hundreds of users of
> > > > > > > IFNAMSIZ in drivers, uapi, netlink, boot config area and more.
> > > > > > > Changing IFNAMSIZ for a mdev bus doesn't really look
> > > > > > > reasonable option  
> > > > to me.  
> > > > > >
> > > > > > How many characters do we really have to work with?  Your
> > > > > > examples below prepend various characters, ex. option-1 results
> > > > > > in ens2f0_m10 or enm10.  Do the extra 8 or 3 characters in these 
> > > > > > count  
> > against IFNAMSIZ?  
> > > > > >  
> > > > > Maximum 15. Last is null termination.
> > > > > Some udev rules setting by user prefix the PF netdev interface. I
> > > > > took such  
> > > > example below where ens2f0 netdev named is prefixed.  
> > > > > Some prefer not to prefix.
> > > > >  
> > > > > > > Hence, I would like to discuss below options.
> > > > > > >
> > > > > > > Option-1: mdev index
> > > > > > > Introduce an optional mdev index/handle as u32 during mdev
> > > > > > > create time. User passes mdev index/handle as input.
> > > >

Re: [PATCH v2 0/2] Simplify mtty driver and mdev core

2019-08-20 Thread Alex Williamson
On Wed, 21 Aug 2019 04:40:15 +
Parav Pandit  wrote:

> > -Original Message-
> > From: Alex Williamson 
> > Sent: Wednesday, August 21, 2019 9:51 AM
> > To: Parav Pandit 
> > Cc: Jiri Pirko ; David S . Miller ;
> > Kirti Wankhede ; Cornelia Huck
> > ; k...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > cjia ; net...@vger.kernel.org
> > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > 
> > On Wed, 21 Aug 2019 03:42:25 +
> > Parav Pandit  wrote:
> >   
> > > > -Original Message-
> > > > From: Alex Williamson 
> > > > Sent: Tuesday, August 20, 2019 10:49 PM
> > > > To: Parav Pandit 
> > > > Cc: Jiri Pirko ; David S . Miller
> > > > ; Kirti Wankhede ;
> > > > Cornelia Huck ; k...@vger.kernel.org;
> > > > linux-kernel@vger.kernel.org; cjia ;
> > > > net...@vger.kernel.org
> > > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > >
> > > > On Tue, 20 Aug 2019 08:58:02 +
> > > > Parav Pandit  wrote:
> > > >  
> > > > > + Dave.
> > > > >
> > > > > Hi Jiri, Dave, Alex, Kirti, Cornelia,
> > > > >
> > > > > Please provide your feedback on it, how shall we proceed?
> > > > >
> > > > > Short summary of requirements.
> > > > > For a given mdev (mediated device [1]), there is one representor
> > > > > netdevice and devlink port in switchdev mode (similar to SR-IOV
> > > > > VF), And there is one netdevice for the actual mdev when mdev is 
> > > > > probed.
> > > > >
> > > > > (a) representor netdev and devlink port should be able derive
> > > > > phys_port_name(). So that representor netdev name can be built
> > > > > deterministically across reboots.
> > > > >
> > > > > (b) for mdev's netdevice, mdev's device should have an attribute.
> > > > > This attribute can be used by udev rules/systemd or something else
> > > > > to rename netdev name deterministically.
> > > > >
> > > > > (c) IFNAMSIZ of 16 bytes is too small to fit whole UUID.
> > > > > A simple grep IFNAMSIZ in stack hints hundreds of users of
> > > > > IFNAMSIZ in drivers, uapi, netlink, boot config area and more.
> > > > > Changing IFNAMSIZ for a mdev bus doesn't really look reasonable 
> > > > > option  
> > to me.  
> > > >
> > > > How many characters do we really have to work with?  Your examples
> > > > below prepend various characters, ex. option-1 results in ens2f0_m10
> > > > or enm10.  Do the extra 8 or 3 characters in these count against 
> > > > IFNAMSIZ?
> > > >  
> > > Maximum 15. Last is null termination.
> > > Some udev rules setting by user prefix the PF netdev interface. I took 
> > > such  
> > example below where ens2f0 netdev named is prefixed.  
> > > Some prefer not to prefix.
> > >  
> > > > > Hence, I would like to discuss below options.
> > > > >
> > > > > Option-1: mdev index
> > > > > Introduce an optional mdev index/handle as u32 during mdev create
> > > > > time. User passes mdev index/handle as input.
> > > > >
> > > > > phys_port_name=mIndex=m%u
> > > > > mdev_index will be available in sysfs as mdev attribute for udev
> > > > > to name the mdev's netdev.
> > > > >
> > > > > example mdev create command:
> > > > > UUID=$(uuidgen)
> > > > > echo $UUID index=10  
> > > > > > /sys/class/net/ens2f0/mdev_supported_types/mlx5_core_mdev/create  
> > > >
> > > > Nit, IIRC previous discussions of additional parameters used comma
> > > > separators, ex. echo $UUID,index=10 >...
> > > >  
> > > Yes, ok.
> > >  
> > > > > > example netdevs:  
> > > > > repnetdev=ens2f0_m10  /*ens2f0 is parent PF's netdevice */  
> > > >
> > > > Is the parent really relevant in the name?  
> > > No. I just picked one udev example who prefixed the parent netdev name.
> > > But there are users who do not prefix it.
> > >  
> > > > Tools like mdevctl are meant to
> > > > provide persistence, creating the same mdev devices on the same
> > > > parent, but that's sim

Re: [PATCH v2 0/2] Simplify mtty driver and mdev core

2019-08-20 Thread Alex Williamson
On Wed, 21 Aug 2019 03:42:25 +
Parav Pandit  wrote:

> > -Original Message-
> > From: Alex Williamson 
> > Sent: Tuesday, August 20, 2019 10:49 PM
> > To: Parav Pandit 
> > Cc: Jiri Pirko ; David S . Miller ;
> > Kirti Wankhede ; Cornelia Huck
> > ; k...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > cjia ; net...@vger.kernel.org
> > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > 
> > On Tue, 20 Aug 2019 08:58:02 +
> > Parav Pandit  wrote:
> >   
> > > + Dave.
> > >
> > > Hi Jiri, Dave, Alex, Kirti, Cornelia,
> > >
> > > Please provide your feedback on it, how shall we proceed?
> > >
> > > Short summary of requirements.
> > > For a given mdev (mediated device [1]), there is one representor
> > > netdevice and devlink port in switchdev mode (similar to SR-IOV VF),
> > > And there is one netdevice for the actual mdev when mdev is probed.
> > >
> > > (a) representor netdev and devlink port should be able derive
> > > phys_port_name(). So that representor netdev name can be built
> > > deterministically across reboots.
> > >
> > > (b) for mdev's netdevice, mdev's device should have an attribute.
> > > This attribute can be used by udev rules/systemd or something else to
> > > rename netdev name deterministically.
> > >
> > > (c) IFNAMSIZ of 16 bytes is too small to fit whole UUID.
> > > A simple grep IFNAMSIZ in stack hints hundreds of users of IFNAMSIZ in
> > > drivers, uapi, netlink, boot config area and more. Changing IFNAMSIZ
> > > for a mdev bus doesn't really look reasonable option to me.  
> > 
> > How many characters do we really have to work with?  Your examples below
> > prepend various characters, ex. option-1 results in ens2f0_m10 or enm10.  Do
> > the extra 8 or 3 characters in these count against IFNAMSIZ?
> >   
> Maximum 15. Last is null termination.
> Some udev rules setting by user prefix the PF netdev interface. I took such 
> example below where ens2f0 netdev named is prefixed.
> Some prefer not to prefix.
> 
> > > Hence, I would like to discuss below options.
> > >
> > > Option-1: mdev index
> > > Introduce an optional mdev index/handle as u32 during mdev create
> > > time. User passes mdev index/handle as input.
> > >
> > > phys_port_name=mIndex=m%u
> > > mdev_index will be available in sysfs as mdev attribute for udev to
> > > name the mdev's netdev.
> > >
> > > example mdev create command:
> > > UUID=$(uuidgen)
> > > echo $UUID index=10  
> > > > /sys/class/net/ens2f0/mdev_supported_types/mlx5_core_mdev/create  
> > 
> > Nit, IIRC previous discussions of additional parameters used comma 
> > separators,
> > ex. echo $UUID,index=10 >...
> >   
> Yes, ok.
> 
> > > > example netdevs:  
> > > repnetdev=ens2f0_m10  /*ens2f0 is parent PF's netdevice */  
> > 
> > Is the parent really relevant in the name?
> No. I just picked one udev example who prefixed the parent netdev name.
> But there are users who do not prefix it.
> 
> > Tools like mdevctl are meant to
> > provide persistence, creating the same mdev devices on the same parent, but
> > that's simply the easiest policy decision.  We can also imagine that 
> > multiple
> > parent devices might support a specified mdev type and policies factoring in
> > proximity, load-balancing, power consumption, etc might be weighed such that
> > we really don't want to promote userspace creating dependencies on the
> > parent association.
> >   
> > > mdev_netdev=enm10
> > >
> > > Pros:
> > > 1. mdevctl and any other existing tools are unaffected.
> > > 2. netdev stack, ovs and other switching platforms are unaffected.
> > > 3. achieves unique phys_port_name for representor netdev 4. achieves
> > > unique mdev eth netdev name for the mdev using udev/systemd extension.
> > > 5. Aligns well with mdev and netdev subsystem and similar to existing
> > > sriov bdf's.  
> > 
> > A user provided index seems strange to me.  It's not really an index, just 
> > a user
> > specified instance number.  Presumably you have the user providing this
> > because if it really were an index, then the value depends on the creation 
> > order
> > and persistence is lost.  Now the user needs to both avoid uuid collision 
> > as well
> > as "index" number collision.  The uuid namespace is large enough to

Re: [PATCH v2 0/2] Simplify mtty driver and mdev core

2019-08-20 Thread Alex Williamson
On Tue, 20 Aug 2019 08:58:02 +
Parav Pandit  wrote:

> + Dave.
> 
> Hi Jiri, Dave, Alex, Kirti, Cornelia,
> 
> Please provide your feedback on it, how shall we proceed?
> 
> Short summary of requirements.
> For a given mdev (mediated device [1]), there is one representor
> netdevice and devlink port in switchdev mode (similar to SR-IOV VF),
> And there is one netdevice for the actual mdev when mdev is probed.
> 
> (a) representor netdev and devlink port should be able derive
> phys_port_name(). So that representor netdev name can be built
> deterministically across reboots.
> 
> (b) for mdev's netdevice, mdev's device should have an attribute.
> This attribute can be used by udev rules/systemd or something else to
> rename netdev name deterministically.
> 
> (c) IFNAMSIZ of 16 bytes is too small to fit whole UUID.
> A simple grep IFNAMSIZ in stack hints hundreds of users of IFNAMSIZ
> in drivers, uapi, netlink, boot config area and more. Changing
> IFNAMSIZ for a mdev bus doesn't really look reasonable option to me.

How many characters do we really have to work with?  Your examples
below prepend various characters, ex. option-1 results in ens2f0_m10 or
enm10.  Do the extra 8 or 3 characters in these count against IFNAMSIZ?

> Hence, I would like to discuss below options.
> 
> Option-1: mdev index
> Introduce an optional mdev index/handle as u32 during mdev create
> time. User passes mdev index/handle as input.
> 
> phys_port_name=mIndex=m%u
> mdev_index will be available in sysfs as mdev attribute for udev to
> name the mdev's netdev.
> 
> example mdev create command:
> UUID=$(uuidgen)
> echo $UUID index=10
> > /sys/class/net/ens2f0/mdev_supported_types/mlx5_core_mdev/create

Nit, IIRC previous discussions of additional parameters used comma
separators, ex. echo $UUID,index=10 >...

> > example netdevs:
> repnetdev=ens2f0_m10  /*ens2f0 is parent PF's netdevice */

Is the parent really relevant in the name?  Tools like mdevctl are
meant to provide persistence, creating the same mdev devices on the
same parent, but that's simply the easiest policy decision.  We can
also imagine that multiple parent devices might support a specified
mdev type and policies factoring in proximity, load-balancing, power
consumption, etc might be weighed such that we really don't want to
promote userspace creating dependencies on the parent association.

> mdev_netdev=enm10
> 
> Pros:
> 1. mdevctl and any other existing tools are unaffected.
> 2. netdev stack, ovs and other switching platforms are unaffected.
> 3. achieves unique phys_port_name for representor netdev
> 4. achieves unique mdev eth netdev name for the mdev using
> udev/systemd extension. 5. Aligns well with mdev and netdev subsystem
> and similar to existing sriov bdf's.

A user provided index seems strange to me.  It's not really an index,
just a user specified instance number.  Presumably you have the user
providing this because if it really were an index, then the value
depends on the creation order and persistence is lost.  Now the user
needs to both avoid uuid collision as well as "index" number
collision.  The uuid namespace is large enough to mostly ignore this,
but this is not.  This seems like a burden.

> Option-2: shorter mdev name
> Extend mdev to have shorter mdev device name in addition to UUID.
> such as 'foo', 'bar'.
> Mdev will continue to have UUID.
> phys_port_name=mdev_name
> 
> Pros:
> 1. All same as option-1, except mdevctl needs upgrade for newer usage.
> It is common practice to upgrade iproute2 package along with the
> kernel. Similar practice to be done with mdevctl.
> 2. Newer users of mdevctl who wants to work with non_UUID names, will
> use newer mdevctl/tools. Cons:
> 1. Dual naming scheme of mdev might affect some of the existing tools.
> It's unclear how/if it actually affects.
> mdevctl [2] is very recently developed and can be enhanced for dual
> naming scheme.

I think we've already nak'ed this one, the device namespace becomes
meaningless if the name becomes just a string where a uuid might be an
example string.  mdevs are named by uuid.
 
> Option-3: mdev uuid alias
> Instead of shorter mdev name or mdev index, have alpha-numeric name
> alias. Alias is an optional mdev sysfs attribute such as 'foo', 'bar'.
> example mdev create command:
> UUID=$(uuidgen)
> echo $UUID alias=foo
> > /sys/class/net/ens2f0/mdev_supported_types/mlx5_core_mdev/create
> > example netdevs:
> examle netdevs:
> repnetdev = ens2f0_mfoo
> mdev_netdev=enmfoo
> 
> Pros:
> 1. All same as option-1.
> 2. Doesn't affect existing mdev naming scheme.
> Cons:
> 1. Index scheme of option-1 is better which can number large number
> of mdevs with fewer characters, simplifying the management tool.

No better than option-1, simply a larger secondary namespace, but still
requires the user to come up with two independent names for the device.

> Option-4: extend IFNAMESZ to be 64 bytes Extended IFNAMESZ from 16 to
> 64 bytes phys_port_name=mdev_UUID_string 

Re: [PATCH v5 2/6] vfio: Introduce vGPU display irq type

2019-08-20 Thread Alex Williamson
On Tue, 20 Aug 2019 02:12:10 +
"Zhang, Tina"  wrote:

> BTW, IIRC, we might also have one question waiting to be replied:
> - Can we just use VFIO_IRQ_TYPE_GFX w/o proposing a new sub type
> (i.e. VFIO_IRQ_SUBTYPE_GFX_DISPLAY_IRQ)? Well, only if we can agree
> on that we don't have any other GFX IRQ requirements in future.
> Otherwise, we might need a sub type to differentiate them.

I think you've answered your own question ;)  We already have the
infrastructure for defining type/sub-type and it allows us to
categorize and group interrupt types together consistent with how we do
for regions, so what's the overhead in this approach?  Otherwise we
tend to have an ad-hoc list.  We can't say with absolute certainty that
we won't have additional GFX related IRQs.  Thanks,

Alex


Re: [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO irq ops

2019-08-20 Thread Alex Williamson
On Tue, 20 Aug 2019 12:03:50 +0800
luoben  wrote:

> 在 2019/8/20 上午4:51, Alex Williamson 写道:
> > On Thu, 15 Aug 2019 21:02:58 +0800
> > Ben Luo  wrote:
> >  
> >> Currently, VFIO takes a lot of free-then-request-irq actions whenever
> >> a VM (with device passthru via VFIO) sets irq affinity or mask/unmask
> >> irq. Those actions only change the cookie data of irqaction or even
> >> change nothing. The free-then-request-irq not only adds more latency,
> >> but also increases the risk of losing interrupt, which may lead to a
> >> VM hung forever in waiting for IO completion  
> > What guest environment is generating this?  Typically I don't see that
> > Windows or Linux guests bounce the interrupt configuration much.
> > Thanks,
> >
> > Alex  
> 
> By tracing centos5u8 on host, I found it keep masking and unmasking 
> interrupt like this:
> 
> [1566032533709879] index:28 irte_hi:00010004a601 
> irte_lo:adb54bc000b98001
> [1566032533711242] index:28 irte_hi: 
> irte_lo:
> [1566032533711258] index:28 irte_hi:0004a601 
> irte_lo:3fff00ac002d
> [1566032533711269] index:28 irte_hi:0004a601 
> irte_lo:3fff00ac002d
[snip] 
> "[1566032533720007]" is timestamp in μs, so centos5u8 tiggers 30+ irte 
> modification within 10ms

Ok, that matches my understanding that only very old guests behave in
this manner.  It's a curious case to optimize as RHEL5 is in extended
life-cycle support, with regular maintenance releases ending 2+ years
ago.  Thanks,

Alex


Re: [PATCH v5 2/6] vfio: Introduce vGPU display irq type

2019-08-20 Thread Alex Williamson
On Tue, 20 Aug 2019 09:20:30 +0200
"kra...@redhat.com"  wrote:

> > > > +#define VFIO_IRQ_TYPE_GFX  (1)
> > > > +/*
> > > > + * vGPU vendor sub-type
> > > > + * vGPU device display related interrupts e.g. vblank/pageflip  */
> > > > +#define VFIO_IRQ_SUBTYPE_GFX_DISPLAY_IRQ   (1)  
> > > 
> > > If this is a GFX/DISPLAY IRQ, why are we talking about a "vGPU" in the
> > > description?  It's not specific to a vGPU implementation, right?  Is this
> > > related to a physical display or a virtual display?  If it's related to 
> > > the GFX
> > > PLANE ioctls, it should state that.  It's not well specified what this 
> > > interrupt
> > > signals.  Is it vblank?  Is it pageflip?
> > > Is it both?  Neither?  Something else?  
> > 
> > Sorry for the confusion caused here. 
> > 
> > The original idea here was to use VFIO_IRQ_SUBTYPE_GFX_DISPLAY_IRQ to
> > notify user space with the display refresh event. The display refresh
> > event is general. When notified, user space can use
> > VFIO_DEVICE_QUERY_GFX_PLANE and VFIO_DEVICE_GET_GFX_DMABUF to get the
> > updated framebuffer, instead of polling them all the time.
> > 
> > In order to give user space more choice to do the optimization,
> > vfio_irq_info_cap_display_plane_events is proposed to tell user space
> > the different plane refresh event values. So when notified by
> > VFIO_IRQ_SUBTYPE_GFX_DISPLAY_IRQ, user space can get the value of the
> > eventfd counter and understand which plane the event refresh event
> > comes from and choose to get the framebuffer on that plane instead of
> > all the planes.
> > 
> > So, from the VFIO user point of view, there is only the display
> > refresh event (i.e. no other events like vblank, pageflip ...). For
> > GTV-g, this display refresh event is implemented by both vblank and
> > pageflip, which is only the implementation thing and can be
> > transparent to the user space. Again sorry about the confusion cased
> > here, I'll correct the comments in the next version.  
> 
> All this should be explained in a comment for the IRQ in the header file.

Yes, Tina's update and your clarification all make sense to me, but it
needs to be specified in the header how this is supposed to work, what
events get signaled and what the user is intended to do in response to
that signal.  The information is all here, it just needs to be included
in the uapi definition.  Thanks,

Alex

> Key point for the API is that (a) this is a "the display should be
> updated" event and (b) this covers all display updates, i.e. user space
> can stop the display update timer and fully depend on getting
> notifications if an update is needed.
> 
> That GTV-g watches guest pageflips is an implementation detail.  Should
> nvidia support this they will probably do something completely
> different.  As far I know they render the guest display to some
> framebuffer at something like 10fps, so it would make sense for them to
> send an event each time they refreshed the framebuffer.
> 
> Also note the relationships (cur_event_val is for DRM_PLANE_TYPE_CURSOR
> updates and pri_event_val for DRM_PLANE_TYPE_PRIMARY).



Re: [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO irq ops

2019-08-19 Thread Alex Williamson
On Thu, 15 Aug 2019 21:02:58 +0800
Ben Luo  wrote:

> Currently, VFIO takes a lot of free-then-request-irq actions whenever
> a VM (with device passthru via VFIO) sets irq affinity or mask/unmask
> irq. Those actions only change the cookie data of irqaction or even
> change nothing. The free-then-request-irq not only adds more latency,
> but also increases the risk of losing interrupt, which may lead to a
> VM hung forever in waiting for IO completion

What guest environment is generating this?  Typically I don't see that
Windows or Linux guests bounce the interrupt configuration much.
Thanks,

Alex

> 
> This patchset solved the issue by:
> Patch 2 introduces update_irq_devid to only update dev_id of irqaction
> Patch 3 make use of update_irq_devid and optimize irq operations in VFIO
> 
> changes from v2:
>  - reformat to avoid quoted string split across lines and etc.
> 
> changes from v1:
>  - add Patch 1 to enhance error recovery etc. in free irq per tglx's comments
>  - enhance error recovery code and debugging info in update_irq_devid
>  - use __must_check in external referencing of update_irq_devid
>  - use EXPORT_SYMBOL_GPL for update_irq_devid
>  - reformat code of patch 3 for better readability
> 
> Ben Luo (3):
>   genirq: enhance error recovery code in free irq
>   genirq: introduce update_irq_devid()
>   vfio_pci: make use of update_irq_devid and optimize irq ops
> 
>  drivers/vfio/pci/vfio_pci_intrs.c | 101 +-
>  include/linux/interrupt.h |   3 ++
>  kernel/irq/manage.c   | 110 
> +-
>  3 files changed, 164 insertions(+), 50 deletions(-)
> 



Re: [PATCH v2] vfio_pci: Replace pci_try_reset_function() with __pci_reset_function_locked() to ensure that the pci device configuration space is restored to its original state

2019-08-19 Thread Alex Williamson
On Sat, 17 Aug 2019 19:51:03 +0800
hexin  wrote:

> In vfio_pci_enable(), save the device's initial configuration information
> and then restore the configuration in vfio_pci_disable(). However, the
> execution result is not the same. Since the pci_try_reset_function()
> function saves the current state before resetting, the configuration
> information restored by pci_load_and_free_saved_state() will be
> overwritten. The __pci_reset_function_locked() function can be used
> to prevent the configuration space from being overwritten.
> 
> Fixes: 890ed578df82 ("vfio-pci: Use pci "try" reset interface")
> Signed-off-by: hexin 
> Signed-off-by: Liu Qi 
> Signed-off-by: Zhang Yu 
> ---
>  drivers/vfio/pci/vfio_pci.c | 17 +
>  1 file changed, 13 insertions(+), 4 deletions(-)

This looks good, but the subject is too long and I find the commit log
somewhat confusing.  May I update these as follows?

vfio_pci: Restore original state on release

vfio_pci_enable() saves the device's initial configuration information
with the intent that it is restored in vfio_pci_disable().  However,
commit 890ed578df82 ("vfio-pci: Use pci "try" reset interface")
replaced the call to __pci_reset_function_locked(), which is not wrapped
in a state save and restore, with pci_try_reset_function(), which
overwrites the restored device state with the current state before
applying it to the device.  Restore use of __pci_reset_function_locked()
to return to the desired behavior.

Thanks,
Alex


> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 703948c..0220616 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -438,11 +438,20 @@ static void vfio_pci_disable(struct vfio_pci_device 
> *vdev)
>   pci_write_config_word(pdev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
>  
>   /*
> -  * Try to reset the device.  The success of this is dependent on
> -  * being able to lock the device, which is not always possible.
> +  * Try to get the locks ourselves to prevent a deadlock. The
> +  * success of this is dependent on being able to lock the device,
> +  * which is not always possible.
> +  * We can not use the "try" reset interface here, which will
> +  * overwrite the previously restored configuration information.
>*/
> - if (vdev->reset_works && !pci_try_reset_function(pdev))
> - vdev->needs_reset = false;
> + if (vdev->reset_works && pci_cfg_access_trylock(pdev)) {
> + if (device_trylock(>dev)) {
> + if (!__pci_reset_function_locked(pdev))
> + vdev->needs_reset = false;
> + device_unlock(>dev);
> + }
> + pci_cfg_access_unlock(pdev);
> + }
>  
>   pci_restore_state(pdev);
>  out:



Re: [PATCH v5 2/6] vfio: Introduce vGPU display irq type

2019-08-16 Thread Alex Williamson
On Fri, 16 Aug 2019 10:35:24 +0800
Tina Zhang  wrote:

> Introduce vGPU specific irq type VFIO_IRQ_TYPE_GFX, and
> VFIO_IRQ_SUBTYPE_GFX_DISPLAY_IRQ as the subtype for vGPU display.
> 
> Introduce vfio_irq_info_cap_display_plane_events capability to notify
> user space with the vGPU's plane update events
> 
> v2:
> - Add VFIO_IRQ_SUBTYPE_GFX_DISPLAY_IRQ description. (Alex & Kechen)
> - Introduce vfio_irq_info_cap_display_plane_events. (Gerd & Alex)
> 
> Signed-off-by: Tina Zhang 
> ---
>  include/uapi/linux/vfio.h | 21 +
>  1 file changed, 21 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index d83c9f136a5b..21ac69f0e1a9 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -465,6 +465,27 @@ struct vfio_irq_info_cap_type {
>   __u32 subtype;  /* type specific */
>  };
>  
> +#define VFIO_IRQ_TYPE_GFX(1)
> +/*
> + * vGPU vendor sub-type
> + * vGPU device display related interrupts e.g. vblank/pageflip
> + */
> +#define VFIO_IRQ_SUBTYPE_GFX_DISPLAY_IRQ (1)

If this is a GFX/DISPLAY IRQ, why are we talking about a "vGPU" in the
description?  It's not specific to a vGPU implementation, right?  Is
this related to a physical display or a virtual display?  If it's
related to the GFX PLANE ioctls, it should state that.  It's not well
specified what this interrupt signals.  Is it vblank?  Is it pageflip?
Is it both?  Neither?  Something else?

> +
> +/*
> + * Display capability of using one eventfd to notify user space with the
> + * vGPU's plane update events.
> + * cur_event_val: eventfd value stands for cursor plane change event.
> + * pri_event_val: eventfd value stands for primary plane change event.
> + */
> +#define VFIO_IRQ_INFO_CAP_DISPLAY4
> +
> +struct vfio_irq_info_cap_display_plane_events {
> + struct vfio_info_cap_header header;
> + __u64 cur_event_val;
> + __u64 pri_event_val;
> +};

Again, what display?  Does this reference a GFX plane?  The event_val
data is not well specified, examples might be necessary.  They seem to
be used as a flag bit, so should we simply define a bit index for the
flag rather than a u64 value?  Where are the actual events per plane
defined?

I'm not sure this patch shouldn't be rolled back into 1, I couldn't
find the previous discussion that triggered it to be separate.  Perhaps
simply for sharing with the work Eric is doing?  If so, that's fine,
but maybe make note of it in the cover letter.  Thanks,

Alex


Re: [PATCH v5 1/6] vfio: Define device specific irq type capability

2019-08-16 Thread Alex Williamson
On Fri, 16 Aug 2019 10:35:23 +0800
Tina Zhang  wrote:

> Cap the number of irqs with fixed indexes and use capability chains
> to chain device specific irqs.
> 
> Signed-off-by: Tina Zhang 
> Signed-off-by: Eric Auger 
> ---
>  include/uapi/linux/vfio.h | 19 ++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 02bb7ad6e986..d83c9f136a5b 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -444,11 +444,27 @@ struct vfio_irq_info {
>  #define VFIO_IRQ_INFO_MASKABLE   (1 << 1)
>  #define VFIO_IRQ_INFO_AUTOMASKED (1 << 2)
>  #define VFIO_IRQ_INFO_NORESIZE   (1 << 3)
> +#define VFIO_IRQ_INFO_FLAG_CAPS  (1 << 4) /* Info supports caps 
> */
>   __u32   index;  /* IRQ index */
>   __u32   count;  /* Number of IRQs within this index */
> + __u32   cap_offset; /* Offset within info struct of first cap */
>  };
>  #define VFIO_DEVICE_GET_IRQ_INFO _IO(VFIO_TYPE, VFIO_BASE + 9)
>  
> +/*
> + * The irq type capability allows irqs unique to a specific device or
> + * class of devices to be exposed.
> + *
> + * The structures below define version 1 of this capability.
> + */
> +#define VFIO_IRQ_INFO_CAP_TYPE  3

Why 3?  What's using 1 and 2 of this newly defined info cap ID?  Thanks,

Alex

> +
> +struct vfio_irq_info_cap_type {
> + struct vfio_info_cap_header header;
> + __u32 type; /* global per bus driver */
> + __u32 subtype;  /* type specific */
> +};
> +
>  /**
>   * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct 
> vfio_irq_set)
>   *
> @@ -550,7 +566,8 @@ enum {
>   VFIO_PCI_MSIX_IRQ_INDEX,
>   VFIO_PCI_ERR_IRQ_INDEX,
>   VFIO_PCI_REQ_IRQ_INDEX,
> - VFIO_PCI_NUM_IRQS
> + VFIO_PCI_NUM_IRQS = 5   /* Fixed user ABI, IRQ indexes >=5 use   */
> + /* device specific cap to define content */
>  };
>  
>  /*



Re: [PATCH v2 08/10] vfio_pci: Loop using PCI_STD_NUM_BARS

2019-08-16 Thread Alex Williamson
On Fri, 16 Aug 2019 12:24:35 +0300
Denis Efremov  wrote:

> Refactor loops to use 'i < PCI_STD_NUM_BARS' instead of
> 'i <= PCI_STD_RESOURCE_END'.
> 
> Signed-off-by: Denis Efremov 
> ---
>  drivers/vfio/pci/vfio_pci.c | 11 +++
>  drivers/vfio/pci/vfio_pci_config.c  | 10 ++
>  drivers/vfio/pci/vfio_pci_private.h |  4 ++--
>  3 files changed, 15 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 703948c9fbe1..cb7d220d3246 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -110,13 +110,15 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>  static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>  {
>   struct resource *res;
> - int bar;
> + int i;
>   struct vfio_pci_dummy_resource *dummy_res;
>  
>   INIT_LIST_HEAD(>dummy_resources_list);
>  
> - for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
> - res = vdev->pdev->resource + bar;
> + for (i = 0; i < PCI_STD_NUM_BARS; i++) {
> + int bar = i + PCI_STD_RESOURCES;
> +
> + res = >pdev->resource[bar];
>  
>   if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
>   goto no_mmap;
> @@ -399,7 +401,8 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
>  
>   vfio_config_free(vdev);
>  
> - for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
> + for (i = 0; i < PCI_STD_NUM_BARS; i++) {
> + bar = i + PCI_STD_RESOURCES;
>   if (!vdev->barmap[bar])
>   continue;
>   pci_iounmap(pdev, vdev->barmap[bar]);
> diff --git a/drivers/vfio/pci/vfio_pci_config.c 
> b/drivers/vfio/pci/vfio_pci_config.c
> index f0891bd8444c..df8772395219 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -455,16 +455,18 @@ static void vfio_bar_fixup(struct vfio_pci_device *vdev)
>  
>   bar = (__le32 *)>vconfig[PCI_BASE_ADDRESS_0];
>  
> - for (i = PCI_STD_RESOURCES; i <= PCI_STD_RESOURCE_END; i++, bar++) {
> - if (!pci_resource_start(pdev, i)) {
> + for (i = 0; i < PCI_STD_NUM_BARS; i++, bar++) {
> + int ibar = i + PCI_STD_RESOURCES;
> +
> + if (!pci_resource_start(pdev, ibar)) {
>   *bar = 0; /* Unmapped by host = unimplemented to user */
>   continue;
>   }
>  
> - mask = ~(pci_resource_len(pdev, i) - 1);
> + mask = ~(pci_resource_len(pdev, ibar) - 1);
>  
>   *bar &= cpu_to_le32((u32)mask);
> - *bar |= vfio_generate_bar_flags(pdev, i);
> + *bar |= vfio_generate_bar_flags(pdev, ibar);
>  
>   if (*bar & cpu_to_le32(PCI_BASE_ADDRESS_MEM_TYPE_64)) {
>   bar++;

It might be a bit cleaner to rename the 'bar' variable to 'vbar', then
we have 'bar' available to use as the BAR number.  It seems more
consistent with other uses.  Otherwise the logic looks fine.  Thanks,

Alex

> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index ee6ee91718a4..8a2c7607d513 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -86,8 +86,8 @@ struct vfio_pci_reflck {
>  
>  struct vfio_pci_device {
>   struct pci_dev  *pdev;
> - void __iomem*barmap[PCI_STD_RESOURCE_END + 1];
> - boolbar_mmap_supported[PCI_STD_RESOURCE_END + 1];
> + void __iomem*barmap[PCI_STD_NUM_BARS];
> + boolbar_mmap_supported[PCI_STD_NUM_BARS];
>   u8  *pci_config_map;
>   u8  *vconfig;
>   struct perm_bits*msi_perm;



Re: [PATCH] vfio_pci: Replace pci_try_reset_function() with __pci_reset_function_locked() to ensure that the pci device configuration space is restored to its original state

2019-08-16 Thread Alex Williamson
On Fri, 16 Aug 2019 11:33:47 +0800
hexin  wrote:

> In vfio_pci_enable(), save the device's initial configuration information
> and then restore the configuration in vfio_pci_disable(). However, the
> execution result is not the same. Since the pci_try_reset_function()
> function saves the current state before resetting, the configuration
> information restored by pci_load_and_free_saved_state() will be
> overwritten. The __pci_reset_function_locked() function can be used
> to prevent the configuration space from being overwritten.
> 
> Signed-off-by: hexin 
> Signed-off-by: Liu Qi 
> Signed-off-by: Zhang Yu 
> ---
>  drivers/vfio/pci/vfio_pci.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 703948c..3c93492 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -441,8 +441,14 @@ static void vfio_pci_disable(struct vfio_pci_device 
> *vdev)
>* Try to reset the device.  The success of this is dependent on
>* being able to lock the device, which is not always possible.
>*/
> - if (vdev->reset_works && !pci_try_reset_function(pdev))
> - vdev->needs_reset = false;
> + if (vdev->reset_works && pci_cfg_access_trylock(pdev)) {
> + if (device_trylock(>dev)) {
> + if (!__pci_reset_function_locked(pdev))
> + vdev->needs_reset = false;
> + device_unlock(>dev);
> + }
> + pci_cfg_access_unlock(pdev);
> + }
>  
>   pci_restore_state(pdev);
>  out:

This used to work, I think what happened is that we initially called
__pci_reset_function() to avoid the saved state getting overwritten,
then commit d24cdbfd28b7 ("vfio-pci: Avoid deadlock on remove") added
the trylock support to avoid deadlock, then commit 890ed578df82
("vfio-pci: Use pci "try" reset interface") assumed the trylock was the
reason for the unusual calling convention and simply replaced it with
pci_try_reset_function().  So, I think we need two things.  First, a
fixes tag:

Fixes: 890ed578df82 ("vfio-pci: Use pci "try" reset interface")

Second, a comment to warn us against performing a similar cleanup again
in the future.  Thanks,

Alex


Re: [PATCH] KVM: x86/MMU: Zap all when removing memslot if VM has assigned device

2019-08-15 Thread Alex Williamson
On Thu, 15 Aug 2019 08:12:28 -0700
Sean Christopherson  wrote:

> Alex Williamson reported regressions with device assignment when KVM
> changed its memslot removal logic to zap only the SPTEs for the memslot
> being removed.  The source of the bug is unknown at this time, and root
> causing the issue will likely be a slow process.  In the short term, fix
> the regression by zapping all SPTEs when removing a memslot from a VM
> with assigned device(s).
> 
> Fixes: 4e103134b862 ("KVM: x86/mmu: Zap only the relevant pages when removing 
> a memslot", 2019-02-05)
> Reported-by: Alex Willamson 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Sean Christopherson 
> ---
> 
> An alternative idea to a full revert.  I assume this would be easy to
> backport, and also easy to revert or quirk depending on where the bug
> is hiding.
> 
>  arch/x86/kvm/mmu.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 8f72526e2f68..358b93882ac6 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -5659,6 +5659,17 @@ static void 
> kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
>   bool flush;
>   gfn_t gfn;
>  
> + /*
> +  * Zapping only the removed memslot introduced regressions for VMs with
> +  * assigned devices.  It is unknown what piece of code is buggy.  Until
> +  * the source of the bug is identified, zap everything if the VM has an
> +  * assigned device.
> +  */
> + if (kvm_arch_has_assigned_device(kvm)) {
> + kvm_mmu_zap_all(kvm);
> + return;
> + }
> +
>   spin_lock(>mmu_lock);
>  
>   if (list_empty(>arch.active_mmu_pages))

Though if we want to zoom in a little further, the patch below seems to
work.  Both versions of these perhaps just highlight that we don't
really know why the original code doesn't work with device assignment,
whether it's something special about GPU mapping, or if it hints that
there's something more generally wrong and difficult to trigger.

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24843cf49579..3956b5844479 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5670,7 +5670,8 @@ static void 
kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
gfn = slot->base_gfn + i;
 
for_each_valid_sp(kvm, sp, gfn) {
-   if (sp->gfn != gfn)
+   if (sp->gfn != gfn &&
+   !kvm_arch_has_assigned_device(kvm))
continue;
 
kvm_mmu_prepare_zap_page(kvm, sp, _list);


Re: [PATCH v2] vfio: re-arrange vfio region definitions

2019-08-14 Thread Alex Williamson
On Tue,  6 Aug 2019 11:30:00 +0200
Cornelia Huck  wrote:

> It is easy to miss already defined region types. Let's re-arrange
> the definitions a bit and add more comments to make it hopefully
> a bit clearer.
> 
> No functional change.
> 
> Signed-off-by: Cornelia Huck 
> ---
> v1 -> v2:
>   - moved all pci subtypes together
>   - tweaked comments a bit more
> ---
>  include/uapi/linux/vfio.h | 45 ++-
>  1 file changed, 26 insertions(+), 19 deletions(-)

Thanks Connie!  This looks good to me, I'll queue it for v5.4.  Thanks,

Alex
 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 8f10748dac79..e809b22f6a60 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -295,15 +295,38 @@ struct vfio_region_info_cap_type {
>   __u32 subtype;  /* type specific */
>  };
>  
> +/*
> + * List of region types, global per bus driver.
> + * If you introduce a new type, please add it here.
> + */
> +
> +/* PCI region type containing a PCI vendor part */
>  #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE (1 << 31)
>  #define VFIO_REGION_TYPE_PCI_VENDOR_MASK (0x)
> +#define VFIO_REGION_TYPE_GFX(1)
> +#define VFIO_REGION_TYPE_CCW (2)
> +
> +/* sub-types for VFIO_REGION_TYPE_PCI_* */
>  
> -/* 8086 Vendor sub-types */
> +/* 8086 vendor PCI sub-types */
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION   (1)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG   (2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG(3)
>  
> -#define VFIO_REGION_TYPE_GFX(1)
> +/* 10de vendor PCI sub-types */
> +/*
> + * NVIDIA GPU NVlink2 RAM is coherent RAM mapped onto the host address space.
> + */
> +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM   (1)
> +
> +/* 1014 vendor PCI sub-types */
> +/*
> + * IBM NPU NVlink2 ATSD (Address Translation Shootdown) register of NPU
> + * to do TLB invalidation on a GPU.
> + */
> +#define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD (1)
> +
> +/* sub-types for VFIO_REGION_TYPE_GFX */
>  #define VFIO_REGION_SUBTYPE_GFX_EDID(1)
>  
>  /**
> @@ -353,25 +376,9 @@ struct vfio_region_gfx_edid {
>  #define VFIO_DEVICE_GFX_LINK_STATE_DOWN  2
>  };
>  
> -#define VFIO_REGION_TYPE_CCW (2)
> -/* ccw sub-types */
> +/* sub-types for VFIO_REGION_TYPE_CCW */
>  #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD(1)
>  
> -/*
> - * 10de vendor sub-type
> - *
> - * NVIDIA GPU NVlink2 RAM is coherent RAM mapped onto the host address space.
> - */
> -#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM   (1)
> -
> -/*
> - * 1014 vendor sub-type
> - *
> - * IBM NPU NVlink2 ATSD (Address Translation Shootdown) register of NPU
> - * to do TLB invalidation on a GPU.
> - */
> -#define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD (1)
> -
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be 
> mmapped
>   * which allows direct access to non-MSIX registers which happened to be 
> within



Re: [PATCH v2 0/2] Simplify mtty driver and mdev core

2019-08-14 Thread Alex Williamson
On Wed, 14 Aug 2019 13:45:49 +
Parav Pandit  wrote:

> > -Original Message-
> > From: Cornelia Huck 
> > Sent: Wednesday, August 14, 2019 6:39 PM
> > To: Parav Pandit 
> > Cc: Alex Williamson ; Kirti Wankhede
> > ; k...@vger.kernel.org; linux-
> > ker...@vger.kernel.org; c...@nvidia.com; Jiri Pirko ;
> > net...@vger.kernel.org
> > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > 
> > On Wed, 14 Aug 2019 12:27:01 +
> > Parav Pandit  wrote:
> >   
> > > + Jiri, + netdev
> > > To get perspective on the ndo->phys_port_name for the representor netdev  
> > of mdev.  
> > >
> > > Hi Cornelia,
> > >  
> > > > -----Original Message-
> > > > From: Cornelia Huck 
> > > > Sent: Wednesday, August 14, 2019 1:32 PM
> > > > To: Parav Pandit 
> > > > Cc: Alex Williamson ; Kirti Wankhede
> > > > ; k...@vger.kernel.org; linux-
> > > > ker...@vger.kernel.org; c...@nvidia.com
> > > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > >
> > > > On Wed, 14 Aug 2019 05:54:36 +
> > > > Parav Pandit  wrote:
> > > >  
> > > > > > > I get that part. I prefer to remove the UUID itself from the
> > > > > > > structure and therefore removing this API makes lot more sense?  
> > > > > >
> > > > > > Mdev and support tools around mdev are based on UUIDs because
> > > > > > it's  
> > > > defined  
> > > > > > in the documentation.  
> > > > > When we introduce newer device naming scheme, it will update the  
> > > > documentation also.  
> > > > > May be that is the time to move to .rst format too.  
> > > >
> > > > You are aware that there are existing tools that expect a uuid
> > > > naming scheme, right?
> > > >  
> > > Yes, Alex mentioned too.
> > > The good tool that I am aware of is [1], which is 4 months old. Not sure 
> > > if it is  
> > part of any distros yet.  
> > >
> > > README also says, that it is in 'early in development. So we have scope 
> > > to  
> > improve it for non UUID names, but lets discuss that more below.
> > 
> > The up-to-date reference for mdevctl is
> > https://github.com/mdevctl/mdevctl. There is currently an effort to get this
> > packaged in Fedora.
> >   
> Awesome.
> 
> > >  
> > > > >  
> > > > > > I don't think it's as simple as saying "voila, UUID dependencies
> > > > > > are removed, users are free to use arbitrary strings".  We'd
> > > > > > need to create some kind of naming policy, what characters are
> > > > > > allows so that we can potentially expand the creation parameters
> > > > > > as has been proposed a couple times, how do we deal with
> > > > > > collisions and races, and why should we make such a change when
> > > > > > a UUID is a perfectly reasonable devices name.  Thanks,
> > > > > >  
> > > > > Sure, we should define a policy on device naming to be more relaxed.
> > > > > We have enough examples in-kernel.
> > > > > Few that I am aware of are netdev (vxlan, macvlan, ipvlan, lot
> > > > > more), rdma  
> > > > etc which has arbitrary device names and ID based device names.  
> > > > >
> > > > > Collisions and race is already taken care today in the mdev core.
> > > > > Same  
> > > > unique device names continue.
> > > >
> > > > I'm still completely missing a rationale _why_ uuids are supposedly
> > > > bad/restricting/etc.  
> > > There is nothing bad about uuid based naming.
> > > Its just too long name to derive phys_port_name of a netdev.
> > > In details below.
> > >
> > > For a given mdev of networking type, we would like to have
> > > (a) representor netdevice [2]
> > > (b) associated devlink port [3]
> > >
> > > Currently these representor netdevice exist only for the PCIe SR-IOV VFs.
> > > It is further getting extended for mdev without SR-IOV.
> > >
> > > Each of the devlink port is attached to representor netdevice [4].
> > >
> > > This netdevice phys_port_name should be a unique derived from some  
> > property of mdev.  
> > > Udev/systemd uses phys

Re: [PATCH v2 0/2] Simplify mtty driver and mdev core

2019-08-13 Thread Alex Williamson
On Tue, 13 Aug 2019 16:28:53 +
Parav Pandit  wrote:

> > -Original Message-
> > From: Alex Williamson 
> > Sent: Tuesday, August 13, 2019 8:23 PM
> > To: Parav Pandit 
> > Cc: Kirti Wankhede ; k...@vger.kernel.org; linux-
> > ker...@vger.kernel.org; coh...@redhat.com; c...@nvidia.com
> > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > 
> > On Tue, 13 Aug 2019 14:40:02 +
> > Parav Pandit  wrote:
> >   
> > > > -Original Message-
> > > > From: Kirti Wankhede 
> > > > Sent: Monday, August 12, 2019 5:06 PM
> > > > To: Alex Williamson ; Parav Pandit
> > > > 
> > > > Cc: k...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > coh...@redhat.com; c...@nvidia.com
> > > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > >
> > > >
> > > >
> > > > On 8/9/2019 4:32 AM, Alex Williamson wrote:  
> > > > > On Thu,  8 Aug 2019 09:12:53 -0500 Parav Pandit
> > > > >  wrote:
> > > > >  
> > > > >> Currently mtty sample driver uses mdev state and UUID in
> > > > >> convoluated way to generate an interrupt.
> > > > >> It uses several translations from mdev_state to mdev_device to mdev  
> > uuid.  
> > > > >> After which it does linear search of long uuid comparision to
> > > > >> find out mdev_state in mtty_trigger_interrupt().
> > > > >> mdev_state is already available while generating interrupt from
> > > > >> which all such translations are done to reach back to mdev_state.
> > > > >>
> > > > >> This translations are done during interrupt generation path.
> > > > >> This is unnecessary and reduandant.  
> > > > >
> > > > > Is the interrupt handling efficiency of this particular sample
> > > > > driver really relevant, or is its purpose more to illustrate the
> > > > > API and provide a proof of concept?  If we go to the trouble to
> > > > > optimize the sample driver and remove this interface from the API, 
> > > > > what  
> > do we lose?  
> > > > >
> > > > > This interface was added via commit:
> > > > >
> > > > > 99e3123e3d72 vfio-mdev: Make mdev_device private and abstract
> > > > > interfaces
> > > > >
> > > > > Where the goal was to create a more formal interface and abstract
> > > > > driver access to the struct mdev_device.  In part this served to
> > > > > make out-of-tree mdev vendor drivers more supportable; the object
> > > > > is considered opaque and access is provided via an API rather than
> > > > > through direct structure fields.
> > > > >
> > > > > I believe that the NVIDIA GRID mdev driver does make use of this
> > > > > interface and it's likely included in the sample driver
> > > > > specifically so that there is an in-kernel user for it (ie.
> > > > > specifically to avoid it being removed so casually).  An
> > > > > interesting feature of the NVIDIA mdev driver is that I believe it 
> > > > > has  
> > portions that run in userspace.  
> > > > > As we know, mdevs are named with a UUID, so I can imagine there
> > > > > are some efficiencies to be gained in having direct access to the
> > > > > UUID for a device when interacting with userspace, rather than
> > > > > repeatedly parsing it from a device name.  
> > > >
> > > > That's right.
> > > >  
> > > > >  Is that really something we want to make more difficult in order
> > > > > to optimize a sample driver?  Knowing that an mdev device uses a
> > > > > UUID for it's name, as tools like libvirt and mdevctl expect, is
> > > > > it really worthwhile to remove such a trivial API?
> > > > >  
> > > > >> Hence,
> > > > >> Patch-1 simplifies mtty sample driver to directly use mdev_state.
> > > > >>
> > > > >> Patch-2, Since no production driver uses mdev_uuid(), simplifies
> > > > >> and removes redandant mdev_uuid() exported symbol.  
> > > > >
> > > > > s/no production driver/no in-kernel production driver/
> > > > >
> > > > > I'd be interested to hear how the NVIDIA f

Re: [PATCH v2 0/2] Simplify mtty driver and mdev core

2019-08-13 Thread Alex Williamson
On Tue, 13 Aug 2019 14:40:02 +
Parav Pandit  wrote:

> > -Original Message-
> > From: Kirti Wankhede 
> > Sent: Monday, August 12, 2019 5:06 PM
> > To: Alex Williamson ; Parav Pandit
> > 
> > Cc: k...@vger.kernel.org; linux-kernel@vger.kernel.org; coh...@redhat.com;
> > c...@nvidia.com
> > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > 
> > 
> > 
> > On 8/9/2019 4:32 AM, Alex Williamson wrote:  
> > > On Thu,  8 Aug 2019 09:12:53 -0500
> > > Parav Pandit  wrote:
> > >  
> > >> Currently mtty sample driver uses mdev state and UUID in convoluated
> > >> way to generate an interrupt.
> > >> It uses several translations from mdev_state to mdev_device to mdev uuid.
> > >> After which it does linear search of long uuid comparision to find
> > >> out mdev_state in mtty_trigger_interrupt().
> > >> mdev_state is already available while generating interrupt from which
> > >> all such translations are done to reach back to mdev_state.
> > >>
> > >> This translations are done during interrupt generation path.
> > >> This is unnecessary and reduandant.  
> > >
> > > Is the interrupt handling efficiency of this particular sample driver
> > > really relevant, or is its purpose more to illustrate the API and
> > > provide a proof of concept?  If we go to the trouble to optimize the
> > > sample driver and remove this interface from the API, what do we lose?
> > >
> > > This interface was added via commit:
> > >
> > > 99e3123e3d72 vfio-mdev: Make mdev_device private and abstract
> > > interfaces
> > >
> > > Where the goal was to create a more formal interface and abstract
> > > driver access to the struct mdev_device.  In part this served to make
> > > out-of-tree mdev vendor drivers more supportable; the object is
> > > considered opaque and access is provided via an API rather than
> > > through direct structure fields.
> > >
> > > I believe that the NVIDIA GRID mdev driver does make use of this
> > > interface and it's likely included in the sample driver specifically
> > > so that there is an in-kernel user for it (ie. specifically to avoid
> > > it being removed so casually).  An interesting feature of the NVIDIA
> > > mdev driver is that I believe it has portions that run in userspace.
> > > As we know, mdevs are named with a UUID, so I can imagine there are
> > > some efficiencies to be gained in having direct access to the UUID for
> > > a device when interacting with userspace, rather than repeatedly
> > > parsing it from a device name.  
> > 
> > That's right.
> >   
> > >  Is that really something we want to make more difficult in order to
> > > optimize a sample driver?  Knowing that an mdev device uses a UUID for
> > > it's name, as tools like libvirt and mdevctl expect, is it really
> > > worthwhile to remove such a trivial API?
> > >  
> > >> Hence,
> > >> Patch-1 simplifies mtty sample driver to directly use mdev_state.
> > >>
> > >> Patch-2, Since no production driver uses mdev_uuid(), simplifies and
> > >> removes redandant mdev_uuid() exported symbol.  
> > >
> > > s/no production driver/no in-kernel production driver/
> > >
> > > I'd be interested to hear how the NVIDIA folks make use of this API
> > > interface.  Thanks,
> > >  
> > 
> > Yes, NVIDIA mdev driver do use this interface. I don't agree on removing
> > mdev_uuid() interface.
> >   
> We need to ask Greg or Linus on the kernel policy on whether an API
> should exist without in-kernel driver. We don't add such API in
> netdev, rdma and possibly other subsystem. Where can we find this
> mdev driver in-tree?

We probably would not have added the API only for an out of tree
driver, but we do have a sample driver that uses it, even if it's
rather convoluted.  The sample driver is showing an example of using the
API, which is rather its purpose more so than absolutely efficient
interrupt handling.  Also, let's not overstate what this particular
API callback provides, it's simply access to the uuid of the device,
which is a fundamental property of a mediated device.  This API was
added simply to provide data abstraction, allowing the struct
mdev_device to be opaque to vendor drivers.  Thanks,

Alex


Re: [PATCH 7/7] vfio_pci: Use PCI_STD_NUM_BARS in loops instead of PCI_STD_RESOURCE_END

2019-08-12 Thread Alex Williamson
On Mon, 12 Aug 2019 15:02:34 -0500
Bjorn Helgaas  wrote:

> On Sun, Aug 11, 2019 at 06:08:04PM +0300, Denis Efremov wrote:
> > This patch refactors the loop condition scheme from
> > 'i <= PCI_STD_RESOURCE_END' to 'i < PCI_STD_NUM_BARS'.
> > 
> > Signed-off-by: Denis Efremov 
> > ---
> >  drivers/vfio/pci/vfio_pci.c | 4 ++--
> >  drivers/vfio/pci/vfio_pci_config.c  | 2 +-
> >  drivers/vfio/pci/vfio_pci_private.h | 4 ++--
> >  3 files changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 703948c9fbe1..13f5430e3f3c 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -115,7 +115,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device 
> > *vdev)
> >  
> > INIT_LIST_HEAD(>dummy_resources_list);
> >  
> > -   for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
> > +   for (bar = 0; bar < PCI_STD_NUM_BARS; bar++) {
> > res = vdev->pdev->resource + bar;  
> 
> PCI_STD_RESOURCES is indeed 0, but since the original went to the
> trouble of avoiding that assumption, I would probably do this:
> 
> for (bar = 0; bar < PCI_STD_NUM_BARS; bar++) {
> res = vdev->pdev->resource + bar + PCI_STD_RESOURCES;
> 
> or maybe even this:
> 
> res = >pdev->resource[bar + PCI_STD_RESOURCES];
> 
> which is more common outside vfio.  But I wouldn't change to using the
> >resource[] form if other vfio code that you're *not* changing
> uses the dev->resource + bar form.

I don't think we have any other instances like that, so the latter form
is fine with me if it's more broadly used.  I do spot one use of [bar]
in drivers/vfio/pci/vfio_pci_rdwr.c that could also take on this form
to void the same assumption though.  Thanks,

Alex

> > if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
> > @@ -399,7 +399,7 @@ static void vfio_pci_disable(struct vfio_pci_device 
> > *vdev)
> >  
> > vfio_config_free(vdev);
> >  
> > -   for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
> > +   for (bar = 0; bar < PCI_STD_NUM_BARS; bar++) {
> > if (!vdev->barmap[bar])
> > continue;
> > pci_iounmap(pdev, vdev->barmap[bar]);
> > diff --git a/drivers/vfio/pci/vfio_pci_config.c 
> > b/drivers/vfio/pci/vfio_pci_config.c
> > index f0891bd8444c..6035a2961160 100644
> > --- a/drivers/vfio/pci/vfio_pci_config.c
> > +++ b/drivers/vfio/pci/vfio_pci_config.c
> > @@ -455,7 +455,7 @@ static void vfio_bar_fixup(struct vfio_pci_device *vdev)
> >  
> > bar = (__le32 *)>vconfig[PCI_BASE_ADDRESS_0];
> >  
> > -   for (i = PCI_STD_RESOURCES; i <= PCI_STD_RESOURCE_END; i++, bar++) {
> > +   for (i = 0; i < PCI_STD_NUM_BARS; i++, bar++) {
> > if (!pci_resource_start(pdev, i)) {
> > *bar = 0; /* Unmapped by host = unimplemented to user */
> > continue;
> > diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> > b/drivers/vfio/pci/vfio_pci_private.h
> > index ee6ee91718a4..8a2c7607d513 100644
> > --- a/drivers/vfio/pci/vfio_pci_private.h
> > +++ b/drivers/vfio/pci/vfio_pci_private.h
> > @@ -86,8 +86,8 @@ struct vfio_pci_reflck {
> >  
> >  struct vfio_pci_device {
> > struct pci_dev  *pdev;
> > -   void __iomem*barmap[PCI_STD_RESOURCE_END + 1];
> > -   boolbar_mmap_supported[PCI_STD_RESOURCE_END + 1];
> > +   void __iomem*barmap[PCI_STD_NUM_BARS];
> > +   boolbar_mmap_supported[PCI_STD_NUM_BARS];
> > u8  *pci_config_map;
> > u8  *vconfig;
> > struct perm_bits*msi_perm;
> > -- 
> > 2.21.0
> >   



Re: [PATCH v2 0/2] Simplify mtty driver and mdev core

2019-08-08 Thread Alex Williamson
On Thu,  8 Aug 2019 09:12:53 -0500
Parav Pandit  wrote:

> Currently mtty sample driver uses mdev state and UUID in convoluated way to
> generate an interrupt.
> It uses several translations from mdev_state to mdev_device to mdev uuid.
> After which it does linear search of long uuid comparision to
> find out mdev_state in mtty_trigger_interrupt().
> mdev_state is already available while generating interrupt from which all
> such translations are done to reach back to mdev_state.
> 
> This translations are done during interrupt generation path.
> This is unnecessary and reduandant.

Is the interrupt handling efficiency of this particular sample driver
really relevant, or is its purpose more to illustrate the API and
provide a proof of concept?  If we go to the trouble to optimize the
sample driver and remove this interface from the API, what do we lose?

This interface was added via commit:

99e3123e3d72 vfio-mdev: Make mdev_device private and abstract interfaces

Where the goal was to create a more formal interface and abstract
driver access to the struct mdev_device.  In part this served to make
out-of-tree mdev vendor drivers more supportable; the object is
considered opaque and access is provided via an API rather than through
direct structure fields.

I believe that the NVIDIA GRID mdev driver does make use of this
interface and it's likely included in the sample driver specifically so
that there is an in-kernel user for it (ie. specifically to avoid it
being removed so casually).  An interesting feature of the NVIDIA mdev
driver is that I believe it has portions that run in userspace.  As we
know, mdevs are named with a UUID, so I can imagine there are some
efficiencies to be gained in having direct access to the UUID for a
device when interacting with userspace, rather than repeatedly parsing
it from a device name.  Is that really something we want to make more
difficult in order to optimize a sample driver?  Knowing that an mdev
device uses a UUID for it's name, as tools like libvirt and mdevctl
expect, is it really worthwhile to remove such a trivial API?

> Hence,
> Patch-1 simplifies mtty sample driver to directly use mdev_state.
> 
> Patch-2, Since no production driver uses mdev_uuid(), simplifies and
> removes redandant mdev_uuid() exported symbol.

s/no production driver/no in-kernel production driver/

I'd be interested to hear how the NVIDIA folks make use of this API
interface.  Thanks,

Alex

> ---
> Changelog:
> v1->v2:
>  - Corrected email of Kirti
>  - Updated cover letter commit log to address comment from Cornelia
>  - Added Reviewed-by tag
> v0->v1:
>  - Updated commit log
> 
> Parav Pandit (2):
>   vfio-mdev/mtty: Simplify interrupt generation
>   vfio/mdev: Removed unused and redundant API for mdev UUID
> 
>  drivers/vfio/mdev/mdev_core.c |  6 --
>  include/linux/mdev.h  |  1 -
>  samples/vfio-mdev/mtty.c  | 39 +++
>  3 files changed, 8 insertions(+), 38 deletions(-)
> 



Re: [RFC PATCH v4 2/6] vfio: Introduce vGPU display irq type

2019-08-02 Thread Alex Williamson
On Fri, 2 Aug 2019 15:35:31 +0200
"kra...@redhat.com"  wrote:

>   Hi,
> 
> > > > Couldn't you expose this as another capability within the IRQ_INFO 
> > > > return
> > > > data?  If you were to define it as a macro, I assume that means it 
> > > > would be
> > > > hard coded, in which case this probably becomes an Intel specific IRQ, 
> > > > rather
> > > > than what appears to be framed as a generic graphics IRQ extension.  A 
> > > > new
> > > > capability could instead allow the vendor to specify their own value, 
> > > > where
> > > > we could define how userspace should interpret and make use of this 
> > > > value.
> > > > Thanks,
> > > Good suggestion. Currently, vfio_irq_info is used to save one irq
> > > info. What we need here is to use it to save several events info.
> > > Maybe we could figure out a general layout of this capability so that
> > > it can be leveraged by others, not only for display irq/events.  
> > 
> > You could also expose a device specific IRQ with count > 1 (ie. similar
> > to MSI/X) and avoid munging the eventfd value, which is not something
> > we do elsewhere, at least in vfio.  Thanks,  
> 
> Well, the basic idea is to use the eventfd value to signal the kind of
> changes which did happen, simliar to IRQ status register bits.
> 
> So, when the guest changes the primary plane, the mdev driver notes
> this.  Same with the cursor plane.  On vblank (when the guests update
> is actually applied) the mdev driver wakes the eventfd and uses eventfd
> value to signal whenever primary plane or cursor plane or both did
> change.
> 
> Then userspace knows which planes need an update without an extra
> VFIO_DEVICE_QUERY_GFX_PLANE roundtrip to the kernel.
> 
> Alternatively we could have one eventfd for each change type.  But given
> that these changes are typically applied at the same time (vblank) we
> would have multiple eventfds being signaled at the same time.  Which
> doesn't look ideal to me ...

Good point, looking at the bits in the eventfd value seems better than
a flood of concurrent interrupts.  Thanks,

Alex


Re: [PATCH] vfio: re-arrange vfio region definitions

2019-07-31 Thread Alex Williamson
On Wed, 31 Jul 2019 20:47:07 +0200
Auger Eric  wrote:

> Hi Connie,
> 
> On 7/17/19 1:49 PM, Cornelia Huck wrote:
> > It is easy to miss already defined region types. Let's re-arrange
> > the definitions a bit and add more comments to make it hopefully
> > a bit clearer.
> > 
> > No functional change.
> > 
> > Signed-off-by: Cornelia Huck 
> > ---
> >  include/uapi/linux/vfio.h | 19 ---
> >  1 file changed, 12 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 8f10748dac79..d9bcf40240be 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -295,15 +295,23 @@ struct vfio_region_info_cap_type {
> > __u32 subtype;  /* type specific */
> >  };
> >  
> > +/*
> > + * List of region types, global per bus driver.
> > + * If you introduce a new type, please add it here.
> > + */
> > +
> > +/* PCI region type containing a PCI vendor part */
> >  #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE   (1 << 31)
> >  #define VFIO_REGION_TYPE_PCI_VENDOR_MASK   (0x)
> > +#define VFIO_REGION_TYPE_GFX(1)
> > +#define VFIO_REGION_TYPE_CCW   (2)
> >  
> > -/* 8086 Vendor sub-types */
> > +/* 8086 vendor PCI sub-types */
> >  #define VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION (1)
> >  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2)
> >  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG  (3)
> >  
> > -#define VFIO_REGION_TYPE_GFX(1)
> > +/* GFX sub-types */
> >  #define VFIO_REGION_SUBTYPE_GFX_EDID(1)
> >  
> >  /**
> > @@ -353,20 +361,17 @@ struct vfio_region_gfx_edid {
> >  #define VFIO_DEVICE_GFX_LINK_STATE_DOWN  2
> >  };
> >  
> > -#define VFIO_REGION_TYPE_CCW   (2)
> >  /* ccw sub-types */
> >  #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD  (1)
> >  
> > +/* 10de vendor PCI sub-types */
> >  /*
> > - * 10de vendor sub-type
> > - *
> >   * NVIDIA GPU NVlink2 RAM is coherent RAM mapped onto the host address 
> > space.
> >   */
> >  #define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM (1)
> >  
> > +/* 1014 vendor PCI sub-types*/
> >  /*
> > - * 1014 vendor sub-type  
> Maybe the 10de vendor sub-type and 1014 vendor sub-type could be put
> just after /* 8086 vendor PCI sub-types */
> 
> More generally if it were possible to leave the subtypes close to their
> parent type too, this would be beneficial I think.
> 
> Besides that becomes sensible to put all those definitions together.

Any sort of consolidation or grouping is an improvement here, thanks
for taking this on, Connie!  I haven't started my branch yet for v5.4,
but if you want to iterate to something agreeable, I'll happily take
the end product :)  The original patch here looks like a good degree of
consolidating the type definitions and improving consistency without
moving large chunks of code.  Thanks,

Alex


Re: [RFC PATCH v4 2/6] vfio: Introduce vGPU display irq type

2019-07-22 Thread Alex Williamson
On Tue, 23 Jul 2019 01:08:19 +
"Zhang, Tina"  wrote:

> > -Original Message-
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Tuesday, July 23, 2019 3:41 AM
> > To: Lu, Kechen 
> > Cc: intel-gvt-...@lists.freedesktop.org; k...@vger.kernel.org; linux-
> > ker...@vger.kernel.org; Zhang, Tina ;
> > kra...@redhat.com; zhen...@linux.intel.com; Lv, Zhiyuan
> > ; Wang, Zhi A ; Tian, Kevin
> > ; Yuan, Hang 
> > Subject: Re: [RFC PATCH v4 2/6] vfio: Introduce vGPU display irq type
> > 
> > On Mon, 22 Jul 2019 05:28:35 +
> > "Lu, Kechen"  wrote:
> >   
> > > Hi,
> > >  
> > > > -Original Message-
> > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > Sent: Saturday, July 20, 2019 12:25 AM
> > > > To: Lu, Kechen 
> > > > Cc: intel-gvt-...@lists.freedesktop.org; k...@vger.kernel.org; linux-
> > > > ker...@vger.kernel.org; Zhang, Tina ;
> > > > kra...@redhat.com; zhen...@linux.intel.com; Lv, Zhiyuan
> > > > ; Wang, Zhi A ; Tian,
> > > > Kevin ; Yuan, Hang 
> > > > Subject: Re: [RFC PATCH v4 2/6] vfio: Introduce vGPU display irq
> > > > type
> > > >
> > > > On Thu, 18 Jul 2019 23:56:36 +0800
> > > > Kechen Lu  wrote:
> > > >  
> > > > > From: Tina Zhang 
> > > > >
> > > > > Introduce vGPU specific irq type VFIO_IRQ_TYPE_GFX, and
> > > > > VFIO_IRQ_SUBTYPE_GFX_DISPLAY_IRQ as the subtype for vGPU display
> > > > >
> > > > > Signed-off-by: Tina Zhang 
> > > > > ---
> > > > >  include/uapi/linux/vfio.h | 3 +++
> > > > >  1 file changed, 3 insertions(+)
> > > > >
> > > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > > index be6adab4f759..df28b17a6e2e 100644
> > > > > --- a/include/uapi/linux/vfio.h
> > > > > +++ b/include/uapi/linux/vfio.h
> > > > > @@ -469,6 +469,9 @@ struct vfio_irq_info_cap_type {
> > > > >   __u32 subtype;  /* type specific */  };
> > > > >
> > > > > +#define VFIO_IRQ_TYPE_GFX(1)
> > > > > +#define VFIO_IRQ_SUBTYPE_GFX_DISPLAY_IRQ (1)
> > > > > +  
> > > >
> > > > Please include a description defining exactly what this IRQ is intended 
> > > > to  
> > signal.  
> > > > For instance, if another vGPU vendor wanted to implement this in
> > > > their driver and didn't have the QEMU code for reference to what it
> > > > does with the IRQ, what would they need to know?  Thanks,
> > > >
> > > > Alex
> > > >  
> > >
> > > Yes, that makes more sense. I'll add the description for it at next 
> > > version  
> > patch.  
> > >
> > > BTW, may I have one more question? In the current design ideas, we
> > > partitioned the vGPU display eventfd counted 8-byte value into at most
> > > 8 events to deliver multiple display events, so we need different
> > > increasement counter value to differentiate the events. As this is the
> > > exposed thing the QEMU has to know, we plan adds a macro here
> > > VFIO_IRQ_SUBTYPE_GFX_DISPLAY_EVENTFD_BASE_SHIFT to make sure  
> > the  
> > > partitions shift in 1 byte, does it make sense putting here? Looking 
> > > forward  
> > to your and Gerd's comments. Thanks!
> > 
> > Couldn't you expose this as another capability within the IRQ_INFO return
> > data?  If you were to define it as a macro, I assume that means it would be
> > hard coded, in which case this probably becomes an Intel specific IRQ, 
> > rather
> > than what appears to be framed as a generic graphics IRQ extension.  A new
> > capability could instead allow the vendor to specify their own value, where
> > we could define how userspace should interpret and make use of this value.
> > Thanks,  
> Good suggestion. Currently, vfio_irq_info is used to save one irq
> info. What we need here is to use it to save several events info.
> Maybe we could figure out a general layout of this capability so that
> it can be leveraged by others, not only for display irq/events.

You could also expose a device specific IRQ with count > 1 (ie. similar
to MSI/X) and avoid munging the eventfd value, which is not something
we do elsewhere, at least in vfio.  Thanks,

Alex


Re: [RFC PATCH v4 2/6] vfio: Introduce vGPU display irq type

2019-07-22 Thread Alex Williamson
On Mon, 22 Jul 2019 05:28:35 +
"Lu, Kechen"  wrote:

> Hi, 
> 
> > -Original Message-
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Saturday, July 20, 2019 12:25 AM
> > To: Lu, Kechen 
> > Cc: intel-gvt-...@lists.freedesktop.org; k...@vger.kernel.org; linux-
> > ker...@vger.kernel.org; Zhang, Tina ;
> > kra...@redhat.com; zhen...@linux.intel.com; Lv, Zhiyuan
> > ; Wang, Zhi A ; Tian, Kevin
> > ; Yuan, Hang 
> > Subject: Re: [RFC PATCH v4 2/6] vfio: Introduce vGPU display irq type
> > 
> > On Thu, 18 Jul 2019 23:56:36 +0800
> > Kechen Lu  wrote:
> >   
> > > From: Tina Zhang 
> > >
> > > Introduce vGPU specific irq type VFIO_IRQ_TYPE_GFX, and
> > > VFIO_IRQ_SUBTYPE_GFX_DISPLAY_IRQ as the subtype for vGPU display
> > >
> > > Signed-off-by: Tina Zhang 
> > > ---
> > >  include/uapi/linux/vfio.h | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index be6adab4f759..df28b17a6e2e 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -469,6 +469,9 @@ struct vfio_irq_info_cap_type {
> > >   __u32 subtype;  /* type specific */
> > >  };
> > >
> > > +#define VFIO_IRQ_TYPE_GFX(1)
> > > +#define VFIO_IRQ_SUBTYPE_GFX_DISPLAY_IRQ (1)
> > > +  
> > 
> > Please include a description defining exactly what this IRQ is intended to 
> > signal.
> > For instance, if another vGPU vendor wanted to implement this in their 
> > driver
> > and didn't have the QEMU code for reference to what it does with the IRQ, 
> > what
> > would they need to know?  Thanks,
> > 
> > Alex
> >   
> 
> Yes, that makes more sense. I'll add the description for it at next version 
> patch.
> 
> BTW, may I have one more question? In the current design ideas, we 
> partitioned 
> the vGPU display eventfd counted 8-byte value into at most 8 events to 
> deliver 
> multiple display events, so we need different increasement counter value to 
> differentiate the events. As this is the exposed thing the QEMU has to know, 
> we
> plan adds a macro here VFIO_IRQ_SUBTYPE_GFX_DISPLAY_EVENTFD_BASE_SHIFT to
> make sure the partitions shift in 1 byte, does it make sense putting here? 
> Looking  
> forward to your and Gerd's comments. Thanks!

Couldn't you expose this as another capability within the IRQ_INFO
return data?  If you were to define it as a macro, I assume that means
it would be hard coded, in which case this probably becomes an Intel
specific IRQ, rather than what appears to be framed as a generic
graphics IRQ extension.  A new capability could instead allow the
vendor to specify their own value, where we could define how userspace
should interpret and make use of this value.  Thanks,

Alex


Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver

2019-07-19 Thread Alex Williamson
On Fri, 12 Jul 2019 12:55:27 +
"Liu, Yi L"  wrote:

> Hi Alex,
> 
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Friday, July 12, 2019 3:08 AM
> > To: Liu, Yi L 
> > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > 
> > On Thu, 11 Jul 2019 12:27:26 +
> > "Liu, Yi L"  wrote:
> >   
> > > Hi Alex,
> > >  
> > > > From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On  
> > Behalf  
> > > > Of Alex Williamson
> > > > Sent: Friday, July 5, 2019 11:55 PM
> > > > To: Liu, Yi L 
> > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > >
> > > > On Thu, 4 Jul 2019 09:11:02 +
> > > > "Liu, Yi L"  wrote:
> > > >  
> > > > > Hi Alex,
> > > > >  
> > > > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > > > Sent: Thursday, July 4, 2019 1:22 AM
> > > > > > To: Liu, Yi L 
> > > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver  
> > > [...]  
> > > > >  
> > > > > > It's really unfortunate that we don't have the mdev inheriting the
> > > > > > iommu group of the iommu_device so that userspace can really 
> > > > > > understand
> > > > > > this relationship.  A separate group makes sense for the aux-domain
> > > > > > case, and is (I guess) not a significant issue in the case of a
> > > > > > singleton iommu_device group, but it's pretty awkward here.  Perhaps
> > > > > > this is something we should correct in design of iommu backed 
> > > > > > mdevs.  
> > > > >
> > > > > Yeah, for aux-domain case, it is not significant issue as aux-domain 
> > > > > essentially
> > > > > means singleton iommu_devie group. And in early time, when designing 
> > > > > the  
> > > > support  
> > > > > for wrap pci as a mdev, we also considered to let vfio-mdev-pci to 
> > > > > reuse
> > > > > iommu_device group. But this results in an iommu backed group 
> > > > > includes mdev  
> > and  
> > > > > physical devices, which might also be strange. Do you think it is 
> > > > > valuable to  
> > > > reconsider  
> > > > > it?  
> > > >
> > > > From a group perspective, the cleanest solution would seem to be that
> > > > IOMMU backed mdevs w/o aux domain support should inherit the IOMMU
> > > > group of the iommu_device,  
> > >
> > > A confirm here. Regards to inherit the IOMMU group of iommu_device, do
> > > you mean mdev device should be added to the IOMMU group of iommu_device
> > > or maintain a parent and inheritor relationship within vfio? I guess you 
> > > mean the
> > > later one? :-)  
> > 
> > I was thinking the former, I'm not sure what the latter implies.  There
> > is no hierarchy within or between IOMMU groups, it's simply a set of
> > devices.  
> 
> I have a concern on adding the mdev device to the iommu_group of
> iommu_device. In such configuration, a iommu backed group includes
> mdev devices and physical devices. Then it might be necessary to advertise
> the mdev info to the in-kernel software which want to loop all devices within
> such an iommu_group. An example I can see is the virtual SVA threads in
> community. e.g. for a guest pasid bind, the changes below loops all the
> devices within an iommu_group, and each loop will call into vendor iommu
> driver with a device structure passed in. It is quite possible that vendor
> iommu driver need to get something behind a physical device (e.g.
> intel_iommu structure). For a physical device, it is fine. While for mdev
> device, it would be a problem if no mdev info advertised to iommu driver. :-(
> Although we have agreement that PASID support should be disabled for
> devices which are from non-singleton group. But I don't feel like to rely on
> such assumptions when designing software flows. Also, it's just an example,
> we have no idea if there will be more similar flows which require to loop all
> devices in an iommu group in future. May be we want to avoid adding a mdev
> to an iommu backed group. :-) More replies to you response below.
> 
> +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> + void __use

Re: [RFC PATCH v4 2/6] vfio: Introduce vGPU display irq type

2019-07-19 Thread Alex Williamson
On Thu, 18 Jul 2019 23:56:36 +0800
Kechen Lu  wrote:

> From: Tina Zhang 
> 
> Introduce vGPU specific irq type VFIO_IRQ_TYPE_GFX, and
> VFIO_IRQ_SUBTYPE_GFX_DISPLAY_IRQ as the subtype for vGPU display
> 
> Signed-off-by: Tina Zhang 
> ---
>  include/uapi/linux/vfio.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index be6adab4f759..df28b17a6e2e 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -469,6 +469,9 @@ struct vfio_irq_info_cap_type {
>   __u32 subtype;  /* type specific */
>  };
>  
> +#define VFIO_IRQ_TYPE_GFX(1)
> +#define VFIO_IRQ_SUBTYPE_GFX_DISPLAY_IRQ (1)
> +

Please include a description defining exactly what this IRQ is intended
to signal.  For instance, if another vGPU vendor wanted to implement
this in their driver and didn't have the QEMU code for reference to
what it does with the IRQ, what would they need to know?  Thanks,

Alex 

>  /**
>   * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct 
> vfio_irq_set)
>   *



[GIT PULL] VFIO updates for v5.3-rc1

2019-07-16 Thread Alex Williamson
Hi Linus,

The following changes since commit 6fbc7275c7a9ba97877050335f290341a1fd8dbf:

  Linux 5.2-rc7 (2019-06-30 11:25:36 +0800)

are available in the Git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v5.3-rc1

for you to fetch changes up to 1e4d09d2212d9e230b967f57bc8df463527dbd75:

  mdev: Send uevents around parent device registration (2019-07-11 13:26:52 
-0600)


VFIO updates for v5.3-rc1

 - Static symbol cleanup in mdev samples (Kefeng Wang)

 - Use vma help in nvlink code (Peng Hao)

 - Remove unused code in mbochs sample (YueHaibing)

 - Send uevents around mdev registration (Alex Williamson)


Alex Williamson (1):
  mdev: Send uevents around parent device registration

Kefeng Wang (1):
  vfio-mdev/samples: make some symbols static

Peng Hao (1):
  vfio: vfio_pci_nvlink2: use a vma helper function

YueHaibing (1):
  sample/mdev/mbochs: remove set but not used variable 'mdev_state'

 drivers/vfio/mdev/mdev_core.c   |  9 +++
 drivers/vfio/pci/vfio_pci_nvlink2.c |  3 +--
 samples/vfio-mdev/mbochs.c  |  3 ---
 samples/vfio-mdev/mtty.c| 47 +++--
 4 files changed, 34 insertions(+), 28 deletions(-)


Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver

2019-07-11 Thread Alex Williamson
On Thu, 11 Jul 2019 12:27:26 +
"Liu, Yi L"  wrote:

> Hi Alex,
> 
> > From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf
> > Of Alex Williamson
> > Sent: Friday, July 5, 2019 11:55 PM
> > To: Liu, Yi L 
> > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > 
> > On Thu, 4 Jul 2019 09:11:02 +
> > "Liu, Yi L"  wrote:
> >   
> > > Hi Alex,
> > >  
> > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > Sent: Thursday, July 4, 2019 1:22 AM
> > > > To: Liu, Yi L 
> > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver  
> [...]
> > >  
> > > > It's really unfortunate that we don't have the mdev inheriting the
> > > > iommu group of the iommu_device so that userspace can really understand
> > > > this relationship.  A separate group makes sense for the aux-domain
> > > > case, and is (I guess) not a significant issue in the case of a
> > > > singleton iommu_device group, but it's pretty awkward here.  Perhaps
> > > > this is something we should correct in design of iommu backed mdevs.  
> > >
> > > Yeah, for aux-domain case, it is not significant issue as aux-domain 
> > > essentially
> > > means singleton iommu_devie group. And in early time, when designing the  
> > support  
> > > for wrap pci as a mdev, we also considered to let vfio-mdev-pci to reuse
> > > iommu_device group. But this results in an iommu backed group includes 
> > > mdev and
> > > physical devices, which might also be strange. Do you think it is 
> > > valuable to  
> > reconsider  
> > > it?  
> > 
> > From a group perspective, the cleanest solution would seem to be that
> > IOMMU backed mdevs w/o aux domain support should inherit the IOMMU
> > group of the iommu_device,  
> 
> A confirm here. Regards to inherit the IOMMU group of iommu_device, do
> you mean mdev device should be added to the IOMMU group of iommu_device
> or maintain a parent and inheritor relationship within vfio? I guess you mean 
> the
> later one? :-)

I was thinking the former, I'm not sure what the latter implies.  There
is no hierarchy within or between IOMMU groups, it's simply a set of
devices.  Maybe what you're getting at is that vfio needs to understand
that the mdev is a child of the endpoint device in its determination of
whether the group is viable.  That's true, but we can also have IOMMU
groups composed of SR-IOV VFs along with their parent PF if the root of
the IOMMU group is (for example) a downstream switch port above the PF.
So we can't simply look at the parent/child relationship within the
group, we somehow need to know that the parent device sharing the IOMMU
group is operating in host kernel space on behalf of the mdev.
 
> > but I think the barrier here is that we have
> > a difficult time determining if the group is "viable" in that case.
> > For example a group where one devices is bound to a native host driver
> > and the other device bound to a vfio driver would typically be
> > considered non-viable as it breaks the isolation guarantees.  However  
> 
> yes, this is how vfio guarantee the isolation before allowing user to further
> add a group to a vfio container and so on.
> 
> > I think in this configuration, the parent device is effectively
> > participating in the isolation and "donating" its iommu group on behalf
> > of the mdev device.  I don't think we can simultaneously use that iommu
> > group for any other purpose.   
> 
> Agree. At least host cannot make use of the iommu group any more in such
> configuration.
> 
> > I'm sure we could come up with a way for
> > vifo-core to understand this relationship and add it to the white list,  
> 
> The configuration is host driver still exists while we want to let mdev device
> to somehow "own" the iommu backed DMA isolation capability. So one possible
> way may be calling vfio_add_group_dev() which will creates a vfio_device 
> instance
> for the iommu_device in vfio.c when creating a iommu backed mdev. Then the
> iommu group is fairly viable.

"fairly viable" ;)  It's a correct use of the term, it's a little funny
though as "fairly" can also mean reasonably/sufficiently/adequately as
well as I think the intended use here equivalent to justly. 

That's an interesting idea to do an implicit vfio_add_group_dev() on
the iommu_device in this case, if you've worked through how that could
play out, it'd be interesting to see.

> > I wonder though how confusing this might be to users

[PATCH v3] mdev: Send uevents around parent device registration

2019-07-10 Thread Alex Williamson
This allows udev to trigger rules when a parent device is registered
or unregistered from mdev.

Reviewed-by: Cornelia Huck 
Signed-off-by: Alex Williamson 
---

v3: Add Connie's R-b
Add comment clarifying expected device requirements for unreg

 drivers/vfio/mdev/mdev_core.c |9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index ae23151442cb..23976db6c6c7 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -146,6 +146,8 @@ int mdev_register_device(struct device *dev, const struct 
mdev_parent_ops *ops)
 {
int ret;
struct mdev_parent *parent;
+   char *env_string = "MDEV_STATE=registered";
+   char *envp[] = { env_string, NULL };
 
/* check for mandatory ops */
if (!ops || !ops->create || !ops->remove || !ops->supported_type_groups)
@@ -197,6 +199,8 @@ int mdev_register_device(struct device *dev, const struct 
mdev_parent_ops *ops)
mutex_unlock(_list_lock);
 
dev_info(dev, "MDEV: Registered\n");
+   kobject_uevent_env(>kobj, KOBJ_CHANGE, envp);
+
return 0;
 
 add_dev_err:
@@ -220,6 +224,8 @@ EXPORT_SYMBOL(mdev_register_device);
 void mdev_unregister_device(struct device *dev)
 {
struct mdev_parent *parent;
+   char *env_string = "MDEV_STATE=unregistered";
+   char *envp[] = { env_string, NULL };
 
mutex_lock(_list_lock);
parent = __find_parent_device(dev);
@@ -243,6 +249,9 @@ void mdev_unregister_device(struct device *dev)
up_write(>unreg_sem);
 
mdev_put_parent(parent);
+
+   /* We still have the caller's reference to use for the uevent */
+   kobject_uevent_env(>kobj, KOBJ_CHANGE, envp);
 }
 EXPORT_SYMBOL(mdev_unregister_device);
 



Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver

2019-07-05 Thread Alex Williamson
On Thu, 4 Jul 2019 09:11:02 +
"Liu, Yi L"  wrote:

> Hi Alex,
> 
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Thursday, July 4, 2019 1:22 AM
> > To: Liu, Yi L 
> > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > 
> > On Wed, 3 Jul 2019 08:25:25 +
> > "Liu, Yi L"  wrote:
> >   
> > > Hi Alex,
> > >
> > > Thanks for the comments. Have four inline responses below. And one
> > > of them need your further help. :-)  
> 
> [...]
> 
> > > > > >  
> > > > > > > > > > used iommu_attach_device() rather than iommu_attach_group()
> > > > > > > > > > for non-aux mdev iommu_device.  Is there a requirement that
> > > > > > > > > > the mdev parent device is in a singleton iommu group?  
> > > > > > > > >
> > > > > > > > > I don't think there should have such limitation. Per my
> > > > > > > > > understanding, vfio-mdev-pci should also be able to bind to
> > > > > > > > > devices which shares iommu group with other devices. vfio-pci 
> > > > > > > > > works  
> > well  
> > > > for such devices.  
> > > > > > > > > And since the two drivers share most of the codes, I think
> > > > > > > > > vfio-mdev-pci should naturally support it as well.  
> > > > > > > >
> > > > > > > > Yes, the difference though is that vfio.c knows when devices are
> > > > > > > > in the same group, which mdev vfio.c only knows about the
> > > > > > > > non-iommu backed group, not the group that is actually used for
> > > > > > > > the iommu backing.  So we either need to enlighten vfio.c or
> > > > > > > > further abstract those details in vfio_iommu_type1.c.  
> > > > > > >
> > > > > > > Not sure if it is necessary to introduce more changes to vfio.c or
> > > > > > > vfio_iommu_type1.c. If it's only for the scenario which two
> > > > > > > devices share an iommu_group, I guess it could be supported by
> > > > > > > using __iommu_attach_device() which has no device counting for the
> > > > > > > group. But maybe I missed something here. It would be great if you
> > > > > > > can elaborate a bit for it. :-)  
> > > > > >
> > > > > > We need to use the group semantics, there's a reason
> > > > > > __iommu_attach_device() is not exposed, it's an internal helper.  I
> > > > > > think there's no way around that we need to somewhere track the
> > > > > > actual group we're attaching to and have the smarts to re-use it for
> > > > > > other devices in the same group.  
> > > > >
> > > > > Hmmm, exposing __iommu_attach_device() is not good, let's forget it.
> > > > > :-)
> > > > >  
> > > > > > > > > > If this is a simplification, then vfio-mdev-pci should not
> > > > > > > > > > bind to devices where this is violated since there's no way
> > > > > > > > > > to use the device.  Can we support it though?  
> > > > > > > > >
> > > > > > > > > yeah, I think we need to support it.  
> > >
> > > I've already made vfio-mdev-pci driver work for non-singleton iommu
> > > group. e.g. for devices in a single iommu group, I can bind the devices
> > > to eithervfio-pci or vfio-mdev-pci and then passthru them to a VM. And
> > > it will fail if user tries to passthru a vfio-mdev-pci device via vfio-pci
> > > manner "-device vfio-pci,host=01:00.1". In other words, vfio-mdev-pci
> > > device can only passthru via
> > > "-device vfio-pci,sysfsdev=/sys/bus/mdev/devices/UUID". This is what
> > > we expect.
> > >
> > > However, I encountered a problem when trying to prevent user from
> > > passthru these devices to different VMs. I've tried in my side, and I
> > > can passthru vfio-pci device and vfio-mdev-pci device to different
> > > VMs. But actually this operation should be failed. If all the devices
> > > are bound to vfio-pci, Qemu will open iommu backed group. So
> > > Qemu can check if a given group has already been used by an
> &

Re: [RFC v2] vhost: introduce mdev based hardware vhost backend

2019-07-05 Thread Alex Williamson
On Thu, 4 Jul 2019 14:21:34 +0800
Tiwei Bie  wrote:

> On Thu, Jul 04, 2019 at 12:31:48PM +0800, Jason Wang wrote:
> > On 2019/7/3 下午9:08, Tiwei Bie wrote:  
> > > On Wed, Jul 03, 2019 at 08:16:23PM +0800, Jason Wang wrote:  
> > > > On 2019/7/3 下午7:52, Tiwei Bie wrote:  
> > > > > On Wed, Jul 03, 2019 at 06:09:51PM +0800, Jason Wang wrote:  
> > > > > > On 2019/7/3 下午5:13, Tiwei Bie wrote:  
> > > > > > > Details about this can be found here:
> > > > > > > 
> > > > > > > https://lwn.net/Articles/750770/
> > > > > > > 
> > > > > > > What's new in this version
> > > > > > > ==
> > > > > > > 
> > > > > > > A new VFIO device type is introduced - vfio-vhost. This addressed
> > > > > > > some comments from here:https://patchwork.ozlabs.org/cover/984763/
> > > > > > > 
> > > > > > > Below is the updated device interface:
> > > > > > > 
> > > > > > > Currently, there are two regions of this device: 1) CONFIG_REGION
> > > > > > > (VFIO_VHOST_CONFIG_REGION_INDEX), which can be used to setup the
> > > > > > > device; 2) NOTIFY_REGION (VFIO_VHOST_NOTIFY_REGION_INDEX), which
> > > > > > > can be used to notify the device.
> > > > > > > 
> > > > > > > 1. CONFIG_REGION
> > > > > > > 
> > > > > > > The region described by CONFIG_REGION is the main control 
> > > > > > > interface.
> > > > > > > Messages will be written to or read from this region.
> > > > > > > 
> > > > > > > The message type is determined by the `request` field in message
> > > > > > > header. The message size is encoded in the message header too.
> > > > > > > The message format looks like this:
> > > > > > > 
> > > > > > > struct vhost_vfio_op {
> > > > > > >   __u64 request;
> > > > > > >   __u32 flags;
> > > > > > >   /* Flag values: */
> > > > > > > #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */
> > > > > > >   __u32 size;
> > > > > > >   union {
> > > > > > >   __u64 u64;
> > > > > > >   struct vhost_vring_state state;
> > > > > > >   struct vhost_vring_addr addr;
> > > > > > >   } payload;
> > > > > > > };
> > > > > > > 
> > > > > > > The existing vhost-kernel ioctl cmds are reused as the message
> > > > > > > requests in above structure.  
> > > > > > Still a comments like V1. What's the advantage of inventing a new 
> > > > > > protocol?  
> > > > > I'm trying to make it work in VFIO's way..
> > > > >   
> > > > > > I believe either of the following should be better:
> > > > > > 
> > > > > > - using vhost ioctl,  we can start from 
> > > > > > SET_VRING_KICK/SET_VRING_CALL and
> > > > > > extend it with e.g notify region. The advantages is that all exist 
> > > > > > userspace
> > > > > > program could be reused without modification (or minimal 
> > > > > > modification). And
> > > > > > vhost API hides lots of details that is not necessary to be 
> > > > > > understood by
> > > > > > application (e.g in the case of container).  
> > > > > Do you mean reusing vhost's ioctl on VFIO device fd directly,
> > > > > or introducing another mdev driver (i.e. vhost_mdev instead of
> > > > > using the existing vfio_mdev) for mdev device?  
> > > > Can we simply add them into ioctl of mdev_parent_ops?  
> > > Right, either way, these ioctls have to be and just need to be
> > > added in the ioctl of the mdev_parent_ops. But another thing we
> > > also need to consider is that which file descriptor the userspace
> > > will do the ioctl() on. So I'm wondering do you mean let the
> > > userspace do the ioctl() on the VFIO device fd of the mdev
> > > device?
> > >   
> > 
> > Yes.  
> 
> Got it! I'm not sure what's Alex opinion on this. If we all
> agree with this, I can do it in this way.
> 
> > Is there any other way btw?  
> 
> Just a quick thought.. Maybe totally a bad idea. I was thinking
> whether it would be odd to do non-VFIO's ioctls on VFIO's device
> fd. So I was wondering whether it's possible to allow binding
> another mdev driver (e.g. vhost_mdev) to the supported mdev
> devices. The new mdev driver, vhost_mdev, can provide similar
> ways to let userspace open the mdev device and do the vhost ioctls
> on it. To distinguish with the vfio_mdev compatible mdev devices,
> the device API of the new vhost_mdev compatible mdev devices
> might be e.g. "vhost-net" for net?
> 
> So in VFIO case, the device will be for passthru directly. And
> in VHOST case, the device can be used to accelerate the existing
> virtualized devices.
> 
> How do you think?

VFIO really can't prevent vendor specific ioctls on the device file
descriptor for mdevs, but a) we'd want to be sure the ioctl address
space can't collide with ioctls we'd use for vfio defined purposes and
b) maybe the VFIO user API isn't what you want in the first place if
you intend to mostly/entirely ignore the defined ioctl set and replace
them with your own.  In the case of the latter, you're also not getting
the advantages of the existing VFIO userspace code, so why expose a
VFIO device at all.

The mdev interface does provide a general 

Re: [PATCH v7 2/6] vfio/type1: Check reserve region conflict and update iova list

2019-07-03 Thread Alex Williamson
On Wed, 26 Jun 2019 16:12:44 +0100
Shameer Kolothum  wrote:

> This retrieves the reserved regions associated with dev group and
> checks for conflicts with any existing dma mappings. Also update
> the iova list excluding the reserved regions.
> 
> Reserved regions with type IOMMU_RESV_DIRECT_RELAXABLE are
> excluded from above checks as they are considered as directly
> mapped regions which are known to be relaxable.
> 
> Signed-off-by: Shameer Kolothum 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 96 +
>  1 file changed, 96 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 970d1ec06aed..b6bfdfa16c33 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1559,6 +1641,7 @@ static int vfio_iommu_type1_attach_group(void 
> *iommu_data,
>   phys_addr_t resv_msi_base;
>   struct iommu_domain_geometry geo;
>   LIST_HEAD(iova_copy);
> + LIST_HEAD(group_resv_regions);
>  
>   mutex_lock(>lock);
>  
> @@ -1644,6 +1727,13 @@ static int vfio_iommu_type1_attach_group(void 
> *iommu_data,
>   goto out_detach;
>   }
>  
> + iommu_get_group_resv_regions(iommu_group, _resv_regions);

This can fail and should have an error case.  I assume we'd fail the
group attach on failure.  Thanks,

Alex


Re: [RFC v2] vhost: introduce mdev based hardware vhost backend

2019-07-03 Thread Alex Williamson
On Wed,  3 Jul 2019 17:13:39 +0800
Tiwei Bie  wrote:
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 8f10748dac79..6c5718ab7eeb 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -201,6 +201,7 @@ struct vfio_device_info {
>  #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3) /* vfio-amba device */
>  #define VFIO_DEVICE_FLAGS_CCW(1 << 4)/* vfio-ccw device */
>  #define VFIO_DEVICE_FLAGS_AP (1 << 5)/* vfio-ap device */
> +#define VFIO_DEVICE_FLAGS_VHOST  (1 << 6)/* vfio-vhost device */
>   __u32   num_regions;/* Max region index + 1 */
>   __u32   num_irqs;   /* Max IRQ index + 1 */
>  };
> @@ -217,6 +218,7 @@ struct vfio_device_info {
>  #define VFIO_DEVICE_API_AMBA_STRING  "vfio-amba"
>  #define VFIO_DEVICE_API_CCW_STRING   "vfio-ccw"
>  #define VFIO_DEVICE_API_AP_STRING"vfio-ap"
> +#define VFIO_DEVICE_API_VHOST_STRING "vfio-vhost"
>  
>  /**
>   * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
> @@ -573,6 +575,23 @@ enum {
>   VFIO_CCW_NUM_IRQS
>  };
>  
> +/*
> + * The vfio-vhost bus driver makes use of the following fixed region and
> + * IRQ index mapping. Unimplemented regions return a size of zero.
> + * Unimplemented IRQ types return a count of zero.
> + */
> +
> +enum {
> + VFIO_VHOST_CONFIG_REGION_INDEX,
> + VFIO_VHOST_NOTIFY_REGION_INDEX,
> + VFIO_VHOST_NUM_REGIONS
> +};
> +
> +enum {
> + VFIO_VHOST_VQ_IRQ_INDEX,
> + VFIO_VHOST_NUM_IRQS
> +};
> +

Note that the vfio API has evolved a bit since vfio-pci started this
way, with fixed indexes for pre-defined region types.  We now support
device specific regions which can be identified by a capability within
the REGION_INFO ioctl return data.  This allows a bit more flexibility,
at the cost of complexity, but the infrastructure already exists in
kernel and QEMU to make it relatively easy.  I think we'll have the
same support for interrupts soon too.  If you continue to pursue the
vfio-vhost direction you might want to consider these before committing
to fixed indexes.  Thanks,

Alex


Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver

2019-07-03 Thread Alex Williamson
On Wed, 3 Jul 2019 08:25:25 +
"Liu, Yi L"  wrote:

> Hi Alex,
> 
> Thanks for the comments. Have four inline responses below. And one
> of them need your further help. :-)
> .
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Friday, June 28, 2019 11:08 PM
> > To: Liu, Yi L 
> > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > 
> > On Mon, 24 Jun 2019 08:20:38 +
> > "Liu, Yi L"  wrote:
> >   
> > > Hi Alex,
> > >  
> > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > Sent: Friday, June 21, 2019 11:58 PM
> > > > To: Liu, Yi L 
> > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > >
> > > > On Fri, 21 Jun 2019 10:23:10 +
> > > > "Liu, Yi L"  wrote:
> > > >  
> > > > > Hi Alex,
> > > > >  
> > > > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > > > Sent: Friday, June 21, 2019 5:08 AM
> > > > > > To: Liu, Yi L 
> > > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > > > > >
> > > > > > On Thu, 20 Jun 2019 13:00:34 + "Liu, Yi L"
> > > > > >  wrote:
> > > > > >  
> > > > > > > Hi Alex,
> > > > > > >  
> > > > > > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > > > > > Sent: Thursday, June 20, 2019 12:27 PM
> > > > > > > > To: Liu, Yi L 
> > > > > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci
> > > > > > > > driver
> > > > > > > >
> > > > > > > > On Sat,  8 Jun 2019 21:21:11 +0800 Liu Yi L
> > > > > > > >  wrote:
> > > > > > > >  
> > > > > > > > > This patch adds sample driver named vfio-mdev-pci. It is
> > > > > > > > > to wrap a PCI device as a mediated device. For a pci
> > > > > > > > > device, once bound to vfio-mdev-pci driver, user space
> > > > > > > > > access of this device will go through vfio mdev framework.
> > > > > > > > > The usage of the device follows mdev management method.
> > > > > > > > > e.g. user should create a mdev before exposing the device to 
> > > > > > > > > user-space.  
> > > > > [...]  
> > > > > > >  
> > > > > > > > However, the patch below just makes the mdev interface
> > > > > > > > behave correctly, I can't make it work on my system because
> > > > > > > > commit 7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching
> > > > > > > > group helpers")  
> > > > > > >
> > > > > > > What error did you encounter. I tested the patch with a device
> > > > > > > in a singleton iommu group. I'm also searching a proper
> > > > > > > machine with multiple devices in an iommu group and test it.  
> > > > > >
> > > > > > In vfio_iommu_type1, iommu backed mdev devices use the
> > > > > > iommu_attach_device() interface, which includes:
> > > > > >
> > > > > > if (iommu_group_device_count(group) != 1)
> > > > > > goto out_unlock;
> > > > > >
> > > > > > So it's impossible to use with non-singleton groups currently.  
> > > > >
> > > > > Hmmm, I think it is no longer good to use iommu_attach_device()
> > > > > for iommu backed mdev devices now. In this flow, the purpose here
> > > > > is to attach a device to a domain and no need to check whether the
> > > > > device is in a singleton iommu group. I think it would be better
> > > > > to use __iommu_attach_device() instead of iommu_attach_device().  
> > > >
> > > > That's a static and unexported, it's intentionally not an exposed
> > > > interface.  We can't attach devices in the same group to separate
> > > > domains allocated through iommu_domain_alloc(), this would violate
> > > > the iommu group isolation principles.  
> > >
> > > Go it. :-) Then not good to expose such interface. Bu

Re: [PATCH v2] mdev: Send uevents around parent device registration

2019-07-02 Thread Alex Williamson
On Tue, 2 Jul 2019 23:34:30 +0530
Kirti Wankhede  wrote:

> On 7/2/2019 8:13 PM, Alex Williamson wrote:
> > On Tue, 2 Jul 2019 19:10:17 +0530
> > Kirti Wankhede  wrote:
> >   
> >> On 7/2/2019 6:38 PM, Alex Williamson wrote:  
> >>> On Tue, 2 Jul 2019 18:17:41 +0530
> >>> Kirti Wankhede  wrote:
> >>> 
> >>>> On 7/2/2019 12:43 PM, Parav Pandit wrote:
> >>>>>
> >>>>>   
> >>>>>> -Original Message-
> >>>>>> From: linux-kernel-ow...@vger.kernel.org  >>>>>> ow...@vger.kernel.org> On Behalf Of Alex Williamson  
> >>>>>> Sent: Tuesday, July 2, 2019 11:12 AM
> >>>>>> To: Kirti Wankhede 
> >>>>>> Cc: coh...@redhat.com; k...@vger.kernel.org; 
> >>>>>> linux-kernel@vger.kernel.org
> >>>>>> Subject: Re: [PATCH v2] mdev: Send uevents around parent device 
> >>>>>> registration
> >>>>>>
> >>>>>> On Tue, 2 Jul 2019 10:25:04 +0530
> >>>>>> Kirti Wankhede  wrote:
> >>>>>>  
> >>>>>>> On 7/2/2019 1:34 AM, Alex Williamson wrote:  
> >>>>>>>> On Mon, 1 Jul 2019 23:20:35 +0530
> >>>>>>>> Kirti Wankhede  wrote:
> >>>>>>>>  
> >>>>>>>>> On 7/1/2019 10:54 PM, Alex Williamson wrote:  
> >>>>>>>>>> On Mon, 1 Jul 2019 22:43:10 +0530
> >>>>>>>>>> Kirti Wankhede  wrote:
> >>>>>>>>>>  
> >>>>>>>>>>> On 7/1/2019 8:24 PM, Alex Williamson wrote:  
> >>>>>>>>>>>> This allows udev to trigger rules when a parent device is
> >>>>>>>>>>>> registered or unregistered from mdev.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Signed-off-by: Alex Williamson 
> >>>>>>>>>>>> ---
> >>>>>>>>>>>>
> >>>>>>>>>>>> v2: Don't remove the dev_info(), Kirti requested they stay and
> >>>>>>>>>>>> removing them is only tangential to the goal of this change.
> >>>>>>>>>>>>  
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks.
> >>>>>>>>>>>
> >>>>>>>>>>>  
> >>>>>>>>>>>>  drivers/vfio/mdev/mdev_core.c |8 
> >>>>>>>>>>>>  1 file changed, 8 insertions(+)
> >>>>>>>>>>>>
> >>>>>>>>>>>> diff --git a/drivers/vfio/mdev/mdev_core.c
> >>>>>>>>>>>> b/drivers/vfio/mdev/mdev_core.c index ae23151442cb..7fb268136c62
> >>>>>>>>>>>> 100644
> >>>>>>>>>>>> --- a/drivers/vfio/mdev/mdev_core.c
> >>>>>>>>>>>> +++ b/drivers/vfio/mdev/mdev_core.c
> >>>>>>>>>>>> @@ -146,6 +146,8 @@ int mdev_register_device(struct device *dev,
> >>>>>>>>>>>> const struct mdev_parent_ops *ops)  {
> >>>>>>>>>>>>  int ret;
> >>>>>>>>>>>>  struct mdev_parent *parent;
> >>>>>>>>>>>> +char *env_string = "MDEV_STATE=registered";
> >>>>>>>>>>>> +char *envp[] = { env_string, NULL };
> >>>>>>>>>>>>
> >>>>>>>>>>>>  /* check for mandatory ops */
> >>>>>>>>>>>>  if (!ops || !ops->create || !ops->remove ||
> >>>>>>>>>>>> !ops->supported_type_groups) @@ -197,6 +199,8 @@ int  
> >>>>>> mdev_register_device(struct device *dev, const struct mdev_parent_ops 
> >>>>>> *ops)  
> >>>>>>>>>>>>  mutex_unlock(_list_lock);
> >>>>>>>>>>>>
> >>>>>>>>>>>>  dev_info(dev, "MDEV: Registered\n");
> >>>>>&g

Re: [PATCH -next] sample/mdev/mbochs: remove set but not used variable 'mdev_state'

2019-07-02 Thread Alex Williamson
On Sat, 25 May 2019 21:53:49 +0800
YueHaibing  wrote:

> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> samples/vfio-mdev/mbochs.c: In function mbochs_ioctl:
> samples/vfio-mdev/mbochs.c:1188:21: warning: variable mdev_state set but not 
> used [-Wunused-but-set-variable]
> 
> It's not used any more since commit 104c7405a64d ("vfio:
> add edid support to mbochs sample driver")
> 
> Signed-off-by: YueHaibing 
> ---
>  samples/vfio-mdev/mbochs.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/samples/vfio-mdev/mbochs.c b/samples/vfio-mdev/mbochs.c
> index b038aa9f5a70..ac5c8c17b1ff 100644
> --- a/samples/vfio-mdev/mbochs.c
> +++ b/samples/vfio-mdev/mbochs.c
> @@ -1185,9 +1185,6 @@ static long mbochs_ioctl(struct mdev_device *mdev, 
> unsigned int cmd,
>  {
>   int ret = 0;
>   unsigned long minsz, outsz;
> - struct mdev_state *mdev_state;
> -
> - mdev_state = mdev_get_drvdata(mdev);
>  
>   switch (cmd) {
>   case VFIO_DEVICE_GET_INFO:

Applied to vfio next branch for 5.3.  Thanks!

Alex


<    1   2   3   4   5   6   7   8   9   10   >