Re: Ping? Re: [PATCH rc] kvm: Prevent compiling virt/kvm/vfio.c unless VFIO is selected

2023-11-29 Thread Alex Williamson
On Wed, 29 Nov 2023 10:31:03 -0800
Sean Christopherson  wrote:

> On Wed, Nov 29, 2023, Jason Gunthorpe wrote:
> > On Tue, Nov 28, 2023 at 06:21:42PM -0800, Sean Christopherson wrote:  
> > > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > > index 454e9295970c..a65b2513f8cd 100644
> > > --- a/include/linux/vfio.h
> > > +++ b/include/linux/vfio.h
> > > @@ -289,16 +289,12 @@ void vfio_combine_iova_ranges(struct rb_root_cached 
> > > *root, u32 cur_nodes,
> > >  /*
> > >   * External user API
> > >   */
> > > -#if IS_ENABLED(CONFIG_VFIO_GROUP)
> > >  struct iommu_group *vfio_file_iommu_group(struct file *file);
> > > +
> > > +#if IS_ENABLED(CONFIG_VFIO_GROUP)
> > >  bool vfio_file_is_group(struct file *file);
> > >  bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
> > >  #else
> > > -static inline struct iommu_group *vfio_file_iommu_group(struct file 
> > > *file)
> > > -{
> > > -   return NULL;
> > > -}
> > > -
> > >  static inline bool vfio_file_is_group(struct file *file)
> > >  {
> > > return false;
> > >   
> > 
> > So you symbol get on a symbol that can never be defined? Still says to
> > me the kconfig needs fixing :|  
> 
> Yeah, I completely agree, and if KVM didn't already rely on this horrific
> behavior and there wasn't a more complete overhaul in-flight, I wouldn't 
> suggest
> this.
> 
> I'll send the KVM Kconfig/Makefile cleanups from my "Hide KVM internals from 
> others"
> series separately (which is still the bulk of the series) so as to prioritize
> getting the cleanups landed.
> 

Seems we have agreement and confirmation of the fix above as an
interim, do you want to post it formally and I can pick it up for
v6.7-rc?  Thanks,

Alex



Re: [PATCH 02/26] vfio: Move KVM get/put helpers to colocate it with other KVM related code

2023-09-28 Thread Alex Williamson
On Fri, 15 Sep 2023 17:30:54 -0700
Sean Christopherson  wrote:

> Move the definitions of vfio_device_get_kvm_safe() and vfio_device_put_kvm()
> down in vfio_main.c to colocate them with other KVM-specific functions,
> e.g. to allow wrapping them all with a single CONFIG_KVM check.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/vfio/vfio_main.c | 104 +++
>  1 file changed, 52 insertions(+), 52 deletions(-)


Reviewed-by: Alex Williamson 

 
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 80e39f7a6d8f..6368eed7b7b2 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -381,58 +381,6 @@ void vfio_unregister_group_dev(struct vfio_device 
> *device)
>  }
>  EXPORT_SYMBOL_GPL(vfio_unregister_group_dev);
>  
> -#if IS_ENABLED(CONFIG_KVM)
> -void vfio_device_get_kvm_safe(struct vfio_device *device, struct kvm *kvm)
> -{
> - void (*pfn)(struct kvm *kvm);
> - bool (*fn)(struct kvm *kvm);
> - bool ret;
> -
> - lockdep_assert_held(>dev_set->lock);
> -
> - if (!kvm)
> - return;
> -
> - pfn = symbol_get(kvm_put_kvm);
> - if (WARN_ON(!pfn))
> - return;
> -
> - fn = symbol_get(kvm_get_kvm_safe);
> - if (WARN_ON(!fn)) {
> - symbol_put(kvm_put_kvm);
> - return;
> - }
> -
> - ret = fn(kvm);
> - symbol_put(kvm_get_kvm_safe);
> - if (!ret) {
> - symbol_put(kvm_put_kvm);
> - return;
> - }
> -
> - device->put_kvm = pfn;
> - device->kvm = kvm;
> -}
> -
> -void vfio_device_put_kvm(struct vfio_device *device)
> -{
> - lockdep_assert_held(>dev_set->lock);
> -
> - if (!device->kvm)
> - return;
> -
> - if (WARN_ON(!device->put_kvm))
> - goto clear;
> -
> - device->put_kvm(device->kvm);
> - device->put_kvm = NULL;
> - symbol_put(kvm_put_kvm);
> -
> -clear:
> - device->kvm = NULL;
> -}
> -#endif
> -
>  /* true if the vfio_device has open_device() called but not close_device() */
>  static bool vfio_assert_device_open(struct vfio_device *device)
>  {
> @@ -1354,6 +1302,58 @@ bool vfio_file_enforced_coherent(struct file *file)
>  }
>  EXPORT_SYMBOL_GPL(vfio_file_enforced_coherent);
>  
> +#if IS_ENABLED(CONFIG_KVM)
> +void vfio_device_get_kvm_safe(struct vfio_device *device, struct kvm *kvm)
> +{
> + void (*pfn)(struct kvm *kvm);
> + bool (*fn)(struct kvm *kvm);
> + bool ret;
> +
> + lockdep_assert_held(>dev_set->lock);
> +
> + if (!kvm)
> + return;
> +
> + pfn = symbol_get(kvm_put_kvm);
> + if (WARN_ON(!pfn))
> + return;
> +
> + fn = symbol_get(kvm_get_kvm_safe);
> + if (WARN_ON(!fn)) {
> + symbol_put(kvm_put_kvm);
> + return;
> + }
> +
> + ret = fn(kvm);
> + symbol_put(kvm_get_kvm_safe);
> + if (!ret) {
> + symbol_put(kvm_put_kvm);
> + return;
> + }
> +
> + device->put_kvm = pfn;
> + device->kvm = kvm;
> +}
> +
> +void vfio_device_put_kvm(struct vfio_device *device)
> +{
> + lockdep_assert_held(>dev_set->lock);
> +
> + if (!device->kvm)
> + return;
> +
> + if (WARN_ON(!device->put_kvm))
> + goto clear;
> +
> + device->put_kvm(device->kvm);
> + device->put_kvm = NULL;
> + symbol_put(kvm_put_kvm);
> +
> +clear:
> + device->kvm = NULL;
> +}
> +#endif
> +
>  static void vfio_device_file_set_kvm(struct file *file, struct kvm *kvm)
>  {
>   struct vfio_device_file *df = file->private_data;



Re: [PATCH 05/26] vfio: KVM: Pass get/put helpers from KVM to VFIO, don't do circular lookup

2023-09-28 Thread Alex Williamson
On Fri, 15 Sep 2023 17:30:57 -0700
Sean Christopherson  wrote:

> Explicitly pass KVM's get/put helpers to VFIO when attaching a VM to
> VFIO instead of having VFIO do a symbol lookup back into KVM.  Having both
> KVM and VFIO do symbol lookups increases the overall complexity and places
> an unnecessary dependency on KVM (from VFIO) without adding any value.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/vfio/vfio.h  |  2 ++
>  drivers/vfio/vfio_main.c | 74 +++-
>  include/linux/vfio.h |  4 ++-
>  virt/kvm/vfio.c  |  9 +++--
>  4 files changed, 47 insertions(+), 42 deletions(-)


Reviewed-by: Alex Williamson 

 
> diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> index a1f741365075..eec51c7ee822 100644
> --- a/drivers/vfio/vfio.h
> +++ b/drivers/vfio/vfio.h
> @@ -19,6 +19,8 @@ struct vfio_container;
>  
>  struct vfio_kvm_reference {
>   struct kvm  *kvm;
> + bool(*get_kvm)(struct kvm *kvm);
> + void(*put_kvm)(struct kvm *kvm);
>   spinlock_t  lock;
>  };
>  
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index e77e8c6aae2f..1f58ab6dbcd2 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -16,7 +16,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -1306,38 +1305,22 @@ EXPORT_SYMBOL_GPL(vfio_file_enforced_coherent);
>  void vfio_device_get_kvm_safe(struct vfio_device *device,
> struct vfio_kvm_reference *ref)
>  {
> - void (*pfn)(struct kvm *kvm);
> - bool (*fn)(struct kvm *kvm);
> - bool ret;
> -
>   lockdep_assert_held(>dev_set->lock);
>  
> + /*
> +  * Note!  The "kvm" and "put_kvm" pointers *must* be transferred to the
> +  * device so that the device can put its reference to KVM.  KVM can
> +  * invoke vfio_device_set_kvm() to detach from VFIO, i.e. nullify all
> +  * pointers in @ref, even if a device holds a reference to KVM!  That
> +  * also means that detaching KVM from VFIO only prevents "new" devices
> +  * from using KVM, it doesn't invalidate KVM references in existing
> +  * devices.
> +  */
>   spin_lock(>lock);
> -
> - if (!ref->kvm)
> - goto out;
> -
> - pfn = symbol_get(kvm_put_kvm);
> - if (WARN_ON(!pfn))
> - goto out;
> -
> - fn = symbol_get(kvm_get_kvm_safe);
> - if (WARN_ON(!fn)) {
> - symbol_put(kvm_put_kvm);
> - goto out;
> + if (ref->kvm && ref->get_kvm(ref->kvm)) {
> + device->kvm = ref->kvm;
> + device->put_kvm = ref->put_kvm;
>   }
> -
> - ret = fn(ref->kvm);
> - symbol_put(kvm_get_kvm_safe);
> - if (!ret) {
> - symbol_put(kvm_put_kvm);
> - goto out;
> - }
> -
> - device->put_kvm = pfn;
> - device->kvm = ref->kvm;
> -
> -out:
>   spin_unlock(>lock);
>  }
>  
> @@ -1353,28 +1336,37 @@ void vfio_device_put_kvm(struct vfio_device *device)
>  
>   device->put_kvm(device->kvm);
>   device->put_kvm = NULL;
> - symbol_put(kvm_put_kvm);
> -
>  clear:
>   device->kvm = NULL;
>  }
>  
>  static void vfio_device_set_kvm(struct vfio_kvm_reference *ref,
> - struct kvm *kvm)
> + struct kvm *kvm,
> + bool (*get_kvm)(struct kvm *kvm),
> + void (*put_kvm)(struct kvm *kvm))
>  {
> + if (WARN_ON_ONCE(kvm && (!get_kvm || !put_kvm)))
> + return;
> +
>   spin_lock(>lock);
>   ref->kvm = kvm;
> + ref->get_kvm = get_kvm;
> + ref->put_kvm = put_kvm;
>   spin_unlock(>lock);
>  }
>  
> -static void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm)
> +static void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm,
> +bool (*get_kvm)(struct kvm *kvm),
> +void (*put_kvm)(struct kvm *kvm))
>  {
>  #if IS_ENABLED(CONFIG_VFIO_GROUP)
> - vfio_device_set_kvm(>kvm_ref, kvm);
> + vfio_device_set_kvm(>kvm_ref, kvm, get_kvm, put_kvm);
>  #endif
>  }
>  
> -static void vfio_device_file_set_kvm(struct file *file, struct kvm *kvm)
> +static void vfio_device_file_set_kvm(struct file *file, struct kvm *kvm,
> +  bool (*get_

Re: [PATCH 03/26] virt: Declare and define vfio_file_set_kvm() iff CONFIG_KVM is enabled

2023-09-28 Thread Alex Williamson
On Fri, 15 Sep 2023 17:30:55 -0700
Sean Christopherson  wrote:

> Hide vfio_file_set_kvm() and its unique helpers if KVM is not enabled,
> nothing else in the kernel (or out of the kernel) should be using a
> KVM specific helper.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/vfio/vfio_main.c | 2 +-
>  include/linux/vfio.h | 2 ++
>  2 files changed, 3 insertions(+), 1 deletion(-)


As Jason noted, s/virt/vfio/ in title.

Reviewed-by: Alex Williamson 

 
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 6368eed7b7b2..124cc88966a7 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -1352,7 +1352,6 @@ void vfio_device_put_kvm(struct vfio_device *device)
>  clear:
>   device->kvm = NULL;
>  }
> -#endif
>  
>  static void vfio_device_file_set_kvm(struct file *file, struct kvm *kvm)
>  {
> @@ -1388,6 +1387,7 @@ void vfio_file_set_kvm(struct file *file, struct kvm 
> *kvm)
>   vfio_device_file_set_kvm(file, kvm);
>  }
>  EXPORT_SYMBOL_GPL(vfio_file_set_kvm);
> +#endif
>  
>  /*
>   * Sub-module support
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 454e9295970c..e80955de266c 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -311,7 +311,9 @@ static inline bool vfio_file_has_dev(struct file *file, 
> struct vfio_device *devi
>  #endif
>  bool vfio_file_is_valid(struct file *file);
>  bool vfio_file_enforced_coherent(struct file *file);
> +#if IS_ENABLED(CONFIG_KVM)
>  void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
> +#endif
>  
>  #define VFIO_PIN_PAGES_MAX_ENTRIES   (PAGE_SIZE/sizeof(unsigned long))
>  



Re: [PATCH 01/26] vfio: Wrap KVM helpers with CONFIG_KVM instead of CONFIG_HAVE_KVM

2023-09-28 Thread Alex Williamson
On Fri, 15 Sep 2023 17:30:53 -0700
Sean Christopherson  wrote:

> Wrap the helpers for getting references to KVM instances with a check on
> CONFIG_KVM being enabled, not on CONFIG_HAVE_KVM being defined.  PPC does
> NOT select HAVE_KVM, despite obviously supporting KVM, and guarding code
> to get references to KVM based on whether or not the architecture supports
> KVM is nonsensical.
> 
> Drop the guard around linux/kvm_host.h entirely, conditionally including a
> generic headers is completely unnecessary.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/vfio/vfio.h  | 2 +-
>  drivers/vfio/vfio_main.c | 4 +---
>  2 files changed, 2 insertions(+), 4 deletions(-)


Reviewed-by: Alex Williamson 


> diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> index 307e3f29b527..c26d1ad68105 100644
> --- a/drivers/vfio/vfio.h
> +++ b/drivers/vfio/vfio.h
> @@ -434,7 +434,7 @@ static inline void vfio_virqfd_exit(void)
>  }
>  #endif
>  
> -#ifdef CONFIG_HAVE_KVM
> +#if IS_ENABLED(CONFIG_KVM)
>  void vfio_device_get_kvm_safe(struct vfio_device *device, struct kvm *kvm);
>  void vfio_device_put_kvm(struct vfio_device *device);
>  #else
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 40732e8ed4c6..80e39f7a6d8f 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -16,9 +16,7 @@
>  #include 
>  #include 
>  #include 
> -#ifdef CONFIG_HAVE_KVM
>  #include 
> -#endif
>  #include 
>  #include 
>  #include 
> @@ -383,7 +381,7 @@ void vfio_unregister_group_dev(struct vfio_device *device)
>  }
>  EXPORT_SYMBOL_GPL(vfio_unregister_group_dev);
>  
> -#ifdef CONFIG_HAVE_KVM
> +#if IS_ENABLED(CONFIG_KVM)
>  void vfio_device_get_kvm_safe(struct vfio_device *device, struct kvm *kvm)
>  {
>   void (*pfn)(struct kvm *kvm);



Re: [PATCH 04/26] vfio: Add struct to hold KVM assets and dedup group vs. iommufd code

2023-09-28 Thread Alex Williamson
On Fri, 15 Sep 2023 17:30:56 -0700
Sean Christopherson  wrote:

> Add a struct to hold the KVM assets need to manage and pass along KVM
> references to VFIO devices.  Providing a common struct deduplicates the
> group vs. iommufd code, and will make it easier to rework the attachment
> logic so that VFIO doesn't have to do a symbol lookup to retrieve the
> get/put helpers from KVM.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/vfio/device_cdev.c |  9 +---
>  drivers/vfio/group.c   | 18 ++--
>  drivers/vfio/vfio.h| 22 +--
>  drivers/vfio/vfio_main.c   | 43 +++---
>  4 files changed, 45 insertions(+), 47 deletions(-)


Reviewed-by: Alex Williamson 

 
> diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
> index e75da0a70d1f..e484d6d6400a 100644
> --- a/drivers/vfio/device_cdev.c
> +++ b/drivers/vfio/device_cdev.c
> @@ -46,13 +46,6 @@ int vfio_device_fops_cdev_open(struct inode *inode, struct 
> file *filep)
>   return ret;
>  }
>  
> -static void vfio_df_get_kvm_safe(struct vfio_device_file *df)
> -{
> - spin_lock(>kvm_ref_lock);
> - vfio_device_get_kvm_safe(df->device, df->kvm);
> - spin_unlock(>kvm_ref_lock);
> -}
> -
>  long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
>   struct vfio_device_bind_iommufd __user *arg)
>  {
> @@ -99,7 +92,7 @@ long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
>* a reference.  This reference is held until device closed.
>* Save the pointer in the device for use by drivers.
>*/
> - vfio_df_get_kvm_safe(df);
> + vfio_device_get_kvm_safe(df->device, >kvm_ref);
>  
>   ret = vfio_df_open(df);
>   if (ret)
> diff --git a/drivers/vfio/group.c b/drivers/vfio/group.c
> index 610a429c6191..756e47ff4cf0 100644
> --- a/drivers/vfio/group.c
> +++ b/drivers/vfio/group.c
> @@ -157,13 +157,6 @@ static int vfio_group_ioctl_set_container(struct 
> vfio_group *group,
>   return ret;
>  }
>  
> -static void vfio_device_group_get_kvm_safe(struct vfio_device *device)
> -{
> - spin_lock(>group->kvm_ref_lock);
> - vfio_device_get_kvm_safe(device, device->group->kvm);
> - spin_unlock(>group->kvm_ref_lock);
> -}
> -
>  static int vfio_df_group_open(struct vfio_device_file *df)
>  {
>   struct vfio_device *device = df->device;
> @@ -184,7 +177,7 @@ static int vfio_df_group_open(struct vfio_device_file *df)
>* the pointer in the device for use by drivers.
>*/
>   if (device->open_count == 0)
> - vfio_device_group_get_kvm_safe(device);
> + vfio_device_get_kvm_safe(device, >group->kvm_ref);
>  
>   df->iommufd = device->group->iommufd;
>   if (df->iommufd && vfio_device_is_noiommu(device) && device->open_count 
> == 0) {
> @@ -560,7 +553,7 @@ static struct vfio_group *vfio_group_alloc(struct 
> iommu_group *iommu_group,
>  
>   refcount_set(>drivers, 1);
>   mutex_init(>group_lock);
> - spin_lock_init(>kvm_ref_lock);
> + spin_lock_init(>kvm_ref.lock);
>   INIT_LIST_HEAD(>device_list);
>   mutex_init(>device_lock);
>   group->iommu_group = iommu_group;
> @@ -884,13 +877,6 @@ bool vfio_group_enforced_coherent(struct vfio_group 
> *group)
>   return ret;
>  }
>  
> -void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm)
> -{
> - spin_lock(>kvm_ref_lock);
> - group->kvm = kvm;
> - spin_unlock(>kvm_ref_lock);
> -}
> -
>  /**
>   * vfio_file_has_dev - True if the VFIO file is a handle for device
>   * @file: VFIO file to check
> diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> index c26d1ad68105..a1f741365075 100644
> --- a/drivers/vfio/vfio.h
> +++ b/drivers/vfio/vfio.h
> @@ -12,18 +12,23 @@
>  #include 
>  #include 
>  
> +struct kvm;
>  struct iommufd_ctx;
>  struct iommu_group;
>  struct vfio_container;
>  
> +struct vfio_kvm_reference {
> + struct kvm  *kvm;
> + spinlock_t  lock;
> +};
> +
>  struct vfio_device_file {
>   struct vfio_device *device;
>   struct vfio_group *group;
>  
>   u8 access_granted;
>   u32 devid; /* only valid when iommufd is valid */
> - spinlock_t kvm_ref_lock; /* protect kvm field */
> - struct kvm *kvm;
> + struct vfio_kvm_reference kvm_ref;
>   struct iommufd_ctx *iommufd; /* protected by struct 
> vfio_device_set::lock */
>  };
>  
> @@ -88,11 +93,10 @@ s

Re: [PATCH 06/26] KVM: Drop CONFIG_KVM_VFIO and just look at KVM+VFIO

2023-09-28 Thread Alex Williamson
On Fri, 15 Sep 2023 17:30:58 -0700
Sean Christopherson  wrote:

> Drop KVM's KVM_VFIO Kconfig, and instead compile in VFIO support if
> and only if VFIO itself is enabled.  Similar to the recent change to have
> VFIO stop looking at HAVE_KVM, compiling in support for talking to VFIO
> just because the architecture supports VFIO is nonsensical.
> 
> This fixes a bug where RISC-V doesn't select KVM_VFIO, i.e. would silently
> fail to do connect KVM and VFIO, even though RISC-V supports VFIO.  The
> bug is benign as the only driver in all of Linux that actually uses the
> KVM reference provided by VFIO is KVM-GT, which is x86/Intel specific.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/arm64/kvm/Kconfig   | 1 -
>  arch/powerpc/kvm/Kconfig | 1 -
>  arch/s390/kvm/Kconfig| 1 -
>  arch/x86/kvm/Kconfig | 1 -
>  virt/kvm/Kconfig | 3 ---
>  virt/kvm/Makefile.kvm| 4 +++-
>  virt/kvm/vfio.h  | 2 +-
>  7 files changed, 4 insertions(+), 9 deletions(-)


Reviewed-by: Alex Williamson 


> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 83c1e09be42e..2b5c332f157d 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -28,7 +28,6 @@ menuconfig KVM
>   select KVM_MMIO
>   select KVM_GENERIC_DIRTYLOG_READ_PROTECT
>   select KVM_XFER_TO_GUEST_WORK
> - select KVM_VFIO
>   select HAVE_KVM_EVENTFD
>   select HAVE_KVM_IRQFD
>   select HAVE_KVM_DIRTY_RING_ACQ_REL
> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> index 902611954200..c4beb49c0eb2 100644
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -22,7 +22,6 @@ config KVM
>   select PREEMPT_NOTIFIERS
>   select HAVE_KVM_EVENTFD
>   select HAVE_KVM_VCPU_ASYNC_IOCTL
> - select KVM_VFIO
>   select IRQ_BYPASS_MANAGER
>   select HAVE_KVM_IRQ_BYPASS
>   select INTERVAL_TREE
> diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
> index 45fdf2a9b2e3..459d536116a6 100644
> --- a/arch/s390/kvm/Kconfig
> +++ b/arch/s390/kvm/Kconfig
> @@ -31,7 +31,6 @@ config KVM
>   select HAVE_KVM_IRQ_ROUTING
>   select HAVE_KVM_INVALID_WAKEUPS
>   select HAVE_KVM_NO_POLL
> - select KVM_VFIO
>   select INTERVAL_TREE
>   select MMU_NOTIFIER
>   help
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index ed90f148140d..0f01e5600b5f 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -45,7 +45,6 @@ config KVM
>   select HAVE_KVM_NO_POLL
>   select KVM_XFER_TO_GUEST_WORK
>   select KVM_GENERIC_DIRTYLOG_READ_PROTECT
> - select KVM_VFIO
>   select INTERVAL_TREE
>   select HAVE_KVM_PM_NOTIFIER if PM
>   select KVM_GENERIC_HARDWARE_ENABLING
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 484d0873061c..f0be3b55cea6 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -59,9 +59,6 @@ config HAVE_KVM_MSI
>  config HAVE_KVM_CPU_RELAX_INTERCEPT
> bool
>  
> -config KVM_VFIO
> -   bool
> -
>  config HAVE_KVM_INVALID_WAKEUPS
> bool
>  
> diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm
> index 2c27d5d0c367..29373b59d89a 100644
> --- a/virt/kvm/Makefile.kvm
> +++ b/virt/kvm/Makefile.kvm
> @@ -6,7 +6,9 @@
>  KVM ?= ../../../virt/kvm
>  
>  kvm-y := $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/binary_stats.o
> -kvm-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o
> +ifdef CONFIG_VFIO
> +kvm-y += $(KVM)/vfio.o
> +endif
>  kvm-$(CONFIG_KVM_MMIO) += $(KVM)/coalesced_mmio.o
>  kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
>  kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o
> diff --git a/virt/kvm/vfio.h b/virt/kvm/vfio.h
> index e130a4a03530..af475a323965 100644
> --- a/virt/kvm/vfio.h
> +++ b/virt/kvm/vfio.h
> @@ -2,7 +2,7 @@
>  #ifndef __KVM_VFIO_H
>  #define __KVM_VFIO_H
>  
> -#ifdef CONFIG_KVM_VFIO
> +#if IS_ENABLED(CONFIG_KVM) && IS_ENABLED(CONFIG_VFIO)
>  int kvm_vfio_ops_init(void);
>  void kvm_vfio_ops_exit(void);
>  #else



Re: [PATCH 0/2] eventfd: simplify signal helpers

2023-07-17 Thread Alex Williamson
On Mon, 17 Jul 2023 19:12:16 -0300
Jason Gunthorpe  wrote:

> On Mon, Jul 17, 2023 at 01:08:31PM -0600, Alex Williamson wrote:
> 
> > What would that mechanism be?  We've been iterating on getting the
> > serialization and buffering correct, but I don't know of another means
> > that combines the notification with a value, so we'd likely end up with
> > an eventfd only for notification and a separate ring buffer for
> > notification values.  
> 
> All FDs do this. You just have to make a FD with custom
> file_operations that does what this wants. The uAPI shouldn't be able
> to tell if the FD is backing it with an eventfd or otherwise. Have the
> kernel return the FD instead of accepting it. Follow the basic design
> of eg mlx5vf_save_fops

Sure, userspace could poll on any fd and read a value from it, but at
that point we're essentially duplicating a lot of what eventfd provides
for a minor(?) semantic difference over how the counter value is
interpreted.  Using an actual eventfd allows the ACPI notification to
work as just another interrupt index within the existing vfio IRQ uAPI.
Thanks,

Alex



Re: [PATCH 0/2] eventfd: simplify signal helpers

2023-07-17 Thread Alex Williamson
On Mon, 17 Jul 2023 10:29:34 +0200
Grzegorz Jaszczyk  wrote:

> pt., 14 lip 2023 o 09:05 Christian Brauner  napisał(a):
> >
> > On Thu, Jul 13, 2023 at 11:10:54AM -0600, Alex Williamson wrote:  
> > > On Thu, 13 Jul 2023 12:05:36 +0200
> > > Christian Brauner  wrote:
> > >  
> > > > Hey everyone,
> > > >
> > > > This simplifies the eventfd_signal() and eventfd_signal_mask() helpers
> > > > by removing the count argument which is effectively unused.  
> > >
> > > We have a patch under review which does in fact make use of the
> > > signaling value:
> > >
> > > https://lore.kernel.org/all/20230630155936.3015595-1-...@semihalf.com/  
> >
> > Huh, thanks for the link.
> >
> > Quoting from
> > https://patchwork.kernel.org/project/kvm/patch/20230307220553.631069-1-...@semihalf.com/#25266856
> >  
> > > Reading an eventfd returns an 8-byte value, we generally only use it
> > > as a counter, but it's been discussed previously and IIRC, it's possible
> > > to use that value as a notification value.  
> >
> > So the goal is to pipe a specific value through eventfd? But it is
> > explicitly a counter. The whole thing is written around a counter and
> > each write and signal adds to the counter.
> >
> > The consequences are pretty well described in the cover letter of
> > v6 https://lore.kernel.org/all/20230630155936.3015595-1-...@semihalf.com/
> >  
> > > Since the eventfd counter is used as ACPI notification value
> > > placeholder, the eventfd signaling needs to be serialized in order to
> > > not end up with notification values being coalesced. Therefore ACPI
> > > notification values are buffered and signalized one by one, when the
> > > previous notification value has been consumed.  
> >
> > But isn't this a good indication that you really don't want an eventfd
> > but something that's explicitly designed to associate specific data with
> > a notification? Using eventfd in that manner requires serialization,
> > buffering, and enforces ordering.

What would that mechanism be?  We've been iterating on getting the
serialization and buffering correct, but I don't know of another means
that combines the notification with a value, so we'd likely end up with
an eventfd only for notification and a separate ring buffer for
notification values.

As this series demonstrates, the current in-kernel users only increment
the counter and most userspace likely discards the counter value, which
makes the counter largely a waste.  While perhaps unconventional,
there's no requirement that the counter may only be incremented by one,
nor any restriction that I see in how userspace must interpret the
counter value.

As I understand the ACPI notification proposal that Grzegorz links
below, a notification with an interpreted value allows for a more
direct userspace implementation when dealing with a series of discrete
notification with value events.  Thanks,

Alex

> > I have no skin in the game aside from having to drop this conversion
> > which I'm fine to do if there are actually users for this btu really,
> > that looks a lot like abusing an api that really wasn't designed for
> > this.  
> 
> https://patchwork.kernel.org/project/kvm/patch/20230307220553.631069-1-...@semihalf.com/
> was posted at the beginig of March and one of the main things we've
> discussed was the mechanism for propagating acpi notification value.
> We've endup with eventfd as the best mechanism and have actually been
> using it from v2. I really do not want to waste this effort, I think
> we are quite advanced with v6 now. Additionally we didn't actually
> modify any part of eventfd support that was in place, we only used it
> in a specific (and discussed beforehand) way.



Re: [PATCH 0/2] eventfd: simplify signal helpers

2023-07-13 Thread Alex Williamson
On Thu, 13 Jul 2023 12:05:36 +0200
Christian Brauner  wrote:

> Hey everyone,
> 
> This simplifies the eventfd_signal() and eventfd_signal_mask() helpers
> by removing the count argument which is effectively unused.

We have a patch under review which does in fact make use of the
signaling value:

https://lore.kernel.org/all/20230630155936.3015595-1-...@semihalf.com/

Thanks,
Alex



Re: [PATCH v4 0/7] introduce vm_flags modifier functions

2023-03-17 Thread Alex Williamson
On Fri, 17 Mar 2023 12:08:32 -0700
Suren Baghdasaryan  wrote:

> On Tue, Mar 14, 2023 at 1:11 PM Alex Williamson
>  wrote:
> >
> > On Thu, 26 Jan 2023 11:37:45 -0800
> > Suren Baghdasaryan  wrote:
> >  
> > > This patchset was originally published as a part of per-VMA locking [1] 
> > > and
> > > was split after suggestion that it's viable on its own and to facilitate
> > > the review process. It is now a preprequisite for the next version of 
> > > per-VMA
> > > lock patchset, which reuses vm_flags modifier functions to lock the VMA 
> > > when
> > > vm_flags are being updated.
> > >
> > > VMA vm_flags modifications are usually done under exclusive mmap_lock
> > > protection because this attrubute affects other decisions like VMA merging
> > > or splitting and races should be prevented. Introduce vm_flags modifier
> > > functions to enforce correct locking.
> > >
> > > The patchset applies cleanly over mm-unstable branch of mm tree.  
> >
> > With this series, vfio-pci developed a bunch of warnings around not
> > holding the mmap_lock write semaphore while calling
> > io_remap_pfn_range() from our fault handler, vfio_pci_mmap_fault().
> >
> > I suspect vdpa has the same issue for their use of remap_pfn_range()
> > from their fault handler, JasonW, MST, FYI.
> >
> > It also looks like gru_fault() would have the same issue, Dimitri.
> >
> > In all cases, we're preemptively setting vm_flags to what
> > remap_pfn_range_notrack() uses, so I thought we were safe here as I
> > specifically remember trying to avoid changing vm_flags from the
> > fault handler.  But apparently that doesn't take into account
> > track_pfn_remap() where VM_PAT comes into play.
> >
> > The reason for using remap_pfn_range() on fault in vfio-pci is that
> > we're mapping device MMIO to userspace, where that MMIO can be disabled
> > and we'd rather zap the mapping when that occurs so that we can sigbus
> > the user rather than allow the user to trigger potentially fatal bus
> > errors on the host.
> >
> > Peter Xu has suggested offline that a non-lazy approach to reinsert the
> > mappings might be more inline with mm expectations relative to touching
> > vm_flags during fault.  What's the right solution here?  Can the fault
> > handling be salvaged, is proactive remapping the right approach, or is
> > there something better?  Thanks,  
> 
> Hi Alex,
> If in your case it's safe to change vm_flags without holding exclusive
> mmap_lock, maybe you can use __vm_flags_mod() the way I used it in
> https://lore.kernel.org/all/20230126193752.297968-7-sur...@google.com,
> while explaining why this should be safe?

Hi Suren,

Thanks for the reply, but I'm not sure I'm following.  Are you
suggesting a bool arg added to io_remap_pfn_range(), or some new
variant of that function to conditionally use __vm_flags_mod() in place
of vm_flags_set() across the call chain?  Thanks,

Alex



Re: [PATCH v4 0/7] introduce vm_flags modifier functions

2023-03-14 Thread Alex Williamson
On Thu, 26 Jan 2023 11:37:45 -0800
Suren Baghdasaryan  wrote:

> This patchset was originally published as a part of per-VMA locking [1] and
> was split after suggestion that it's viable on its own and to facilitate
> the review process. It is now a preprequisite for the next version of per-VMA
> lock patchset, which reuses vm_flags modifier functions to lock the VMA when
> vm_flags are being updated.
> 
> VMA vm_flags modifications are usually done under exclusive mmap_lock
> protection because this attrubute affects other decisions like VMA merging
> or splitting and races should be prevented. Introduce vm_flags modifier
> functions to enforce correct locking.
> 
> The patchset applies cleanly over mm-unstable branch of mm tree.

With this series, vfio-pci developed a bunch of warnings around not
holding the mmap_lock write semaphore while calling
io_remap_pfn_range() from our fault handler, vfio_pci_mmap_fault().

I suspect vdpa has the same issue for their use of remap_pfn_range()
from their fault handler, JasonW, MST, FYI.

It also looks like gru_fault() would have the same issue, Dimitri.

In all cases, we're preemptively setting vm_flags to what
remap_pfn_range_notrack() uses, so I thought we were safe here as I
specifically remember trying to avoid changing vm_flags from the
fault handler.  But apparently that doesn't take into account
track_pfn_remap() where VM_PAT comes into play.

The reason for using remap_pfn_range() on fault in vfio-pci is that
we're mapping device MMIO to userspace, where that MMIO can be disabled
and we'd rather zap the mapping when that occurs so that we can sigbus
the user rather than allow the user to trigger potentially fatal bus
errors on the host.

Peter Xu has suggested offline that a non-lazy approach to reinsert the
mappings might be more inline with mm expectations relative to touching
vm_flags during fault.  What's the right solution here?  Can the fault
handling be salvaged, is proactive remapping the right approach, or is
there something better?  Thanks,

Alex



Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems

2023-03-06 Thread Alex Williamson
On Mon, 6 Mar 2023 18:35:22 -0600 (CST)
Timothy Pearson  wrote:

> - Original Message -
> > From: "Alex Williamson" 
> > To: "Timothy Pearson" 
> > Cc: "kvm" , "linuxppc-dev" 
> > 
> > Sent: Monday, March 6, 2023 5:46:07 PM
> > Subject: Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems  
> 
> > On Mon, 6 Mar 2023 11:29:53 -0600 (CST)
> > Timothy Pearson  wrote:
> >   
> >> This patch series reenables VFIO support on POWER systems.  It
> >> is based on Alexey Kardashevskiys's patch series, rebased and
> >> successfully tested under QEMU with a Marvell PCIe SATA controller
> >> on a POWER9 Blackbird host.
> >> 
> >> Alexey Kardashevskiy (3):
> >>   powerpc/iommu: Add "borrowing" iommu_table_group_ops
> >>   powerpc/pci_64: Init pcibios subsys a bit later
> >>   powerpc/iommu: Add iommu_ops to report capabilities and allow blocking
> >> domains
> >> 
> >> Timothy Pearson (1):
> >>   Add myself to MAINTAINERS for Power VFIO support
> >> 
> >>  MAINTAINERS   |   5 +
> >>  arch/powerpc/include/asm/iommu.h  |   6 +-
> >>  arch/powerpc/include/asm/pci-bridge.h |   7 +
> >>  arch/powerpc/kernel/iommu.c   | 246 +-
> >>  arch/powerpc/kernel/pci_64.c  |   2 +-
> >>  arch/powerpc/platforms/powernv/pci-ioda.c |  36 +++-
> >>  arch/powerpc/platforms/pseries/iommu.c|  27 +++
> >>  arch/powerpc/platforms/pseries/pseries.h  |   4 +
> >>  arch/powerpc/platforms/pseries/setup.c|   3 +
> >>  drivers/vfio/vfio_iommu_spapr_tce.c   |  96 ++---
> >>  10 files changed, 338 insertions(+), 94 deletions(-)
> >>   
> > 
> > For vfio and MAINTAINERS portions,
> > 
> > Acked-by: Alex Williamson 
> > 
> > I'll note though that spapr_tce_take_ownership() looks like it copied a
> > bug from the old tce_iommu_take_ownership() where tbl and tbl->it_map
> > are tested before calling iommu_take_ownership() but not in the unwind
> > loop, ie. tables we might have skipped on setup are unconditionally
> > released on unwind.  Thanks,
> > 
> > Alex  
> 
> Thanks for that.  I'll put together a patch to get rid of that
> potential bug that can be applied after this series is merged, unless
> you'd rather I resubmit a v3 with the issue fixed?

Follow-up fix is fine by me.  Thanks,

Alex



Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems

2023-03-06 Thread Alex Williamson
On Mon, 6 Mar 2023 11:29:53 -0600 (CST)
Timothy Pearson  wrote:

> This patch series reenables VFIO support on POWER systems.  It
> is based on Alexey Kardashevskiys's patch series, rebased and
> successfully tested under QEMU with a Marvell PCIe SATA controller
> on a POWER9 Blackbird host.
> 
> Alexey Kardashevskiy (3):
>   powerpc/iommu: Add "borrowing" iommu_table_group_ops
>   powerpc/pci_64: Init pcibios subsys a bit later
>   powerpc/iommu: Add iommu_ops to report capabilities and allow blocking
> domains
> 
> Timothy Pearson (1):
>   Add myself to MAINTAINERS for Power VFIO support
> 
>  MAINTAINERS   |   5 +
>  arch/powerpc/include/asm/iommu.h  |   6 +-
>  arch/powerpc/include/asm/pci-bridge.h |   7 +
>  arch/powerpc/kernel/iommu.c   | 246 +-
>  arch/powerpc/kernel/pci_64.c  |   2 +-
>  arch/powerpc/platforms/powernv/pci-ioda.c |  36 +++-
>  arch/powerpc/platforms/pseries/iommu.c|  27 +++
>  arch/powerpc/platforms/pseries/pseries.h  |   4 +
>  arch/powerpc/platforms/pseries/setup.c|   3 +
>  drivers/vfio/vfio_iommu_spapr_tce.c   |  96 ++---
>  10 files changed, 338 insertions(+), 94 deletions(-)
> 

For vfio and MAINTAINERS portions,

Acked-by: Alex Williamson 

I'll note though that spapr_tce_take_ownership() looks like it copied a
bug from the old tce_iommu_take_ownership() where tbl and tbl->it_map
are tested before calling iommu_take_ownership() but not in the unwind
loop, ie. tables we might have skipped on setup are unconditionally
released on unwind.  Thanks,

Alex



[PATCH] vfio/pci: Revert nvlink removal uAPI breakage

2021-05-04 Thread Alex Williamson
Revert the uAPI changes from the below commit with notice that these
regions and capabilities are no longer provided.

Fixes: b392a1989170 ("vfio/pci: remove vfio_pci_nvlink2")
Reported-by: Greg Kurz 
Signed-off-by: Alex Williamson 
---

Greg (Kurz), please double check this resolves the issue.  Thanks!

 include/uapi/linux/vfio.h |   46 +
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 34b1f53a3901..ef33ea002b0b 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -333,10 +333,21 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG  (3)
 
 /* 10de vendor PCI sub-types */
-/* subtype 1 was VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM, don't use */
+/*
+ * NVIDIA GPU NVlink2 RAM is coherent RAM mapped onto the host address space.
+ *
+ * Deprecated, region no longer provided
+ */
+#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM (1)
 
 /* 1014 vendor PCI sub-types */
-/* subtype 1 was VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD, don't use */
+/*
+ * IBM NPU NVlink2 ATSD (Address Translation Shootdown) register of NPU
+ * to do TLB invalidation on a GPU.
+ *
+ * Deprecated, region no longer provided
+ */
+#define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD   (1)
 
 /* sub-types for VFIO_REGION_TYPE_GFX */
 #define VFIO_REGION_SUBTYPE_GFX_EDID(1)
@@ -630,9 +641,36 @@ struct vfio_device_migration_info {
  */
 #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3
 
-/* subtype 4 was VFIO_REGION_INFO_CAP_NVLINK2_SSATGT, don't use */
+/*
+ * Capability with compressed real address (aka SSA - small system address)
+ * where GPU RAM is mapped on a system bus. Used by a GPU for DMA routing
+ * and by the userspace to associate a NVLink bridge with a GPU.
+ *
+ * Deprecated, capability no longer provided
+ */
+#define VFIO_REGION_INFO_CAP_NVLINK2_SSATGT4
+
+struct vfio_region_info_cap_nvlink2_ssatgt {
+   struct vfio_info_cap_header header;
+   __u64 tgt;
+};
 
-/* subtype 5 was VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD, don't use */
+/*
+ * Capability with an NVLink link speed. The value is read by
+ * the NVlink2 bridge driver from the bridge's "ibm,nvlink-speed"
+ * property in the device tree. The value is fixed in the hardware
+ * and failing to provide the correct value results in the link
+ * not working with no indication from the driver why.
+ *
+ * Deprecated, capability no longer provided
+ */
+#define VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD5
+
+struct vfio_region_info_cap_nvlink2_lnkspd {
+   struct vfio_info_cap_header header;
+   __u32 link_speed;
+   __u32 __pad;
+};
 
 /**
  * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,




Re: remove the nvlink2 pci_vfio subdriver v2

2021-05-04 Thread Alex Williamson
On Tue, 4 May 2021 16:11:31 +0200
Greg Kurz  wrote:

> On Tue, 4 May 2021 15:30:15 +0200
> Greg Kroah-Hartman  wrote:
> 
> > On Tue, May 04, 2021 at 03:20:34PM +0200, Greg Kurz wrote:  
> > > On Tue, 4 May 2021 14:59:07 +0200
> > > Greg Kroah-Hartman  wrote:
> > >   
> > > > On Tue, May 04, 2021 at 02:22:36PM +0200, Greg Kurz wrote:  
> > > > > On Fri, 26 Mar 2021 07:13:09 +0100
> > > > > Christoph Hellwig  wrote:
> > > > >   
> > > > > > Hi all,
> > > > > > 
> > > > > > the nvlink2 vfio subdriver is a weird beast.  It supports a hardware
> > > > > > feature without any open source component - what would normally be
> > > > > > the normal open source userspace that we require for kernel drivers,
> > > > > > although in this particular case user space could of course be a
> > > > > > kernel driver in a VM.  It also happens to be a complete mess that
> > > > > > does not properly bind to PCI IDs, is hacked into the vfio_pci 
> > > > > > driver
> > > > > > and also pulles in over 1000 lines of code always build into powerpc
> > > > > > kernels that have Power NV support enabled.  Because of all these
> > > > > > issues and the lack of breaking userspace when it is removed I think
> > > > > > the best idea is to simply kill.
> > > > > > 
> > > > > > Changes since v1:
> > > > > >  - document the removed subtypes as reserved
> > > > > >  - add the ACK from Greg
> > > > > > 
> > > > > > Diffstat:
> > > > > >  arch/powerpc/platforms/powernv/npu-dma.c |  705 
> > > > > > ---
> > > > > >  b/arch/powerpc/include/asm/opal.h|3 
> > > > > >  b/arch/powerpc/include/asm/pci-bridge.h  |1 
> > > > > >  b/arch/powerpc/include/asm/pci.h |7 
> > > > > >  b/arch/powerpc/platforms/powernv/Makefile|2 
> > > > > >  b/arch/powerpc/platforms/powernv/opal-call.c |2 
> > > > > >  b/arch/powerpc/platforms/powernv/pci-ioda.c  |  185 ---
> > > > > >  b/arch/powerpc/platforms/powernv/pci.c   |   11 
> > > > > >  b/arch/powerpc/platforms/powernv/pci.h   |   17 
> > > > > >  b/arch/powerpc/platforms/pseries/pci.c   |   23 
> > > > > >  b/drivers/vfio/pci/Kconfig   |6 
> > > > > >  b/drivers/vfio/pci/Makefile  |1 
> > > > > >  b/drivers/vfio/pci/vfio_pci.c|   18 
> > > > > >  b/drivers/vfio/pci/vfio_pci_private.h|   14 
> > > > > >  b/include/uapi/linux/vfio.h  |   38 -  
> > > > > 
> > > > > 
> > > > > Hi Christoph,
> > > > > 
> > > > > FYI, these uapi changes break build of QEMU.  
> > > > 
> > > > What uapi changes?
> > > >   
> > > 
> > > All macros and structure definitions that are being removed
> > > from include/uapi/linux/vfio.h by patch 1.
> > >   
> > > > What exactly breaks?
> > > >   
> > > 
> > > These macros and types are used by the current QEMU code base.
> > > Next time the QEMU source tree updates its copy of the kernel
> > > headers, the compilation of affected code will fail.  
> > 
> > So does QEMU use this api that is being removed, or does it just have
> > some odd build artifacts of the uapi things?
> >   
> 
> These are region subtypes definition and associated capabilities.
> QEMU basically gets information on VFIO regions from the kernel
> driver and for those regions with a nvlink2 subtype, it tries
> to extract some more nvlink2 related info.


Urgh, let's put the uapi header back in place with a deprecation
notice.  Userspace should never have a dependency on the existence of a
given region, but clearly will have code to parse the data structure
describing that region.  I'll post a patch.  Thanks,

Alex



Re: [PATCH 1/2] vfio/pci: remove vfio_pci_nvlink2

2021-04-12 Thread Alex Williamson
On Mon, 12 Apr 2021 19:41:41 +1000
Michael Ellerman  wrote:

> Alex Williamson  writes:
> > On Fri, 26 Mar 2021 07:13:10 +0100
> > Christoph Hellwig  wrote:
> >  
> >> This driver never had any open userspace (which for VFIO would include
> >> VM kernel drivers) that use it, and thus should never have been added
> >> by our normal userspace ABI rules.
> >> 
> >> Signed-off-by: Christoph Hellwig 
> >> Acked-by: Greg Kroah-Hartman 
> >> ---
> >>  drivers/vfio/pci/Kconfig|   6 -
> >>  drivers/vfio/pci/Makefile   |   1 -
> >>  drivers/vfio/pci/vfio_pci.c |  18 -
> >>  drivers/vfio/pci/vfio_pci_nvlink2.c | 490 
> >>  drivers/vfio/pci/vfio_pci_private.h |  14 -
> >>  include/uapi/linux/vfio.h   |  38 +--
> >>  6 files changed, 4 insertions(+), 563 deletions(-)
> >>  delete mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c  
> >
> > Hearing no objections, applied to vfio next branch for v5.13.  Thanks,  
> 
> Looks like you only took patch 1?
> 
> I can't take patch 2 on its own, that would break the build.
> 
> Do you want to take both patches? There's currently no conflicts against
> my tree. It's possible one could appear before the v5.13 merge window,
> though it would probably just be something minor.
> 
> Or I could apply both patches to my tree, which means patch 1 would
> appear as two commits in the git history, but that's not a big deal.

I've already got a conflict in my next branch with patch 1, so it's
best to go through my tree.  Seems like a shared branch would be
easiest to allow you to merge and manage potential conflicts against
patch 2, I've pushed a branch here:

https://github.com/awilliam/linux-vfio.git v5.13/vfio/nvlink

Thanks,
Alex



Re: [PATCH 1/2] vfio/pci: remove vfio_pci_nvlink2

2021-04-06 Thread Alex Williamson
On Fri, 26 Mar 2021 07:13:10 +0100
Christoph Hellwig  wrote:

> This driver never had any open userspace (which for VFIO would include
> VM kernel drivers) that use it, and thus should never have been added
> by our normal userspace ABI rules.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Greg Kroah-Hartman 
> ---
>  drivers/vfio/pci/Kconfig|   6 -
>  drivers/vfio/pci/Makefile   |   1 -
>  drivers/vfio/pci/vfio_pci.c |  18 -
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 490 
>  drivers/vfio/pci/vfio_pci_private.h |  14 -
>  include/uapi/linux/vfio.h   |  38 +--
>  6 files changed, 4 insertions(+), 563 deletions(-)
>  delete mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c

Hearing no objections, applied to vfio next branch for v5.13.  Thanks,

Alex



Re: [PATCH 1/2] vfio/pci: remove vfio_pci_nvlink2

2021-03-22 Thread Alex Williamson
On Mon, 22 Mar 2021 16:01:54 +0100
Christoph Hellwig  wrote:
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 8ce36c1d53ca11..db7e782419d5d9 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -332,19 +332,6 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG   (2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG(3)
>  
> -/* 10de vendor PCI sub-types */
> -/*
> - * NVIDIA GPU NVlink2 RAM is coherent RAM mapped onto the host address space.
> - */
> -#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM   (1)
> -
> -/* 1014 vendor PCI sub-types */
> -/*
> - * IBM NPU NVlink2 ATSD (Address Translation Shootdown) register of NPU
> - * to do TLB invalidation on a GPU.
> - */
> -#define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD (1)
> -
>  /* sub-types for VFIO_REGION_TYPE_GFX */
>  #define VFIO_REGION_SUBTYPE_GFX_EDID(1)
>  
> @@ -637,33 +624,6 @@ struct vfio_device_migration_info {
>   */
>  #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE   3
>  
> -/*
> - * Capability with compressed real address (aka SSA - small system address)
> - * where GPU RAM is mapped on a system bus. Used by a GPU for DMA routing
> - * and by the userspace to associate a NVLink bridge with a GPU.
> - */
> -#define VFIO_REGION_INFO_CAP_NVLINK2_SSATGT  4
> -
> -struct vfio_region_info_cap_nvlink2_ssatgt {
> - struct vfio_info_cap_header header;
> - __u64 tgt;
> -};
> -
> -/*
> - * Capability with an NVLink link speed. The value is read by
> - * the NVlink2 bridge driver from the bridge's "ibm,nvlink-speed"
> - * property in the device tree. The value is fixed in the hardware
> - * and failing to provide the correct value results in the link
> - * not working with no indication from the driver why.
> - */
> -#define VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD  5
> -
> -struct vfio_region_info_cap_nvlink2_lnkspd {
> - struct vfio_info_cap_header header;
> - __u32 link_speed;
> - __u32 __pad;
> -};
> -
>  /**
>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
>   *   struct vfio_irq_info)

I'll leave any attempt to defend keeping this code to Alexey, but
minimally these region sub-types and capability IDs should probably be
reserved to avoid breaking whatever userspace might exist to consume
these.  Our ID space is sufficiently large that we don't need to
recycle them any time soon.  Thanks,

Alex



Re: [PATCH kernel v2] vfio/pci/nvlink2: Do not attempt NPU2 setup on POWER8NVL NPU

2020-12-03 Thread Alex Williamson
On Sun, 22 Nov 2020 18:39:50 +1100
Alexey Kardashevskiy  wrote:

> We execute certain NPU2 setup code (such as mapping an LPID to a device
> in NPU2) unconditionally if an Nvlink bridge is detected. However this
> cannot succeed on POWER8NVL machines as the init helpers return an error
> other than ENODEV which means the device is there is and setup failed so
> vfio_pci_enable() fails and pass through is not possible.
> 
> This changes the two NPU2 related init helpers to return -ENODEV if
> there is no "memory-region" device tree property as this is
> the distinction between NPU and NPU2.
> 
> Tested on
> - POWER9 pvr=004e1201, Ubuntu 19.04 host, Ubuntu 18.04 vm,
>   NVIDIA GV100 10de:1db1 driver 418.39
> - POWER8 pvr=004c0100, RHEL 7.6 host, Ubuntu 16.10 vm,
>   NVIDIA P100 10de:15f9 driver 396.47
> 
> Fixes: 7f92891778df ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] 
> subdriver")
> Cc: sta...@vger.kernel.org # 5.0
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v2:
> * updated commit log with tested configs and replaced P8+ with POWER8NVL for 
> clarity
> ---
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)

Thanks, applies to vfio next branch for v5.11.

Alex


> 
> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
> b/drivers/vfio/pci/vfio_pci_nvlink2.c
> index 65c61710c0e9..9adcf6a8f888 100644
> --- a/drivers/vfio/pci/vfio_pci_nvlink2.c
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -231,7 +231,7 @@ int vfio_pci_nvdia_v100_nvlink2_init(struct 
> vfio_pci_device *vdev)
>   return -EINVAL;
>  
>   if (of_property_read_u32(npu_node, "memory-region", _phandle))
> - return -EINVAL;
> + return -ENODEV;
>  
>   mem_node = of_find_node_by_phandle(mem_phandle);
>   if (!mem_node)
> @@ -393,7 +393,7 @@ int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
>   int ret;
>   struct vfio_pci_npu2_data *data;
>   struct device_node *nvlink_dn;
> - u32 nvlink_index = 0;
> + u32 nvlink_index = 0, mem_phandle = 0;
>   struct pci_dev *npdev = vdev->pdev;
>   struct device_node *npu_node = pci_device_to_OF_node(npdev);
>   struct pci_controller *hose = pci_bus_to_host(npdev->bus);
> @@ -408,6 +408,9 @@ int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
>   if (!pnv_pci_get_gpu_dev(vdev->pdev))
>   return -ENODEV;
>  
> + if (of_property_read_u32(npu_node, "memory-region", _phandle))
> + return -ENODEV;
> +
>   /*
>* NPU2 normally has 8 ATSD registers (for concurrency) and 6 links
>* so we can allocate one register per link, using nvlink index as



Re: [PATCH v2 1/1] vfio-pci/nvlink2: Allow fallback to ibm,mmio-atsd[0]

2020-04-01 Thread Alex Williamson
On Tue, 31 Mar 2020 15:12:46 +1100
Sam Bobroff  wrote:

> Older versions of skiboot only provide a single value in the device
> tree property "ibm,mmio-atsd", even when multiple Address Translation
> Shoot Down (ATSD) registers are present. This prevents NVLink2 devices
> (other than the first) from being used with vfio-pci because vfio-pci
> expects to be able to assign a dedicated ATSD register to each NVLink2
> device.
> 
> However, ATSD registers can be shared among devices. This change
> allows vfio-pci to fall back to sharing the register at index 0 if
> necessary.
> 
> Fixes: 7f92891778df ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] 
> subdriver")
> Signed-off-by: Sam Bobroff 
> ---
> Patch set v2:
> Patch 1/1: vfio-pci/nvlink2: Allow fallback to ibm,mmio-atsd[0]
> - Removed unnecessary warning.
> - Added Fixes tag.
> 
> Patch set v1:
> Patch 1/1: vfio-pci/nvlink2: Allow fallback to ibm,mmio-atsd[0]
> 
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)

Applied to vfio next branch for v5.7 with Alexey's review.  Thanks,

Alex

> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
> b/drivers/vfio/pci/vfio_pci_nvlink2.c
> index f2983f0f84be..ae2af590e501 100644
> --- a/drivers/vfio/pci/vfio_pci_nvlink2.c
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -420,8 +420,14 @@ int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
>  
>   if (of_property_read_u64_index(hose->dn, "ibm,mmio-atsd", nvlink_index,
>   _atsd)) {
> - dev_warn(>pdev->dev, "No available ATSD found\n");
> - mmio_atsd = 0;
> + if (of_property_read_u64_index(hose->dn, "ibm,mmio-atsd", 0,
> + _atsd)) {
> + dev_warn(>pdev->dev, "No available ATSD found\n");
> + mmio_atsd = 0;
> + } else {
> + dev_warn(>pdev->dev,
> +  "Using fallback ibm,mmio-atsd[0] for ATSD.\n");
> + }
>   }
>  
>   if (of_property_read_u64(npu_node, "ibm,device-tgt-addr", )) {



Re: [PATCH kernel 5/5] vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB

2020-02-20 Thread Alex Williamson
On Tue, 18 Feb 2020 18:36:50 +1100
Alexey Kardashevskiy  wrote:

> So far the only option for a big 64big DMA window was a window located
> at 0x800... (1<<59) which creates problems for devices
> supporting smaller DMA masks.
> 
> This exploits a POWER9 PHB option to allow the second DMA window to map
> at 0 and advertises it with a 4GB offset to avoid overlap with
> the default 32bit window.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  include/uapi/linux/vfio.h   |  2 ++
>  drivers/vfio/vfio_iommu_spapr_tce.c | 10 --
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9e843a147ead..c7f89d47335a 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -831,9 +831,11 @@ struct vfio_iommu_spapr_tce_info {
>   __u32 argsz;
>   __u32 flags;
>  #define VFIO_IOMMU_SPAPR_INFO_DDW(1 << 0)/* DDW supported */
> +#define VFIO_IOMMU_SPAPR_INFO_DDW_START  (1 << 1)/* DDW offset */
>   __u32 dma32_window_start;   /* 32 bit window start (bytes) */
>   __u32 dma32_window_size;/* 32 bit window size (bytes) */
>   struct vfio_iommu_spapr_tce_ddw_info ddw;
> + __u64 dma64_window_start;
>  };
>  
>  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO_IO(VFIO_TYPE, VFIO_BASE + 12)
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 16b3adc508db..4f22be3c4aa2 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -691,7 +691,7 @@ static long tce_iommu_create_window(struct tce_container 
> *container,
>   container->tables[num] = tbl;
>  
>   /* Return start address assigned by platform in create_table() */
> - *start_addr = tbl->it_offset << tbl->it_page_shift;
> + *start_addr = tbl->it_dmaoff << tbl->it_page_shift;
>  
>   return 0;
>  
> @@ -842,7 +842,13 @@ static long tce_iommu_ioctl(void *iommu_data,
>   info.ddw.levels = table_group->max_levels;
>   }
>  
> - ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info, ddw);
> + ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info,
> + dma64_window_start);

This breaks existing users, now they no longer get the ddw struct
unless their argsz also includes the new dma64 window field.

> +
> + if (info.argsz >= ddwsz) {
> + info.flags |= VFIO_IOMMU_SPAPR_INFO_DDW_START;
> + info.dma64_window_start = table_group->tce64_start;
> + }

This is inconsistent with ddw where we set the flag regardless of
argsz, but obviously only provide the field to the user if they've
provided room for it.  Thanks,

Alex

>  
>   if (info.argsz >= ddwsz)
>   minsz = ddwsz;



Re: [PATCH kernel] vfio/spapr/nvlink2: Skip unpinning pages on error exit

2020-01-10 Thread Alex Williamson
On Mon, 23 Dec 2019 12:09:27 +1100
Alexey Kardashevskiy  wrote:

> The nvlink2 subdriver for IBM Witherspoon machines preregisters
> GPU memory in the IOMMI API so KVM TCE code can map this memory
> for DMA as well. This is done by mm_iommu_newdev() called from
> vfio_pci_nvgpu_regops::mmap.
> 
> In an unlikely event of failure the data->mem remains NULL and
> since mm_iommu_put() (which unregisters the region and unpins memory
> if that was regular memory) does not expect mem==NULL, it should not be
> called.
> 
> This adds a check to only call mm_iommu_put() for a valid data->mem.
> 
> Fixes: 7f92891778df ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] 
> subdriver")
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
> b/drivers/vfio/pci/vfio_pci_nvlink2.c
> index f2983f0f84be..3f5f8198a6bb 100644
> --- a/drivers/vfio/pci/vfio_pci_nvlink2.c
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -97,8 +97,10 @@ static void vfio_pci_nvgpu_release(struct vfio_pci_device 
> *vdev,
>  
>   /* If there were any mappings at all... */
>   if (data->mm) {
> - ret = mm_iommu_put(data->mm, data->mem);
> - WARN_ON(ret);
> + if (data->mem) {
> + ret = mm_iommu_put(data->mm, data->mem);
> + WARN_ON(ret);
> + }
>  
>   mmdrop(data->mm);
>   }

Applied to vfio next branch for v5.6.  Thanks,

Alex



Re: [PATCH v7 09/24] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-21 Thread Alex Williamson
On Wed, 20 Nov 2019 23:13:39 -0800
John Hubbard  wrote:

> As it says in the updated comment in gup.c: current FOLL_LONGTERM
> behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the
> FS DAX check requirement on vmas.
> 
> However, the corresponding restriction in get_user_pages_remote() was
> slightly stricter than is actually required: it forbade all
> FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
> that do not set the "locked" arg.
> 
> Update the code and comments accordingly, and update the VFIO caller
> to take advantage of this, fixing a bug as a result: the VFIO caller
> is logically a FOLL_LONGTERM user.
> 
> Also, remove an unnessary pair of calls that were releasing and
> reacquiring the mmap_sem. There is no need to avoid holding mmap_sem
> just in order to call page_to_pfn().
> 
> Also, move the DAX check ("if a VMA is DAX, don't allow long term
> pinning") from the VFIO call site, all the way into the internals
> of get_user_pages_remote() and __gup_longterm_locked(). That is:
> get_user_pages_remote() calls __gup_longterm_locked(), which in turn
> calls check_dax_vmas(). It's lightly explained in the comments as well.
> 
> Thanks to Jason Gunthorpe for pointing out a clean way to fix this,
> and to Dan Williams for helping clarify the DAX refactoring.
> 
> Reviewed-by: Jason Gunthorpe 
> Reviewed-by: Ira Weiny 
> Suggested-by: Jason Gunthorpe 
> Cc: Dan Williams 
> Cc: Jerome Glisse 
> Signed-off-by: John Hubbard 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 30 +-
>  mm/gup.c| 27 ++-
>  2 files changed, 27 insertions(+), 30 deletions(-)

Tested with device assignment and Intel mdev vGPU assignment with QEMU
userspace:

Tested-by: Alex Williamson 
Acked-by: Alex Williamson 

Feel free to include for 19/24 as well.  Thanks,

Alex

> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index d864277ea16f..c7a111ad9975 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -340,7 +340,6 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
> long vaddr,
>  {
>   struct page *page[1];
>   struct vm_area_struct *vma;
> - struct vm_area_struct *vmas[1];
>   unsigned int flags = 0;
>   int ret;
>  
> @@ -348,33 +347,14 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
> long vaddr,
>   flags |= FOLL_WRITE;
>  
>   down_read(>mmap_sem);
> - if (mm == current->mm) {
> - ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
> -  vmas);
> - } else {
> - ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> - vmas, NULL);
> - /*
> -  * The lifetime of a vaddr_get_pfn() page pin is
> -  * userspace-controlled. In the fs-dax case this could
> -  * lead to indefinite stalls in filesystem operations.
> -  * Disallow attempts to pin fs-dax pages via this
> -  * interface.
> -  */
> - if (ret > 0 && vma_is_fsdax(vmas[0])) {
> - ret = -EOPNOTSUPP;
> - put_page(page[0]);
> - }
> - }
> - up_read(>mmap_sem);
> -
> + ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
> + page, NULL, NULL);
>   if (ret == 1) {
>   *pfn = page_to_pfn(page[0]);
> - return 0;
> + ret = 0;
> + goto done;
>   }
>  
> - down_read(>mmap_sem);
> -
>   vaddr = untagged_addr(vaddr);
>  
>   vma = find_vma_intersection(mm, vaddr, vaddr + 1);
> @@ -384,7 +364,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
> long vaddr,
>   if (is_invalid_reserved_pfn(*pfn))
>   ret = 0;
>   }
> -
> +done:
>   up_read(>mmap_sem);
>   return ret;
>  }
> diff --git a/mm/gup.c b/mm/gup.c
> index 14fcdc502166..cce2c9676853 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -29,6 +29,13 @@ struct follow_page_context {
>   unsigned int page_mask;
>  };
>  
> +static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
> +   struct mm_struct *mm,
> +   unsigned long start,
> +   unsigned long nr_pages,
> +   struct page **pages,
> 

Re: [PATCH kernel] vfio/spapr_tce: Fix incorrect tce_iommu_group memory free

2019-08-23 Thread Alex Williamson
On Mon, 19 Aug 2019 11:51:17 +1000
Alexey Kardashevskiy  wrote:

> The @tcegrp variable is used in 1) a loop over attached groups
> 2) it stores a pointer to a newly allocated tce_iommu_group if 1) found
> nothing. However the error handler does not distinguish how we got there
> and incorrectly releases memory for a found+incompatible group.
> 
> This fixes it by adding another error handling case.
> 
> Fixes: 0bd971676e68 ("powerpc/powernv/npu: Add compound IOMMU groups")
> Signed-off-by: Alexey Kardashevskiy 
> ---

Applied to vfio next branch with Paul's R-b.  Thanks,

Alex

> 
> The bug is there since 2157e7b82f3b but it would not appear in practice
> before 0bd971676e68, hence that "Fixes". Or it still should be
> 157e7b82f3b ("vfio: powerpc/spapr: Register memory and define IOMMU v2")
> ?
> 
> Found it when tried adding a "compound PE" (GPU + NPUs) to a container
> with a passed through xHCI host. The compatibility test (->create_table
> should be equal) treats them as incompatible which might a bug (or
> we are just suboptimal here) on its own.
> 
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 8ce9ad21129f..babef8b00daf 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -1234,7 +1234,7 @@ static long tce_iommu_take_ownership_ddw(struct 
> tce_container *container,
>  static int tce_iommu_attach_group(void *iommu_data,
>   struct iommu_group *iommu_group)
>  {
> - int ret;
> + int ret = 0;
>   struct tce_container *container = iommu_data;
>   struct iommu_table_group *table_group;
>   struct tce_iommu_group *tcegrp = NULL;
> @@ -1287,13 +1287,13 @@ static int tce_iommu_attach_group(void *iommu_data,
>   !table_group->ops->release_ownership) {
>   if (container->v2) {
>   ret = -EPERM;
> - goto unlock_exit;
> + goto free_exit;
>   }
>   ret = tce_iommu_take_ownership(container, table_group);
>   } else {
>   if (!container->v2) {
>   ret = -EPERM;
> - goto unlock_exit;
> + goto free_exit;
>   }
>   ret = tce_iommu_take_ownership_ddw(container, table_group);
>   if (!tce_groups_attached(container) && !container->tables[0])
> @@ -1305,10 +1305,11 @@ static int tce_iommu_attach_group(void *iommu_data,
>   list_add(>next, >group_list);
>   }
>  
> -unlock_exit:
> +free_exit:
>   if (ret && tcegrp)
>   kfree(tcegrp);
>  
> +unlock_exit:
>   mutex_unlock(>lock);
>  
>   return ret;



Re: [PATCH kernel] vfio/spapr_tce: Fix incorrect tce_iommu_group memory free

2019-08-23 Thread Alex Williamson
On Fri, 23 Aug 2019 15:32:41 +1000
Paul Mackerras  wrote:

> On Mon, Aug 19, 2019 at 11:51:17AM +1000, Alexey Kardashevskiy wrote:
> > The @tcegrp variable is used in 1) a loop over attached groups
> > 2) it stores a pointer to a newly allocated tce_iommu_group if 1) found
> > nothing. However the error handler does not distinguish how we got there
> > and incorrectly releases memory for a found+incompatible group.
> > 
> > This fixes it by adding another error handling case.
> > 
> > Fixes: 0bd971676e68 ("powerpc/powernv/npu: Add compound IOMMU groups")
> > Signed-off-by: Alexey Kardashevskiy   
> 
> Good catch.  This is potentially nasty since it is a double free.
> Alex, are you going to take this, or would you prefer it goes via
> Michael Ellerman's tree?
> 
> Reviewed-by: Paul Mackerras 

I can take it, I've got it queued, but was hoping for an ack/review by
you or David.  I'll add the R-b and push it out to my next branch.
Thanks,

Alex


Re: [PATCH v3] mm: add account_locked_vm utility function

2019-06-03 Thread Alex Williamson
On Wed, 29 May 2019 16:50:19 -0400
Daniel Jordan  wrote:

> locked_vm accounting is done roughly the same way in five places, so
> unify them in a helper.
> 
> Include the helper's caller in the debug print to distinguish between
> callsites.
> 
> Error codes stay the same, so user-visible behavior does too.  The one
> exception is that the -EPERM case in tce_account_locked_vm is removed
> because Alexey has never seen it triggered.
> 
> Signed-off-by: Daniel Jordan 
> Tested-by: Alexey Kardashevskiy 
> Cc: Alan Tull 
> Cc: Alex Williamson 
> Cc: Andrew Morton 
> Cc: Benjamin Herrenschmidt 
> Cc: Christoph Lameter 
> Cc: Christophe Leroy 
> Cc: Davidlohr Bueso 
> Cc: Ira Weiny 
> Cc: Jason Gunthorpe 
> Cc: Mark Rutland 
> Cc: Michael Ellerman 
> Cc: Moritz Fischer 
> Cc: Paul Mackerras 
> Cc: Steve Sistare 
> Cc: Wu Hao 
> Cc: linux...@kvack.org
> Cc: k...@vger.kernel.org
> Cc: kvm-...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-f...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> ---
> v3:
>  - uninline account_locked_vm (Andrew)
>  - fix doc comment (Ira)
>  - retain down_write_killable in vfio type1 (Alex)
>  - leave Alexey's T-b since the code is the same aside from uninlining
>  - sanity tested with vfio type1, sanity-built on ppc
> 
>  arch/powerpc/kvm/book3s_64_vio.c | 44 ++--
>  arch/powerpc/mm/book3s64/iommu_api.c | 41 ++-
>  drivers/fpga/dfl-afu-dma-region.c| 53 ++--
>  drivers/vfio/vfio_iommu_spapr_tce.c  | 54 ++--
>  drivers/vfio/vfio_iommu_type1.c  | 17 +--
>  include/linux/mm.h   |  4 ++
>  mm/util.c| 75 
>  7 files changed, 98 insertions(+), 190 deletions(-)

I tend to prefer adding a negative rather than converting to absolute
and passing a bool for inc/dec, but it all seems equivalent, so for
vfio parts

Acked-by: Alex Williamson 

> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> b/arch/powerpc/kvm/book3s_64_vio.c
> index 66270e07449a..768b645c7edf 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -30,6 +30,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -56,43 +57,6 @@ static unsigned long kvmppc_stt_pages(unsigned long 
> tce_pages)
>   return tce_pages + ALIGN(stt_bytes, PAGE_SIZE) / PAGE_SIZE;
>  }
>  
> -static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
> -{
> - long ret = 0;
> -
> - if (!current || !current->mm)
> - return ret; /* process exited */
> -
> - down_write(>mm->mmap_sem);
> -
> - if (inc) {
> - unsigned long locked, lock_limit;
> -
> - locked = current->mm->locked_vm + stt_pages;
> - lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> - if (locked > lock_limit && !capable(CAP_IPC_LOCK))
> - ret = -ENOMEM;
> - else
> - current->mm->locked_vm += stt_pages;
> - } else {
> - if (WARN_ON_ONCE(stt_pages > current->mm->locked_vm))
> - stt_pages = current->mm->locked_vm;
> -
> - current->mm->locked_vm -= stt_pages;
> - }
> -
> - pr_debug("[%d] RLIMIT_MEMLOCK KVM %c%ld %ld/%ld%s\n", current->pid,
> - inc ? '+' : '-',
> - stt_pages << PAGE_SHIFT,
> - current->mm->locked_vm << PAGE_SHIFT,
> - rlimit(RLIMIT_MEMLOCK),
> - ret ? " - exceeded" : "");
> -
> - up_write(>mm->mmap_sem);
> -
> - return ret;
> -}
> -
>  static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
>  {
>   struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> @@ -302,7 +266,7 @@ static int kvm_spapr_tce_release(struct inode *inode, 
> struct file *filp)
>  
>   kvm_put_kvm(stt->kvm);
>  
> - kvmppc_account_memlimit(
> + account_locked_vm(current->mm,
>   kvmppc_stt_pages(kvmppc_tce_pages(stt->size)), false);
>   call_rcu(>rcu, release_spapr_tce_table);
>  
> @@ -327,7 +291,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>   return -EINVAL;
>  
>   npages = kvmppc_tce_pages(size);
> - ret = kvmppc_account_memlimit(kvmppc_stt_pages(npages), true);
> + ret = account_locked_vm(current->mm, kvmppc_stt_pages(npages), true);
>   if (ret)
>   r

Re: [PATCH v2] mm: add account_locked_vm utility function

2019-05-29 Thread Alex Williamson
On Tue, 28 May 2019 11:04:24 -0400
Daniel Jordan  wrote:

> On Sat, May 25, 2019 at 02:51:18PM -0700, Andrew Morton wrote:
> > On Fri, 24 May 2019 13:50:45 -0400 Daniel Jordan 
> >  wrote:
> >   
> > > locked_vm accounting is done roughly the same way in five places, so
> > > unify them in a helper.  Standardize the debug prints, which vary
> > > slightly, but include the helper's caller to disambiguate between
> > > callsites.
> > > 
> > > Error codes stay the same, so user-visible behavior does too.  The one
> > > exception is that the -EPERM case in tce_account_locked_vm is removed
> > > because Alexey has never seen it triggered.
> > > 
> > > ...
> > >
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -1564,6 +1564,25 @@ long get_user_pages_unlocked(unsigned long start, 
> > > unsigned long nr_pages,
> > >  int get_user_pages_fast(unsigned long start, int nr_pages,
> > >   unsigned int gup_flags, struct page **pages);
> > >  
> > > +int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool 
> > > inc,
> > > + struct task_struct *task, bool bypass_rlim);
> > > +
> > > +static inline int account_locked_vm(struct mm_struct *mm, unsigned long 
> > > pages,
> > > + bool inc)
> > > +{
> > > + int ret;
> > > +
> > > + if (pages == 0 || !mm)
> > > + return 0;
> > > +
> > > + down_write(>mmap_sem);
> > > + ret = __account_locked_vm(mm, pages, inc, current,
> > > +   capable(CAP_IPC_LOCK));
> > > + up_write(>mmap_sem);
> > > +
> > > + return ret;
> > > +}  
> > 
> > That's quite a mouthful for an inlined function.  How about uninlining
> > the whole thing and fiddling drivers/vfio/vfio_iommu_type1.c to suit. 
> > I wonder why it does down_write_killable and whether it really needs
> > to...  
> 
> Sure, I can uninline it.  vfio changelogs don't show a particular reason for
> _killable[1].  Maybe Alex has something to add.  Otherwise I'll respin without
> it since the simplification seems worth removing _killable.
> 
> [1] 0cfef2b7410b ("vfio/type1: Remove locked page accounting workqueue")

A userspace vfio driver maps DMA via an ioctl through this path, so I
believe I used killable here just to be friendly that it could be
interrupted and we could fall out with an errno if it were stuck here.
No harm, no foul, the user's mapping is aborted and unwound.  If we're
deadlocked or seriously contended on mmap_sem, maybe we're already in
trouble, but it seemed like a valid and low hanging use case for
killable.  Thanks,

Alex


Re: [PATCH] vfio-pci/nvlink2: Fix potential VMA leak

2019-05-07 Thread Alex Williamson
On Tue, 7 May 2019 09:01:45 +0200
Greg Kurz  wrote:

> On Tue, 7 May 2019 11:52:44 +1000
> Sam Bobroff  wrote:
> 
> > On Mon, May 06, 2019 at 03:58:45PM -0600, Alex Williamson wrote:  
> > > On Fri, 19 Apr 2019 17:37:17 +0200
> > > Greg Kurz  wrote:
> > > 
> > > > If vfio_pci_register_dev_region() fails then we should rollback
> > > > previous changes, ie. unmap the ATSD registers.
> > > > 
> > > > Signed-off-by: Greg Kurz 
> > > > ---
> > > 
> > > Applied to vfio next branch for v5.2 with Alexey's R-b.  Thanks!
> > > 
> > > Alex
> > 
> > Should this have a fixes tag? e.g.:
> > Fixes: 7f92891778df ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] 
> > subdriver")
> >   
> 
> Oops... you're right.
> 
> Alex, can you add the above tag ?

Added.  Thanks,

Alex


Re: [PATCH] vfio-pci/nvlink2: Fix potential VMA leak

2019-05-06 Thread Alex Williamson
On Fri, 19 Apr 2019 17:37:17 +0200
Greg Kurz  wrote:

> If vfio_pci_register_dev_region() fails then we should rollback
> previous changes, ie. unmap the ATSD registers.
> 
> Signed-off-by: Greg Kurz 
> ---

Applied to vfio next branch for v5.2 with Alexey's R-b.  Thanks!

Alex

>  drivers/vfio/pci/vfio_pci_nvlink2.c |2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
> b/drivers/vfio/pci/vfio_pci_nvlink2.c
> index 32f695ffe128..50fe3c4f7feb 100644
> --- a/drivers/vfio/pci/vfio_pci_nvlink2.c
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -472,6 +472,8 @@ int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
>   return 0;
>  
>  free_exit:
> + if (data->base)
> + memunmap(data->base);
>   kfree(data);
>  
>   return ret;
> 



Re: [PATCH kernel v3] powerpc/powernv: Isolate NVLinks between GV100GL on Witherspoon

2019-04-30 Thread Alex Williamson
On Tue, 30 Apr 2019 16:14:35 +1000
Alexey Kardashevskiy  wrote:

> On 30/04/2019 15:45, Alistair Popple wrote:
> > Alexey,
> >   
> > +void pnv_try_isolate_nvidia_v100(struct pci_dev *bridge)
> > +{
> > +   u32 mask, val;
> > +   void __iomem *bar0_0, *bar0_12, *bar0_a0;
> > +   struct pci_dev *pdev;
> > +   u16 cmd = 0, cmdmask = PCI_COMMAND_MEMORY;
> > +
> > +   if (!bridge->subordinate)
> > +   return;
> > +
> > +   pdev = list_first_entry_or_null(>subordinate->devices,
> > +   struct pci_dev, bus_list);
> > +   if (!pdev)
> > +   return;
> > +
> > +   if (pdev->vendor != PCI_VENDOR_ID_NVIDIA)  
> > 
> > Don't you also need to check the PCIe devid to match only [PV]100 devices 
> > as 
> > well? I doubt there's any guarantee these registers will remain the same 
> > for 
> > all future (or older) NVIDIA devices.  
> 
> 
> I do not have the complete list of IDs and I already saw 3 different
> device ids and this only works for machines with ibm,npu/gpu/nvlinks
> properties so for now it works and for the future we are hoping to
> either have an open source nvidia driver or some small minidriver (also
> from nvidia, or may be a spec allowing us to write one) to allow
> topology discovery on the host so we would not depend on the skiboot's
> powernv DT.
> 
> > IMHO this should really be done in the device driver in the guest. A 
> > malcious 
> > guest could load a modified driver that doesn't do this, but that should 
> > not 
> > compromise other guests which presumably load a non-compromised driver that 
> > disables the links on that guests GPU. However I guess in practice what you 
> > have here should work equally well.  
> 
> Doing it in the guest means a good guest needs to have an updated
> driver, we do not really want to depend on this. The idea of IOMMU
> groups is that the hypervisor provides isolation irrespective to what
> the guest does.

+1 It's not the user/guest driver's responsibility to maintain the
isolation of the device.  Thanks,

Alex

> Also vfio+qemu+slof needs to convey the nvlink topology to the guest,
> seems like an unnecessary complication.
> 
> 
> 
> > - Alistair
> >   
> > +   return;
> > +
> > +   mask = nvlinkgpu_get_disable_mask(>dev);
> > +   if (!mask)
> > +   return;
> > +
> > +   bar0_0 = pci_iomap_range(pdev, 0, 0, 0x1);
> > +   if (!bar0_0) {
> > +   pci_err(pdev, "Error mapping BAR0 @0\n");
> > +   return;
> > +   }
> > +   bar0_12 = pci_iomap_range(pdev, 0, 0x12, 0x1);
> > +   if (!bar0_12) {
> > +   pci_err(pdev, "Error mapping BAR0 @12\n");
> > +   goto bar0_0_unmap;
> > +   }
> > +   bar0_a0 = pci_iomap_range(pdev, 0, 0xA0, 0x1);
> > +   if (!bar0_a0) {
> > +   pci_err(pdev, "Error mapping BAR0 @A0\n");
> > +   goto bar0_12_unmap;
> > +   }  
> 
>  Is it really necessary to do three separate ioremaps vs one that would
>  cover them all here?  I suspect you're just sneaking in PAGE_SIZE with
>  the 0x1 size mappings anyway.  Seems like it would simplify setup,
>  error reporting, and cleanup to to ioremap to the PAGE_ALIGN'd range
>  of the highest register accessed. Thanks,  
> >>>
> >>> Sure I can map it once, I just do not see the point in mapping/unmapping
> >>> all 0xa1>>16=161 system pages for a very short period of time while
> >>> we know precisely that we need just 3 pages.
> >>>
> >>> Repost?  
> >>
> >> Ping?
> >>
> >> Can this go in as it is (i.e. should I ping Michael) or this needs
> >> another round? It would be nice to get some formal acks. Thanks,
> >>  
>  Alex
>   
> > +
> > +   pci_restore_state(pdev);
> > +   pci_read_config_word(pdev, PCI_COMMAND, );
> > +   if ((cmd & cmdmask) != cmdmask)
> > +   pci_write_config_word(pdev, PCI_COMMAND, cmd | cmdmask);
> > +
> > +   /*
> > +* The sequence is from "Tesla P100 and V100 SXM2 NVLink 
> > Isolation on
> > +* Multi-Tenant Systems".
> > +* The register names are not provided there either, hence raw 
> > values.
> > +*/
> > +   iowrite32(0x4, bar0_12 + 0x4C);
> > +   iowrite32(0x2, bar0_12 + 0x2204);
> > +   val = ioread32(bar0_0 + 0x200);
> > +   val |= 0x0200;
> > +   iowrite32(val, bar0_0 + 0x200);
> > +   val = ioread32(bar0_a0 + 0x148);
> > +   val |= mask;
> > +   iowrite32(val, bar0_a0 + 0x148);
> > +
> > +   if ((cmd | cmdmask) != cmd)
> > +   pci_write_config_word(pdev, PCI_COMMAND, cmd);
> > +
> > +   

Re: [PATCH kernel v3] powerpc/powernv: Isolate NVLinks between GV100GL on Witherspoon

2019-04-11 Thread Alex Williamson
On Thu, 11 Apr 2019 16:48:44 +1000
Alexey Kardashevskiy  wrote:

> The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> peer-to-peer NVLinks in groups of 2 to 4 GPUs with no buffers/latches
> between GPUs.
> 
> Because of these interconnected NVLinks, the POWERNV platform puts such
> interconnected GPUs to the same IOMMU group. However users want to pass
> GPUs through individually which requires separate IOMMU groups.
> 
> Thankfully V100 GPUs implement an interface to disable arbitrary links
> by programming link disabling mask via the GPU's BAR0. Once a link is
> disabled, it only can be enabled after performing the secondary bus reset
> (SBR) on the GPU. Since these GPUs do not advertise any other type of
> reset, it is reset by the platform's SBR handler.
> 
> This adds an extra step to the POWERNV's SBR handler to block NVLinks to
> GPUs which do not belong to the same group as the GPU being reset.
> 
> This adds a new "isolate_nvlink" kernel parameter to force GPU isolation;
> when enabled, every GPU gets placed in its own IOMMU group. The new
> parameter is off by default to preserve the existing behaviour.
> 
> Before isolating:
> [nvdbg ~]$ nvidia-smi topo -m
> GPU0GPU1GPU2CPU Affinity
> GPU0 X  NV2 NV2 0-0
> GPU1NV2  X  NV2 0-0
> GPU2NV2 NV2  X  0-0
> 
> After isolating:
> [nvdbg ~]$ nvidia-smi topo -m
> GPU0GPU1GPU2CPU Affinity
> GPU0 X  PHB PHB 0-0
> GPU1PHB  X  PHB 0-0
> GPU2PHB PHB  X  0-0
> 
> Where:
>   X= Self
>   PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically 
> the CPU)
>   NV#  = Connection traversing a bonded set of # NVLinks
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v3:
> * added pci_err() for failed ioremap
> * reworked commit log
> 
> v2:
> * this is rework of [PATCH kernel RFC 0/2] vfio, powerpc/powernv: Isolate 
> GV100GL
> but this time it is contained in the powernv platform
> ---
>  arch/powerpc/platforms/powernv/Makefile  |   2 +-
>  arch/powerpc/platforms/powernv/pci.h |   1 +
>  arch/powerpc/platforms/powernv/eeh-powernv.c |   1 +
>  arch/powerpc/platforms/powernv/npu-dma.c |  24 +++-
>  arch/powerpc/platforms/powernv/nvlinkgpu.c   | 137 +++
>  5 files changed, 162 insertions(+), 3 deletions(-)
>  create mode 100644 arch/powerpc/platforms/powernv/nvlinkgpu.c
> 
> diff --git a/arch/powerpc/platforms/powernv/Makefile 
> b/arch/powerpc/platforms/powernv/Makefile
> index da2e99efbd04..60a10d3b36eb 100644
> --- a/arch/powerpc/platforms/powernv/Makefile
> +++ b/arch/powerpc/platforms/powernv/Makefile
> @@ -6,7 +6,7 @@ obj-y += opal-msglog.o opal-hmi.o 
> opal-power.o opal-irqchip.o
>  obj-y+= opal-kmsg.o opal-powercap.o opal-psr.o 
> opal-sensor-groups.o
>  
>  obj-$(CONFIG_SMP)+= smp.o subcore.o subcore-asm.o
> -obj-$(CONFIG_PCI)+= pci.o pci-ioda.o npu-dma.o pci-ioda-tce.o
> +obj-$(CONFIG_PCI)+= pci.o pci-ioda.o npu-dma.o pci-ioda-tce.o nvlinkgpu.o
>  obj-$(CONFIG_CXL_BASE)   += pci-cxl.o
>  obj-$(CONFIG_EEH)+= eeh-powernv.o
>  obj-$(CONFIG_PPC_SCOM)   += opal-xscom.o
> diff --git a/arch/powerpc/platforms/powernv/pci.h 
> b/arch/powerpc/platforms/powernv/pci.h
> index 8e36da379252..9fd3f391482c 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -250,5 +250,6 @@ extern void pnv_pci_unlink_table_and_group(struct 
> iommu_table *tbl,
>  extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
>   void *tce_mem, u64 tce_size,
>   u64 dma_offset, unsigned int page_shift);
> +extern void pnv_try_isolate_nvidia_v100(struct pci_dev *gpdev);
>  
>  #endif /* __POWERNV_PCI_H */
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
> b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index f38078976c5d..464b097d9635 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -937,6 +937,7 @@ void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>   pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
>   pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
>   }
> + pnv_try_isolate_nvidia_v100(dev);
>  }
>  
>  static void pnv_eeh_wait_for_pending(struct pci_dn *pdn, const char *type,
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
> b/arch/powerpc/platforms/powernv/npu-dma.c
> index dc23d9d2a7d9..d4f9ee6222b5 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -22,6 +22,23 @@
>  
>  #include "pci.h"
>  
> +static bool isolate_nvlink;
> +
> +static int __init parse_isolate_nvlink(char *p)
> +{
> + bool val;
> +
> + if (!p)
> + val = true;
> + else if (kstrtobool(p, ))
> +

Re: [RFC PATCH kernel v2] powerpc/powernv: Isolate NVLinks between GV100GL on Witherspoon

2019-04-04 Thread Alex Williamson
On Thu,  4 Apr 2019 16:23:24 +1100
Alexey Kardashevskiy  wrote:

> The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
> platform puts all interconnected GPUs to the same IOMMU group.
> 
> However the user may want to pass individual GPUs to the userspace so
> in order to do so we need to put them into separate IOMMU groups and
> cut off the interconnects.
> 
> Thankfully V100 GPUs implement an interface to do by programming link
> disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
> this interface, it cannot be re-enabled until the secondary bus reset is
> issued to the GPU.
> 
> This adds an extra step to the secondary bus reset handler (the one used
> for such GPUs) to block NVLinks to GPUs which do not belong to the same
> group as the GPU being reset.
> 
> This adds a new "isolate_nvlink" kernel parameter to allow GPU isolation;
> when enabled, every GPU gets its own IOMMU group. The new parameter is off
> by default to preserve the existing behaviour.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v2:
> * this is rework of [PATCH kernel RFC 0/2] vfio, powerpc/powernv: Isolate 
> GV100GL
> but this time it is contained in the powernv platform
> ---
>  arch/powerpc/platforms/powernv/Makefile  |   2 +-
>  arch/powerpc/platforms/powernv/pci.h |   1 +
>  arch/powerpc/platforms/powernv/eeh-powernv.c |   1 +
>  arch/powerpc/platforms/powernv/npu-dma.c |  24 +++-
>  arch/powerpc/platforms/powernv/nvlinkgpu.c   | 131 +++
>  5 files changed, 156 insertions(+), 3 deletions(-)
>  create mode 100644 arch/powerpc/platforms/powernv/nvlinkgpu.c
> 
> diff --git a/arch/powerpc/platforms/powernv/Makefile 
> b/arch/powerpc/platforms/powernv/Makefile
> index da2e99efbd04..60a10d3b36eb 100644
> --- a/arch/powerpc/platforms/powernv/Makefile
> +++ b/arch/powerpc/platforms/powernv/Makefile
> @@ -6,7 +6,7 @@ obj-y += opal-msglog.o opal-hmi.o 
> opal-power.o opal-irqchip.o
>  obj-y+= opal-kmsg.o opal-powercap.o opal-psr.o 
> opal-sensor-groups.o
>  
>  obj-$(CONFIG_SMP)+= smp.o subcore.o subcore-asm.o
> -obj-$(CONFIG_PCI)+= pci.o pci-ioda.o npu-dma.o pci-ioda-tce.o
> +obj-$(CONFIG_PCI)+= pci.o pci-ioda.o npu-dma.o pci-ioda-tce.o nvlinkgpu.o
>  obj-$(CONFIG_CXL_BASE)   += pci-cxl.o
>  obj-$(CONFIG_EEH)+= eeh-powernv.o
>  obj-$(CONFIG_PPC_SCOM)   += opal-xscom.o
> diff --git a/arch/powerpc/platforms/powernv/pci.h 
> b/arch/powerpc/platforms/powernv/pci.h
> index 8e36da379252..9fd3f391482c 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -250,5 +250,6 @@ extern void pnv_pci_unlink_table_and_group(struct 
> iommu_table *tbl,
>  extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
>   void *tce_mem, u64 tce_size,
>   u64 dma_offset, unsigned int page_shift);
> +extern void pnv_try_isolate_nvidia_v100(struct pci_dev *gpdev);
>  
>  #endif /* __POWERNV_PCI_H */
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
> b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index f38078976c5d..464b097d9635 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -937,6 +937,7 @@ void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>   pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
>   pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
>   }
> + pnv_try_isolate_nvidia_v100(dev);
>  }
>  
>  static void pnv_eeh_wait_for_pending(struct pci_dn *pdn, const char *type,
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
> b/arch/powerpc/platforms/powernv/npu-dma.c
> index dc23d9d2a7d9..017eae8197e7 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -529,6 +529,23 @@ static void pnv_comp_attach_table_group(struct npu_comp 
> *npucomp,
>   ++npucomp->pe_num;
>  }
>  
> +static bool isolate_nvlink;
> +
> +static int __init parse_isolate_nvlink(char *p)
> +{
> + bool val;
> +
> + if (!p)
> + val = true;
> + else if (kstrtobool(p, ))
> + return -EINVAL;
> +
> + isolate_nvlink = val;
> +
> + return 0;
> +}
> +early_param("isolate_nvlink", parse_isolate_nvlink);
> +
>  struct iommu_table_group *pnv_try_setup_npu_table_group(struct pnv_ioda_pe 
> *pe)
>  {
>   struct iommu_table_group *table_group;
> @@ -549,7 +566,7 @@ struct iommu_table_group 
> *pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe)
>  
>   hose = pci_bus_to_host(npdev->bus);
>  
> - if (hose->npu) {
> + if (hose->npu && !isolate_nvlink) {
>   table_group = >npu->npucomp.table_group;
>  
>   if (!table_group->group) {
> @@ -559,7 +576,10 @@ struct iommu_table_group 
> 

Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

2019-03-22 Thread Alex Williamson
On Fri, 22 Mar 2019 14:08:38 +1100
David Gibson  wrote:

> On Thu, Mar 21, 2019 at 12:19:34PM -0600, Alex Williamson wrote:
> > On Thu, 21 Mar 2019 10:56:00 +1100
> > David Gibson  wrote:
> >   
> > > On Wed, Mar 20, 2019 at 01:09:08PM -0600, Alex Williamson wrote:  
> > > > On Wed, 20 Mar 2019 15:38:24 +1100
> > > > David Gibson  wrote:
> > > > 
> > > > > On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:
> > > > > > On Fri, 15 Mar 2019 19:18:35 +1100
> > > > > > Alexey Kardashevskiy  wrote:
> > > > > >   
> > > > > > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links 
> > > > > > > and
> > > > > > > (on POWER9) NVLinks. In addition to that, GPUs themselves have 
> > > > > > > direct
> > > > > > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the 
> > > > > > > POWERNV
> > > > > > > platform puts all interconnected GPUs to the same IOMMU group.
> > > > > > > 
> > > > > > > However the user may want to pass individual GPUs to the 
> > > > > > > userspace so
> > > > > > > in order to do so we need to put them into separate IOMMU groups 
> > > > > > > and
> > > > > > > cut off the interconnects.
> > > > > > > 
> > > > > > > Thankfully V100 GPUs implement an interface to do by programming 
> > > > > > > link
> > > > > > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU 
> > > > > > > using
> > > > > > > this interface, it cannot be re-enabled until the secondary bus 
> > > > > > > reset is
> > > > > > > issued to the GPU.
> > > > > > > 
> > > > > > > This defines a reset_done() handler for V100 NVlink2 device which
> > > > > > > determines what links need to be disabled. This relies on presence
> > > > > > > of the new "ibm,nvlink-peers" device tree property of a GPU 
> > > > > > > telling which
> > > > > > > PCI peers it is connected to (which includes NVLink bridges or 
> > > > > > > peer GPUs).
> > > > > > > 
> > > > > > > This does not change the existing behaviour and instead adds
> > > > > > > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > > > > > > 
> > > > > > > The alternative approaches would be:
> > > > > > > 
> > > > > > > 1. do this in the system firmware (skiboot) but for that we would 
> > > > > > > need
> > > > > > > to tell skiboot via an additional OPAL call whether or not we 
> > > > > > > want this
> > > > > > > isolation - skiboot is unaware of IOMMU groups.
> > > > > > > 
> > > > > > > 2. do this in the secondary bus reset handler in the POWERNV 
> > > > > > > platform -
> > > > > > > the problem with that is at that point the device is not enabled, 
> > > > > > > i.e.
> > > > > > > config space is not restored so we need to enable the device 
> > > > > > > (i.e. MMIO
> > > > > > > bit in CMD register + program valid address to BAR0) in order to 
> > > > > > > disable
> > > > > > > links and then perhaps undo all this initialization to bring the 
> > > > > > > device
> > > > > > > back to the state where pci_try_reset_function() expects it to 
> > > > > > > be.  
> > > > > > 
> > > > > > The trouble seems to be that this approach only maintains the 
> > > > > > isolation
> > > > > > exposed by the IOMMU group when vfio-pci is the active driver for 
> > > > > > the
> > > > > > device.  IOMMU groups can be used by any driver and the IOMMU core 
> > > > > > is
> > > > > > incorporating groups in various ways.  
> > > > > 
> > > > > I don't think that reasoning is quite right.  An IOMMU group doesn't
> > > > > necessarily represent devices which *are* isolated, just devices whi

Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

2019-03-21 Thread Alex Williamson
On Thu, 21 Mar 2019 10:56:00 +1100
David Gibson  wrote:

> On Wed, Mar 20, 2019 at 01:09:08PM -0600, Alex Williamson wrote:
> > On Wed, 20 Mar 2019 15:38:24 +1100
> > David Gibson  wrote:
> >   
> > > On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:  
> > > > On Fri, 15 Mar 2019 19:18:35 +1100
> > > > Alexey Kardashevskiy  wrote:
> > > > 
> > > > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> > > > > (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> > > > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the 
> > > > > POWERNV
> > > > > platform puts all interconnected GPUs to the same IOMMU group.
> > > > > 
> > > > > However the user may want to pass individual GPUs to the userspace so
> > > > > in order to do so we need to put them into separate IOMMU groups and
> > > > > cut off the interconnects.
> > > > > 
> > > > > Thankfully V100 GPUs implement an interface to do by programming link
> > > > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU 
> > > > > using
> > > > > this interface, it cannot be re-enabled until the secondary bus reset 
> > > > > is
> > > > > issued to the GPU.
> > > > > 
> > > > > This defines a reset_done() handler for V100 NVlink2 device which
> > > > > determines what links need to be disabled. This relies on presence
> > > > > of the new "ibm,nvlink-peers" device tree property of a GPU telling 
> > > > > which
> > > > > PCI peers it is connected to (which includes NVLink bridges or peer 
> > > > > GPUs).
> > > > > 
> > > > > This does not change the existing behaviour and instead adds
> > > > > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > > > > 
> > > > > The alternative approaches would be:
> > > > > 
> > > > > 1. do this in the system firmware (skiboot) but for that we would need
> > > > > to tell skiboot via an additional OPAL call whether or not we want 
> > > > > this
> > > > > isolation - skiboot is unaware of IOMMU groups.
> > > > > 
> > > > > 2. do this in the secondary bus reset handler in the POWERNV platform 
> > > > > -
> > > > > the problem with that is at that point the device is not enabled, i.e.
> > > > > config space is not restored so we need to enable the device (i.e. 
> > > > > MMIO
> > > > > bit in CMD register + program valid address to BAR0) in order to 
> > > > > disable
> > > > > links and then perhaps undo all this initialization to bring the 
> > > > > device
> > > > > back to the state where pci_try_reset_function() expects it to be.
> > > > 
> > > > The trouble seems to be that this approach only maintains the isolation
> > > > exposed by the IOMMU group when vfio-pci is the active driver for the
> > > > device.  IOMMU groups can be used by any driver and the IOMMU core is
> > > > incorporating groups in various ways.
> > > 
> > > I don't think that reasoning is quite right.  An IOMMU group doesn't
> > > necessarily represent devices which *are* isolated, just devices which
> > > *can be* isolated.  There are plenty of instances when we don't need
> > > to isolate devices in different IOMMU groups: passing both groups to
> > > the same guest or userspace VFIO driver for example, or indeed when
> > > both groups are owned by regular host kernel drivers.
> > > 
> > > In at least some of those cases we also don't want to isolate the
> > > devices when we don't have to, usually for performance reasons.  
> > 
> > I see IOMMU groups as representing the current isolation of the device,
> > not just the possible isolation.  If there are ways to break down that
> > isolation then ideally the group would be updated to reflect it.  The
> > ACS disable patches seem to support this, at boot time we can choose to
> > disable ACS at certain points in the topology to favor peer-to-peer
> > performance over isolation.  This is then reflected in the group
> > composition, because even though ACS *can be* enabled at the given
> > isolation points, it's intentionally not with this option.  Whether or
&

Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

2019-03-20 Thread Alex Williamson
On Wed, 20 Mar 2019 15:38:24 +1100
David Gibson  wrote:

> On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:
> > On Fri, 15 Mar 2019 19:18:35 +1100
> > Alexey Kardashevskiy  wrote:
> >   
> > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> > > (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
> > > platform puts all interconnected GPUs to the same IOMMU group.
> > > 
> > > However the user may want to pass individual GPUs to the userspace so
> > > in order to do so we need to put them into separate IOMMU groups and
> > > cut off the interconnects.
> > > 
> > > Thankfully V100 GPUs implement an interface to do by programming link
> > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
> > > this interface, it cannot be re-enabled until the secondary bus reset is
> > > issued to the GPU.
> > > 
> > > This defines a reset_done() handler for V100 NVlink2 device which
> > > determines what links need to be disabled. This relies on presence
> > > of the new "ibm,nvlink-peers" device tree property of a GPU telling which
> > > PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
> > > 
> > > This does not change the existing behaviour and instead adds
> > > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > > 
> > > The alternative approaches would be:
> > > 
> > > 1. do this in the system firmware (skiboot) but for that we would need
> > > to tell skiboot via an additional OPAL call whether or not we want this
> > > isolation - skiboot is unaware of IOMMU groups.
> > > 
> > > 2. do this in the secondary bus reset handler in the POWERNV platform -
> > > the problem with that is at that point the device is not enabled, i.e.
> > > config space is not restored so we need to enable the device (i.e. MMIO
> > > bit in CMD register + program valid address to BAR0) in order to disable
> > > links and then perhaps undo all this initialization to bring the device
> > > back to the state where pci_try_reset_function() expects it to be.  
> > 
> > The trouble seems to be that this approach only maintains the isolation
> > exposed by the IOMMU group when vfio-pci is the active driver for the
> > device.  IOMMU groups can be used by any driver and the IOMMU core is
> > incorporating groups in various ways.  
> 
> I don't think that reasoning is quite right.  An IOMMU group doesn't
> necessarily represent devices which *are* isolated, just devices which
> *can be* isolated.  There are plenty of instances when we don't need
> to isolate devices in different IOMMU groups: passing both groups to
> the same guest or userspace VFIO driver for example, or indeed when
> both groups are owned by regular host kernel drivers.
> 
> In at least some of those cases we also don't want to isolate the
> devices when we don't have to, usually for performance reasons.

I see IOMMU groups as representing the current isolation of the device,
not just the possible isolation.  If there are ways to break down that
isolation then ideally the group would be updated to reflect it.  The
ACS disable patches seem to support this, at boot time we can choose to
disable ACS at certain points in the topology to favor peer-to-peer
performance over isolation.  This is then reflected in the group
composition, because even though ACS *can be* enabled at the given
isolation points, it's intentionally not with this option.  Whether or
not a given user who owns multiple devices needs that isolation is
really beside the point, the user can choose to connect groups via IOMMU
mappings or reconfigure the system to disable ACS and potentially more
direct routing.  The IOMMU groups are still accurately reflecting the
topology and IOMMU based isolation.

> > So, if there's a device specific
> > way to configure the isolation reported in the group, which requires
> > some sort of active management against things like secondary bus
> > resets, then I think we need to manage it above the attached endpoint
> > driver.  
> 
> The problem is that above the endpoint driver, we don't actually have
> enough information about what should be isolated.  For VFIO we want to
> isolate things if they're in different containers, for most regular
> host kernel drivers we don't need to isolate at all (although we might
> as well when it doesn't have a cost).

This idea that we only want to isolate things if they're in different
containers is 

Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

2019-03-19 Thread Alex Williamson
On Fri, 15 Mar 2019 19:18:35 +1100
Alexey Kardashevskiy  wrote:

> The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
> platform puts all interconnected GPUs to the same IOMMU group.
> 
> However the user may want to pass individual GPUs to the userspace so
> in order to do so we need to put them into separate IOMMU groups and
> cut off the interconnects.
> 
> Thankfully V100 GPUs implement an interface to do by programming link
> disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
> this interface, it cannot be re-enabled until the secondary bus reset is
> issued to the GPU.
> 
> This defines a reset_done() handler for V100 NVlink2 device which
> determines what links need to be disabled. This relies on presence
> of the new "ibm,nvlink-peers" device tree property of a GPU telling which
> PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
> 
> This does not change the existing behaviour and instead adds
> a new "isolate_nvlink" kernel parameter to allow such isolation.
> 
> The alternative approaches would be:
> 
> 1. do this in the system firmware (skiboot) but for that we would need
> to tell skiboot via an additional OPAL call whether or not we want this
> isolation - skiboot is unaware of IOMMU groups.
> 
> 2. do this in the secondary bus reset handler in the POWERNV platform -
> the problem with that is at that point the device is not enabled, i.e.
> config space is not restored so we need to enable the device (i.e. MMIO
> bit in CMD register + program valid address to BAR0) in order to disable
> links and then perhaps undo all this initialization to bring the device
> back to the state where pci_try_reset_function() expects it to be.

The trouble seems to be that this approach only maintains the isolation
exposed by the IOMMU group when vfio-pci is the active driver for the
device.  IOMMU groups can be used by any driver and the IOMMU core is
incorporating groups in various ways.  So, if there's a device specific
way to configure the isolation reported in the group, which requires
some sort of active management against things like secondary bus
resets, then I think we need to manage it above the attached endpoint
driver.  Ideally I'd see this as a set of PCI quirks so that we might
leverage it beyond POWER platforms.  I'm not sure how we get past the
reliance on device tree properties that we won't have on other
platforms though, if only NVIDIA could at least open a spec addressing
the discovery and configuration of NVLink registers on their
devices :-\  Thanks,

Alex


Re: [PATCH kernel] vfio/spapr_tce: Skip unsetting already unset table

2019-02-19 Thread Alex Williamson
On Wed, 13 Feb 2019 11:18:21 +1100
Alexey Kardashevskiy  wrote:

> On 13/02/2019 07:52, Alex Williamson wrote:
> > On Mon, 11 Feb 2019 18:49:17 +1100
> > Alexey Kardashevskiy  wrote:
> >   
> >> VFIO TCE IOMMU v2 owns IOMMU tables so when detach a IOMMU group from
> >> a container, we need to unset those from a group so we call unset_window()
> >> so do we unconditionally. We also unset tables when removing a DMA window  
> > 
> > Patch looks ok, but this first sentence trails off into a bit of a word
> > salad.  Care to refine a bit?  Thanks,  
> 
> Fair comment, sorry for the salad. How about this?
> 
> ===
> VFIO TCE IOMMU v2 owns IOMMU tables. When we detach an IOMMU group from
> a container, we need to unset these tables from the group which we do by
> calling unset_window(). We also unset tables when removing a DMA window
> via the VFIO_IOMMU_SPAPR_TCE_REMOVE ioctl.
> ===


Applied to vfio next branch with updated commit log and David's R-b.
Thanks,

Alex

> >   
> >> via the VFIO_IOMMU_SPAPR_TCE_REMOVE ioctl.
> >>
> >> The window removal checks if the table actually exists (hidden inside
> >> tce_iommu_find_table()) but the group detaching does not so the user
> >> may see duplicating messages:
> >> pci 0009:03 : [PE# fd] Removing DMA window #0
> >> pci 0009:03 : [PE# fd] Removing DMA window #1
> >> pci 0009:03 : [PE# fd] Removing DMA window #0
> >> pci 0009:03 : [PE# fd] Removing DMA window #1
> >>
> >> At the moment this is not a problem as the second invocation
> >> of unset_window() writes zeroes to the HW registers again and exits early
> >> as there is no table.
> >>
> >> Signed-off-by: Alexey Kardashevskiy 
> >> ---
> >>
> >> When doing VFIO PCI hot unplug, first we remove the DMA window and
> >> set container->tables[num] - this is a first couple of messages.
> >> Then we detach the group and then we see another couple of the same
> >> messages which confused myself.
> >> ---
> >>  drivers/vfio/vfio_iommu_spapr_tce.c | 3 ++-
> >>  1 file changed, 2 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> >> b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> index c424913..8dbb270 100644
> >> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> @@ -1235,7 +1235,8 @@ static void tce_iommu_release_ownership_ddw(struct 
> >> tce_container *container,
> >>}
> >>  
> >>for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i)
> >> -  table_group->ops->unset_window(table_group, i);
> >> +  if (container->tables[i])
> >> +  table_group->ops->unset_window(table_group, i);
> >>  
> >>table_group->ops->release_ownership(table_group);
> >>  }  
> >   
> 



Re: [PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages

2019-02-13 Thread Alex Williamson
On Tue, 12 Feb 2019 19:26:50 -0500
Daniel Jordan  wrote:

> On Tue, Feb 12, 2019 at 11:41:10AM -0700, Alex Williamson wrote:
> > Daniel Jordan  wrote:  
> > > On Mon, Feb 11, 2019 at 03:56:20PM -0700, Jason Gunthorpe wrote:  
> > > > I haven't looked at this super closely, but how does this stuff work?
> > > > 
> > > > do_mlock doesn't touch pinned_vm, and this doesn't touch locked_vm...
> > > > 
> > > > Shouldn't all this be 'if (locked_vm + pinned_vm < RLIMIT_MEMLOCK)' ?
> > > >
> > > > Otherwise MEMLOCK is really doubled..
> > > 
> > > So this has been a problem for some time, but it's not as easy as adding 
> > > them
> > > together, see [1][2] for a start.
> > > 
> > > The locked_vm/pinned_vm issue definitely needs fixing, but all this 
> > > series is
> > > trying to do is account to the right counter.  
> 
> Thanks for taking a look, Alex.
> 
> > This still makes me nervous because we have userspace dependencies on
> > setting process locked memory.  
> 
> Could you please expand on this?  Trying to get more context.

VFIO is a userspace driver interface and the pinned/locked page
accounting we're doing here is trying to prevent a user from exceeding
their locked memory limits.  Thus a VM management tool or unprivileged
userspace driver needs to have appropriate locked memory limits
configured for their use case.  Currently we do not have a unified
accounting scheme, so if a page is mlock'd by the user and also mapped
through VFIO for DMA, it's accounted twice, these both increment
locked_vm and userspace needs to manage that.  If pinned memory
and locked memory are now two separate buckets and we're only comparing
one of them against the locked memory limit, then it seems we have
effectively doubled the user's locked memory for this use case, as
Jason questioned.  The user could mlock one page and DMA map another,
they're both "locked", but now they only take one slot in each bucket.

If we continue forward with using a separate bucket here, userspace
could infer that accounting is unified and lower the user's locked
memory limit, or exploit the gap that their effective limit might
actually exceed system memory.  In the former case, if we do eventually
correct to compare the total of the combined buckets against the user's
locked memory limits, we'll break users that have adapted their locked
memory limits to meet the apparent needs.  In the latter case, the
inconsistent accounting is potentially an attack vector.

> > There's a user visible difference if we
> > account for them in the same bucket vs separate.  Perhaps we're
> > counting in the wrong bucket now, but if we "fix" that and userspace
> > adapts, how do we ever go back to accounting both mlocked and pinned
> > memory combined against rlimit?  Thanks,  
> 
> PeterZ posted an RFC that addresses this point[1].  It kept pinned_vm and
> locked_vm accounting separate, but allowed the two to be added safely to be
> compared against RLIMIT_MEMLOCK.

Unless I'm incorrect in the concerns above, I don't see how we can
convert vfio before this occurs.
 
> Anyway, until some solution is agreed on, are there objections to converting
> locked_vm to an atomic, to avoid user-visible changes, instead of switching
> locked_vm users to pinned_vm?

Seems that as long as we have separate buckets that are compared
individually to rlimit that we've got problems, it's just a matter of
where they're exposed based on which bucket is used for which
interface.  Thanks,

Alex


Re: [PATCH kernel] vfio/spapr_tce: Skip unsetting already unset table

2019-02-12 Thread Alex Williamson
On Mon, 11 Feb 2019 18:49:17 +1100
Alexey Kardashevskiy  wrote:

> VFIO TCE IOMMU v2 owns IOMMU tables so when detach a IOMMU group from
> a container, we need to unset those from a group so we call unset_window()
> so do we unconditionally. We also unset tables when removing a DMA window

Patch looks ok, but this first sentence trails off into a bit of a word
salad.  Care to refine a bit?  Thanks,

Alex

> via the VFIO_IOMMU_SPAPR_TCE_REMOVE ioctl.
> 
> The window removal checks if the table actually exists (hidden inside
> tce_iommu_find_table()) but the group detaching does not so the user
> may see duplicating messages:
> pci 0009:03 : [PE# fd] Removing DMA window #0
> pci 0009:03 : [PE# fd] Removing DMA window #1
> pci 0009:03 : [PE# fd] Removing DMA window #0
> pci 0009:03 : [PE# fd] Removing DMA window #1
> 
> At the moment this is not a problem as the second invocation
> of unset_window() writes zeroes to the HW registers again and exits early
> as there is no table.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> 
> When doing VFIO PCI hot unplug, first we remove the DMA window and
> set container->tables[num] - this is a first couple of messages.
> Then we detach the group and then we see another couple of the same
> messages which confused myself.
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index c424913..8dbb270 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -1235,7 +1235,8 @@ static void tce_iommu_release_ownership_ddw(struct 
> tce_container *container,
>   }
>  
>   for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i)
> - table_group->ops->unset_window(table_group, i);
> + if (container->tables[i])
> + table_group->ops->unset_window(table_group, i);
>  
>   table_group->ops->release_ownership(table_group);
>  }



Re: [PATCH 2/5] vfio/spapr_tce: use pinned_vm instead of locked_vm to account pinned pages

2019-02-12 Thread Alex Williamson
On Tue, 12 Feb 2019 17:56:18 +1100
Alexey Kardashevskiy  wrote:

> On 12/02/2019 09:44, Daniel Jordan wrote:
> > Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned
> > pages"), locked and pinned pages are accounted separately.  The SPAPR
> > TCE VFIO IOMMU driver accounts pinned pages to locked_vm; use pinned_vm
> > instead.
> > 
> > pinned_vm recently became atomic and so no longer relies on mmap_sem
> > held as writer: delete.
> > 
> > Signed-off-by: Daniel Jordan 
> > ---
> >  Documentation/vfio.txt  |  6 +--
> >  drivers/vfio/vfio_iommu_spapr_tce.c | 64 ++---
> >  2 files changed, 33 insertions(+), 37 deletions(-)
> > 
> > diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> > index f1a4d3c3ba0b..fa37d65363f9 100644
> > --- a/Documentation/vfio.txt
> > +++ b/Documentation/vfio.txt
> > @@ -308,7 +308,7 @@ This implementation has some specifics:
> > currently there is no way to reduce the number of calls. In order to 
> > make
> > things faster, the map/unmap handling has been implemented in real mode
> > which provides an excellent performance which has limitations such as
> > -   inability to do locked pages accounting in real time.
> > +   inability to do pinned pages accounting in real time.
> >  
> >  4) According to sPAPR specification, A Partitionable Endpoint (PE) is an 
> > I/O
> > subtree that can be treated as a unit for the purposes of partitioning 
> > and
> > @@ -324,7 +324,7 @@ This implementation has some specifics:
> > returns the size and the start of the DMA window on the PCI bus.
> >  
> > VFIO_IOMMU_ENABLE
> > -   enables the container. The locked pages accounting
> > +   enables the container. The pinned pages accounting
> > is done at this point. This lets user first to know what
> > the DMA window is and adjust rlimit before doing any real job.

I don't know of a ulimit only covering pinned pages, so for
documentation it seems more correct to continue referring to this as
locked page accounting.

> > @@ -454,7 +454,7 @@ This implementation has some specifics:
> >  
> > PPC64 paravirtualized guests generate a lot of map/unmap requests,
> > and the handling of those includes pinning/unpinning pages and updating
> > -   mm::locked_vm counter to make sure we do not exceed the rlimit.
> > +   mm::pinned_vm counter to make sure we do not exceed the rlimit.
> > The v2 IOMMU splits accounting and pinning into separate operations:
> >  
> > - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY 
> > ioctls
> > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> > b/drivers/vfio/vfio_iommu_spapr_tce.c
> > index c424913324e3..f47e020dc5e4 100644
> > --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> > @@ -34,9 +34,11 @@
> >  static void tce_iommu_detach_group(void *iommu_data,
> > struct iommu_group *iommu_group);
> >  
> > -static long try_increment_locked_vm(struct mm_struct *mm, long npages)
> > +static long try_increment_pinned_vm(struct mm_struct *mm, long npages)
> >  {
> > -   long ret = 0, locked, lock_limit;
> > +   long ret = 0;
> > +   s64 pinned;
> > +   unsigned long lock_limit;
> >  
> > if (WARN_ON_ONCE(!mm))
> > return -EPERM;
> > @@ -44,39 +46,33 @@ static long try_increment_locked_vm(struct mm_struct 
> > *mm, long npages)
> > if (!npages)
> > return 0;
> >  
> > -   down_write(>mmap_sem);
> > -   locked = mm->locked_vm + npages;
> > +   pinned = atomic64_add_return(npages, >pinned_vm);
> > lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > -   if (locked > lock_limit && !capable(CAP_IPC_LOCK))
> > +   if (pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
> > ret = -ENOMEM;
> > -   else
> > -   mm->locked_vm += npages;
> > +   atomic64_sub(npages, >pinned_vm);
> > +   }
> >  
> > -   pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
> > +   pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%lu%s\n", current->pid,
> > npages << PAGE_SHIFT,
> > -   mm->locked_vm << PAGE_SHIFT,
> > -   rlimit(RLIMIT_MEMLOCK),
> > -   ret ? " - exceeded" : "");
> > -
> > -   up_write(>mmap_sem);
> > +   atomic64_read(>pinned_vm) << PAGE_SHIFT,
> > +   rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : "");
> >  
> > return ret;
> >  }
> >  
> > -static void decrement_locked_vm(struct mm_struct *mm, long npages)
> > +static void decrement_pinned_vm(struct mm_struct *mm, long npages)
> >  {
> > if (!mm || !npages)
> > return;
> >  
> > -   down_write(>mmap_sem);
> > -   if (WARN_ON_ONCE(npages > mm->locked_vm))
> > -   npages = mm->locked_vm;
> > -   mm->locked_vm -= npages;
> > -   pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
> > +   if (WARN_ON_ONCE(npages > 

Re: [PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages

2019-02-12 Thread Alex Williamson
On Mon, 11 Feb 2019 18:11:53 -0500
Daniel Jordan  wrote:

> On Mon, Feb 11, 2019 at 03:56:20PM -0700, Jason Gunthorpe wrote:
> > On Mon, Feb 11, 2019 at 05:44:33PM -0500, Daniel Jordan wrote:  
> > > @@ -266,24 +267,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, 
> > > long npage, bool async)
> > >   if (!mm)
> > >   return -ESRCH; /* process exited */
> > >  
> > > - ret = down_write_killable(>mmap_sem);
> > > - if (!ret) {
> > > - if (npage > 0) {
> > > - if (!dma->lock_cap) {
> > > - unsigned long limit;
> > > -
> > > - limit = task_rlimit(dma->task,
> > > - RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > + pinned_vm = atomic64_add_return(npage, >pinned_vm);
> > >  
> > > - if (mm->locked_vm + npage > limit)
> > > - ret = -ENOMEM;
> > > - }
> > > + if (npage > 0 && !dma->lock_cap) {
> > > + unsigned long limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >>
> > > +
> > > - PAGE_SHIFT;  
> > 
> > I haven't looked at this super closely, but how does this stuff work?
> > 
> > do_mlock doesn't touch pinned_vm, and this doesn't touch locked_vm...
> > 
> > Shouldn't all this be 'if (locked_vm + pinned_vm < RLIMIT_MEMLOCK)' ?
> >
> > Otherwise MEMLOCK is really doubled..  
> 
> So this has been a problem for some time, but it's not as easy as adding them
> together, see [1][2] for a start.
> 
> The locked_vm/pinned_vm issue definitely needs fixing, but all this series is
> trying to do is account to the right counter.

This still makes me nervous because we have userspace dependencies on
setting process locked memory.  There's a user visible difference if we
account for them in the same bucket vs separate.  Perhaps we're
counting in the wrong bucket now, but if we "fix" that and userspace
adapts, how do we ever go back to accounting both mlocked and pinned
memory combined against rlimit?  Thanks,

Alex


Re: [PATCH kernel] vfio-pci/nvlink2: Fix ancient gcc warnings

2019-01-23 Thread Alex Williamson
On Wed, 23 Jan 2019 15:07:11 +1100
Alexey Kardashevskiy  wrote:

> Using the {0} construct as a generic initializer is perfectly fine in C,
> however due to a bug in old gcc there is a warning:
> 
>   + /kisskb/src/drivers/vfio/pci/vfio_pci_nvlink2.c: warning: (near
> initialization for 'cap.header') [-Wmissing-braces]:  => 181:9
> 
> Since for whatever reason we still want to compile the modern kernel
> with such an old gcc without warnings, this changes the capabilities
> initialization.
> 
> The gcc bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53119
> 
> Signed-off-by: Alexey Kardashevskiy 

Added Fixes: and Reported-by: tags.

Applied to for-linus branch for v5.0.  Thanks,

Alex

> ---
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 30 ++---
>  1 file changed, 15 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
> b/drivers/vfio/pci/vfio_pci_nvlink2.c
> index 054a2cf..91d945b 100644
> --- a/drivers/vfio/pci/vfio_pci_nvlink2.c
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -178,11 +178,11 @@ static int vfio_pci_nvgpu_add_capability(struct 
> vfio_pci_device *vdev,
>   struct vfio_pci_region *region, struct vfio_info_cap *caps)
>  {
>   struct vfio_pci_nvgpu_data *data = region->data;
> - struct vfio_region_info_cap_nvlink2_ssatgt cap = { 0 };
> -
> - cap.header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT;
> - cap.header.version = 1;
> - cap.tgt = data->gpu_tgt;
> + struct vfio_region_info_cap_nvlink2_ssatgt cap = {
> + .header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT,
> + .header.version = 1,
> + .tgt = data->gpu_tgt
> + };
>  
>   return vfio_info_add_capability(caps, , sizeof(cap));
>  }
> @@ -365,18 +365,18 @@ static int vfio_pci_npu2_add_capability(struct 
> vfio_pci_device *vdev,
>   struct vfio_pci_region *region, struct vfio_info_cap *caps)
>  {
>   struct vfio_pci_npu2_data *data = region->data;
> - struct vfio_region_info_cap_nvlink2_ssatgt captgt = { 0 };
> - struct vfio_region_info_cap_nvlink2_lnkspd capspd = { 0 };
> + struct vfio_region_info_cap_nvlink2_ssatgt captgt = {
> + .header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT,
> + .header.version = 1,
> + .tgt = data->gpu_tgt
> + };
> + struct vfio_region_info_cap_nvlink2_lnkspd capspd = {
> + .header.id = VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD,
> + .header.version = 1,
> + .link_speed = data->link_speed
> + };
>   int ret;
>  
> - captgt.header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT;
> - captgt.header.version = 1;
> - captgt.tgt = data->gpu_tgt;
> -
> - capspd.header.id = VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD;
> - capspd.header.version = 1;
> - capspd.link_speed = data->link_speed;
> -
>   ret = vfio_info_add_capability(caps, , sizeof(captgt));
>   if (ret)
>   return ret;



Re: [PATCH kernel] vfio-pci/nvlink2: Fix ancient gcc warnings

2019-01-22 Thread Alex Williamson
Hi Geert,

The below patch comes about from the build regressions and improvements
list you've sent out, but something doesn't add up that we'd be testing
with an old compiler where initialization with { 0 } generates a
"missing braces around initialization" warning.  Is this really the
case or are we missing something here?  There's no harm that I can see
with Alexey's fix, but are these really just false positives from a
compiler bug that we should selectively ignore if the "fix" is less
clean?  Thanks,

Alex

On Wed, 23 Jan 2019 15:07:11 +1100
Alexey Kardashevskiy  wrote:

> Using the {0} construct as a generic initializer is perfectly fine in C,
> however due to a bug in old gcc there is a warning:
> 
>   + /kisskb/src/drivers/vfio/pci/vfio_pci_nvlink2.c: warning: (near
> initialization for 'cap.header') [-Wmissing-braces]:  => 181:9
> 
> Since for whatever reason we still want to compile the modern kernel
> with such an old gcc without warnings, this changes the capabilities
> initialization.
> 
> The gcc bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53119
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 30 ++---
>  1 file changed, 15 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
> b/drivers/vfio/pci/vfio_pci_nvlink2.c
> index 054a2cf..91d945b 100644
> --- a/drivers/vfio/pci/vfio_pci_nvlink2.c
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -178,11 +178,11 @@ static int vfio_pci_nvgpu_add_capability(struct 
> vfio_pci_device *vdev,
>   struct vfio_pci_region *region, struct vfio_info_cap *caps)
>  {
>   struct vfio_pci_nvgpu_data *data = region->data;
> - struct vfio_region_info_cap_nvlink2_ssatgt cap = { 0 };
> -
> - cap.header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT;
> - cap.header.version = 1;
> - cap.tgt = data->gpu_tgt;
> + struct vfio_region_info_cap_nvlink2_ssatgt cap = {
> + .header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT,
> + .header.version = 1,
> + .tgt = data->gpu_tgt
> + };
>  
>   return vfio_info_add_capability(caps, , sizeof(cap));
>  }
> @@ -365,18 +365,18 @@ static int vfio_pci_npu2_add_capability(struct 
> vfio_pci_device *vdev,
>   struct vfio_pci_region *region, struct vfio_info_cap *caps)
>  {
>   struct vfio_pci_npu2_data *data = region->data;
> - struct vfio_region_info_cap_nvlink2_ssatgt captgt = { 0 };
> - struct vfio_region_info_cap_nvlink2_lnkspd capspd = { 0 };
> + struct vfio_region_info_cap_nvlink2_ssatgt captgt = {
> + .header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT,
> + .header.version = 1,
> + .tgt = data->gpu_tgt
> + };
> + struct vfio_region_info_cap_nvlink2_lnkspd capspd = {
> + .header.id = VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD,
> + .header.version = 1,
> + .link_speed = data->link_speed
> + };
>   int ret;
>  
> - captgt.header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT;
> - captgt.header.version = 1;
> - captgt.tgt = data->gpu_tgt;
> -
> - capspd.header.id = VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD;
> - capspd.header.version = 1;
> - capspd.link_speed = data->link_speed;
> -
>   ret = vfio_info_add_capability(caps, , sizeof(captgt));
>   if (ret)
>   return ret;



Re: [PATCH kernel v7 20/20] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver

2018-12-20 Thread Alex Williamson
On Fri, 21 Dec 2018 12:50:00 +1100
Alexey Kardashevskiy  wrote:

> On 21/12/2018 12:37, Alex Williamson wrote:
> > On Fri, 21 Dec 2018 12:23:16 +1100
> > Alexey Kardashevskiy  wrote:
> >   
> >> On 21/12/2018 03:46, Alex Williamson wrote:  
> >>> On Thu, 20 Dec 2018 19:23:50 +1100
> >>> Alexey Kardashevskiy  wrote:
> >>> 
> >>>> POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
> >>>> pluggable PCIe devices but still have PCIe links which are used
> >>>> for config space and MMIO. In addition to that the GPUs have 6 NVLinks
> >>>> which are connected to other GPUs and the POWER9 CPU. POWER9 chips
> >>>> have a special unit on a die called an NPU which is an NVLink2 host bus
> >>>> adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
> >>>> These systems also support ATS (address translation services) which is
> >>>> a part of the NVLink2 protocol. Such GPUs also share on-board RAM
> >>>> (16GB or 32GB) to the system via the same NVLink2 so a CPU has
> >>>> cache-coherent access to a GPU RAM.
> >>>>
> >>>> This exports GPU RAM to the userspace as a new VFIO device region. This
> >>>> preregisters the new memory as device memory as it might be used for DMA.
> >>>> This inserts pfns from the fault handler as the GPU memory is not onlined
> >>>> until the vendor driver is loaded and trained the NVLinks so doing this
> >>>> earlier causes low level errors which we fence in the firmware so
> >>>> it does not hurt the host system but still better be avoided; for the 
> >>>> same
> >>>> reason this does not map GPU RAM into the host kernel (usual thing for
> >>>> emulated access otherwise).
> >>>>
> >>>> This exports an ATSD (Address Translation Shootdown) register of NPU 
> >>>> which
> >>>> allows TLB invalidations inside GPU for an operating system. The register
> >>>> conveniently occupies a single 64k page. It is also presented to
> >>>> the userspace as a new VFIO device region. One NPU has 8 ATSD registers,
> >>>> each of them can be used for TLB invalidation in a GPU linked to this 
> >>>> NPU.
> >>>> This allocates one ATSD register per an NVLink bridge allowing passing
> >>>> up to 6 registers. Due to the host firmware bug (just recently fixed),
> >>>> only 1 ATSD register per NPU was actually advertised to the host system
> >>>> so this passes that alone register via the first NVLink bridge device in
> >>>> the group which is still enough as QEMU collects them all back and
> >>>> presents to the guest via vPHB to mimic the emulated NPU PHB on the host.
> >>>>
> >>>> In order to provide the userspace with the information about 
> >>>> GPU-to-NVLink
> >>>> connections, this exports an additional capability called "tgt"
> >>>> (which is an abbreviated host system bus address). The "tgt" property
> >>>> tells the GPU its own system address and allows the guest driver to
> >>>> conglomerate the routing information so each GPU knows how to get 
> >>>> directly
> >>>> to the other GPUs.
> >>>>
> >>>> For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
> >>>> know LPID (a logical partition ID or a KVM guest hardware ID in other
> >>>> words) and PID (a memory context ID of a userspace process, not to be
> >>>> confused with a linux pid). This assigns a GPU to LPID in the NPU and
> >>>> this is why this adds a listener for KVM on an IOMMU group. A PID comes
> >>>> via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
> >>>>
> >>>> This requires coherent memory and ATSD to be available on the host as
> >>>> the GPU vendor only supports configurations with both features enabled
> >>>> and other configurations are known not to work. Because of this and
> >>>> because of the ways the features are advertised to the host system
> >>>> (which is a device tree with very platform specific properties),
> >>>> this requires enabled POWERNV platform.
> >>>>
> >>>> The V100 GPUs do not advertise any of these capabilities via the config
> >>>> space and there are more than j

Re: [PATCH kernel v7 20/20] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver

2018-12-20 Thread Alex Williamson
On Fri, 21 Dec 2018 12:23:16 +1100
Alexey Kardashevskiy  wrote:

> On 21/12/2018 03:46, Alex Williamson wrote:
> > On Thu, 20 Dec 2018 19:23:50 +1100
> > Alexey Kardashevskiy  wrote:
> >   
> >> POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
> >> pluggable PCIe devices but still have PCIe links which are used
> >> for config space and MMIO. In addition to that the GPUs have 6 NVLinks
> >> which are connected to other GPUs and the POWER9 CPU. POWER9 chips
> >> have a special unit on a die called an NPU which is an NVLink2 host bus
> >> adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
> >> These systems also support ATS (address translation services) which is
> >> a part of the NVLink2 protocol. Such GPUs also share on-board RAM
> >> (16GB or 32GB) to the system via the same NVLink2 so a CPU has
> >> cache-coherent access to a GPU RAM.
> >>
> >> This exports GPU RAM to the userspace as a new VFIO device region. This
> >> preregisters the new memory as device memory as it might be used for DMA.
> >> This inserts pfns from the fault handler as the GPU memory is not onlined
> >> until the vendor driver is loaded and trained the NVLinks so doing this
> >> earlier causes low level errors which we fence in the firmware so
> >> it does not hurt the host system but still better be avoided; for the same
> >> reason this does not map GPU RAM into the host kernel (usual thing for
> >> emulated access otherwise).
> >>
> >> This exports an ATSD (Address Translation Shootdown) register of NPU which
> >> allows TLB invalidations inside GPU for an operating system. The register
> >> conveniently occupies a single 64k page. It is also presented to
> >> the userspace as a new VFIO device region. One NPU has 8 ATSD registers,
> >> each of them can be used for TLB invalidation in a GPU linked to this NPU.
> >> This allocates one ATSD register per an NVLink bridge allowing passing
> >> up to 6 registers. Due to the host firmware bug (just recently fixed),
> >> only 1 ATSD register per NPU was actually advertised to the host system
> >> so this passes that alone register via the first NVLink bridge device in
> >> the group which is still enough as QEMU collects them all back and
> >> presents to the guest via vPHB to mimic the emulated NPU PHB on the host.
> >>
> >> In order to provide the userspace with the information about GPU-to-NVLink
> >> connections, this exports an additional capability called "tgt"
> >> (which is an abbreviated host system bus address). The "tgt" property
> >> tells the GPU its own system address and allows the guest driver to
> >> conglomerate the routing information so each GPU knows how to get directly
> >> to the other GPUs.
> >>
> >> For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
> >> know LPID (a logical partition ID or a KVM guest hardware ID in other
> >> words) and PID (a memory context ID of a userspace process, not to be
> >> confused with a linux pid). This assigns a GPU to LPID in the NPU and
> >> this is why this adds a listener for KVM on an IOMMU group. A PID comes
> >> via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
> >>
> >> This requires coherent memory and ATSD to be available on the host as
> >> the GPU vendor only supports configurations with both features enabled
> >> and other configurations are known not to work. Because of this and
> >> because of the ways the features are advertised to the host system
> >> (which is a device tree with very platform specific properties),
> >> this requires enabled POWERNV platform.
> >>
> >> The V100 GPUs do not advertise any of these capabilities via the config
> >> space and there are more than just one device ID so this relies on
> >> the platform to tell whether these GPUs have special abilities such as
> >> NVLinks.
> >>
> >> Signed-off-by: Alexey Kardashevskiy 
> >> ---
> >> Changes:
> >> v6.1:
> >> * fixed outdated comment about VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD
> >>
> >> v6:
> >> * reworked capabilities - tgt for nvlink and gpu and link-speed
> >> for nvlink only
> >>
> >> v5:
> >> * do not memremap GPU RAM for emulation, map it only when it is needed
> >> * allocate 1 ATSD register per NVLink bridge, if none left, then expose
> >> the region with a zero size
> >>

Re: [PATCH kernel v7 20/20] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver

2018-12-20 Thread Alex Williamson
On Thu, 20 Dec 2018 19:23:50 +1100
Alexey Kardashevskiy  wrote:

> POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
> pluggable PCIe devices but still have PCIe links which are used
> for config space and MMIO. In addition to that the GPUs have 6 NVLinks
> which are connected to other GPUs and the POWER9 CPU. POWER9 chips
> have a special unit on a die called an NPU which is an NVLink2 host bus
> adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
> These systems also support ATS (address translation services) which is
> a part of the NVLink2 protocol. Such GPUs also share on-board RAM
> (16GB or 32GB) to the system via the same NVLink2 so a CPU has
> cache-coherent access to a GPU RAM.
> 
> This exports GPU RAM to the userspace as a new VFIO device region. This
> preregisters the new memory as device memory as it might be used for DMA.
> This inserts pfns from the fault handler as the GPU memory is not onlined
> until the vendor driver is loaded and trained the NVLinks so doing this
> earlier causes low level errors which we fence in the firmware so
> it does not hurt the host system but still better be avoided; for the same
> reason this does not map GPU RAM into the host kernel (usual thing for
> emulated access otherwise).
> 
> This exports an ATSD (Address Translation Shootdown) register of NPU which
> allows TLB invalidations inside GPU for an operating system. The register
> conveniently occupies a single 64k page. It is also presented to
> the userspace as a new VFIO device region. One NPU has 8 ATSD registers,
> each of them can be used for TLB invalidation in a GPU linked to this NPU.
> This allocates one ATSD register per an NVLink bridge allowing passing
> up to 6 registers. Due to the host firmware bug (just recently fixed),
> only 1 ATSD register per NPU was actually advertised to the host system
> so this passes that alone register via the first NVLink bridge device in
> the group which is still enough as QEMU collects them all back and
> presents to the guest via vPHB to mimic the emulated NPU PHB on the host.
> 
> In order to provide the userspace with the information about GPU-to-NVLink
> connections, this exports an additional capability called "tgt"
> (which is an abbreviated host system bus address). The "tgt" property
> tells the GPU its own system address and allows the guest driver to
> conglomerate the routing information so each GPU knows how to get directly
> to the other GPUs.
> 
> For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
> know LPID (a logical partition ID or a KVM guest hardware ID in other
> words) and PID (a memory context ID of a userspace process, not to be
> confused with a linux pid). This assigns a GPU to LPID in the NPU and
> this is why this adds a listener for KVM on an IOMMU group. A PID comes
> via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
> 
> This requires coherent memory and ATSD to be available on the host as
> the GPU vendor only supports configurations with both features enabled
> and other configurations are known not to work. Because of this and
> because of the ways the features are advertised to the host system
> (which is a device tree with very platform specific properties),
> this requires enabled POWERNV platform.
> 
> The V100 GPUs do not advertise any of these capabilities via the config
> space and there are more than just one device ID so this relies on
> the platform to tell whether these GPUs have special abilities such as
> NVLinks.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v6.1:
> * fixed outdated comment about VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD
> 
> v6:
> * reworked capabilities - tgt for nvlink and gpu and link-speed
> for nvlink only
> 
> v5:
> * do not memremap GPU RAM for emulation, map it only when it is needed
> * allocate 1 ATSD register per NVLink bridge, if none left, then expose
> the region with a zero size
> * separate caps per device type
> * addressed AW review comments
> 
> v4:
> * added nvlink-speed to the NPU bridge capability as this turned out to
> be not a constant value
> * instead of looking at the exact device ID (which also changes from system
> to system), now this (indirectly) looks at the device tree to know
> if GPU and NPU support NVLink
> 
> v3:
> * reworded the commit log about tgt
> * added tracepoints (do we want them enabled for entire vfio-pci?)
> * added code comments
> * added write|mmap flags to the new regions
> * auto enabled VFIO_PCI_NVLINK2 config option
> * added 'tgt' capability to a GPU so QEMU can recreate ibm,npu and ibm,gpu
> references; there are required by the NVIDIA driver
> * keep notifier registered only for short time
> ---
>  drivers/vfio/pci/Makefile   |   1 +
>  drivers/vfio/pci/trace.h| 102 ++
>  drivers/vfio/pci/vfio_pci_private.h |  14 +
>  include/uapi/linux/vfio.h   |  37 +++
>  drivers/vfio/pci/vfio_pci.c |  27 +-
>  

Re: [PATCH kernel v6 20/20] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver

2018-12-19 Thread Alex Williamson
it a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index 93c1738..127071b 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -163,4 +163,18 @@ static inline int vfio_pci_igd_init(struct 
> vfio_pci_device *vdev)
>   return -ENODEV;
>  }
>  #endif
> +#ifdef CONFIG_VFIO_PCI_NVLINK2
> +extern int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev);
> +extern int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev);
> +#else
> +static inline int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device 
> *vdev)
> +{
> + return -ENODEV;
> +}
> +
> +static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
> +{
> + return -ENODEV;
> +}
> +#endif
>  #endif /* VFIO_PCI_PRIVATE_H */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 8131028..22b825c 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -353,6 +353,21 @@ struct vfio_region_gfx_edid {
>  #define VFIO_DEVICE_GFX_LINK_STATE_DOWN  2
>  };
>  
> +/*
> + * 10de vendor sub-type
> + *
> + * NVIDIA GPU NVlink2 RAM is coherent RAM mapped onto the host address space.
> + */
> +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM   (1)
> +
> +/*
> + * 1014 vendor sub-type
> + *
> + * IBM NPU NVlink2 ATSD (Address Translation Shootdown) register of NPU
> + * to do TLB invalidation on a GPU.
> + */
> +#define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD (1)
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be 
> mmapped
>   * which allows direct access to non-MSIX registers which happened to be 
> within
> @@ -363,6 +378,29 @@ struct vfio_region_gfx_edid {
>   */
>  #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE   3
>  
> +/*
> + * Capability with compressed real address (aka SSA - small system address)
> + * where GPU RAM is mapped on a system bus. Used by a GPU for DMA routing.
> + */
> +#define VFIO_REGION_INFO_CAP_NVLINK2_SSATGT  4
> +
> +struct vfio_region_info_cap_nvlink2_ssatgt {
> + struct vfio_info_cap_header header;
> + __u64 tgt;
> +};
> +
> +/*
> + * Capability with compressed real address (aka SSA - small system address),
> + * used to match the NVLink bridge with a GPU. Also contains a link speed.
> + */

Comments carried over from previous definitions are no longer
accurate.  Thanks,

Alex

> +#define VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD  5
> +
> +struct vfio_region_info_cap_nvlink2_lnkspd {
> + struct vfio_info_cap_header header;
> + __u32 link_speed;
> + __u32 __pad;
> +};
> +
>  /**
>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
>   *   struct vfio_irq_info)
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 6cb70cf..67c03f2 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -302,14 +302,37 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>   if (ret) {
>   dev_warn(>pdev->dev,
>"Failed to setup Intel IGD regions\n");
> - vfio_pci_disable(vdev);
> - return ret;
> + goto disable_exit;
> + }
> + }
> +
> + if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> + IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
> + ret = vfio_pci_nvdia_v100_nvlink2_init(vdev);
> + if (ret && ret != -ENODEV) {
> + dev_warn(>pdev->dev,
> +  "Failed to setup NVIDIA NV2 RAM region\n");
> + goto disable_exit;
> + }
> + }
> +
> + if (pdev->vendor == PCI_VENDOR_ID_IBM &&
> + IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
> + ret = vfio_pci_ibm_npu2_init(vdev);
> + if (ret && ret != -ENODEV) {
> +     dev_warn(>pdev->dev,
> + "Failed to setup NVIDIA NV2 ATSD 
> region\n");
> + goto disable_exit;
>   }
>   }
>  
>   vfio_pci_probe_mmaps(vdev);
>  
>   return 0;
> +
> +disable_exit:
> + vfio_pci_disable(vdev);
> + return ret;
>  }
>  
>  static void vfio_pci_disable(struct vfio_pci_device *vdev)
> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
> b/drivers/vfio/pci/vfio_pci_nvlink2.c
> new file mode 100644
> index 000..054a2cf
> --- /dev/null
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -0,0 

Re: [PATCH kernel v6 19/20] vfio_pci: Allow regions to add own capabilities

2018-12-19 Thread Alex Williamson
[cc +kvm, +lkml]

Ditto list cc comment from 18/20

On Wed, 19 Dec 2018 19:52:31 +1100
Alexey Kardashevskiy  wrote:

> VFIO regions already support region capabilities with a limited set of
> fields. However the subdriver might have to report to the userspace
> additional bits.
> 
> This adds an add_capability() hook to vfio_pci_regops.
> 
> Signed-off-by: Alexey Kardashevskiy 
> Acked-by: Alex Williamson 
> ---
> Changes:
> v3:
> * removed confusing rationale for the patch, the next patch makes
> use of it anyway
> ---
>  drivers/vfio/pci/vfio_pci_private.h | 3 +++
>  drivers/vfio/pci/vfio_pci.c | 6 ++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index 86aab05..93c1738 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -62,6 +62,9 @@ struct vfio_pci_regops {
>   int (*mmap)(struct vfio_pci_device *vdev,
>   struct vfio_pci_region *region,
>   struct vm_area_struct *vma);
> + int (*add_capability)(struct vfio_pci_device *vdev,
> +   struct vfio_pci_region *region,
> +   struct vfio_info_cap *caps);
>  };
>  
>  struct vfio_pci_region {
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 4a6f7c0..6cb70cf 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -763,6 +763,12 @@ static long vfio_pci_ioctl(void *device_data,
>   if (ret)
>   return ret;
>  
> + if (vdev->region[i].ops->add_capability) {
> + ret = vdev->region[i].ops->add_capability(vdev,
> + >region[i], );
> + if (ret)
> + return ret;
> + }
>   }
>   }
>  



Re: [PATCH kernel v6 18/20] vfio_pci: Allow mapping extra regions

2018-12-19 Thread Alex Williamson
[cc +kvm, +lkml]

Sorry, just noticed these are only visible on ppc lists or for those
directly cc'd.  vfio's official development list is the kvm list.  I'll
let spapr specific changes get away without copying this list, but
changes like this really need to be visible to everyone.  Thanks,

Alex

On Wed, 19 Dec 2018 19:52:30 +1100
Alexey Kardashevskiy  wrote:

> So far we only allowed mapping of MMIO BARs to the userspace. However
> there are GPUs with on-board coherent RAM accessible via side
> channels which we also want to map to the userspace. The first client
> for this is NVIDIA V100 GPU with NVLink2 direct links to a POWER9
> NPU-enabled CPU; such GPUs have 16GB RAM which is coherently mapped
> to the system address space, we are going to export these as an extra
> PCI region.
> 
> We already support extra PCI regions and this adds support for mapping
> them to the userspace.
> 
> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: David Gibson 
> Acked-by: Alex Williamson 
> ---
> Changes:
> v2:
> * reverted one of mistakenly removed error checks
> ---
>  drivers/vfio/pci/vfio_pci_private.h | 3 +++
>  drivers/vfio/pci/vfio_pci.c | 9 +
>  2 files changed, 12 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index cde3b5d..86aab05 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -59,6 +59,9 @@ struct vfio_pci_regops {
> size_t count, loff_t *ppos, bool iswrite);
>   void(*release)(struct vfio_pci_device *vdev,
>  struct vfio_pci_region *region);
> + int (*mmap)(struct vfio_pci_device *vdev,
> + struct vfio_pci_region *region,
> + struct vm_area_struct *vma);
>  };
>  
>  struct vfio_pci_region {
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index fef5002..4a6f7c0 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1130,6 +1130,15 @@ static int vfio_pci_mmap(void *device_data, struct 
> vm_area_struct *vma)
>   return -EINVAL;
>   if ((vma->vm_flags & VM_SHARED) == 0)
>   return -EINVAL;
> + if (index >= VFIO_PCI_NUM_REGIONS) {
> + int regnum = index - VFIO_PCI_NUM_REGIONS;
> + struct vfio_pci_region *region = vdev->region + regnum;
> +
> + if (region && region->ops && region->ops->mmap &&
> + (region->flags & VFIO_REGION_INFO_FLAG_MMAP))
> + return region->ops->mmap(vdev, region, vma);
> + return -EINVAL;
> + }
>   if (index >= VFIO_PCI_ROM_REGION_INDEX)
>   return -EINVAL;
>   if (!vdev->bar_mmap_supported[index])



Re: [PATCH kernel v5 20/20] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver

2018-12-18 Thread Alex Williamson
On Thu, 13 Dec 2018 17:17:34 +1100
Alexey Kardashevskiy  wrote:

> POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
> pluggable PCIe devices but still have PCIe links which are used
> for config space and MMIO. In addition to that the GPUs have 6 NVLinks
> which are connected to other GPUs and the POWER9 CPU. POWER9 chips
> have a special unit on a die called an NPU which is an NVLink2 host bus
> adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
> These systems also support ATS (address translation services) which is
> a part of the NVLink2 protocol. Such GPUs also share on-board RAM
> (16GB or 32GB) to the system via the same NVLink2 so a CPU has
> cache-coherent access to a GPU RAM.
> 
> This exports GPU RAM to the userspace as a new VFIO device region. This
> preregisters the new memory as device memory as it might be used for DMA.
> This inserts pfns from the fault handler as the GPU memory is not onlined
> until the vendor driver is loaded and trained the NVLinks so doing this
> earlier causes low level errors which we fence in the firmware so
> it does not hurt the host system but still better be avoided; for the same
> reason this does not map GPU RAM into the host kernel (usual thing for
> emulated access otherwise).
> 
> This exports an ATSD (Address Translation Shootdown) register of NPU which
> allows TLB invalidations inside GPU for an operating system. The register
> conveniently occupies a single 64k page. It is also presented to
> the userspace as a new VFIO device region. One NPU has 8 ATSD registers,
> each of them can be used for TLB invalidation in a GPU linked to this NPU.
> This allocates one ATSD register per an NVLink bridge allowing passing
> up to 6 registers. Due to the host firmware bug (just recently fixed),
> only 1 ATSD register per NPU was actually advertised to the host system
> so this passes that alone register via the first NVLink bridge device in
> the group which is still enough as QEMU collects them all back and
> presents to the guest via vPHB to mimic the emulated NPU PHB on the host.
> 
> In order to provide the userspace with the information about GPU-to-NVLink
> connections, this exports an additional capability called "tgt"
> (which is an abbreviated host system bus address). The "tgt" property
> tells the GPU its own system address and allows the guest driver to
> conglomerate the routing information so each GPU knows how to get directly
> to the other GPUs.
> 
> For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
> know LPID (a logical partition ID or a KVM guest hardware ID in other
> words) and PID (a memory context ID of a userspace process, not to be
> confused with a linux pid). This assigns a GPU to LPID in the NPU and
> this is why this adds a listener for KVM on an IOMMU group. A PID comes
> via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
> 
> This requires coherent memory and ATSD to be available on the host as
> the GPU vendor only supports configurations with both features enabled
> and other configurations are known not to work. Because of this and
> because of the ways the features are advertised to the host system
> (which is a device tree with very platform specific properties),
> this requires enabled POWERNV platform.
> 
> The V100 GPUs do not advertise any of these capabilities via the config
> space and there are more than just one device ID so this relies on
> the platform to tell whether these GPUs have special abilities such as
> NVLinks.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v5:
> * do not memremap GPU RAM for emulation, map it only when it is needed
> * allocate 1 ATSD register per NVLink bridge, if none left, then expose
> the region with a zero size
> * separate caps per device type
> * addressed AW review comments
> 
> v4:
> * added nvlink-speed to the NPU bridge capability as this turned out to
> be not a constant value
> * instead of looking at the exact device ID (which also changes from system
> to system), now this (indirectly) looks at the device tree to know
> if GPU and NPU support NVLink
> 
> v3:
> * reworded the commit log about tgt
> * added tracepoints (do we want them enabled for entire vfio-pci?)
> * added code comments
> * added write|mmap flags to the new regions
> * auto enabled VFIO_PCI_NVLINK2 config option
> * added 'tgt' capability to a GPU so QEMU can recreate ibm,npu and ibm,gpu
> references; there are required by the NVIDIA driver
> * keep notifier registered only for short time
> ---
>  drivers/vfio/pci/Makefile   |   1 +
>  drivers/vfio/pci/trace.h| 102 ++
>  drivers/vfio/pci/vfio_pci_private.h |  14 +
>  include/uapi/linux/vfio.h   |  39 +++
>  drivers/vfio/pci/vfio_pci.c |  27 +-
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 473 
>  drivers/vfio/pci/Kconfig|   6 +
>  7 files changed, 660 insertions(+), 2 deletions(-)
>  create mode 

Re: [PATCH kernel v4 19/19] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver

2018-12-10 Thread Alex Williamson
On Tue, 11 Dec 2018 11:57:20 +1100
Alexey Kardashevskiy  wrote:

> On 11/12/2018 11:08, Alex Williamson wrote:
> > On Fri, 23 Nov 2018 16:53:04 +1100
> > Alexey Kardashevskiy  wrote:
> >   
> >> POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
> >> pluggable PCIe devices but still have PCIe links which are used
> >> for config space and MMIO. In addition to that the GPUs have 6 NVLinks
> >> which are connected to other GPUs and the POWER9 CPU. POWER9 chips
> >> have a special unit on a die called an NPU which is an NVLink2 host bus
> >> adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
> >> These systems also support ATS (address translation services) which is
> >> a part of the NVLink2 protocol. Such GPUs also share on-board RAM
> >> (16GB or 32GB) to the system via the same NVLink2 so a CPU has
> >> cache-coherent access to a GPU RAM.
> >>
> >> This exports GPU RAM to the userspace as a new VFIO device region. This
> >> preregisters the new memory as device memory as it might be used for DMA.
> >> This inserts pfns from the fault handler as the GPU memory is not onlined
> >> until the vendor driver is loaded and trained the NVLinks so doing this
> >> earlier causes low level errors which we fence in the firmware so
> >> it does not hurt the host system but still better be avoided.
> >>
> >> This exports an ATSD (Address Translation Shootdown) register of NPU which
> >> allows TLB invalidations inside GPU for an operating system. The register
> >> conveniently occupies a single 64k page. It is also presented to
> >> the userspace as a new VFIO device region.
> >>
> >> In order to provide the userspace with the information about GPU-to-NVLink
> >> connections, this exports an additional capability called "tgt"
> >> (which is an abbreviated host system bus address). The "tgt" property
> >> tells the GPU its own system address and allows the guest driver to
> >> conglomerate the routing information so each GPU knows how to get directly
> >> to the other GPUs.
> >>
> >> For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
> >> know LPID (a logical partition ID or a KVM guest hardware ID in other
> >> words) and PID (a memory context ID of a userspace process, not to be
> >> confused with a linux pid). This assigns a GPU to LPID in the NPU and
> >> this is why this adds a listener for KVM on an IOMMU group. A PID comes
> >> via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
> >>
> >> This requires coherent memory and ATSD to be available on the host as
> >> the GPU vendor only supports configurations with both features enabled
> >> and other configurations are known not to work. Because of this and
> >> because of the ways the features are advertised to the host system
> >> (which is a device tree with very platform specific properties),
> >> this requires enabled POWERNV platform.
> >>
> >> The V100 GPUs do not advertise none of these capabilities via the config  
> > 
> > s/none/any/
> >   
> >> space and there are more than just one device ID so this relies on
> >> the platform to tell whether these GPUs have special abilities such as
> >> NVLinks.
> >>
> >> Signed-off-by: Alexey Kardashevskiy 
> >> ---
> >> Changes:
> >> v4:
> >> * added nvlink-speed to the NPU bridge capability as this turned out to
> >> be not a constant value
> >> * instead of looking at the exact device ID (which also changes from system
> >> to system), now this (indirectly) looks at the device tree to know
> >> if GPU and NPU support NVLink
> >>
> >> v3:
> >> * reworded the commit log about tgt
> >> * added tracepoints (do we want them enabled for entire vfio-pci?)
> >> * added code comments
> >> * added write|mmap flags to the new regions
> >> * auto enabled VFIO_PCI_NVLINK2 config option
> >> * added 'tgt' capability to a GPU so QEMU can recreate ibm,npu and ibm,gpu
> >> references; there are required by the NVIDIA driver
> >> * keep notifier registered only for short time
> >> ---
> >>  drivers/vfio/pci/Makefile   |   1 +
> >>  drivers/vfio/pci/trace.h| 102 +++
> >>  drivers/vfio/pci/vfio_pci_private.h |   2 +
> >>  include/uapi/linux/vfio.h   |  27 ++
> >>  drivers/vfio/pci/vfio_pci.c |  37 ++-
> >

Re: [PATCH kernel v4 18/19] vfio_pci: Allow regions to add own capabilities

2018-12-10 Thread Alex Williamson
On Fri, 23 Nov 2018 16:53:03 +1100
Alexey Kardashevskiy  wrote:

> VFIO regions already support region capabilities with a limited set of
> fields. However the subdriver might have to report to the userspace
> additional bits.
> 
> This adds an add_capability() hook to vfio_pci_regops.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v3:
> * removed confusing rationale for the patch, the next patch makes
> use of it anyway
> ---
>  drivers/vfio/pci/vfio_pci_private.h | 3 +++
>  drivers/vfio/pci/vfio_pci.c | 6 ++
>  2 files changed, 9 insertions(+)

Acked-by: Alex Williamson 

> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index 86aab05..93c1738 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -62,6 +62,9 @@ struct vfio_pci_regops {
>   int (*mmap)(struct vfio_pci_device *vdev,
>   struct vfio_pci_region *region,
>   struct vm_area_struct *vma);
> + int (*add_capability)(struct vfio_pci_device *vdev,
> +   struct vfio_pci_region *region,
> +   struct vfio_info_cap *caps);
>  };
>  
>  struct vfio_pci_region {
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 4a6f7c0..6cb70cf 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -763,6 +763,12 @@ static long vfio_pci_ioctl(void *device_data,
>   if (ret)
>   return ret;
>  
> + if (vdev->region[i].ops->add_capability) {
> + ret = vdev->region[i].ops->add_capability(vdev,
> + >region[i], );
> + if (ret)
> + return ret;
> + }
>   }
>   }
>  



Re: [PATCH kernel v4 17/19] vfio_pci: Allow mapping extra regions

2018-12-10 Thread Alex Williamson
On Fri, 23 Nov 2018 16:53:02 +1100
Alexey Kardashevskiy  wrote:

> So far we only allowed mapping of MMIO BARs to the userspace. However
> there there are GPUs with on-board coherent RAM accessible via side

s/there there/there/

Otherwise:

Acked-by: Alex Williamson 

> channels which we also want to map to the userspace. The first client
> for this is NVIDIA V100 GPU with NVLink2 direct links to a POWER9
> NPU-enabled CPU; such GPUs have 16GB RAM which is coherently mapped
> to the system address space, we are going to export these as an extra
> PCI region.
> 
> We already support extra PCI regions and this adds support for mapping
> them to the userspace.
> 
> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: David Gibson 
> ---
> Changes:
> v2:
> * reverted one of mistakenly removed error checks
> ---
>  drivers/vfio/pci/vfio_pci_private.h | 3 +++
>  drivers/vfio/pci/vfio_pci.c | 9 +
>  2 files changed, 12 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index cde3b5d..86aab05 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -59,6 +59,9 @@ struct vfio_pci_regops {
> size_t count, loff_t *ppos, bool iswrite);
>   void(*release)(struct vfio_pci_device *vdev,
>  struct vfio_pci_region *region);
> + int (*mmap)(struct vfio_pci_device *vdev,
> + struct vfio_pci_region *region,
> + struct vm_area_struct *vma);
>  };
>  
>  struct vfio_pci_region {
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index fef5002..4a6f7c0 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1130,6 +1130,15 @@ static int vfio_pci_mmap(void *device_data, struct 
> vm_area_struct *vma)
>   return -EINVAL;
>   if ((vma->vm_flags & VM_SHARED) == 0)
>   return -EINVAL;
> + if (index >= VFIO_PCI_NUM_REGIONS) {
> + int regnum = index - VFIO_PCI_NUM_REGIONS;
> + struct vfio_pci_region *region = vdev->region + regnum;
> +
> + if (region && region->ops && region->ops->mmap &&
> + (region->flags & VFIO_REGION_INFO_FLAG_MMAP))
> + return region->ops->mmap(vdev, region, vma);
> + return -EINVAL;
> + }
>   if (index >= VFIO_PCI_ROM_REGION_INDEX)
>   return -EINVAL;
>   if (!vdev->bar_mmap_supported[index])



Re: [PATCH kernel v4 19/19] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver

2018-12-10 Thread Alex Williamson
On Fri, 23 Nov 2018 16:53:04 +1100
Alexey Kardashevskiy  wrote:

> POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
> pluggable PCIe devices but still have PCIe links which are used
> for config space and MMIO. In addition to that the GPUs have 6 NVLinks
> which are connected to other GPUs and the POWER9 CPU. POWER9 chips
> have a special unit on a die called an NPU which is an NVLink2 host bus
> adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
> These systems also support ATS (address translation services) which is
> a part of the NVLink2 protocol. Such GPUs also share on-board RAM
> (16GB or 32GB) to the system via the same NVLink2 so a CPU has
> cache-coherent access to a GPU RAM.
> 
> This exports GPU RAM to the userspace as a new VFIO device region. This
> preregisters the new memory as device memory as it might be used for DMA.
> This inserts pfns from the fault handler as the GPU memory is not onlined
> until the vendor driver is loaded and trained the NVLinks so doing this
> earlier causes low level errors which we fence in the firmware so
> it does not hurt the host system but still better be avoided.
> 
> This exports an ATSD (Address Translation Shootdown) register of NPU which
> allows TLB invalidations inside GPU for an operating system. The register
> conveniently occupies a single 64k page. It is also presented to
> the userspace as a new VFIO device region.
> 
> In order to provide the userspace with the information about GPU-to-NVLink
> connections, this exports an additional capability called "tgt"
> (which is an abbreviated host system bus address). The "tgt" property
> tells the GPU its own system address and allows the guest driver to
> conglomerate the routing information so each GPU knows how to get directly
> to the other GPUs.
> 
> For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
> know LPID (a logical partition ID or a KVM guest hardware ID in other
> words) and PID (a memory context ID of a userspace process, not to be
> confused with a linux pid). This assigns a GPU to LPID in the NPU and
> this is why this adds a listener for KVM on an IOMMU group. A PID comes
> via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
> 
> This requires coherent memory and ATSD to be available on the host as
> the GPU vendor only supports configurations with both features enabled
> and other configurations are known not to work. Because of this and
> because of the ways the features are advertised to the host system
> (which is a device tree with very platform specific properties),
> this requires enabled POWERNV platform.
> 
> The V100 GPUs do not advertise none of these capabilities via the config

s/none/any/

> space and there are more than just one device ID so this relies on
> the platform to tell whether these GPUs have special abilities such as
> NVLinks.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v4:
> * added nvlink-speed to the NPU bridge capability as this turned out to
> be not a constant value
> * instead of looking at the exact device ID (which also changes from system
> to system), now this (indirectly) looks at the device tree to know
> if GPU and NPU support NVLink
> 
> v3:
> * reworded the commit log about tgt
> * added tracepoints (do we want them enabled for entire vfio-pci?)
> * added code comments
> * added write|mmap flags to the new regions
> * auto enabled VFIO_PCI_NVLINK2 config option
> * added 'tgt' capability to a GPU so QEMU can recreate ibm,npu and ibm,gpu
> references; there are required by the NVIDIA driver
> * keep notifier registered only for short time
> ---
>  drivers/vfio/pci/Makefile   |   1 +
>  drivers/vfio/pci/trace.h| 102 +++
>  drivers/vfio/pci/vfio_pci_private.h |   2 +
>  include/uapi/linux/vfio.h   |  27 ++
>  drivers/vfio/pci/vfio_pci.c |  37 ++-
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 448 
>  drivers/vfio/pci/Kconfig|   6 +
>  7 files changed, 621 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/vfio/pci/trace.h
>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> 
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 76d8ec0..9662c06 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -1,5 +1,6 @@
>  
>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
> +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
>  
>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
...
> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index 93c1738..7639241 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -163,4 +163,6 @@ static inline int vfio_pci_igd_init(struct 
> vfio_pci_device *vdev)
>   return -ENODEV;
>  }
>  #endif
> +extern int 

Re: [PATCH kernel 3/3] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-10-18 Thread Alex Williamson
On Thu, 18 Oct 2018 10:37:46 -0700
Piotr Jaroszynski  wrote:

> On 10/18/18 9:55 AM, Alex Williamson wrote:
> > On Thu, 18 Oct 2018 11:31:33 +1100
> > Alexey Kardashevskiy  wrote:
> >   
> >> On 18/10/2018 08:52, Alex Williamson wrote:  
> >>> On Wed, 17 Oct 2018 12:19:20 +1100
> >>> Alexey Kardashevskiy  wrote:
> >>>  
> >>>> On 17/10/2018 06:08, Alex Williamson wrote:  
> >>>>> On Mon, 15 Oct 2018 20:42:33 +1100
> >>>>> Alexey Kardashevskiy  wrote:
> >>>>>> +
> >>>>>> +  if (pdev->vendor == PCI_VENDOR_ID_IBM &&
> >>>>>> +  pdev->device == 0x04ea) {
> >>>>>> +  ret = vfio_pci_ibm_npu2_init(vdev);
> >>>>>> +  if (ret) {
> >>>>>> +  dev_warn(>pdev->dev,
> >>>>>> +  "Failed to setup NVIDIA NV2 
> >>>>>> ATSD region\n");
> >>>>>> +  goto disable_exit;
> >>>>>>}  
> >>>>>
> >>>>> So the NPU is also actually owned by vfio-pci and assigned to the VM?  
> >>>>
> >>>> Yes. On a running system it looks like:
> >>>>
> >>>> 0007:00:00.0 Bridge: IBM Device 04ea (rev 01)
> >>>> 0007:00:00.1 Bridge: IBM Device 04ea (rev 01)
> >>>> 0007:00:01.0 Bridge: IBM Device 04ea (rev 01)
> >>>> 0007:00:01.1 Bridge: IBM Device 04ea (rev 01)
> >>>> 0007:00:02.0 Bridge: IBM Device 04ea (rev 01)
> >>>> 0007:00:02.1 Bridge: IBM Device 04ea (rev 01)
> >>>> 0035:00:00.0 PCI bridge: IBM Device 04c1
> >>>> 0035:01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> >>>> 0035:02:04.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> >>>> 0035:02:05.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> >>>> 0035:02:0d.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> >>>> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> >>>> (rev a1
> >>>> 0035:04:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> >>>> (rev a1)
> >>>> 0035:05:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> >>>> (rev a1)
> >>>>
> >>>> One "IBM Device" bridge represents one NVLink2, i.e. a piece of NPU.
> >>>> They all and 3 GPUs go to the same IOMMU group and get passed through to
> >>>> a guest.
> >>>>
> >>>> The entire NPU does not have representation via sysfs as a whole though. 
> >>>>  
> >>>
> >>> So the NPU is a bridge, but it uses a normal header type so vfio-pci
> >>> will bind to it?  
> >>
> >> An NPU is a NVLink bridge, it is not PCI in any sense. We (the host
> >> powerpc firmware known as "skiboot" or "opal") have chosen to emulate a
> >> virtual bridge per 1 NVLink on the firmware level. So for each physical
> >> NPU there are 6 virtual bridges. So the NVIDIA driver does not need to
> >> know much about NPUs.
> >>  
> >>> And the ATSD register that we need on it is not
> >>> accessible through these PCI representations of the sub-pieces of the
> >>> NPU?  Thanks,  
> >>
> >> No, only via the device tree. The skiboot puts the ATSD register address
> >> to the PHB's DT property called 'ibm,mmio-atsd' of these virtual bridges.  
> > 
> > Ok, so the NPU is essential a virtual device already, mostly just a
> > stub.  But it seems that each NPU is associated to a specific GPU, how
> > is that association done?  In the use case here it seems like it's just
> > a vehicle to provide this ibm,mmio-atsd property to guest DT and the tgt
> > routing information to the GPU.  So if both of those were attached to
> > the GPU, there'd be no purpose in assigning the NPU other than it's in
> > the same IOMMU group with a type 0 header, so something needs to be
> > done with it.  If it's a virtual device, perhaps it could have a type 1
> > header so vfio wouldn't care about it, then we would only assign the
> > GPU with these extra properties, which seems easier for management
> > tools and users.  If the guest driver needs a visible NPU device, QEMU
> > could possibly emulate one to make the GPU asso

Re: [PATCH kernel 3/3] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-10-18 Thread Alex Williamson
On Thu, 18 Oct 2018 11:31:33 +1100
Alexey Kardashevskiy  wrote:

> On 18/10/2018 08:52, Alex Williamson wrote:
> > On Wed, 17 Oct 2018 12:19:20 +1100
> > Alexey Kardashevskiy  wrote:
> >   
> >> On 17/10/2018 06:08, Alex Williamson wrote:  
> >>> On Mon, 15 Oct 2018 20:42:33 +1100
> >>> Alexey Kardashevskiy  wrote:
> >>> 
> >>>> POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
> >>>> pluggable PCIe devices but implement PCIe links for config space and 
> >>>> MMIO.
> >>>> In addition to that the GPUs are interconnected to each other and also
> >>>> have direct links to the P9 CPU. The links are NVLink2 and provide direct
> >>>> access to the system RAM for GPUs via NPU (an NVLink2 "proxy" on P9 
> >>>> chip).
> >>>> These systems also support ATS (address translation services) which is
> >>>> a part of the NVLink2 prototol. Such GPUs also share on-board RAM
> >>>> (16GB in tested config) to the system via the same NVLink2 so a CPU has
> >>>> cache-coherent access to a GPU RAM.
> >>>>
> >>>> This exports GPU RAM to the userspace as a new PCI region. This
> >>>> preregisters the new memory as device memory as it might be used for DMA.
> >>>> This inserts pfns from the fault handler as the GPU memory is not onlined
> >>>> until the NVIDIA driver is loaded and trained the links so doing this
> >>>> earlier produces low level errors which we fence in the firmware so
> >>>> it does not hurt the host system but still better to avoid.
> >>>>
> >>>> This exports ATSD (Address Translation Shootdown) register of NPU which
> >>>> allows the guest to invalidate TLB. The register conviniently occupies
> >>>> a single 64k page. Since NPU maps the GPU memory, it has a "tgt" property
> >>>> (which is an abbreviated host system bus address). This exports the "tgt"
> >>>> as a capability so the guest can program it into the GPU so the GPU can
> >>>> know how to route DMA trafic.
> >>>
> >>> I'm not really following what "tgt" is and why it's needed.  Is the GPU
> >>> memory here different than the GPU RAM region above?  Why does the user
> >>> need the host system bus address of this "tgt" thing?  Are we not able
> >>> to relocate it in guest physical address space, does this shootdown
> >>> only work in the host physical address space and therefore we need this
> >>> offset?  Please explain, I'm confused.
> >>
> >>
> >> This "tgt" is made of:
> >> - "memory select" (bits 45, 46)
> >> - "group select" (bits 43, 44)
> >> - "chip select" (bit 42)
> >> - chip internal address (bits 0..41)
> >>
> >> These are internal to GPU and this is where GPU RAM is mapped into the
> >> GPU's real space, this fits 46 bits.
> >>
> >> On POWER9 CPU the bits are different and higher so the same memory is
> >> mapped higher on P9 CPU. Just because we can map it higher, I guess.
> >>
> >> So it is not exactly the address but this provides the exact physical
> >> location of the memory.
> >>
> >> We have a group of 3 interconnected GPUs, they got their own
> >> memory/group/chip numbers. The GPUs use ATS service to translate
> >> userspace to physical (host or guest) addresses. Now a GPU needs to know
> >> which specific link to use for a specific physical address, in other
> >> words what this physical address belongs to - a CPU or one of GPUs. This
> >> is when "tgt" is used by the GPU hardware.  
> > 
> > Clear as mud ;)   
> 
> /me is sad. I hope Piotr explained it better...

It's starting to be a bit more clear, and Piotr anticipated the
security questions I was mulling over.

> > So tgt, provided by the npu2 capability of the ATSD
> > region of the NPU tells the GPU (a completely separate device) how to
> > route it its own RAM via its NVLink interface?  How can one tgt
> > indicate the routing for multiple interfaces?  
> 
> This NVLink DMA is using direct host physical addresses (no IOMMU, no
> filtering) which come from ATS. So unless we tell the GPU its own
> address range on the host CPU, it will route trafic via CPU. And the
> driver can also discover the NVLink topology and tell each GPU physical
> addresses of peer GPUs.

I th

Re: [PATCH kernel 3/3] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-10-17 Thread Alex Williamson
On Wed, 17 Oct 2018 12:19:20 +1100
Alexey Kardashevskiy  wrote:

> On 17/10/2018 06:08, Alex Williamson wrote:
> > On Mon, 15 Oct 2018 20:42:33 +1100
> > Alexey Kardashevskiy  wrote:
> >   
> >> POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
> >> pluggable PCIe devices but implement PCIe links for config space and MMIO.
> >> In addition to that the GPUs are interconnected to each other and also
> >> have direct links to the P9 CPU. The links are NVLink2 and provide direct
> >> access to the system RAM for GPUs via NPU (an NVLink2 "proxy" on P9 chip).
> >> These systems also support ATS (address translation services) which is
> >> a part of the NVLink2 prototol. Such GPUs also share on-board RAM
> >> (16GB in tested config) to the system via the same NVLink2 so a CPU has
> >> cache-coherent access to a GPU RAM.
> >>
> >> This exports GPU RAM to the userspace as a new PCI region. This
> >> preregisters the new memory as device memory as it might be used for DMA.
> >> This inserts pfns from the fault handler as the GPU memory is not onlined
> >> until the NVIDIA driver is loaded and trained the links so doing this
> >> earlier produces low level errors which we fence in the firmware so
> >> it does not hurt the host system but still better to avoid.
> >>
> >> This exports ATSD (Address Translation Shootdown) register of NPU which
> >> allows the guest to invalidate TLB. The register conviniently occupies
> >> a single 64k page. Since NPU maps the GPU memory, it has a "tgt" property
> >> (which is an abbreviated host system bus address). This exports the "tgt"
> >> as a capability so the guest can program it into the GPU so the GPU can
> >> know how to route DMA trafic.  
> > 
> > I'm not really following what "tgt" is and why it's needed.  Is the GPU
> > memory here different than the GPU RAM region above?  Why does the user
> > need the host system bus address of this "tgt" thing?  Are we not able
> > to relocate it in guest physical address space, does this shootdown
> > only work in the host physical address space and therefore we need this
> > offset?  Please explain, I'm confused.  
> 
> 
> This "tgt" is made of:
> - "memory select" (bits 45, 46)
> - "group select" (bits 43, 44)
> - "chip select" (bit 42)
> - chip internal address (bits 0..41)
> 
> These are internal to GPU and this is where GPU RAM is mapped into the
> GPU's real space, this fits 46 bits.
> 
> On POWER9 CPU the bits are different and higher so the same memory is
> mapped higher on P9 CPU. Just because we can map it higher, I guess.
> 
> So it is not exactly the address but this provides the exact physical
> location of the memory.
> 
> We have a group of 3 interconnected GPUs, they got their own
> memory/group/chip numbers. The GPUs use ATS service to translate
> userspace to physical (host or guest) addresses. Now a GPU needs to know
> which specific link to use for a specific physical address, in other
> words what this physical address belongs to - a CPU or one of GPUs. This
> is when "tgt" is used by the GPU hardware.

Clear as mud ;)  So tgt, provided by the npu2 capability of the ATSD
region of the NPU tells the GPU (a completely separate device) how to
route it its own RAM via its NVLink interface?  How can one tgt
indicate the routing for multiple interfaces?

> A GPU could run all the DMA trafic via the system bus indeed, just not
> as fast.
> 
> I am also struggling here and adding an Nvidia person in cc: (I should
> have done that when I posted the patches, my bad) to correct when/if I
> am wrong.
> 
> 
> 
> >
> >> For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
> >> know LPID (a logical partition ID or a KVM guest hardware ID in other
> >> words) and PID (a memory context ID of an userspace process, not to be
> >> confused with a linux pid). This assigns a GPU to LPID in the NPU and
> >> this is why this adds a listener for KVM on an IOMMU group. A PID comes
> >> via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
> >>
> >> This requires coherent memory and ATSD to be available on the host as
> >> the GPU vendor only supports configurations with both features enabled
> >> and other configurations are known not to work. Because of this and
> >> because of the ways the features are advertised to the host system
> >> (which is a device tree with very platform specific properties),

Re: [PATCH kernel 3/3] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-10-16 Thread Alex Williamson
@@ -303,6 +303,12 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG   (2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG(3)
>  
> +/* NVIDIA GPU NVlink2 RAM */
> +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM   (1)
> +
> +/* IBM NPU NVlink2 ATSD */
> +#define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD (1)
> +

Please include some of the description in the commitlog here for
reference.  Also please be explicit that these are vendor defined
regions and note the numerical vendor ID associated with them.

>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be 
> mmapped
>   * which allows direct access to non-MSIX registers which happened to be 
> within
> @@ -313,6 +319,18 @@ struct vfio_region_info_cap_type {
>   */
>  #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE   3
>  
> +/*
> + * Capability with compressed real address (aka SSA - small system address)
> + * where GPU RAM is mapped on a system bus. Used by a GPU for DMA routing.
> + */
> +#define VFIO_REGION_INFO_CAP_NPU24
> +
> +struct vfio_region_info_cap_npu2 {
> + struct vfio_info_cap_header header;
> + __u64 tgt;
> + /* size is defined in VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM */

But this is a capability for the IBM_NVLINK2_ATSD?  What is the
relevance of this comment?  Is this capability relevant to the RAM or
ATSD?

> +};
> +
>  /**
>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
>   *   struct vfio_irq_info)
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 4a3b93e..e9afd43 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -224,6 +224,16 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
>   return false;
>  }
>  
> +int __weak vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev)
> +{
> + return -ENODEV;
> +}
> +
> +int __weak vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
> +{
> + return -ENODEV;
> +}
> +
>  static int vfio_pci_enable(struct vfio_pci_device *vdev)
>  {
>   struct pci_dev *pdev = vdev->pdev;
> @@ -302,14 +312,37 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>   if (ret) {
>   dev_warn(>pdev->dev,
>"Failed to setup Intel IGD regions\n");
> - vfio_pci_disable(vdev);
> - return ret;
> + goto disable_exit;
> + }
> + }
> +
> + if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> + pdev->device == 0x1db1) {
> + ret = vfio_pci_nvdia_v100_nvlink2_init(vdev);
> + if (ret) {
> + dev_warn(>pdev->dev,
> +  "Failed to setup NVIDIA NV2 RAM region\n");
> + goto disable_exit;
> + }
> + }

This device ID is not unique to POWER9 Witherspoon systems, I see your
comment in the commitlog, but this is clearly going to generate a
dev_warn and failure on an x86 system with the same hardware.  Perhaps
this could be masked off with IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2) like
the IGD code above this chunk does?

> +
> + if (pdev->vendor == PCI_VENDOR_ID_IBM &&
> + pdev->device == 0x04ea) {
> + ret = vfio_pci_ibm_npu2_init(vdev);
> + if (ret) {
> + dev_warn(>pdev->dev,
> + "Failed to setup NVIDIA NV2 ATSD 
> region\n");
> + goto disable_exit;
>   }

So the NPU is also actually owned by vfio-pci and assigned to the VM?

>   }
>  
>   vfio_pci_probe_mmaps(vdev);
>  
>   return 0;
> +
> +disable_exit:
> + vfio_pci_disable(vdev);
> + return ret;
>  }
>  
>  static void vfio_pci_disable(struct vfio_pci_device *vdev)
> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
> b/drivers/vfio/pci/vfio_pci_nvlink2.c
> new file mode 100644
> index 000..c9d2b55
> --- /dev/null
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -0,0 +1,409 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
> + *
> + * Copyright (C) 2018 IBM Corp.  All rights reserved.
> + * Author: Alexey Kardashevskiy 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Register an on-GPU RAM region for cacheable access.
&g

Re: [RFC PATCH kernel] vfio/spapr_tce: Get rid of possible infinite loop

2018-10-03 Thread Alex Williamson
On Tue,  2 Oct 2018 13:22:31 +1000
Alexey Kardashevskiy  wrote:

> As a part of cleanup, the SPAPR TCE IOMMU subdriver releases preregistered
> memory. If there is a bug in memory release, the loop in
> tce_iommu_release() becomes infinite; this actually happened to me.
> 
> This makes the loop finite and prints a warning on every failure to make
> the code more bug prone.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 10 +++---
>  1 file changed, 3 insertions(+), 7 deletions(-)

Should this have a stable/fixes tag?  Looks like it's relevant to:

4b6fad7097f8 powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown

Also, not sure who you're wanting to take this since it was sent to ppc
lists.  If it's me, let me know, otherwise

Acked-by: Alex Williamson 

Thanks,
Alex

> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index b1a8ab3..ece0651 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -371,6 +371,7 @@ static void tce_iommu_release(void *iommu_data)
>  {
>   struct tce_container *container = iommu_data;
>   struct tce_iommu_group *tcegrp;
> + struct tce_iommu_prereg *tcemem, *tmtmp;
>   long i;
>  
>   while (tce_groups_attached(container)) {
> @@ -393,13 +394,8 @@ static void tce_iommu_release(void *iommu_data)
>   tce_iommu_free_table(container, tbl);
>   }
>  
> - while (!list_empty(>prereg_list)) {
> - struct tce_iommu_prereg *tcemem;
> -
> - tcemem = list_first_entry(>prereg_list,
> - struct tce_iommu_prereg, next);
> - WARN_ON_ONCE(tce_iommu_prereg_free(container, tcemem));
> - }
> + list_for_each_entry_safe(tcemem, tmtmp, >prereg_list, next)
> + WARN_ON(tce_iommu_prereg_free(container, tcemem));
>  
>   tce_iommu_disable(container);
>   if (container->mm)



Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-08-09 Thread Alex Williamson
On Thu, 9 Aug 2018 14:21:29 +1000
Alexey Kardashevskiy  wrote:

> On 08/08/2018 18:39, Alexey Kardashevskiy wrote:
> > 
> > 
> > On 02/08/2018 02:16, Alex Williamson wrote:  
> >> On Wed, 1 Aug 2018 18:37:35 +1000
> >> Alexey Kardashevskiy  wrote:
> >>  
> >>> On 01/08/2018 00:29, Alex Williamson wrote:  
> >>>> On Tue, 31 Jul 2018 14:03:35 +1000
> >>>> Alexey Kardashevskiy  wrote:
> >>>> 
> >>>>> On 31/07/2018 02:29, Alex Williamson wrote:
> >>>>>> On Mon, 30 Jul 2018 18:58:49 +1000
> >>>>>> Alexey Kardashevskiy  wrote:
> >>>>>>> After some local discussions, it was pointed out that force disabling
> >>>>>>> nvlinks won't bring us much as for an nvlink to work, both sides need 
> >>>>>>> to
> >>>>>>> enable it so malicious guests cannot penetrate good ones (or a host)
> >>>>>>> unless a good guest enabled the link but won't happen with a well
> >>>>>>> behaving guest. And if two guests became malicious, then can still 
> >>>>>>> only
> >>>>>>> harm each other, and so can they via other ways such network. This is
> >>>>>>> different from PCIe as once PCIe link is unavoidably enabled, a well
> >>>>>>> behaving device cannot firewall itself from peers as it is up to the
> >>>>>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU 
> >>>>>>> still
> >>>>>>> has means to protect itself, just like a guest can run "firewalld" for
> >>>>>>> network.
> >>>>>>>
> >>>>>>> Although it would be a nice feature to have an extra barrier between
> >>>>>>> GPUs, is inability to block the links in hypervisor still a blocker 
> >>>>>>> for
> >>>>>>> V100 pass through?  
> >>>>>>
> >>>>>> How is the NVLink configured by the guest, is it 'on'/'off' or are
> >>>>>> specific routes configured?   
> >>>>>
> >>>>> The GPU-GPU links need not to be blocked and need to be enabled
> >>>>> (==trained) by a driver in the guest. There are no routes between GPUs
> >>>>> in NVLink fabric, these are direct links, it is just a switch on each
> >>>>> side, both switches need to be on for a link to work.
> >>>>
> >>>> Ok, but there is at least the possibility of multiple direct links per
> >>>> GPU, the very first diagram I find of NVlink shows 8 interconnected
> >>>> GPUs:
> >>>>
> >>>> https://www.nvidia.com/en-us/data-center/nvlink/
> >>>
> >>> Out design is like the left part of the picture but it is just a detail.  
> >>
> >> Unless we can specifically identify a direct link vs a mesh link, we
> >> shouldn't be making assumptions about the degree of interconnect.
> >>
> >>>> So if each switch enables one direct, point to point link, how does the
> >>>> guest know which links to open for which peer device?
> >>>
> >>> It uses PCI config space on GPUs to discover the topology.  
> >>
> >> So do we need to virtualize this config space if we're going to
> >> virtualize the topology?
> >>  
> >>>> And of course
> >>>> since we can't see the spec, a security audit is at best hearsay :-\
> >>>
> >>> Yup, the exact discovery protocol is hidden.  
> >>
> >> It could be reverse engineered...
> >>  
> >>>>> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
> >>>>> is controlled via the emulated PCI bridges which I pass through together
> >>>>> with the GPU.
> >>>>
> >>>> So there's a special emulated switch, is that how the guest knows which
> >>>> GPUs it can enable NVLinks to?
> >>>
> >>> Since it only has PCI config space (there is nothing relevant in the
> >>> device tree at all), I assume (double checking with the NVIDIA folks
> >>> now) the guest driver enables them all, tests which pair works and
> >>> disables the ones which do not. This gives a malicious guest a tiny
> >>> window of opportunity to br

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-08-01 Thread Alex Williamson
On Wed, 1 Aug 2018 18:37:35 +1000
Alexey Kardashevskiy  wrote:

> On 01/08/2018 00:29, Alex Williamson wrote:
> > On Tue, 31 Jul 2018 14:03:35 +1000
> > Alexey Kardashevskiy  wrote:
> >   
> >> On 31/07/2018 02:29, Alex Williamson wrote:  
> >>> On Mon, 30 Jul 2018 18:58:49 +1000
> >>> Alexey Kardashevskiy  wrote:  
> >>>> After some local discussions, it was pointed out that force disabling
> >>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
> >>>> enable it so malicious guests cannot penetrate good ones (or a host)
> >>>> unless a good guest enabled the link but won't happen with a well
> >>>> behaving guest. And if two guests became malicious, then can still only
> >>>> harm each other, and so can they via other ways such network. This is
> >>>> different from PCIe as once PCIe link is unavoidably enabled, a well
> >>>> behaving device cannot firewall itself from peers as it is up to the
> >>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
> >>>> has means to protect itself, just like a guest can run "firewalld" for
> >>>> network.
> >>>>
> >>>> Although it would be a nice feature to have an extra barrier between
> >>>> GPUs, is inability to block the links in hypervisor still a blocker for
> >>>> V100 pass through?
> >>>
> >>> How is the NVLink configured by the guest, is it 'on'/'off' or are
> >>> specific routes configured? 
> >>
> >> The GPU-GPU links need not to be blocked and need to be enabled
> >> (==trained) by a driver in the guest. There are no routes between GPUs
> >> in NVLink fabric, these are direct links, it is just a switch on each
> >> side, both switches need to be on for a link to work.  
> > 
> > Ok, but there is at least the possibility of multiple direct links per
> > GPU, the very first diagram I find of NVlink shows 8 interconnected
> > GPUs:
> > 
> > https://www.nvidia.com/en-us/data-center/nvlink/  
> 
> Out design is like the left part of the picture but it is just a detail.

Unless we can specifically identify a direct link vs a mesh link, we
shouldn't be making assumptions about the degree of interconnect.
 
> > So if each switch enables one direct, point to point link, how does the
> > guest know which links to open for which peer device?  
> 
> It uses PCI config space on GPUs to discover the topology.

So do we need to virtualize this config space if we're going to
virtualize the topology?

> > And of course
> > since we can't see the spec, a security audit is at best hearsay :-\  
> 
> Yup, the exact discovery protocol is hidden.

It could be reverse engineered...

> >> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
> >> is controlled via the emulated PCI bridges which I pass through together
> >> with the GPU.  
> > 
> > So there's a special emulated switch, is that how the guest knows which
> > GPUs it can enable NVLinks to?  
> 
> Since it only has PCI config space (there is nothing relevant in the
> device tree at all), I assume (double checking with the NVIDIA folks
> now) the guest driver enables them all, tests which pair works and
> disables the ones which do not. This gives a malicious guest a tiny
> window of opportunity to break into a good guest. Hm :-/

Let's not minimize that window, that seems like a prime candidate for
an exploit.

> >>> If the former, then isn't a non-malicious
> >>> guest still susceptible to a malicious guest?
> >>
> >> A non-malicious guest needs to turn its switch on for a link to a GPU
> >> which belongs to a malicious guest.  
> > 
> > Actual security, or obfuscation, will we ever know...  
> >>>> If the latter, how is  
> >>> routing configured by the guest given that the guest view of the
> >>> topology doesn't match physical hardware?  Are these routes
> >>> deconfigured by device reset?  Are they part of the save/restore
> >>> state?  Thanks,
> > 
> > Still curious what happens to these routes on reset.  Can a later user
> > of a GPU inherit a device where the links are already enabled?  Thanks,  
> 
> I am told that the GPU reset disables links. As a side effect, we get an
> HMI (a hardware fault which reset the host machine) when trying
> accessing the GPU RAM which indicates that the link is down as the
> memory is only accessible via the nvlink. We have special fencing code
> in our host firmware (skiboot) to fence this memory on PCI reset so
> reading from it returns zeroes instead of HMIs.

What sort of reset is required for this?  Typically we rely on
secondary bus reset for GPUs, but it would be a problem if GPUs were to
start implementing FLR and nobody had a spec to learn that FLR maybe
didn't disable the link.  The better approach to me still seems to be
virtualizing these NVLink config registers to an extent that the user
can only enabling links where they have ownership of both ends of the
connection.  Thanks,

Alex


Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-07-31 Thread Alex Williamson
On Tue, 31 Jul 2018 14:03:35 +1000
Alexey Kardashevskiy  wrote:

> On 31/07/2018 02:29, Alex Williamson wrote:
> > On Mon, 30 Jul 2018 18:58:49 +1000
> > Alexey Kardashevskiy  wrote:
> >> After some local discussions, it was pointed out that force disabling
> >> nvlinks won't bring us much as for an nvlink to work, both sides need to
> >> enable it so malicious guests cannot penetrate good ones (or a host)
> >> unless a good guest enabled the link but won't happen with a well
> >> behaving guest. And if two guests became malicious, then can still only
> >> harm each other, and so can they via other ways such network. This is
> >> different from PCIe as once PCIe link is unavoidably enabled, a well
> >> behaving device cannot firewall itself from peers as it is up to the
> >> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
> >> has means to protect itself, just like a guest can run "firewalld" for
> >> network.
> >>
> >> Although it would be a nice feature to have an extra barrier between
> >> GPUs, is inability to block the links in hypervisor still a blocker for
> >> V100 pass through?  
> > 
> > How is the NVLink configured by the guest, is it 'on'/'off' or are
> > specific routes configured?   
> 
> The GPU-GPU links need not to be blocked and need to be enabled
> (==trained) by a driver in the guest. There are no routes between GPUs
> in NVLink fabric, these are direct links, it is just a switch on each
> side, both switches need to be on for a link to work.

Ok, but there is at least the possibility of multiple direct links per
GPU, the very first diagram I find of NVlink shows 8 interconnected
GPUs:

https://www.nvidia.com/en-us/data-center/nvlink/

So if each switch enables one direct, point to point link, how does the
guest know which links to open for which peer device?  And of course
since we can't see the spec, a security audit is at best hearsay :-\
 
> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
> is controlled via the emulated PCI bridges which I pass through together
> with the GPU.

So there's a special emulated switch, is that how the guest knows which
GPUs it can enable NVLinks to?

> > If the former, then isn't a non-malicious
> > guest still susceptible to a malicious guest?  
> 
> A non-malicious guest needs to turn its switch on for a link to a GPU
> which belongs to a malicious guest.

Actual security, or obfuscation, will we ever know...

> > If the latter, how is
> > routing configured by the guest given that the guest view of the
> > topology doesn't match physical hardware?  Are these routes
> > deconfigured by device reset?  Are they part of the save/restore
> > state?  Thanks,  

Still curious what happens to these routes on reset.  Can a later user
of a GPU inherit a device where the links are already enabled?  Thanks,

Alex


Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-07-30 Thread Alex Williamson
On Mon, 30 Jul 2018 18:58:49 +1000
Alexey Kardashevskiy  wrote:

> On 11/07/2018 19:26, Alexey Kardashevskiy wrote:
> > On Tue, 10 Jul 2018 16:37:15 -0600
> > Alex Williamson  wrote:
> >   
> >> On Tue, 10 Jul 2018 14:10:20 +1000
> >> Alexey Kardashevskiy  wrote:
> >>  
> >>> On Thu, 7 Jun 2018 23:03:23 -0600
> >>> Alex Williamson  wrote:
> >>> 
> >>>> On Fri, 8 Jun 2018 14:14:23 +1000
> >>>> Alexey Kardashevskiy  wrote:
> >>>>   
> >>>>> On 8/6/18 1:44 pm, Alex Williamson wrote:    
> >>>>>> On Fri, 8 Jun 2018 13:08:54 +1000
> >>>>>> Alexey Kardashevskiy  wrote:
> >>>>>>   
> >>>>>>> On 8/6/18 8:15 am, Alex Williamson wrote:  
> >>>>>>>> On Fri, 08 Jun 2018 07:54:02 +1000
> >>>>>>>> Benjamin Herrenschmidt  wrote:
> >>>>>>>> 
> >>>>>>>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:   
> >>>>>>>>>  
> >>>>>>>>>>
> >>>>>>>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> >>>>>>>>>> connected devices makes sense?  AIUI we have a PCI view of these
> >>>>>>>>>> devices and from that perspective they're isolated.  That's the 
> >>>>>>>>>> view of
> >>>>>>>>>> the device used to generate the grouping.  However, not visible to 
> >>>>>>>>>> us,
> >>>>>>>>>> these devices are interconnected via NVLink.  What isolation 
> >>>>>>>>>> properties
> >>>>>>>>>> does NVLink provide given that its entire purpose for existing 
> >>>>>>>>>> seems to
> >>>>>>>>>> be to provide a high performance link for p2p between devices? 
> >>>>>>>>>>  
> >>>>>>>>>
> >>>>>>>>> Not entire. On POWER chips, we also have an nvlink between the 
> >>>>>>>>> device
> >>>>>>>>> and the CPU which is running significantly faster than PCIe.
> >>>>>>>>>
> >>>>>>>>> But yes, there are cross-links and those should probably be 
> >>>>>>>>> accounted
> >>>>>>>>> for in the grouping.
> >>>>>>>>
> >>>>>>>> Then after we fix the grouping, can we just let the host driver 
> >>>>>>>> manage
> >>>>>>>> this coherent memory range and expose vGPUs to guests?  The use case 
> >>>>>>>> of
> >>>>>>>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> >>>>>>>> convince NVIDIA to support more than a single vGPU per VM though)
> >>>>>>>> 
> >>>>>>>
> >>>>>>> These are physical GPUs, not virtual sriov-alike things they are
> >>>>>>> implementing as well elsewhere.  
> >>>>>>
> >>>>>> vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> >>>>>> either.  That's why we have mdev devices now to implement software
> >>>>>> defined devices.  I don't have first hand experience with V-series, but
> >>>>>> I would absolutely expect a PCIe-based Tesla V100 to support vGPU. 
> >>>>>>  
> >>>>>
> >>>>> So assuming V100 can do vGPU, you are suggesting ditching this patchset 
> >>>>> and
> >>>>> using mediated vGPUs instead, correct?
> >>>>
> >>>> If it turns out that our PCIe-only-based IOMMU grouping doesn't
> >>>> account for lack of isolation on the NVLink side and we correct that,
> >>>> limiting assignment to sets of 3 interconnected GPUs, is that still a
> >>>> useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
> >>>> whether they choose to support vGPU on these GPUs or whether they can
> >>>> be convinced to support multiple vGPUs per VM.
> >>>>  

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-07-10 Thread Alex Williamson
On Tue, 10 Jul 2018 14:10:20 +1000
Alexey Kardashevskiy  wrote:

> On Thu, 7 Jun 2018 23:03:23 -0600
> Alex Williamson  wrote:
> 
> > On Fri, 8 Jun 2018 14:14:23 +1000
> > Alexey Kardashevskiy  wrote:
> >   
> > > On 8/6/18 1:44 pm, Alex Williamson wrote:
> > > > On Fri, 8 Jun 2018 13:08:54 +1000
> > > > Alexey Kardashevskiy  wrote:
> > > >   
> > > >> On 8/6/18 8:15 am, Alex Williamson wrote:  
> > > >>> On Fri, 08 Jun 2018 07:54:02 +1000
> > > >>> Benjamin Herrenschmidt  wrote:
> > > >>> 
> > > >>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> > > >>>>>
> > > >>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> > > >>>>> connected devices makes sense?  AIUI we have a PCI view of these
> > > >>>>> devices and from that perspective they're isolated.  That's the 
> > > >>>>> view of
> > > >>>>> the device used to generate the grouping.  However, not visible to 
> > > >>>>> us,
> > > >>>>> these devices are interconnected via NVLink.  What isolation 
> > > >>>>> properties
> > > >>>>> does NVLink provide given that its entire purpose for existing 
> > > >>>>> seems to
> > > >>>>> be to provide a high performance link for p2p between devices?  
> > > >>>>> 
> > > >>>>
> > > >>>> Not entire. On POWER chips, we also have an nvlink between the device
> > > >>>> and the CPU which is running significantly faster than PCIe.
> > > >>>>
> > > >>>> But yes, there are cross-links and those should probably be accounted
> > > >>>> for in the grouping.
> > > >>>
> > > >>> Then after we fix the grouping, can we just let the host driver manage
> > > >>> this coherent memory range and expose vGPUs to guests?  The use case 
> > > >>> of
> > > >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > > >>> convince NVIDIA to support more than a single vGPU per VM though) 
> > > >>>
> > > >>
> > > >> These are physical GPUs, not virtual sriov-alike things they are
> > > >> implementing as well elsewhere.  
> > > > 
> > > > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > > > either.  That's why we have mdev devices now to implement software
> > > > defined devices.  I don't have first hand experience with V-series, but
> > > > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.  
> > > 
> > > So assuming V100 can do vGPU, you are suggesting ditching this patchset 
> > > and
> > > using mediated vGPUs instead, correct?
> > 
> > If it turns out that our PCIe-only-based IOMMU grouping doesn't
> > account for lack of isolation on the NVLink side and we correct that,
> > limiting assignment to sets of 3 interconnected GPUs, is that still a
> > useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
> > whether they choose to support vGPU on these GPUs or whether they can
> > be convinced to support multiple vGPUs per VM.
> >   
> > > >> My current understanding is that every P9 chip in that box has some 
> > > >> NVLink2
> > > >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> > > >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 
> > > >> links
> > > >> as well.
> > > >>
> > > >> From small bits of information I have it seems that a GPU can perfectly
> > > >> work alone and if the NVIDIA driver does not see these interconnects
> > > >> (because we do not pass the rest of the big 3xGPU group to this 
> > > >> guest), it
> > > >> continues with a single GPU. There is an "nvidia-smi -r" big reset 
> > > >> hammer
> > > >> which simply refuses to work until all 3 GPUs are passed so there is 
> > > >> some
> > > >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) 
> > > >> to
> > > >> get a confirmation from NVIDIA that it is ok to p

Re: [PATCH kernel v2 1/2] vfio/spapr: Use IOMMU pageshift rather than pagesize

2018-06-30 Thread Alex Williamson
On Tue, 26 Jun 2018 15:59:25 +1000
Alexey Kardashevskiy  wrote:

> The size is always equal to 1 page so let's use this. Later on this will
> be used for other checks which use page shifts to check the granularity
> of access.
> 
> This should cause no behavioral change.
> 
> Reviewed-by: David Gibson 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)

I assume a v3+ will go in through the ppc tree since the bulk of the
series is there.  For this,

Acked-by: Alex Williamson 

> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 759a5bd..2da5f05 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -457,13 +457,13 @@ static void tce_iommu_unuse_page(struct tce_container 
> *container,
>  }
>  
>  static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container,
> - unsigned long tce, unsigned long size,
> + unsigned long tce, unsigned long shift,
>   unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem)
>  {
>   long ret = 0;
>   struct mm_iommu_table_group_mem_t *mem;
>  
> - mem = mm_iommu_lookup(container->mm, tce, size);
> + mem = mm_iommu_lookup(container->mm, tce, 1ULL << shift);
>   if (!mem)
>   return -EINVAL;
>  
> @@ -487,7 +487,7 @@ static void tce_iommu_unuse_page_v2(struct tce_container 
> *container,
>   if (!pua)
>   return;
>  
> - ret = tce_iommu_prereg_ua_to_hpa(container, *pua, IOMMU_PAGE_SIZE(tbl),
> + ret = tce_iommu_prereg_ua_to_hpa(container, *pua, tbl->it_page_shift,
>   , );
>   if (ret)
>   pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
> @@ -611,7 +611,7 @@ static long tce_iommu_build_v2(struct tce_container 
> *container,
>   entry + i);
>  
>   ret = tce_iommu_prereg_ua_to_hpa(container,
> - tce, IOMMU_PAGE_SIZE(tbl), , );
> + tce, tbl->it_page_shift, , );
>   if (ret)
>   break;
>  



Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alex Williamson
On Fri, 8 Jun 2018 14:14:23 +1000
Alexey Kardashevskiy  wrote:

> On 8/6/18 1:44 pm, Alex Williamson wrote:
> > On Fri, 8 Jun 2018 13:08:54 +1000
> > Alexey Kardashevskiy  wrote:
> >   
> >> On 8/6/18 8:15 am, Alex Williamson wrote:  
> >>> On Fri, 08 Jun 2018 07:54:02 +1000
> >>> Benjamin Herrenschmidt  wrote:
> >>> 
> >>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> >>>>>
> >>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> >>>>> connected devices makes sense?  AIUI we have a PCI view of these
> >>>>> devices and from that perspective they're isolated.  That's the view of
> >>>>> the device used to generate the grouping.  However, not visible to us,
> >>>>> these devices are interconnected via NVLink.  What isolation properties
> >>>>> does NVLink provide given that its entire purpose for existing seems to
> >>>>> be to provide a high performance link for p2p between devices?  
> >>>>
> >>>> Not entire. On POWER chips, we also have an nvlink between the device
> >>>> and the CPU which is running significantly faster than PCIe.
> >>>>
> >>>> But yes, there are cross-links and those should probably be accounted
> >>>> for in the grouping.
> >>>
> >>> Then after we fix the grouping, can we just let the host driver manage
> >>> this coherent memory range and expose vGPUs to guests?  The use case of
> >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> >>> convince NVIDIA to support more than a single vGPU per VM though)
> >>
> >> These are physical GPUs, not virtual sriov-alike things they are
> >> implementing as well elsewhere.  
> > 
> > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > either.  That's why we have mdev devices now to implement software
> > defined devices.  I don't have first hand experience with V-series, but
> > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.  
> 
> So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> using mediated vGPUs instead, correct?

If it turns out that our PCIe-only-based IOMMU grouping doesn't
account for lack of isolation on the NVLink side and we correct that,
limiting assignment to sets of 3 interconnected GPUs, is that still a
useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
whether they choose to support vGPU on these GPUs or whether they can
be convinced to support multiple vGPUs per VM.

> >> My current understanding is that every P9 chip in that box has some NVLink2
> >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> >> as well.
> >>
> >> From small bits of information I have it seems that a GPU can perfectly
> >> work alone and if the NVIDIA driver does not see these interconnects
> >> (because we do not pass the rest of the big 3xGPU group to this guest), it
> >> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> >> which simply refuses to work until all 3 GPUs are passed so there is some
> >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> >> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> >>
> >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> >> interconnected group).  
> > 
> > I'm not gaining much confidence that we can rely on isolation between
> > NVLink connected GPUs, it sounds like you're simply expecting that
> > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > is going to play nice and nobody will figure out how to do bad things
> > because... obfuscation?  Thanks,  
> 
> Well, we already believe that a proprietary firmware of a sriov-capable
> adapter like Mellanox ConnextX is not doing bad things, how is this
> different in principle?

It seems like the scope and hierarchy are different.  Here we're
talking about exposing big discrete devices, which are peers of one
another (and have history of being reverse engineered), to userspace
drivers.  Once handed to userspace, each of those devices needs to be
considered untrusted.  In the case of SR-IOV, we typically have a
trusted host driver for the PF managing untrusted VFs.  We do rely on
some sanity in the hardware/firmware in isolating the VFs from each
other and from the P

Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-06-07 Thread Alex Williamson
On Fri, 8 Jun 2018 13:52:05 +1000
Alexey Kardashevskiy  wrote:

> On 8/6/18 1:35 pm, Alex Williamson wrote:
> > On Fri, 8 Jun 2018 13:09:13 +1000
> > Alexey Kardashevskiy  wrote:  
> >> On 8/6/18 3:04 am, Alex Williamson wrote:  
> >>> On Thu,  7 Jun 2018 18:44:20 +1000
> >>> Alexey Kardashevskiy  wrote:  
> >>>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >>>> index 7bddf1e..38c9475 100644
> >>>> --- a/drivers/vfio/pci/vfio_pci.c
> >>>> +++ b/drivers/vfio/pci/vfio_pci.c
> >>>> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device 
> >>>> *vdev)
> >>>>  }
> >>>>  }
> >>>>  
> >>>> +if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> >>>> +pdev->device == 0x1db1 &&
> >>>> +IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
> >>>
> >>> Can't we do better than check this based on device ID?  Perhaps PCIe
> >>> capability hints at this?
> >>
> >> A normal PCI pluggable device looks like this:
> >>
> >> root@fstn3:~# sudo lspci -vs :03:00.0
> >> :03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
> >>Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
> >>Flags: fast devsel, IRQ 497
> >>Memory at 3fe0 (32-bit, non-prefetchable) [disabled] [size=16M]
> >>Memory at 2000 (64-bit, prefetchable) [disabled] [size=16G]
> >>Memory at 2004 (64-bit, prefetchable) [disabled] [size=32M]
> >>Capabilities: [60] Power Management version 3
> >>Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >>Capabilities: [78] Express Endpoint, MSI 00
> >>Capabilities: [100] Virtual Channel
> >>Capabilities: [128] Power Budgeting 
> >>Capabilities: [420] Advanced Error Reporting
> >>Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
> >> 
> >>Capabilities: [900] #19
> >>
> >>
> >> This is a NVLink v1 machine:
> >>
> >> aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
> >> 000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
> >>Subsystem: NVIDIA Corporation Device 116b
> >>Flags: bus master, fast devsel, latency 0, IRQ 457
> >>Memory at 3fe3 (32-bit, non-prefetchable) [size=16M]
> >>Memory at 2600 (64-bit, prefetchable) [size=16G]
> >>Memory at 2604 (64-bit, prefetchable) [size=32M]
> >>Capabilities: [60] Power Management version 3
> >>Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >>Capabilities: [78] Express Endpoint, MSI 00
> >>Capabilities: [100] Virtual Channel
> >>Capabilities: [250] Latency Tolerance Reporting
> >>Capabilities: [258] L1 PM Substates
> >>Capabilities: [128] Power Budgeting 
> >>Capabilities: [420] Advanced Error Reporting
> >>Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
> >> 
> >>Capabilities: [900] #19
> >>Kernel driver in use: nvidia
> >>Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
> >>
> >>
> >> This is the one the patch is for:
> >>
> >> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
> >> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> >> (rev a1)
> >>Subsystem: NVIDIA Corporation Device 1212
> >>Flags: fast devsel, IRQ 82, NUMA node 8
> >>Memory at 620c28000 (32-bit, non-prefetchable) [disabled] [size=16M]
> >>Memory at 62280 (64-bit, prefetchable) [disabled] [size=16G]
> >>Memory at 62284 (64-bit, prefetchable) [disabled] [size=32M]
> >>Capabilities: [60] Power Management version 3
> >>Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >>Capabilities: [78] Express Endpoint, MSI 00
> >>Capabilities: [100] Virtual Channel
> >>Capabilities: [250] Latency Tolerance Reporting
> >>Capabilities: [258] L1 PM Substates
> >>Capabilities: [128] Power Budgeting 
> >>Capabilities: [420] Advanced Error Reporting
> >>Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
> >> 
> >>Capabilities: [900] #19
> >>Capabilities: [ac0] #23
>

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alex Williamson
On Fri, 8 Jun 2018 13:08:54 +1000
Alexey Kardashevskiy  wrote:

> On 8/6/18 8:15 am, Alex Williamson wrote:
> > On Fri, 08 Jun 2018 07:54:02 +1000
> > Benjamin Herrenschmidt  wrote:
> >   
> >> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
> >>>
> >>> Can we back up and discuss whether the IOMMU grouping of NVLink
> >>> connected devices makes sense?  AIUI we have a PCI view of these
> >>> devices and from that perspective they're isolated.  That's the view of
> >>> the device used to generate the grouping.  However, not visible to us,
> >>> these devices are interconnected via NVLink.  What isolation properties
> >>> does NVLink provide given that its entire purpose for existing seems to
> >>> be to provide a high performance link for p2p between devices?
> >>
> >> Not entire. On POWER chips, we also have an nvlink between the device
> >> and the CPU which is running significantly faster than PCIe.
> >>
> >> But yes, there are cross-links and those should probably be accounted
> >> for in the grouping.  
> > 
> > Then after we fix the grouping, can we just let the host driver manage
> > this coherent memory range and expose vGPUs to guests?  The use case of
> > assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > convince NVIDIA to support more than a single vGPU per VM though)  
> 
> These are physical GPUs, not virtual sriov-alike things they are
> implementing as well elsewhere.

vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
either.  That's why we have mdev devices now to implement software
defined devices.  I don't have first hand experience with V-series, but
I would absolutely expect a PCIe-based Tesla V100 to support vGPU.

> My current understanding is that every P9 chip in that box has some NVLink2
> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> as well.
> 
> From small bits of information I have it seems that a GPU can perfectly
> work alone and if the NVIDIA driver does not see these interconnects
> (because we do not pass the rest of the big 3xGPU group to this guest), it
> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> which simply refuses to work until all 3 GPUs are passed so there is some
> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> 
> So we will either have 6 groups (one per GPU) or 2 groups (one per
> interconnected group).

I'm not gaining much confidence that we can rely on isolation between
NVLink connected GPUs, it sounds like you're simply expecting that
proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
is going to play nice and nobody will figure out how to do bad things
because... obfuscation?  Thanks,

Alex


Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-06-07 Thread Alex Williamson
On Fri, 8 Jun 2018 13:09:13 +1000
Alexey Kardashevskiy  wrote:
> On 8/6/18 3:04 am, Alex Williamson wrote:
> > On Thu,  7 Jun 2018 18:44:20 +1000
> > Alexey Kardashevskiy  wrote:
> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >> index 7bddf1e..38c9475 100644
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device 
> >> *vdev)
> >>}
> >>}
> >>  
> >> +  if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> >> +  pdev->device == 0x1db1 &&
> >> +  IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {  
> > 
> > Can't we do better than check this based on device ID?  Perhaps PCIe
> > capability hints at this?  
> 
> A normal PCI pluggable device looks like this:
> 
> root@fstn3:~# sudo lspci -vs :03:00.0
> :03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
>   Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
>   Flags: fast devsel, IRQ 497
>   Memory at 3fe0 (32-bit, non-prefetchable) [disabled] [size=16M]
>   Memory at 2000 (64-bit, prefetchable) [disabled] [size=16G]
>   Memory at 2004 (64-bit, prefetchable) [disabled] [size=32M]
>   Capabilities: [60] Power Management version 3
>   Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>   Capabilities: [78] Express Endpoint, MSI 00
>   Capabilities: [100] Virtual Channel
>   Capabilities: [128] Power Budgeting 
>   Capabilities: [420] Advanced Error Reporting
>   Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
> 
>   Capabilities: [900] #19
> 
> 
> This is a NVLink v1 machine:
> 
> aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
> 000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
>   Subsystem: NVIDIA Corporation Device 116b
>   Flags: bus master, fast devsel, latency 0, IRQ 457
>   Memory at 3fe3 (32-bit, non-prefetchable) [size=16M]
>   Memory at 2600 (64-bit, prefetchable) [size=16G]
>   Memory at 2604 (64-bit, prefetchable) [size=32M]
>   Capabilities: [60] Power Management version 3
>   Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>   Capabilities: [78] Express Endpoint, MSI 00
>   Capabilities: [100] Virtual Channel
>   Capabilities: [250] Latency Tolerance Reporting
>   Capabilities: [258] L1 PM Substates
>   Capabilities: [128] Power Budgeting 
>   Capabilities: [420] Advanced Error Reporting
>   Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
> 
>   Capabilities: [900] #19
>   Kernel driver in use: nvidia
>   Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
> 
> 
> This is the one the patch is for:
> 
> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> (rev a1)
>   Subsystem: NVIDIA Corporation Device 1212
>   Flags: fast devsel, IRQ 82, NUMA node 8
>   Memory at 620c28000 (32-bit, non-prefetchable) [disabled] [size=16M]
>   Memory at 62280 (64-bit, prefetchable) [disabled] [size=16G]
>   Memory at 62284 (64-bit, prefetchable) [disabled] [size=32M]
>   Capabilities: [60] Power Management version 3
>   Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>   Capabilities: [78] Express Endpoint, MSI 00
>   Capabilities: [100] Virtual Channel
>   Capabilities: [250] Latency Tolerance Reporting
>   Capabilities: [258] L1 PM Substates
>   Capabilities: [128] Power Budgeting 
>   Capabilities: [420] Advanced Error Reporting
>   Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
> 
>   Capabilities: [900] #19
>   Capabilities: [ac0] #23
>   Kernel driver in use: vfio-pci
> 
> 
> I can only see a new capability #23 which I have no idea about what it
> actually does - my latest PCIe spec is
> PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
> till #21, do you have any better spec? Does not seem promising anyway...

You could just look in include/uapi/linux/pci_regs.h and see that 23
(0x17) is a TPH Requester capability and google for that...  It's a TLP
processing hint related to cache processing for requests from system
specific interconnects.  Sounds rather promising.  Of course there's
also the vendor specific capability that might be probed if NVIDIA will
tell you what to look for and the init function you've implemented
looks for specific devicetr

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alex Williamson
On Fri, 08 Jun 2018 10:58:54 +1000
Benjamin Herrenschmidt  wrote:

> On Thu, 2018-06-07 at 18:34 -0600, Alex Williamson wrote:
> > > We *can* allow individual GPUs to be passed through, either if somebody
> > > designs a system without cross links, or if the user is ok with the
> > > security risk as the guest driver will not enable them if it doesn't
> > > "find" both sides of them.  
> > 
> > If GPUs are not isolated and we cannot prevent them from probing each
> > other via these links, then I think we have an obligation to configure
> > grouping in a way that doesn't rely on a benevolent userspace.  Thanks,  
> 
> Well, it's a user decision, no ? Like how we used to let the user
> decide whether to pass-through things that have LSIs shared out of
> their domain.

No, users don't get to pinky swear they'll be good.  The kernel creates
IOMMU groups assuming the worst case isolation and malicious users.
Its the kernel's job to protect itself from users and to protect users
from each other.  Anything else is unsupportable.  The only way to
bypass the default grouping is to modify the kernel.  Thanks,

Alex


Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alex Williamson
On Fri, 08 Jun 2018 09:20:30 +1000
Benjamin Herrenschmidt  wrote:

> On Thu, 2018-06-07 at 16:15 -0600, Alex Williamson wrote:
> > On Fri, 08 Jun 2018 07:54:02 +1000
> > Benjamin Herrenschmidt  wrote:
> >   
> > > On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
> > > > 
> > > > Can we back up and discuss whether the IOMMU grouping of NVLink
> > > > connected devices makes sense?  AIUI we have a PCI view of these
> > > > devices and from that perspective they're isolated.  That's the view of
> > > > the device used to generate the grouping.  However, not visible to us,
> > > > these devices are interconnected via NVLink.  What isolation properties
> > > > does NVLink provide given that its entire purpose for existing seems to
> > > > be to provide a high performance link for p2p between devices?
> > > 
> > > Not entire. On POWER chips, we also have an nvlink between the device
> > > and the CPU which is running significantly faster than PCIe.
> > > 
> > > But yes, there are cross-links and those should probably be accounted
> > > for in the grouping.  
> > 
> > Then after we fix the grouping, can we just let the host driver manage
> > this coherent memory range and expose vGPUs to guests?  The use case of
> > assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > convince NVIDIA to support more than a single vGPU per VM though)
> > Thanks,  
> 
> I don't know about "vGPUs" and what nVidia may be cooking in that area.
> 
> The patched from Alexey allow for passing through the full thing, but
> they aren't trivial (there are additional issues, I'm not sure how
> covered they are, as we need to pay with the mapping attributes of
> portions of the GPU memory on the host side...).
> 
> Note: The cross-links are only per-socket so that would be 2 groups of
> 3.
> 
> We *can* allow individual GPUs to be passed through, either if somebody
> designs a system without cross links, or if the user is ok with the
> security risk as the guest driver will not enable them if it doesn't
> "find" both sides of them.

If GPUs are not isolated and we cannot prevent them from probing each
other via these links, then I think we have an obligation to configure
grouping in a way that doesn't rely on a benevolent userspace.  Thanks,

Alex


Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alex Williamson
On Fri, 08 Jun 2018 07:54:02 +1000
Benjamin Herrenschmidt  wrote:

> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> > 
> > Can we back up and discuss whether the IOMMU grouping of NVLink
> > connected devices makes sense?  AIUI we have a PCI view of these
> > devices and from that perspective they're isolated.  That's the view of
> > the device used to generate the grouping.  However, not visible to us,
> > these devices are interconnected via NVLink.  What isolation properties
> > does NVLink provide given that its entire purpose for existing seems to
> > be to provide a high performance link for p2p between devices?  
> 
> Not entire. On POWER chips, we also have an nvlink between the device
> and the CPU which is running significantly faster than PCIe.
> 
> But yes, there are cross-links and those should probably be accounted
> for in the grouping.

Then after we fix the grouping, can we just let the host driver manage
this coherent memory range and expose vGPUs to guests?  The use case of
assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
convince NVIDIA to support more than a single vGPU per VM though)
Thanks,

Alex


Re: [RFC PATCH kernel 4/5] vfio_pci: Allow mapping extra regions

2018-06-07 Thread Alex Williamson
On Thu,  7 Jun 2018 18:44:19 +1000
Alexey Kardashevskiy  wrote:

What's an "extra region", -ENOCOMMITLOG

> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/pci/vfio_pci_private.h |  3 +++
>  drivers/vfio/pci/vfio_pci.c | 10 --
>  2 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index cde3b5d..86aab05 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -59,6 +59,9 @@ struct vfio_pci_regops {
> size_t count, loff_t *ppos, bool iswrite);
>   void(*release)(struct vfio_pci_device *vdev,
>  struct vfio_pci_region *region);
> + int (*mmap)(struct vfio_pci_device *vdev,
> + struct vfio_pci_region *region,
> + struct vm_area_struct *vma);
>  };
>  
>  struct vfio_pci_region {
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 3729937..7bddf1e 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1123,10 +1123,16 @@ static int vfio_pci_mmap(void *device_data, struct 
> vm_area_struct *vma)
>   return -EINVAL;
>   if ((vma->vm_flags & VM_SHARED) == 0)
>   return -EINVAL;
> + if (index >= VFIO_PCI_NUM_REGIONS) {
> + int regnum = index - VFIO_PCI_NUM_REGIONS;
> + struct vfio_pci_region *region = vdev->region + regnum;
> +
> + if (region && region->ops && region->ops->mmap)
> + return region->ops->mmap(vdev, region, vma);
> + return -EINVAL;
> + }
>   if (index >= VFIO_PCI_ROM_REGION_INDEX)
>   return -EINVAL;
> - if (!vdev->bar_mmap_supported[index])
> - return -EINVAL;

This seems unrelated.  Thanks,

Alex
  
>   phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
>   req_len = vma->vm_end - vma->vm_start;




Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-06-07 Thread Alex Williamson
On Thu,  7 Jun 2018 18:44:20 +1000
Alexey Kardashevskiy  wrote:

> Some POWER9 chips come with special NVLink2 links which provide
> cacheable memory access to the RAM physically located on NVIDIA GPU.
> This memory is presented to a host via the device tree but remains
> offline until the NVIDIA driver onlines it.
> 
> This exports this RAM to the userspace as a new region so
> the NVIDIA driver in the guest can train these links and online GPU RAM.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/pci/Makefile   |   1 +
>  drivers/vfio/pci/vfio_pci_private.h |   8 ++
>  include/uapi/linux/vfio.h   |   3 +
>  drivers/vfio/pci/vfio_pci.c |   9 ++
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 190 
> 
>  drivers/vfio/pci/Kconfig|   4 +
>  6 files changed, 215 insertions(+)
>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> 
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 76d8ec0..9662c06 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -1,5 +1,6 @@
>  
>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
> +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
>  
>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index 86aab05..7115b9b 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -160,4 +160,12 @@ static inline int vfio_pci_igd_init(struct 
> vfio_pci_device *vdev)
>   return -ENODEV;
>  }
>  #endif
> +#ifdef CONFIG_VFIO_PCI_NVLINK2
> +extern int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev);
> +#else
> +static inline int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
> +{
> + return -ENODEV;
> +}
> +#endif
>  #endif /* VFIO_PCI_PRIVATE_H */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 1aa7b82..2fe8227 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -301,6 +301,9 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG   (2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG(3)
>  
> +/* NVIDIA GPU NV2 */
> +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2   (4)

You're continuing the Intel vendor ID sub-types for an NVIDIA vendor ID
subtype.  Each vendor has their own address space of sub-types.

> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be 
> mmapped
>   * which allows direct access to non-MSIX registers which happened to be 
> within
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 7bddf1e..38c9475 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>   }
>   }
>  
> + if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> + pdev->device == 0x1db1 &&
> + IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {

Can't we do better than check this based on device ID?  Perhaps PCIe
capability hints at this?

Is it worthwhile to continue with assigning the device in the !ENABLED
case?  For instance, maybe it would be better to provide a weak
definition of vfio_pci_nvlink2_init() that would cause us to fail here
if we don't have this device specific support enabled.  I realize
you're following the example set forth for IGD, but those regions are
optional, for better or worse.

> + ret = vfio_pci_nvlink2_init(vdev);
> + if (ret)
> + dev_warn(>pdev->dev,
> +  "Failed to setup NVIDIA NV2 RAM region\n");
> + }
> +
>   vfio_pci_probe_mmaps(vdev);
>  
>   return 0;
> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
> b/drivers/vfio/pci/vfio_pci_nvlink2.c
> new file mode 100644
> index 000..451c5cb
> --- /dev/null
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -0,0 +1,190 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
> + *
> + * Copyright (C) 2018 IBM Corp.  All rights reserved.
> + * Author: Alexey Kardashevskiy 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Register an on-GPU RAM region for cacheable access.
> + *
> + * Derived from original vfio_pci_igd.c:
> + * Copyright

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-06-07 Thread Alex Williamson
On Thu,  7 Jun 2018 18:44:15 +1000
Alexey Kardashevskiy  wrote:

> Here is an rfc of some patches adding psaa-through support
> for NVIDIA V100 GPU found in some POWER9 boxes.
> 
> The example P9 system has 6 GPUs, each accompanied with 2 bridges
> representing the hardware links (aka NVLink2):
> 
>  4  0004:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  5  0004:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  6  0004:06:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  4  0006:00:00.0 Bridge: IBM Device 04ea (rev 01)
>  4  0006:00:00.1 Bridge: IBM Device 04ea (rev 01)
>  5  0006:00:01.0 Bridge: IBM Device 04ea (rev 01)
>  5  0006:00:01.1 Bridge: IBM Device 04ea (rev 01)
>  6  0006:00:02.0 Bridge: IBM Device 04ea (rev 01)
>  6  0006:00:02.1 Bridge: IBM Device 04ea (rev 01)
> 10  0007:00:00.0 Bridge: IBM Device 04ea (rev 01)
> 10  0007:00:00.1 Bridge: IBM Device 04ea (rev 01)
> 11  0007:00:01.0 Bridge: IBM Device 04ea (rev 01)
> 11  0007:00:01.1 Bridge: IBM Device 04ea (rev 01)
> 12  0007:00:02.0 Bridge: IBM Device 04ea (rev 01)
> 12  0007:00:02.1 Bridge: IBM Device 04ea (rev 01)
> 10  0035:03:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 11  0035:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 12  0035:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 
> ^^ the number is an IOMMU group ID.

Can we back up and discuss whether the IOMMU grouping of NVLink
connected devices makes sense?  AIUI we have a PCI view of these
devices and from that perspective they're isolated.  That's the view of
the device used to generate the grouping.  However, not visible to us,
these devices are interconnected via NVLink.  What isolation properties
does NVLink provide given that its entire purpose for existing seems to
be to provide a high performance link for p2p between devices?
 
> Each bridge represents an additional hardware interface called "NVLink2",
> it is not a PCI link but separate but. The design inherits from original
> NVLink from POWER8.
> 
> The new feature of V100 is 16GB of cache coherent memory on GPU board.
> This memory is presented to the host via the device tree and remains offline
> until the NVIDIA driver loads, trains NVLink2 (via the config space of these
> bridges above) and the nvidia-persistenced daemon then onlines it.
> The memory remains online as long as nvidia-persistenced is running, when
> it stops, it offlines the memory.
> 
> The amount of GPUs suggest passing them through to a guest. However,
> in order to do so we cannot use the NVIDIA driver so we have a host with
> a 128GB window (bigger or equal to actual GPU RAM size) in a system memory
> with no page structs backing this window and we cannot touch this memory
> before the NVIDIA driver configures it in a host or a guest as
> HMI (hardware management interrupt?) occurs.

Having a lot of GPUs only suggests assignment to a guest if there's
actually isolation provided between those GPUs.  Otherwise we'd need to
assign them as one big group, which gets a lot less useful.  Thanks,

Alex

> On the example system the GPU RAM windows are located at:
> 0x0400  
> 0x0420  
> 0x0440  
> 0x2400  
> 0x2420  
> 0x2440  
> 
> So the complications are:
> 
> 1. cannot touch the GPU memory till it is trained, i.e. cannot add ptes
> to VFIO-to-userspace or guest-to-host-physical translations till
> the driver trains it (i.e. nvidia-persistenced has started), otherwise
> prefetching happens and HMI occurs; I am trying to get this changed
> somehow;
> 
> 2. since it appears as normal cache coherent memory, it will be used
> for DMA which means it has to be pinned and mapped in the host. Having
> no page structs makes it different from the usual case - we only need
> translate user addresses to host physical and map GPU RAM memory but
> pinning is not required.
> 
> This series maps GPU RAM via the GPU vfio-pci device so QEMU can then
> register this memory as a KVM memory slot and present memory nodes to
> the guest. Unless NVIDIA provides an userspace driver, this is no use
> for things like DPDK.
> 
> 
> There is another problem which the series does not address but worth
> mentioning - it is not strictly necessary to map GPU RAM to the guest
> exactly where it is in the host (I tested this to some extent), we still
> might want to represent the memory at the same offset as on the host
> which increases the size of a TCE table needed to cover such a huge
> window: (((0x2440 + 0x20) >> 16)*8)>>20 = 4556MB
> I am addressing this in a separate patchset by allocating indirect TCE
> levels on demand and using 16MB IOMMU pages in the guest as we can now
> back emulated pages with the smaller hardware ones.
> 
> 
> This is an RFC. Please comment. Thanks.
> 
> 
> 
> Alexey Kardashevskiy (5):
>   vfio/spapr_tce: Simplify page contained test
>   powerpc/iommu_context: Change 

Re: [PATCH kernel] vfio/spapr: Add trace points for map/unmap

2017-11-29 Thread Alex Williamson
On Thu, 23 Nov 2017 15:13:37 +1100
Alexey Kardashevskiy <a...@ozlabs.ru> wrote:

> On 17/11/17 17:58, Alexey Kardashevskiy wrote:
> > On 17/11/17 11:13, Alex Williamson wrote:  
> >> On Tue, 14 Nov 2017 10:47:12 +1100
> >> Alexey Kardashevskiy <a...@ozlabs.ru> wrote:
> >>  
> >>> On 27/10/17 14:00, Alexey Kardashevskiy wrote:  
> >>>> This adds trace_map/trace_unmap tracepoints to spapr driver. Type1 
> >>>> already
> >>>> uses these via the IOMMU API (iommu_map/__iommu_unmap).
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
> >>
> >> Is this really legitimate to include tracepoints from a different  
> >> subsystem?>  The vfio type1 backend gets these trace points by virtue of  
> >> it actually using the IOMMU API, it doesn't call them itself.  I'm kind
> >> of surprised these are actually available to be called from a module.  
> > 
> > They are explicitly exported:
> > 
> > EXPORT_TRACEPOINT_SYMBOL_GPL(map);
> > EXPORT_TRACEPOINT_SYMBOL_GPL(unmap);
> > 
> > I would think this is for things like drivers/vfio/vfio_iommu_spapr_tce.c ,
> > why else?...
> > 
> >   
> >> I suspect the way to do this is probably to define our own tracepoints
> >> in the vfio/spapr backend or insert tracepoints into the IOMMU layers
> >> that that code calls into rather than masquerading as tracepoints from
> >> a different subsystem.  Right?  
> > 
> > This makes sense too. But it is going to be just cut-n-paste and some
> > confusion -
> > /sys/kernel/debug/tracing/events/iommu/map will still be present but
> > won't work and
> > /sys/kernel/debug/tracing/events/vfio/vfio_iommu_spapr_tce/map will.

But iommu/map does work, it's just not called by the vfio spapr tce
backend, which is absolutely correct.  In fact, that's part of my
reservation about this approach is that it could be interpreted as
implying a call path that doesn't exist on this arch.

> > Judges? :)  
> 
> 
> Still nak? I discussed this locally, the conclusion was it is a matter of
> taste and this proposal is not that disgusting. Thanks.

Is that our goal now, proposals that aren't too disgusting?  I think
it's an opportunity to introduce our own tracing infrastructure and it
feels a lot more correct to do that than to generalize existing trace
points from other subsystems.  Thanks,

Alex


> >>>> ---
> >>>>
> >>>> Example:
> >>>>  qemu-system-ppc-8655  [096]   724.662740: unmap:IOMMU: 
> >>>> iova=0x3000 size=4096 unmapped_size=4096
> >>>>  qemu-system-ppc-8656  [104]   724.970912: map:  IOMMU: 
> >>>> iova=0x0800 paddr=0x7ffef7ff size=65536
> >>>> ---
> >>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 12 ++--
> >>>>  1 file changed, 10 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> >>>> b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>> index 63112c36ab2d..4531486c77c6 100644
> >>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>> @@ -22,6 +22,7 @@
> >>>>  #include 
> >>>>  #include 
> >>>>  #include 
> >>>> +#include 
> >>>>  
> >>>>  #include 
> >>>>  #include 
> >>>> @@ -502,17 +503,19 @@ static int tce_iommu_clear(struct tce_container 
> >>>> *container,
> >>>>  struct iommu_table *tbl,
> >>>>  unsigned long entry, unsigned long pages)
> >>>>  {
> >>>> -unsigned long oldhpa;
> >>>> +unsigned long oldhpa, unmapped, firstentry = entry, totalpages 
> >>>> = pages;
> >>>>  long ret;
> >>>>  enum dma_data_direction direction;
> >>>>  
> >>>> -for ( ; pages; --pages, ++entry) {
> >>>> +for (unmapped = 0; pages; --pages, ++entry) {
> >>>>  direction = DMA_NONE;
> >>>>  oldhpa = 0;
> >>>>  ret = iommu_tce_xchg(tbl, entry, , );
> >>>>  if (ret)
> >>>>  continue;
> >>>>  
> >>>> +++unmapped;
> >>>> +
> >>>>  if (direction == DMA_NONE)
> >>>>  continue;
> >>>>  
> >>>> @@ -523,6 +526,9 @@ static int tce_iommu_clear(struct tce_container 
> >>>> *container,
> >>>>  
> >>>>  tce_iommu_unuse_page(container, oldhpa);
> >>>>  }
> >>>> +trace_unmap(firstentry << tbl->it_page_shift,
> >>>> +totalpages << tbl->it_page_shift,
> >>>> +unmapped << tbl->it_page_shift);
> >>>>  
> >>>>  return 0;
> >>>>  }
> >>>> @@ -965,6 +971,8 @@ static long tce_iommu_ioctl(void *iommu_data,
> >>>>  direction);
> >>>>  
> >>>>  iommu_flush_tce(tbl);
> >>>> +if (!ret)
> >>>> +trace_map(param.iova, param.vaddr, param.size);
> >>>>  
> >>>>  return ret;
> >>>>  }
> >>>> 
> >>>
> >>>  
> >>  
> > 
> >   
> 
> 



Re: [PATCH kernel] vfio/spapr: Add trace points for map/unmap

2017-11-16 Thread Alex Williamson
On Tue, 14 Nov 2017 10:47:12 +1100
Alexey Kardashevskiy  wrote:

> On 27/10/17 14:00, Alexey Kardashevskiy wrote:
> > This adds trace_map/trace_unmap tracepoints to spapr driver. Type1 already
> > uses these via the IOMMU API (iommu_map/__iommu_unmap).
> > 
> > Signed-off-by: Alexey Kardashevskiy   

Is this really legitimate to include tracepoints from a different
subsystem?  The vfio type1 backend gets these trace points by virtue of
it actually using the IOMMU API, it doesn't call them itself.  I'm kind
of surprised these are actually available to be called from a module.
I suspect the way to do this is probably to define our own tracepoints
in the vfio/spapr backend or insert tracepoints into the IOMMU layers
that that code calls into rather than masquerading as tracepoints from
a different subsystem.  Right?  Thanks,

Alex

> > ---
> > 
> > Example:
> >  qemu-system-ppc-8655  [096]   724.662740: unmap:IOMMU: 
> > iova=0x3000 size=4096 unmapped_size=4096
> >  qemu-system-ppc-8656  [104]   724.970912: map:  IOMMU: 
> > iova=0x0800 paddr=0x7ffef7ff size=65536
> > ---
> >  drivers/vfio/vfio_iommu_spapr_tce.c | 12 ++--
> >  1 file changed, 10 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> > b/drivers/vfio/vfio_iommu_spapr_tce.c
> > index 63112c36ab2d..4531486c77c6 100644
> > --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> > @@ -22,6 +22,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -502,17 +503,19 @@ static int tce_iommu_clear(struct tce_container 
> > *container,
> > struct iommu_table *tbl,
> > unsigned long entry, unsigned long pages)
> >  {
> > -   unsigned long oldhpa;
> > +   unsigned long oldhpa, unmapped, firstentry = entry, totalpages = pages;
> > long ret;
> > enum dma_data_direction direction;
> >  
> > -   for ( ; pages; --pages, ++entry) {
> > +   for (unmapped = 0; pages; --pages, ++entry) {
> > direction = DMA_NONE;
> > oldhpa = 0;
> > ret = iommu_tce_xchg(tbl, entry, , );
> > if (ret)
> > continue;
> >  
> > +   ++unmapped;
> > +
> > if (direction == DMA_NONE)
> > continue;
> >  
> > @@ -523,6 +526,9 @@ static int tce_iommu_clear(struct tce_container 
> > *container,
> >  
> > tce_iommu_unuse_page(container, oldhpa);
> > }
> > +   trace_unmap(firstentry << tbl->it_page_shift,
> > +   totalpages << tbl->it_page_shift,
> > +   unmapped << tbl->it_page_shift);
> >  
> > return 0;
> >  }
> > @@ -965,6 +971,8 @@ static long tce_iommu_ioctl(void *iommu_data,
> > direction);
> >  
> > iommu_flush_tce(tbl);
> > +   if (!ret)
> > +   trace_map(param.iova, param.vaddr, param.size);
> >  
> > return ret;
> > }
> >   
> 
> 



Re: [PATCH v3 2/2] pseries/eeh: Add Pseries pcibios_bus_add_device

2017-10-13 Thread Alex Williamson
On Fri, 13 Oct 2017 07:01:48 -0500
Steven Royer  wrote:

> On 2017-10-13 06:53, Steven Royer wrote:
> > On 2017-10-12 22:34, Bjorn Helgaas wrote:  
> >> [+cc Alex, Bodong, Eli, Saeed]
> >> 
> >> On Thu, Oct 12, 2017 at 02:59:23PM -0500, Bryant G. Ly wrote:  
> >>> On 10/12/17 1:29 PM, Bjorn Helgaas wrote:  
> >>> >On Thu, Oct 12, 2017 at 03:09:53PM +1100, Michael Ellerman wrote:  
> >>> >>Bjorn Helgaas  writes:
> >>> >>  
> >>> >>>On Fri, Sep 22, 2017 at 09:19:28AM -0500, Bryant G. Ly wrote:  
> >>> This patch adds the machine dependent call for
> >>> pcibios_bus_add_device, since the previous patch
> >>> separated the calls out between the PowerNV and PowerVM.
> >>> 
> >>> The difference here is that for the PowerVM environment
> >>> we do not want match_driver set because in this environment
> >>> we do not want the VF device drivers to load immediately, due to
> >>> firmware loading the device node when VF device is assigned to the
> >>> logical partition.
> >>> 
> >>> This patch will depend on the patch linked below, which is under
> >>> review.
> >>> 
> >>> https://patchwork.kernel.org/patch/9882915/
> >>> 
> >>> Signed-off-by: Bryant G. Ly 
> >>> Signed-off-by: Juan J. Alvarez 
> >>> ---
> >>>   arch/powerpc/platforms/pseries/eeh_pseries.c | 24 
> >>>  
> >>>   1 file changed, 24 insertions(+)
> >>> 
> >>> diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c 
> >>> b/arch/powerpc/platforms/pseries/eeh_pseries.c
> >>> index 6b812ad990e4..45946ee90985 100644
> >>> --- a/arch/powerpc/platforms/pseries/eeh_pseries.c
> >>> +++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
> >>> @@ -64,6 +64,27 @@ static unsigned char 
> >>> slot_errbuf[RTAS_ERROR_LOG_MAX];
> >>>   static DEFINE_SPINLOCK(slot_errbuf_lock);
> >>>   static int eeh_error_buf_size;
> >>> +void pseries_pcibios_bus_add_device(struct pci_dev *pdev)
> >>> +{
> >>> + struct pci_dn *pdn = pci_get_pdn(pdev);
> >>> +
> >>> + if (!pdev->is_virtfn)
> >>> + return;
> >>> +
> >>> + pdn->device_id  =  pdev->device;
> >>> + pdn->vendor_id  =  pdev->vendor;
> >>> + pdn->class_code =  pdev->class;
> >>> +
> >>> + /*
> >>> +  * The following operations will fail if VF's sysfs files
> >>> +  * aren't created or its resources aren't finalized.
> >>> +  */
> >>> + eeh_add_device_early(pdn);
> >>> + eeh_add_device_late(pdev);
> >>> + eeh_sysfs_add_device(pdev);
> >>> + pdev->match_driver = -1;  
> >>> >>>match_driver is a bool, which should be assigned "true" or "false".  
> >>> >>Above he mentioned a dependency on:
> >>> >>
> >>> >>   [04/10] PCI: extend pci device match_driver state
> >>> >>   https://patchwork.kernel.org/patch/9882915/
> >>> >>
> >>> >>
> >>> >>Which makes it an int.  
> >>> >Oh, right, I missed that, thanks.
> >>> >  
> >>> >>Or has that patch been rejected or something?  
> >>> >I haven't *rejected* it, but it's low on my priority list, so you
> >>> >shouldn't depend on it unless it adds functionality you really need.
> >>> >If I did apply that particular patch, I would want some rework because
> >>> >it currently obfuscates the match_driver logic.  There's no clue when
> >>> >reading the code what -1/0/1 mean.  
> >>> So do you prefer enum's? - If so I can make a change for that.  
> >>> >Apparently here you *do* want the "-1 means the PCI core will never
> >>> >set match_driver to 1" functionality, so maybe you do depend on it.  
> >>> We depend on the patch because we want that ability to never set
> >>> match_driver,
> >>> for SRIOV on PowerVM.  
> >> 
> >> Is this really new PowerVM-specific functionality?  ISTR recent 
> >> discussions
> >> about inhibiting driver binding in a generic way, e.g.,
> >> http://lkml.kernel.org/r/1490022874-54718-1-git-send-email-bod...@mellanox.com
> >>   
> >>> >If that's the case, how to you ever bind a driver to these VFs?  The
> >>> >changelog says you don't want VF drivers to load *immediately*, so I
> >>> >assume you do want them to load eventually.
> >>> >  
> >>> The VF's that get dynamically created within the configure SR-IOV
> >>> call, on the Pseries Platform, wont be matched with a driver. - We
> >>> do not want it to match.
> >>> 
> >>> The Power Hypervisor will load the VFs. The VF's will get
> >>> assigned(by the user) via the HMC or Novalink in this environment
> >>> which will then trigger PHYP to load the VF device node to the
> >>> device tree.  
> >> 
> >> I don't know what it means for the Hypervisor to "load the VFs."  Can
> >> you explain that in PCI-speak?
> >> 
> >> The things I know about are:
> >> 
> >>   - we set PCI_SRIOV_CTRL_VFE in the PF, which enables VFs
> >>   - now the VFs respond to 

Re: [PATCH kernel v3] vfio/spapr: Add cond_resched() for huge updates

2017-09-29 Thread Alex Williamson
On Thu, 28 Sep 2017 19:16:12 +1000
Alexey Kardashevskiy  wrote:

> Clearing very big IOMMU tables can trigger soft lockups. This adds
> cond_resched() to allow the scheduler to do context switching when
> it decides to.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> 
> The testcase is POWER9 box with 264GB guest, 4 VFIO devices from
> independent IOMMU groups, 64K IOMMU pages. This configuration produces
> 4325376 TCE entries, each entry update incurs 4 OPAL calls to update
> an individual PE TCE cache; this produced lockups for more than 20s.
> Reducing table size to 4194304 (i.e. 256GB guest) or removing one
> of 4 VFIO devices makes the problem go away.
> 
> ---
> Changes:
> v3:
> * cond_resched() checks for should_resched() so we just call resched()
> and let the cpu scheduler decide whether to switch or not
> 
> v2:
> * replaced with time based solution
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 63112c36ab2d..759a5bdd40e1 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -507,6 +507,8 @@ static int tce_iommu_clear(struct tce_container 
> *container,
>   enum dma_data_direction direction;
>  
>   for ( ; pages; --pages, ++entry) {
> + cond_resched();
> +
>   direction = DMA_NONE;
>   oldhpa = 0;
>   ret = iommu_tce_xchg(tbl, entry, , );

This looks fine to me, I've applied it to my local next branch for
v4.15.  I'll push that branch next week, once I can rebase to
4.14-rc3.  Thanks,

Alex


Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-17 Thread Alex Williamson
On Thu, 17 Aug 2017 10:56:35 +
David Laight <david.lai...@aculab.com> wrote:

> From: Alex Williamson
> > Sent: 16 August 2017 17:56  
> ...
> > Firmware pissing match...  Processors running with 8k or less page size
> > fall within the recommendations of the PCI spec for register alignment
> > of MMIO regions of the device and this whole problem becomes less of an
> > issue.  
> 
> Actually if qemu is causing the MSI-X table accesses to fault, why doesn't
> it just lie to the guest about the physical address of the MSI-X table?
> Then mmio access to anything in the same physical page will just work.

That's an interesting idea, but now you need to add a BAR for the
virtualized vector table, but you'll also need to support extending a
BAR because there won't necessarily be a BAR available to add.  Of
course PCI requires natural alignment of BARs, thus an extra few bytes
on the end doubles the BAR size.  So also hope that if we need to
extend a BAR that there's a relatively small one available.  In either
case you're changing the layout of the device from what the driver might
expect.  We try pretty hard with device assignment to leave things in
the same place as they appear on bare metal, perhaps removing things,
but not actually moving things.  It might work in the majority of
cases, but it seems a bit precarious overall.  Thanks,

Alex


Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-16 Thread Alex Williamson
On Wed, 16 Aug 2017 10:35:49 +1000
Benjamin Herrenschmidt <b...@kernel.crashing.org> wrote:

> On Tue, 2017-08-15 at 10:37 -0600, Alex Williamson wrote:
> > Of course I don't think either of those are worth imposing a
> > performance penalty where we don't otherwise need one.  However, if we
> > look at a VM scenario where the guest is following the PCI standard for
> > programming MSI-X interrupts (ie. not POWER), we need some mechanism to
> > intercept those MMIO writes to the vector table and configure the host
> > interrupt domain of the device rather than allowing the guest direct
> > access.  This is simply part of virtualizing the device to the guest.
> > So even if the kernel allows mmap'ing the vector table, the hypervisor
> > needs to trap it, so the mmap isn't required or used anyway.  It's only
> > when you define a non-PCI standard for your guest to program
> > interrupts, as POWER has done, and can therefore trust that the
> > hypervisor does not need to trap on the vector table that having that
> > mmap'able vector table becomes fully useful.  AIUI, ARM supports 64k
> > pages too... does ARM have any strategy that would actually make it
> > possible to make use of an mmap covering the vector table?  Thanks,  
> 
> WTF  Alex, can you stop once and for all with all that "POWER is
> not standard" bullshit please ? It's completely wrong.

As you've stated, the MSI-X vector table on POWER is currently updated
via a hypercall.  POWER is overall PCI compliant (I assume), but the
guest does not directly modify the vector table in MMIO space of the
device.  This is important...

> This has nothing to do with PCIe standard !

Yes, it actually does, because if the guest relies on the vector table
to be virtualized then it doesn't particularly matter whether the
vfio-pci kernel driver allows that portion of device MMIO space to be
directly accessed or mapped because QEMU needs for it to be trapped in
order to provide that virtualization.

I'm not knocking POWER, it's a smart thing for virtualization to have
defined this hypercall which negates the need for vector table
virtualization and allows efficient mapping of the device.  On other
platform, it's not necessarily practical given the broad base of legacy
guests supported where we'd never get agreement to implement this as
part of the platform spec... if there even was such a thing.  Maybe we
could provide the hypercall and dynamically enable direct vector table
mapping (disabling vector table virtualization) only if the hypercall
is used.

> The PCIe standard says strictly *nothing* whatsoever about how an OS
> obtains the magic address/values to put in the device and how the PCIe
> host bridge may do appropriate fitering.

And now we've jumped the tracks...  The only way the platform specific
address/data values become important is if we allow direct access to
the vector table AND now we're formulating how the user/guest might
write to it directly.  Otherwise the virtualization of the vector
table, or paravirtualization via hypercall provides the translation
where the host and guest address/data pairs can operate in completely
different address spaces.

> There is nothing on POWER that prevents the guest from writing the MSI-
> X address/data by hand. The problem isn't who writes the values or even
> how. The problem breaks down into these two things that are NOT covered
> by any aspect of the PCIe standard:

You've moved on to a different problem, I think everyone aside from
POWER is still back at the problem where who writes the vector table
values is a forefront problem.
 
>   1- The OS needs to obtain address/data values for an MSI that will
> "work" for the device.
> 
>   2- The HW+HV needs to prevent collateral damage caused by a device
> issuing stores to incorrect address or with incorrect data. Now *this*
> is necessary for *ANY* kind of DMA whether it's an MSI or something
> else anyway.
> 
> Now, the filtering done by qemu is NOT a reasonable way to handle 2)
> and whatever excluse about "making it harder" doesn't fly a meter when
> it comes to security. Making it "harder to break accidentally" I also
> don't buy, people don't just randomly put things in their MSI-X tables
> "accidentally", that stuff works or doesn't.

As I said before, I'm not willing to preserve the weak attributes that
blocking direct vector table access provides over pursuing a more
performant interface, but I also don't think their value is absolute
zero either.

> That leaves us with 1). Now this is purely a platform specific matters,
> not a spec matter. Once the HW has a way to enforce you can only
> generate "allowed" MSIs it becomes a matter of having some FW mechanism
> that can be used to informed the OS wh

Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-15 Thread Alex Williamson
On Mon, 14 Aug 2017 14:12:33 +0100
Robin Murphy  wrote:

> On 14/08/17 10:45, Alexey Kardashevskiy wrote:
> > Folks,
> > 
> > Is there anything to change besides those compiler errors and David's
> > comment in 5/5? Or the while patchset is too bad? Thanks.  
> 
> While I now understand it's not the low-level thing I first thought it
> was, so my reasoning has changed, personally I don't like this approach
> any more than the previous one - it still smells of abusing external
> APIs to pass information from one part of VFIO to another (and it has
> the same conceptual problem of attributing something to interrupt
> sources that is actually a property of the interrupt target).
> 
> Taking a step back, though, why does vfio-pci perform this check in the
> first place? If a malicious guest already has control of a device, any
> kind of interrupt spoofing it could do by fiddling with the MSI-X
> message address/data it could simply do with a DMA write anyway, so the
> security argument doesn't stand up in general (sure, not all PCIe
> devices may be capable of arbitrary DMA, but that seems like more of a
> tenuous security-by-obscurity angle to me). Besides, with Type1 IOMMU
> the fact that we've let a device be assigned at all means that this is
> already a non-issue (because either the hardware provides isolation or
> the user has explicitly accepted the consequences of an unsafe
> configuration) - from patch #4 that's apparently the same for SPAPR TCE,
> in which case it seems this flag doesn't even need to be propagated and
> could simply be assumed always.
> 
> On the other hand, if the check is not so much to mitigate malicious
> guests attacking the system as to prevent dumb guests breaking
> themselves (e.g. if some or all of the MSI-X capability is actually
> emulated), then allowing things to sometimes go wrong on the grounds of
> an irrelevant hardware feature doesn't seem correct :/

While the theoretical security provided by preventing direct access to
the MSI-X vector table may be mostly a matter of obfuscation, in
practice, I think it changes the problem of creating arbitrary DMA
writes from a generic, trivial, PCI spec based exercise to a more device
specific challenge.  I do however have evidence that there are
consumers of the vfio API who would have attempted to program device
interrupts by directly manipulating the vector table had they not been
prevented from doing so and contacting me to learn about the SET_IRQ
ioctl.  Therefore I think the behavior also contributes to making the
overall API more difficult to use incorrectly.

Of course I don't think either of those are worth imposing a
performance penalty where we don't otherwise need one.  However, if we
look at a VM scenario where the guest is following the PCI standard for
programming MSI-X interrupts (ie. not POWER), we need some mechanism to
intercept those MMIO writes to the vector table and configure the host
interrupt domain of the device rather than allowing the guest direct
access.  This is simply part of virtualizing the device to the guest.
So even if the kernel allows mmap'ing the vector table, the hypervisor
needs to trap it, so the mmap isn't required or used anyway.  It's only
when you define a non-PCI standard for your guest to program
interrupts, as POWER has done, and can therefore trust that the
hypervisor does not need to trap on the vector table that having that
mmap'able vector table becomes fully useful.  AIUI, ARM supports 64k
pages too... does ARM have any strategy that would actually make it
possible to make use of an mmap covering the vector table?  Thanks,

Alex


Re: [PATCH v2] include/linux/vfio.h: Guard powerpc-specific functions with CONFIG_VFIO_SPAPR_EEH

2017-07-26 Thread Alex Williamson
On Tue, 18 Jul 2017 14:22:20 -0300
Murilo Opsfelder Araujo  wrote:

> When CONFIG_EEH=y and CONFIG_VFIO_SPAPR_EEH=n, build fails with the
> following:
> 
> drivers/vfio/pci/vfio_pci.o: In function `.vfio_pci_release':
> vfio_pci.c:(.text+0xa98): undefined reference to 
> `.vfio_spapr_pci_eeh_release'
> drivers/vfio/pci/vfio_pci.o: In function `.vfio_pci_open':
> vfio_pci.c:(.text+0x1420): undefined reference to 
> `.vfio_spapr_pci_eeh_open'
> 
> In this case, vfio_pci.c should use the empty definitions of
> vfio_spapr_pci_eeh_open and vfio_spapr_pci_eeh_release functions.
> 
> This patch fixes it by guarding these function definitions with
> CONFIG_VFIO_SPAPR_EEH, the symbol that controls whether vfio_spapr_eeh.c is
> built, which is where the non-empty versions of these functions are. We need 
> to
> make use of IS_ENABLED() macro because CONFIG_VFIO_SPAPR_EEH is a tristate
> option.
> 
> This issue was found during a randconfig build. Logs are here:
> 
> http://kisskb.ellerman.id.au/kisskb/buildresult/12982362/
> 
> Signed-off-by: Murilo Opsfelder Araujo 
> ---

Applied to my for-linus branch with David and Alexey's R-b for v4.13.
Thanks,

Alex

> 
> Changes from v1:
> - Rebased on top of next-20170718.
> 
>  include/linux/vfio.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 586809a..a47b985 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -152,7 +152,7 @@ extern int vfio_set_irqs_validate_and_prepare(struct 
> vfio_irq_set *hdr,
> size_t *data_size);
> 
>  struct pci_dev;
> -#ifdef CONFIG_EEH
> +#if IS_ENABLED(CONFIG_VFIO_SPAPR_EEH)
>  extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);
>  extern void vfio_spapr_pci_eeh_release(struct pci_dev *pdev);
>  extern long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
> @@ -173,7 +173,7 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct 
> iommu_group *group,
>  {
>   return -ENOTTY;
>  }
> -#endif /* CONFIG_EEH */
> +#endif /* CONFIG_VFIO_SPAPR_EEH */
> 
>  /*
>   * IRQfd - generic
> --
> 2.9.4



Re: [PATCH v2] include/linux/vfio.h: Guard powerpc-specific functions with CONFIG_VFIO_SPAPR_EEH

2017-07-25 Thread Alex Williamson
[cc +Alexey, David]

Any comments from the usual suspects for vfio/spapr?  Thanks,

Alex

On Tue, 25 Jul 2017 10:56:38 -0300
Murilo Opsfelder Araújo  wrote:

> On 07/18/2017 02:22 PM, Murilo Opsfelder Araujo wrote:
> > When CONFIG_EEH=y and CONFIG_VFIO_SPAPR_EEH=n, build fails with the
> > following:
> > 
> > drivers/vfio/pci/vfio_pci.o: In function `.vfio_pci_release':
> > vfio_pci.c:(.text+0xa98): undefined reference to 
> > `.vfio_spapr_pci_eeh_release'
> > drivers/vfio/pci/vfio_pci.o: In function `.vfio_pci_open':
> > vfio_pci.c:(.text+0x1420): undefined reference to 
> > `.vfio_spapr_pci_eeh_open'
> > 
> > In this case, vfio_pci.c should use the empty definitions of
> > vfio_spapr_pci_eeh_open and vfio_spapr_pci_eeh_release functions.
> > 
> > This patch fixes it by guarding these function definitions with
> > CONFIG_VFIO_SPAPR_EEH, the symbol that controls whether vfio_spapr_eeh.c is
> > built, which is where the non-empty versions of these functions are. We 
> > need to
> > make use of IS_ENABLED() macro because CONFIG_VFIO_SPAPR_EEH is a tristate
> > option.
> > 
> > This issue was found during a randconfig build. Logs are here:
> > 
> > http://kisskb.ellerman.id.au/kisskb/buildresult/12982362/
> > 
> > Signed-off-by: Murilo Opsfelder Araujo 
> > ---
> > 
> > Changes from v1:
> > - Rebased on top of next-20170718.  
> 
> Hi, Alex.
> 
> Are you applying this?
> 
> Thanks!
> 



Re: [PATCH kernel 0/3 REPOST] vfio-pci: Add support for mmapping MSI-X table

2017-06-29 Thread Alex Williamson
On Wed, 28 Jun 2017 17:27:32 +1000
Alexey Kardashevskiy <a...@ozlabs.ru> wrote:

> On 24/06/17 01:17, Alex Williamson wrote:
> > On Fri, 23 Jun 2017 15:06:37 +1000
> > Alexey Kardashevskiy <a...@ozlabs.ru> wrote:
> >   
> >> On 23/06/17 07:11, Alex Williamson wrote:  
> >>> On Thu, 15 Jun 2017 15:48:42 +1000
> >>> Alexey Kardashevskiy <a...@ozlabs.ru> wrote:
> >>> 
> >>>> Here is a patchset which Yongji was working on before
> >>>> leaving IBM LTC. Since we still want to have this functionality
> >>>> in the kernel (DPDK is the first user), here is a rebase
> >>>> on the current upstream.
> >>>>
> >>>>
> >>>> Current vfio-pci implementation disallows to mmap the page
> >>>> containing MSI-X table in case that users can write directly
> >>>> to MSI-X table and generate an incorrect MSIs.
> >>>>
> >>>> However, this will cause some performance issue when there
> >>>> are some critical device registers in the same page as the
> >>>> MSI-X table. We have to handle the mmio access to these
> >>>> registers in QEMU emulation rather than in guest.
> >>>>
> >>>> To solve this issue, this series allows to expose MSI-X table
> >>>> to userspace when hardware enables the capability of interrupt
> >>>> remapping which can ensure that a given PCI device can only
> >>>> shoot the MSIs assigned for it. And we introduce a new bus_flags
> >>>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
> >>>> for different archs.
> >>>>
> >>>> The patch 3 are based on the proposed patchset[1].
> >>>>
> >>>> Changelog
> >>>> v3:
> >>>> - rebased on the current upstream
> >>>
> >>> There's something not forthcoming here, the last version I see from
> >>> Yongji is this one:
> >>>
> >>> https://lists.linuxfoundation.org/pipermail/iommu/2016-June/017245.html
> >>>
> >>> Which was a 6-patch series where patches 2-4 tried to apply
> >>> PCI_BUS_FLAGS_MSI_REMAP for cases that supported other platforms.  That
> >>> doesn't exist here, so it's not simply a rebase.  Patch 1/ seems to
> >>> equate this new flag to the IOMMU capability IOMMU_CAP_INTR_REMAP, but
> >>> nothing is done here to match them together.  That patch also mentions
> >>> the work Eric has done for similar features on ARM, but again those
> >>> patches are dropped.  It seems like an incomplete feature now.  Thanks,   
> >>>  
> >>
> >>
> >> Thanks! I suspected this is not the latest but could not find anything
> >> better than we use internally for tests, and I could not reach Yongji for
> >> comments whether this was the latest update.
> >>
> >> As I am reading the patches, I notice that the "msi remap" term is used all
> >> over the place. While this remapping capability may be the case for x86/arm
> >> (and therefore the IOMMU_CAP_INTR_REMAP flag makes sense), powernv does not
> >> do remapping but provides hardware isolation. When we are allowing MSIX BAR
> >> mapping to the userspace - the isolation is what we really care about. Will
> >> it make sense to rename PCI_BUS_FLAGS_MSI_REMAP to
> >> PCI_BUS_FLAGS_MSI_ISOLATED ?  
> > 
> > I don't have a strong opinion either way, so long as it's fully
> > described what the flag indicates.
> >   
> >> Another thing - the patchset enables PCI_BUS_FLAGS_MSI_REMAP when IOMMU
> >> just advertises IOMMU_CAP_INTR_REMAP, not necessarily uses it, should the
> >> patchset actually look at something like irq_remapping_enabled in
> >> drivers/iommu/amd_iommu.c instead?  
> > 
> > Interrupt remapping being enabled is implicit in IOMMU_CAP_INTR_REMAP,
> > neither intel or amd iommu export the capability unless enabled.
> > Nobody cares if it's supported but not enabled.  Thanks,  
> 
> 
> As I am reading the current drivers/vfio/vfio_iommu_type1.c, it feels like
> MSIX BAR mappings can always be allowed for the type1 IOMMU as
> vfio_iommu_type1_attach_group() performs this check:
> 
> msi_remap = resv_msi ? irq_domain_check_msi_remap() :
> iommu_capable(bus, IOMMU_CAP_INTR_REMAP);
> 
> and simply does not proceed if MSI remap is not supported. Is that correct
> or I miss something here? Thanks.

The MSI code in type1 has absolutely nothing to do with BAR mappings.
That's looking at how MSI is handled by the IOMMU, whether it needs a
reserved mapping area and whether MSI writes have source ID
validation.  Thanks,

Alex


Re: [PATCH kernel 0/3 REPOST] vfio-pci: Add support for mmapping MSI-X table

2017-06-23 Thread Alex Williamson
On Fri, 23 Jun 2017 15:06:37 +1000
Alexey Kardashevskiy <a...@ozlabs.ru> wrote:

> On 23/06/17 07:11, Alex Williamson wrote:
> > On Thu, 15 Jun 2017 15:48:42 +1000
> > Alexey Kardashevskiy <a...@ozlabs.ru> wrote:
> >   
> >> Here is a patchset which Yongji was working on before
> >> leaving IBM LTC. Since we still want to have this functionality
> >> in the kernel (DPDK is the first user), here is a rebase
> >> on the current upstream.
> >>
> >>
> >> Current vfio-pci implementation disallows to mmap the page
> >> containing MSI-X table in case that users can write directly
> >> to MSI-X table and generate an incorrect MSIs.
> >>
> >> However, this will cause some performance issue when there
> >> are some critical device registers in the same page as the
> >> MSI-X table. We have to handle the mmio access to these
> >> registers in QEMU emulation rather than in guest.
> >>
> >> To solve this issue, this series allows to expose MSI-X table
> >> to userspace when hardware enables the capability of interrupt
> >> remapping which can ensure that a given PCI device can only
> >> shoot the MSIs assigned for it. And we introduce a new bus_flags
> >> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
> >> for different archs.
> >>
> >> The patch 3 are based on the proposed patchset[1].
> >>
> >> Changelog
> >> v3:
> >> - rebased on the current upstream  
> > 
> > There's something not forthcoming here, the last version I see from
> > Yongji is this one:
> > 
> > https://lists.linuxfoundation.org/pipermail/iommu/2016-June/017245.html
> > 
> > Which was a 6-patch series where patches 2-4 tried to apply
> > PCI_BUS_FLAGS_MSI_REMAP for cases that supported other platforms.  That
> > doesn't exist here, so it's not simply a rebase.  Patch 1/ seems to
> > equate this new flag to the IOMMU capability IOMMU_CAP_INTR_REMAP, but
> > nothing is done here to match them together.  That patch also mentions
> > the work Eric has done for similar features on ARM, but again those
> > patches are dropped.  It seems like an incomplete feature now.  Thanks,  
> 
> 
> Thanks! I suspected this is not the latest but could not find anything
> better than we use internally for tests, and I could not reach Yongji for
> comments whether this was the latest update.
> 
> As I am reading the patches, I notice that the "msi remap" term is used all
> over the place. While this remapping capability may be the case for x86/arm
> (and therefore the IOMMU_CAP_INTR_REMAP flag makes sense), powernv does not
> do remapping but provides hardware isolation. When we are allowing MSIX BAR
> mapping to the userspace - the isolation is what we really care about. Will
> it make sense to rename PCI_BUS_FLAGS_MSI_REMAP to
> PCI_BUS_FLAGS_MSI_ISOLATED ?

I don't have a strong opinion either way, so long as it's fully
described what the flag indicates.

> Another thing - the patchset enables PCI_BUS_FLAGS_MSI_REMAP when IOMMU
> just advertises IOMMU_CAP_INTR_REMAP, not necessarily uses it, should the
> patchset actually look at something like irq_remapping_enabled in
> drivers/iommu/amd_iommu.c instead?

Interrupt remapping being enabled is implicit in IOMMU_CAP_INTR_REMAP,
neither intel or amd iommu export the capability unless enabled.
Nobody cares if it's supported but not enabled.  Thanks,

Alex

> >> v2:
> >> - Make the commit log more clear
> >> - Replace pci_bus_check_msi_remapping() with pci_bus_msi_isolated()
> >>   so that we could clearly know what the function does
> >> - Set PCI_BUS_FLAGS_MSI_REMAP in pci_create_root_bus() instead
> >>   of iommu_bus_notifier()
> >> - Reserve VFIO_REGION_INFO_FLAG_CAPS when we allow to mmap MSI-X
> >>   table so that we can know whether we allow to mmap MSI-X table
> >>   in QEMU
> >>
> >> [1] 
> >> https://www.mail-archive.com/linux-kernel%40vger.kernel.org/msg1138820.html
> >>
> >>
> >> This is based on sha1
> >> 63f700aab4c1 Linus Torvalds "Merge tag 'xtensa-20170612' of 
> >> git://github.com/jcmvbkbc/linux-xtensa".
> >>
> >> Please comment. Thanks.
> >>
> >>
> >>
> >> Yongji Xie (3):
> >>   PCI: Add a new PCI_BUS_FLAGS_MSI_REMAP flag
> >>   pci-ioda: Set PCI_BUS_FLAGS_MSI_REMAP for IODA host bridge
> >>   vfio-pci: Allow to expose MSI-X table to userspace if interrupt
> >> remapping is enabled
> >>
> >>  include/linux/pci.h   |  1 +
> >>  arch/powerpc/platforms/powernv/pci-ioda.c |  8 
> >>  drivers/vfio/pci/vfio_pci.c   | 18 +++---
> >>  drivers/vfio/pci/vfio_pci_rdwr.c  |  3 ++-
> >>  4 files changed, 26 insertions(+), 4 deletions(-)
> >>  
> >   
> 
> 



Re: [PATCH kernel 0/3 REPOST] vfio-pci: Add support for mmapping MSI-X table

2017-06-22 Thread Alex Williamson
On Thu, 15 Jun 2017 15:48:42 +1000
Alexey Kardashevskiy  wrote:

> Here is a patchset which Yongji was working on before
> leaving IBM LTC. Since we still want to have this functionality
> in the kernel (DPDK is the first user), here is a rebase
> on the current upstream.
> 
> 
> Current vfio-pci implementation disallows to mmap the page
> containing MSI-X table in case that users can write directly
> to MSI-X table and generate an incorrect MSIs.
> 
> However, this will cause some performance issue when there
> are some critical device registers in the same page as the
> MSI-X table. We have to handle the mmio access to these
> registers in QEMU emulation rather than in guest.
> 
> To solve this issue, this series allows to expose MSI-X table
> to userspace when hardware enables the capability of interrupt
> remapping which can ensure that a given PCI device can only
> shoot the MSIs assigned for it. And we introduce a new bus_flags
> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
> for different archs.
> 
> The patch 3 are based on the proposed patchset[1].
> 
> Changelog
> v3:
> - rebased on the current upstream

There's something not forthcoming here, the last version I see from
Yongji is this one:

https://lists.linuxfoundation.org/pipermail/iommu/2016-June/017245.html

Which was a 6-patch series where patches 2-4 tried to apply
PCI_BUS_FLAGS_MSI_REMAP for cases that supported other platforms.  That
doesn't exist here, so it's not simply a rebase.  Patch 1/ seems to
equate this new flag to the IOMMU capability IOMMU_CAP_INTR_REMAP, but
nothing is done here to match them together.  That patch also mentions
the work Eric has done for similar features on ARM, but again those
patches are dropped.  It seems like an incomplete feature now.  Thanks,

Alex

> v2:
> - Make the commit log more clear
> - Replace pci_bus_check_msi_remapping() with pci_bus_msi_isolated()
>   so that we could clearly know what the function does
> - Set PCI_BUS_FLAGS_MSI_REMAP in pci_create_root_bus() instead
>   of iommu_bus_notifier()
> - Reserve VFIO_REGION_INFO_FLAG_CAPS when we allow to mmap MSI-X
>   table so that we can know whether we allow to mmap MSI-X table
>   in QEMU
> 
> [1] 
> https://www.mail-archive.com/linux-kernel%40vger.kernel.org/msg1138820.html
> 
> 
> This is based on sha1
> 63f700aab4c1 Linus Torvalds "Merge tag 'xtensa-20170612' of 
> git://github.com/jcmvbkbc/linux-xtensa".
> 
> Please comment. Thanks.
> 
> 
> 
> Yongji Xie (3):
>   PCI: Add a new PCI_BUS_FLAGS_MSI_REMAP flag
>   pci-ioda: Set PCI_BUS_FLAGS_MSI_REMAP for IODA host bridge
>   vfio-pci: Allow to expose MSI-X table to userspace if interrupt
> remapping is enabled
> 
>  include/linux/pci.h   |  1 +
>  arch/powerpc/platforms/powernv/pci-ioda.c |  8 
>  drivers/vfio/pci/vfio_pci.c   | 18 +++---
>  drivers/vfio/pci/vfio_pci_rdwr.c  |  3 ++-
>  4 files changed, 26 insertions(+), 4 deletions(-)
> 



Re: [PATCH guest kernel] vfio/powerpc/spapr_tce: Enforce IOMMU type compatibility check

2017-04-04 Thread Alex Williamson
On Tue, 4 Apr 2017 20:12:45 +1000
Alexey Kardashevskiy <a...@ozlabs.ru> wrote:

> On 25/03/17 23:25, Alexey Kardashevskiy wrote:
> > On 25/03/17 07:29, Alex Williamson wrote:  
> >> On Fri, 24 Mar 2017 17:44:06 +1100
> >> Alexey Kardashevskiy <a...@ozlabs.ru> wrote:
> >>  
> >>> The existing SPAPR TCE driver advertises both VFIO_SPAPR_TCE_IOMMU and
> >>> VFIO_SPAPR_TCE_v2_IOMMU types to the userspace and the userspace usually
> >>> picks the v2.
> >>>
> >>> Normally the userspace would create a container, attach an IOMMU group
> >>> to it and only then set the IOMMU type (which would normally be v2).
> >>>
> >>> However a specific IOMMU group may not support v2, in other words
> >>> it may not implement set_window/unset_window/take_ownership/
> >>> release_ownership and such a group should not be attached to
> >>> a v2 container.
> >>>
> >>> This adds extra checks that a new group can do what the selected IOMMU
> >>> type suggests. The userspace can then test the return value from
> >>> ioctl(VFIO_SET_IOMMU, VFIO_SPAPR_TCE_v2_IOMMU) and try
> >>> VFIO_SPAPR_TCE_IOMMU.
> >>>
> >>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
> >>> ---
> >>>
> >>> This is one of the patches needed to do nested VFIO - for either
> >>> second level guest or DPDK running in a guest.
> >>> ---
> >>>  drivers/vfio/vfio_iommu_spapr_tce.c | 8 
> >>>  1 file changed, 8 insertions(+)  
> >>
> >> I'm not sure I understand why you're labeling this "guest kernel", is a  
> > 
> > 
> > That is my script :)
> >   
> >> VM the only case where we can have combinations that only a subset of
> >> the groups might support v2?
> > 
> > powernv (non-virtualized, and it runs HV KVM) host provides v2-capable
> > groups, they all the same, and a pseries host (which normally runs as a
> > guest but it can do nested KVM as well - it is called PR KVM) can do only
> > v1 (after this patch, without it - no vfio at all).
> >   
> >> What terrible things happen when such a
> >> combination is created?  
> > 
> > There is no mixture at the moment, I just needed a way to tell userspace
> > that a group cannot do v2.
> >   
> >> The fix itself seems sane, but I'm trying to
> >> figure out whether it should be marked for stable, should go in for
> >> v4.11, or be queued for v4.12.  Thanks,  
> > 
> > No need for stable.  
> 
> 
> So what is the next step with this patch?

Unless there are objections or further comments, I'll put this in my
next branch for v4.12, probably this week.  Thanks,

Alex

> >>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> >>> b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>> index cf3de91fbfe7..a7d811524092 100644
> >>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>> @@ -1335,8 +1335,16 @@ static int tce_iommu_attach_group(void *iommu_data,
> >>>  
> >>>   if (!table_group->ops || !table_group->ops->take_ownership ||
> >>>   !table_group->ops->release_ownership) {
> >>> + if (container->v2) {
> >>> + ret = -EPERM;
> >>> + goto unlock_exit;
> >>> + }
> >>>   ret = tce_iommu_take_ownership(container, table_group);
> >>>   } else {
> >>> + if (!container->v2) {
> >>> + ret = -EPERM;
> >>> + goto unlock_exit;
> >>> + }
> >>>   ret = tce_iommu_take_ownership_ddw(container, table_group);
> >>>   if (!tce_groups_attached(container) && !container->tables[0])
> >>>   container->def_window_pending = true;  
> >>  
> > 
> >   
> 
> 



Re: [PATCH guest kernel] vfio/powerpc/spapr_tce: Enforce IOMMU type compatibility check

2017-03-24 Thread Alex Williamson
On Fri, 24 Mar 2017 17:44:06 +1100
Alexey Kardashevskiy  wrote:

> The existing SPAPR TCE driver advertises both VFIO_SPAPR_TCE_IOMMU and
> VFIO_SPAPR_TCE_v2_IOMMU types to the userspace and the userspace usually
> picks the v2.
> 
> Normally the userspace would create a container, attach an IOMMU group
> to it and only then set the IOMMU type (which would normally be v2).
> 
> However a specific IOMMU group may not support v2, in other words
> it may not implement set_window/unset_window/take_ownership/
> release_ownership and such a group should not be attached to
> a v2 container.
> 
> This adds extra checks that a new group can do what the selected IOMMU
> type suggests. The userspace can then test the return value from
> ioctl(VFIO_SET_IOMMU, VFIO_SPAPR_TCE_v2_IOMMU) and try
> VFIO_SPAPR_TCE_IOMMU.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> 
> This is one of the patches needed to do nested VFIO - for either
> second level guest or DPDK running in a guest.
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 8 
>  1 file changed, 8 insertions(+)

I'm not sure I understand why you're labeling this "guest kernel", is a
VM the only case where we can have combinations that only a subset of
the groups might support v2?  What terrible things happen when such a
combination is created?  The fix itself seems sane, but I'm trying to
figure out whether it should be marked for stable, should go in for
v4.11, or be queued for v4.12.  Thanks,

Alex

> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index cf3de91fbfe7..a7d811524092 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -1335,8 +1335,16 @@ static int tce_iommu_attach_group(void *iommu_data,
>  
>   if (!table_group->ops || !table_group->ops->take_ownership ||
>   !table_group->ops->release_ownership) {
> + if (container->v2) {
> + ret = -EPERM;
> + goto unlock_exit;
> + }
>   ret = tce_iommu_take_ownership(container, table_group);
>   } else {
> + if (!container->v2) {
> + ret = -EPERM;
> + goto unlock_exit;
> + }
>   ret = tce_iommu_take_ownership_ddw(container, table_group);
>   if (!tce_groups_attached(container) && !container->tables[0])
>   container->def_window_pending = true;



Re: [PATCH kernel v11 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-24 Thread Alex Williamson
xt)
>  #ifdef CONFIG_PPC_BOOK3S_64
>   case KVM_CAP_SPAPR_TCE:
>   case KVM_CAP_SPAPR_TCE_64:
> + /* fallthrough */
> + case KVM_CAP_SPAPR_TCE_VFIO:
>   case KVM_CAP_PPC_RTAS:
>   case KVM_CAP_PPC_FIXUP_HCALL:
>   case KVM_CAP_PPC_ENABLE_HCALL:
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index d32f239eb471..37d9118fd84b 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -20,6 +20,10 @@
>  #include 
>  #include "vfio.h"
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +#include 
> +#endif
> +
>  struct kvm_vfio_group {
>   struct list_head node;
>   struct vfio_group *vfio_group;
> @@ -89,6 +93,47 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group 
> *vfio_group)
>   return ret > 0;
>  }
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> +{
> + int (*fn)(struct vfio_group *);
> + int ret = -EINVAL;
> +
> + fn = symbol_get(vfio_external_user_iommu_id);
> + if (!fn)
> + return ret;
> +
> + ret = fn(vfio_group);
> +
> + symbol_put(vfio_external_user_iommu_id);
> +
> + return ret;
> +}
> +
> +static struct iommu_group *kvm_vfio_group_get_iommu_group(
> + struct vfio_group *group)
> +{
> + int group_id = kvm_vfio_external_user_iommu_id(group);
> +
> + if (group_id < 0)
> + return NULL;
> +
> + return iommu_group_get_by_id(group_id);
> +}
> +
> +static void kvm_spapr_tce_release_vfio_group(struct kvm *kvm,
> + struct vfio_group *vfio_group)
> +{
> + struct iommu_group *grp = kvm_vfio_group_get_iommu_group(vfio_group);
> +
> + if (WARN_ON_ONCE(!grp))
> + return;
> +
> + kvm_spapr_tce_release_iommu_group(kvm, grp);
> + iommu_group_put(grp);
> +}
> +#endif
> +
>  /*
>   * Groups can use the same or different IOMMU domains.  If the same then
>   * adding a new group may change the coherency of groups we've previously
> @@ -211,6 +256,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, 
> long attr, u64 arg)
>  
>   mutex_unlock(>lock);
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> + kvm_spapr_tce_release_vfio_group(dev->kvm, vfio_group);
> +#endif
>   kvm_vfio_group_set_kvm(vfio_group, NULL);
>  
>   kvm_vfio_group_put_external_user(vfio_group);
> @@ -218,6 +266,57 @@ static int kvm_vfio_set_group(struct kvm_device *dev, 
> long attr, u64 arg)
>   kvm_vfio_update_coherency(dev);
>  
>   return ret;
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> + case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> + struct kvm_vfio_spapr_tce param;
> + struct kvm_vfio *kv = dev->private;
> + struct vfio_group *vfio_group;
> + struct kvm_vfio_group *kvg;
> + struct fd f;
> + struct iommu_group *grp;
> +
> + if (copy_from_user(, (void __user *)arg,
> + sizeof(struct kvm_vfio_spapr_tce)))
> + return -EFAULT;
> +
> + f = fdget(param.groupfd);
> + if (!f.file)
> + return -EBADF;
> +
> + vfio_group = kvm_vfio_group_get_external_user(f.file);
> + fdput(f);
> +
> + if (IS_ERR(vfio_group))
> + return PTR_ERR(vfio_group);
> +
> + grp = kvm_vfio_group_get_iommu_group(vfio_group);
> + if (WARN_ON_ONCE(!grp)) {
> + kvm_vfio_group_put_external_user(vfio_group);
> + return -EIO;
> + }
> +
> + ret = -ENOENT;
> +
> + mutex_lock(>lock);
> +
> + list_for_each_entry(kvg, >group_list, node) {
> + if (kvg->vfio_group != vfio_group)
> + continue;
> +
> + ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> + param.tablefd, grp);
> + break;
> + }
> +
> + mutex_unlock(>lock);
> +
> + iommu_group_put(grp);
> + kvm_vfio_group_put_external_user(vfio_group);
> +
> + return ret;
> + }
> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>   }
>  
>   return -ENXIO;
> @@ -242,6 +341,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>   switch (attr->attr) {
>   case KVM_DEV_VFIO_GROUP_ADD:
>   case KVM_DEV_VFIO_GROUP_DEL:
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> + case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> +#endif
>   return 0;
>   }
>  
> @@ -257,6 +359,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>   struct kvm_vfio_group *kvg, *tmp;
>  
>   list_for_each_entry_safe(kvg, tmp, >group_list, node) {
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> + kvm_spapr_tce_release_vfio_group(dev->kvm, kvg->vfio_group);
> +#endif
>   kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>   kvm_vfio_group_put_external_user(kvg->vfio_group);
>   list_del(>node);


Acked-by: Alex Williamson <alex.william...@redhat.com>


Re: [PATCH kernel v11 04/10] powerpc/vfio_spapr_tce: Add reference counting to iommu_table

2017-03-24 Thread Alex Williamson
pc/platforms/pseries/vio.c
> +++ b/arch/powerpc/platforms/pseries/vio.c
> @@ -1318,7 +1318,7 @@ static void vio_dev_release(struct device *dev)
>   struct iommu_table *tbl = get_iommu_table_base(dev);
>  
>   if (tbl)
> - iommu_free_table(tbl, of_node_full_name(dev->of_node));
> + iommu_tce_table_put(tbl);
>   of_node_put(dev->of_node);
>   kfree(to_vio_dev(dev));
>  }
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index fbec7348a7e5..8031d3a55a17 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -680,7 +680,7 @@ static void tce_iommu_free_table(struct tce_container 
> *container,
>   unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
>  
>   tce_iommu_userspace_view_free(tbl, container->mm);
> - iommu_free_table(tbl, "");
> + iommu_tce_table_put(tbl);
>   decrement_locked_vm(container->mm, pages);
>  }
>  


Acked-by: Alex Williamson <alex.william...@redhat.com>


Re: [PATCH kernel v10 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-21 Thread Alex Williamson
On Fri, 17 Mar 2017 16:09:59 +1100
Alexey Kardashevskiy  wrote:

> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> If we fail to update a hardware IOMMU table unexpected reason, we just
> clear it and move on as there is nothing really we can do about it -
> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> will be mirrored automatically to the hardware and there is no interface
> to report to the guest about possible failures.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This adds real mode version of WARN_ON_ONCE() as the generic version
> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
> returns in the code, this also adds a check for already existing
> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v10:
> * fixed leaking references in virt/kvm/vfio.c
> * moved code to helpers - kvm_vfio_group_get_iommu_group, 
> kvm_spapr_tce_release_vfio_group
> * fixed possible race between referencing table and destroying it via
> VFIO add/remove window ioctls()
> 
> v9:
> * removed referencing a group in KVM, only referencing iommu_table's now
> * fixed a reference leak in KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE handler
> * fixed typo in vfio.txt
> * removed @argsz and @flags from struct kvm_vfio_spapr_tce
> 
> v8:
> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
> to handle them
> * changed vmalloc_to_phys() callers to return H_HARDWARE
> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
> and added a comment about this in the code
> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
> and do WARN_ON
> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
> have all vmalloc_to_phys() callsites covered
> 
> v7:
> * added realmode-friendly WARN_ON_ONCE_RM
> 
> v6:
> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> * moved kvmppc_gpa_to_ua() to TCE validation
> 
> v5:
> * changed error codes in multiple places
> * added bunch of WARN_ON() in places which should not really happen
> * adde a check that an iommu table is not attached already to LIOBN
> * dropped explicit calls to iommu_tce_clear_param_check/
> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> call them anyway (since the previous patch)
> * if we fail to update a hardware IOMMU table for unexpected reason,
> this just clears the entry
> 
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
> 
> v3:
> * simplified not to use VFIO group notifiers
> * 

Re: [PATCH kernel v9 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-16 Thread Alex Williamson
On Thu, 16 Mar 2017 18:09:32 +1100
Alexey Kardashevskiy  wrote:

> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> If we fail to update a hardware IOMMU table unexpected reason, we just
> clear it and move on as there is nothing really we can do about it -
> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> will be mirrored automatically to the hardware and there is no interface
> to report to the guest about possible failures.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This adds real mode version of WARN_ON_ONCE() as the generic version
> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
> returns in the code, this also adds a check for already existing
> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v9:
> * removed referencing a group in KVM, only referencing iommu_table's now
> * fixed a reference leak in KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE handler
> * fixed typo in vfio.txt
> * removed @argsz and @flags from struct kvm_vfio_spapr_tce
> 
> v8:
> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
> to handle them
> * changed vmalloc_to_phys() callers to return H_HARDWARE
> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
> and added a comment about this in the code
> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
> and do WARN_ON
> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
> have all vmalloc_to_phys() callsites covered
> 
> v7:
> * added realmode-friendly WARN_ON_ONCE_RM
> 
> v6:
> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> * moved kvmppc_gpa_to_ua() to TCE validation
> 
> v5:
> * changed error codes in multiple places
> * added bunch of WARN_ON() in places which should not really happen
> * adde a check that an iommu table is not attached already to LIOBN
> * dropped explicit calls to iommu_tce_clear_param_check/
> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> call them anyway (since the previous patch)
> * if we fail to update a hardware IOMMU table for unexpected reason,
> this just clears the entry
> 
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
> 
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
> 
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu_table may appear in the list several times, to be fixed later
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  18 +-
>  

Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-15 Thread Alex Williamson
On Thu, 16 Mar 2017 00:21:07 +1100
Alexey Kardashevskiy <a...@ozlabs.ru> wrote:

> On 15/03/17 08:05, Alex Williamson wrote:
> > On Fri, 10 Mar 2017 14:53:37 +1100
> > Alexey Kardashevskiy <a...@ozlabs.ru> wrote:
> >   
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >> without passing them to user space which saves time on switching
> >> to user space and back.
> >>
> >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >> KVM tries to handle a TCE request in the real mode, if failed
> >> it passes the request to the virtual mode to complete the operation.
> >> If it a virtual mode handler fails, the request is passed to
> >> the user space; this is not expected to happen though.
> >>
> >> To avoid dealing with page use counters (which is tricky in real mode),
> >> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >> to pre-register the userspace memory. The very first TCE request will
> >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >> of the TCE table (iommu_table::it_userspace) is not allocated till
> >> the very first mapping happens and we cannot call vmalloc in real mode.
> >>
> >> If we fail to update a hardware IOMMU table unexpected reason, we just
> >> clear it and move on as there is nothing really we can do about it -
> >> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> >> will be mirrored automatically to the hardware and there is no interface
> >> to report to the guest about possible failures.
> >>
> >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >> and associates a physical IOMMU table with the SPAPR TCE table (which
> >> is a guest view of the hardware IOMMU table). The iommu_table object
> >> is cached and referenced so we do not have to look up for it in real mode.
> >>
> >> This does not implement the UNSET counterpart as there is no use for it -
> >> once the acceleration is enabled, the existing userspace won't
> >> disable it unless a VFIO container is destroyed; this adds necessary
> >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>
> >> As this creates a descriptor per IOMMU table-LIOBN couple (called
> >> kvmppc_spapr_tce_iommu_table), it is possible to have several
> >> descriptors with the same iommu_table (hardware IOMMU table) attached
> >> to the same LIOBN; we do not remove duplicates though as
> >> iommu_table_ops::exchange not just update a TCE entry (which is
> >> shared among IOMMU groups) but also invalidates the TCE cache
> >> (one per IOMMU group).
> >>
> >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >> space.
> >>
> >> This adds real mode version of WARN_ON_ONCE() as the generic version
> >> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
> >> returns in the code, this also adds a check for already existing
> >> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
> >>
> >> This finally makes use of vfio_external_user_iommu_id() which was
> >> introduced quite some time ago and was considered for removal.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
> >> ---
> >> Changes:
> >> v8:
> >> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
> >> to handle them
> >> * changed vmalloc_to_phys() callers to return H_HARDWARE
> >> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
> >> and added a comment about this in the code
> >> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
> >> and do WARN_ON
> >> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
> >> have all vmalloc_to_phys() callsites covered
> >>
> >> v7:
> >> * added realmode-friendly WARN_ON_ONCE_RM
> >>
> >> v6:
> >> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> >> * moved kvmppc_gpa_to_ua() to TCE validation
> >>
> >> v5:
> >> * changed error co

Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-15 Thread Alex Williamson
On Wed, 15 Mar 2017 15:40:14 +1100
David Gibson  wrote:
> > > diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> > > b/arch/powerpc/kvm/book3s_64_vio.c
> > > index e96a4590464c..be18cda01e1b 100644
> > > --- a/arch/powerpc/kvm/book3s_64_vio.c
> > > +++ b/arch/powerpc/kvm/book3s_64_vio.c
> > > @@ -28,6 +28,10 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > >  
> > >  #include 
> > >  #include 
> > > @@ -40,6 +44,36 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > > +
> > > +static void kvm_vfio_group_put_external_user(struct vfio_group 
> > > *vfio_group)
> > > +{
> > > + void (*fn)(struct vfio_group *);
> > > +
> > > + fn = symbol_get(vfio_group_put_external_user);
> > > + if (WARN_ON(!fn))
> > > + return;
> > > +
> > > + fn(vfio_group);
> > > +
> > > + symbol_put(vfio_group_put_external_user);
> > > +}
> > > +
> > > +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> > > +{
> > > + int (*fn)(struct vfio_group *);
> > > + int ret = -1;
> > > +
> > > + fn = symbol_get(vfio_external_user_iommu_id);
> > > + if (!fn)
> > > + return ret;
> > > +
> > > + ret = fn(vfio_group);
> > > +
> > > + symbol_put(vfio_external_user_iommu_id);
> > > +
> > > + return ret;
> > > +}  
> > 
> > 
> > Ugh.  This feels so wrong.  Why can't you have kvm-vfio pass the
> > iommu_group?  Why do you need to hold this additional vfio_group
> > reference?  
> 
> Keeping the vfio_group reference makes sense to me, since we don't
> want the vfio context for the group to go away while it's attached to
> the LIOBN.

But there's already a reference for that, it's taken by
KVM_DEV_VFIO_GROUP_ADD and held until KVM_DEV_VFIO_GROUP_DEL.  Both the
DEL path and the cleanup path call kvm_spapr_tce_release_iommu_group()
before releasing that reference, so it seems entirely redundant.

> However, going via the iommu_id rather than just having an interface
> to directly grab the iommu group from the vfio_group seems bizarre to
> me.  I'm ok with cleaning that up later, however.

We have kvm_spapr_tce_attach_iommu_group() and
kvm_spapr_tce_release_iommu_group(), but both take a vfio_group, not an
iommu_group as a parameter.  I don't particularly have a problem with
the vfio_group -> iommu ID -> iommu_group, but if we drop the extra
vfio_group reference and pass the iommu_group itself to these functions
then we can keep all the symbol reference stuff in the kvm-vfio glue
layer.  Thanks,

Alex


Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-14 Thread Alex Williamson
On Fri, 10 Mar 2017 14:53:37 +1100
Alexey Kardashevskiy  wrote:

> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> If we fail to update a hardware IOMMU table unexpected reason, we just
> clear it and move on as there is nothing really we can do about it -
> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> will be mirrored automatically to the hardware and there is no interface
> to report to the guest about possible failures.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> As this creates a descriptor per IOMMU table-LIOBN couple (called
> kvmppc_spapr_tce_iommu_table), it is possible to have several
> descriptors with the same iommu_table (hardware IOMMU table) attached
> to the same LIOBN; we do not remove duplicates though as
> iommu_table_ops::exchange not just update a TCE entry (which is
> shared among IOMMU groups) but also invalidates the TCE cache
> (one per IOMMU group).
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This adds real mode version of WARN_ON_ONCE() as the generic version
> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
> returns in the code, this also adds a check for already existing
> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v8:
> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
> to handle them
> * changed vmalloc_to_phys() callers to return H_HARDWARE
> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
> and added a comment about this in the code
> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
> and do WARN_ON
> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
> have all vmalloc_to_phys() callsites covered
> 
> v7:
> * added realmode-friendly WARN_ON_ONCE_RM
> 
> v6:
> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> * moved kvmppc_gpa_to_ua() to TCE validation
> 
> v5:
> * changed error codes in multiple places
> * added bunch of WARN_ON() in places which should not really happen
> * adde a check that an iommu table is not attached already to LIOBN
> * dropped explicit calls to iommu_tce_clear_param_check/
> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> call them anyway (since the previous patch)
> * if we fail to update a hardware IOMMU table for unexpected reason,
> this just clears the entry
> 
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
> 
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
> 
> v2:
> * reworked to use new 

Re: [PATCH kernel v8 04/10] powerpc/vfio_spapr_tce: Add reference counting to iommu_table

2017-03-14 Thread Alex Williamson
ci-ioda.c
> index 7916d0cb05fe..ec3e565de511 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1425,7 +1425,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev 
> *dev, struct pnv_ioda_pe
>   iommu_group_put(pe->table_group.group);
>   BUG_ON(pe->table_group.group);
>   }
> - iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
> + iommu_table_put(tbl);
>  }
>  
>  static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
> @@ -2226,7 +2226,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb 
> *phb,
>   __free_pages(tce_mem, get_order(tce32_segsz * segs));
>   if (tbl) {
>   pnv_pci_unlink_table_and_group(tbl, >table_group);
> - iommu_free_table(tbl, "pnv");
> + iommu_table_put(tbl);
>   }
>  }
>  
> @@ -2322,7 +2322,7 @@ static long pnv_pci_ioda2_create_table(struct 
> iommu_table_group *table_group,
>   bus_offset, page_shift, window_size,
>   levels, tbl);
>   if (ret) {
> - iommu_free_table(tbl, "pnv");
> + iommu_table_put(tbl);
>   return ret;
>   }
>  
> @@ -2366,7 +2366,7 @@ static long pnv_pci_ioda2_setup_default_config(struct 
> pnv_ioda_pe *pe)
>   if (rc) {
>   pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
>   rc);
> - iommu_free_table(tbl, "");
> + iommu_table_put(tbl);
>   return rc;
>   }
>  
> @@ -2454,7 +2454,7 @@ static void pnv_ioda2_take_ownership(struct 
> iommu_table_group *table_group)
>   pnv_pci_ioda2_unset_window(>table_group, 0);
>   if (pe->pbus)
>   pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
> - iommu_free_table(tbl, "pnv");
> + iommu_table_put(tbl);
>  }
>  
>  static void pnv_ioda2_release_ownership(struct iommu_table_group 
> *table_group)
> @@ -3427,7 +3427,7 @@ static void pnv_pci_ioda1_release_pe_dma(struct 
> pnv_ioda_pe *pe)
>   }
>  
>   free_pages(tbl->it_base, get_order(tbl->it_size << 3));
> - iommu_free_table(tbl, "pnv");
> + iommu_table_put(tbl);
>  }
>  
>  static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
> @@ -3454,7 +3454,7 @@ static void pnv_pci_ioda2_release_pe_dma(struct 
> pnv_ioda_pe *pe)
>   }
>  
>   pnv_pci_ioda2_table_free_pages(tbl);
> - iommu_free_table(tbl, "pnv");
> + iommu_table_put(tbl);
>  }
>  
>  static void pnv_ioda_free_pe_seg(struct pnv_ioda_pe *pe,
> diff --git a/arch/powerpc/platforms/powernv/pci.c 
> b/arch/powerpc/platforms/powernv/pci.c
> index a43f22dc069e..9b2bdcad51ba 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -767,6 +767,7 @@ struct iommu_table *pnv_pci_table_alloc(int nid)
>  
>   tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, nid);
>   INIT_LIST_HEAD_RCU(>it_group_list);
> + kref_init(>it_kref);
>  
>   return tbl;
>  }
> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> b/arch/powerpc/platforms/pseries/iommu.c
> index 0a733ddae926..a713e20311b8 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -74,6 +74,7 @@ static struct iommu_table_group 
> *iommu_pseries_alloc_group(int node)
>   goto fail_exit;
>  
>   INIT_LIST_HEAD_RCU(>it_group_list);
> + kref_init(>it_kref);
>   tgl->table_group = table_group;
>   list_add_rcu(>next, >it_group_list);
>  
> @@ -115,7 +116,7 @@ static void iommu_pseries_free_group(struct 
> iommu_table_group *table_group,
>   BUG_ON(table_group->group);
>   }
>  #endif
> - iommu_free_table(tbl, node_name);
> + iommu_table_put(tbl);
>  
>   kfree(table_group);
>  }
> diff --git a/arch/powerpc/platforms/pseries/vio.c 
> b/arch/powerpc/platforms/pseries/vio.c
> index 720493932486..744d639da92c 100644
> --- a/arch/powerpc/platforms/pseries/vio.c
> +++ b/arch/powerpc/platforms/pseries/vio.c
> @@ -1318,7 +1318,7 @@ static void vio_dev_release(struct device *dev)
>   struct iommu_table *tbl = get_iommu_table_base(dev);
>  
>   if (tbl)
> - iommu_free_table(tbl, of_node_full_name(dev->of_node));
> + iommu_table_put(tbl);
>   of_node_put(dev->of_node);
>   kfree(to_vio_dev(dev));
>  }
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index fbec7348a7e5..4f6ca9d80ead 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -680,7 +680,7 @@ static void tce_iommu_free_table(struct tce_container 
> *container,
>   unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
>  
>   tce_iommu_userspace_view_free(tbl, container->mm);
> - iommu_free_table(tbl, "");
> + iommu_table_put(tbl);
>   decrement_locked_vm(container->mm, pages);
>  }
>  

Acked-by: Alex Williamson <alex.william...@redhat.com>


Re: [PATCH kernel v8 03/10] powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal

2017-03-14 Thread Alex Williamson
On Fri, 10 Mar 2017 14:53:30 +1100
Alexey Kardashevskiy <a...@ozlabs.ru> wrote:

> At the moment iommu_table can be disposed by either calling
> iommu_table_free() directly or it_ops::free(); the only implementation
> of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
> iommu_table_free() anyway.
> 
> As we are going to have reference counting on tables, we need an unified
> way of disposing tables.
> 
> This moves it_ops::free() call into iommu_free_table() and makes use
> of the latter. The free() callback now handles only platform-specific
> data.
> 
> As from now on the iommu_free_table() calls it_ops->free(), we need
> to have it_ops initialized before calling iommu_free_table() so this
> moves this initialization in pnv_pci_ioda2_create_table().
> 
> This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
> Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>
> ---
> Changes:
> v5:
> * moved "tbl->it_ops = _ioda2_iommu_ops" earlier and updated
> the commit log
> ---
>  arch/powerpc/kernel/iommu.c   |  4 
>  arch/powerpc/platforms/powernv/pci-ioda.c | 10 --
>  drivers/vfio/vfio_iommu_spapr_tce.c   |  2 +-
>  3 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 9bace5df05d5..bc142d87130f 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char 
> *node_name)
>   if (!tbl)
>   return;
>  
> + if (tbl->it_ops->free)
> + tbl->it_ops->free(tbl);
> +
>   if (!tbl->it_map) {
>   kfree(tbl);
>   return;
> @@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char 
> *node_name)
>   /* free table */
>   kfree(tbl);
>  }
> +EXPORT_SYMBOL_GPL(iommu_free_table);

A slightly cringe worthy generically named export in arch code.

>  
>  /* Creates TCEs for a user provided buffer.  The user buffer must be
>   * contiguous real kernel storage (not vmalloc).  The address passed here
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 69c40b43daa3..7916d0cb05fe 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1425,7 +1425,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev 
> *dev, struct pnv_ioda_pe
>   iommu_group_put(pe->table_group.group);
>   BUG_ON(pe->table_group.group);
>   }
> - pnv_pci_ioda2_table_free_pages(tbl);
>   iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
>  }
>  
> @@ -2041,7 +2040,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, 
> long index,
>  static void pnv_ioda2_table_free(struct iommu_table *tbl)
>  {
>   pnv_pci_ioda2_table_free_pages(tbl);
> - iommu_free_table(tbl, "pnv");
>  }
>  
>  static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> @@ -2318,6 +2316,8 @@ static long pnv_pci_ioda2_create_table(struct 
> iommu_table_group *table_group,
>   if (!tbl)
>   return -ENOMEM;
>  
> + tbl->it_ops = _ioda2_iommu_ops;
> +
>   ret = pnv_pci_ioda2_table_alloc_pages(nid,
>   bus_offset, page_shift, window_size,
>   levels, tbl);
> @@ -2326,8 +2326,6 @@ static long pnv_pci_ioda2_create_table(struct 
> iommu_table_group *table_group,
>   return ret;
>   }
>  
> - tbl->it_ops = _ioda2_iommu_ops;
> -
>   *ptbl = tbl;
>  
>   return 0;
> @@ -2368,7 +2366,7 @@ static long pnv_pci_ioda2_setup_default_config(struct 
> pnv_ioda_pe *pe)
>   if (rc) {
>   pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
>   rc);
> - pnv_ioda2_table_free(tbl);
> + iommu_free_table(tbl, "");
>   return rc;
>   }
>  
> @@ -2456,7 +2454,7 @@ static void pnv_ioda2_take_ownership(struct 
> iommu_table_group *table_group)
>   pnv_pci_ioda2_unset_window(>table_group, 0);
>   if (pe->pbus)
>   pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
> - pnv_ioda2_table_free(tbl);
> + iommu_free_table(tbl, "pnv");
>  }
>  
>  static void pnv_ioda2_release_ownership(struct iommu_table_group 
> *table_group)
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index cf3de91

Re: [PATCH kernel v8 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration

2017-03-14 Thread Alex Williamson
On Tue, 14 Mar 2017 11:55:33 +1100
David Gibson  wrote:

> On Tue, Mar 14, 2017 at 11:54:03AM +1100, Alexey Kardashevskiy wrote:
> > On 10/03/17 15:48, David Gibson wrote:  
> > > On Fri, Mar 10, 2017 at 02:53:27PM +1100, Alexey Kardashevskiy wrote:  
> > >> This is my current queue of patches to add acceleration of TCE
> > >> updates in KVM.
> > >>
> > >> This is based on Linus'es tree sha1 c1aa905a304e.  

Hmm, sure about that?  03/10 doesn't apply.

> > > 
> > > I think we're finally there - I've now sent an R-b for all patches.  
> > 
> > Thanks for the patience.
> > 
> > 
> > I supposed in order to proceed now I need an ack from Alex, correct?  
> 
> That, or simply for him to merge it.

Given the diffstat, I'd guess you're looking for acks from me and maybe
Paolo, but it looks like it should be merged through ppc trees.  Thanks,

Alex

> > >>
> > >> Please comment. Thanks.
> > >>
> > >> Changes:
> > >> v8:
> > >> * kept fixing oddities with error handling in 10/10
> > >>
> > >> v7:
> > >> * added realmode's WARN_ON_ONCE_RM in arch/powerpc/kvm/book3s_64_vio_hv.c
> > >>
> > >> v6:
> > >> * reworked the last patch in terms of error handling and parameters 
> > >> checking
> > >>
> > >> v5:
> > >> * replaced "KVM: PPC: Separate TCE validation from update" with
> > >> "KVM: PPC: iommu: Unify TCE checking"
> > >> * changed already reviewed "powerpc/iommu/vfio_spapr_tce: Cleanup 
> > >> iommu_table disposal"
> > >> * reworked "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO"
> > >> * more details in individual commit logs
> > >>
> > >> v4:
> > >> * addressed comments from v3
> > >> * updated subject lines with correct component names
> > >> * regrouped the patchset in order:
> > >>  - powerpc fixes;
> > >>  - vfio_spapr_tce driver fixes;
> > >>  - KVM/PPC fixes;
> > >>  - KVM+PPC+VFIO;
> > >> * everything except last 2 patches have "Reviewed-By: David"
> > >>
> > >> v3:
> > >> * there was no full repost, only last patch was posted
> > >>
> > >> v2:
> > >> * 11/11 reworked to use new notifiers, it is rather RFC as it still has
> > >> a issue;
> > >> * got 09/11, 10/11 to use notifiers in 11/11;
> > >> * added rb: David to most of patches and added a comment in 05/11.
> > >>
> > >> Alexey Kardashevskiy (10):
> > >>   powerpc/mmu: Add real mode support for IOMMU preregistered memory
> > >>   powerpc/powernv/iommu: Add real mode version of
> > >> iommu_table_ops::exchange()
> > >>   powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
> > >>   powerpc/vfio_spapr_tce: Add reference counting to iommu_table
> > >>   KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
> > >>   KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
> > >>   KVM: PPC: Pass kvm* to kvmppc_find_table()
> > >>   KVM: PPC: Use preregistered memory API to access TCE list
> > >>   KVM: PPC: iommu: Unify TCE checking
> > >>   KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
> > >>
> > >>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
> > >>  arch/powerpc/include/asm/iommu.h   |  32 ++-
> > >>  arch/powerpc/include/asm/kvm_host.h|   8 +
> > >>  arch/powerpc/include/asm/kvm_ppc.h |  12 +-
> > >>  arch/powerpc/include/asm/mmu_context.h |   4 +
> > >>  include/uapi/linux/kvm.h   |   9 +
> > >>  arch/powerpc/kernel/iommu.c|  86 +---
> > >>  arch/powerpc/kvm/book3s_64_vio.c   | 330 
> > >> -
> > >>  arch/powerpc/kvm/book3s_64_vio_hv.c| 303 
> > >> ++
> > >>  arch/powerpc/kvm/powerpc.c |   2 +
> > >>  arch/powerpc/mm/mmu_context_iommu.c|  39 
> > >>  arch/powerpc/platforms/powernv/pci-ioda.c  |  46 ++--
> > >>  arch/powerpc/platforms/powernv/pci.c   |   1 +
> > >>  arch/powerpc/platforms/pseries/iommu.c |   3 +-
> > >>  arch/powerpc/platforms/pseries/vio.c   |   2 +-
> > >>  drivers/vfio/vfio_iommu_spapr_tce.c|   2 +-
> > >>  virt/kvm/vfio.c|  60 ++
> > >>  arch/powerpc/kvm/Kconfig   |   1 +
> > >>  18 files changed, 855 insertions(+), 107 deletions(-)
> > >>  
> > >   
> > 
> >   
> 
> 
> 
> 



  1   2   3   4   >