RE: [PATCH 3/3] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl

2023-11-26 Thread Duan, Zhenzhong



>-Original Message-
>From: Liu, Yi L 
>Sent: Monday, November 27, 2023 2:39 PM
>Subject: [PATCH 3/3] vfio: Report PASID capability via VFIO_DEVICE_FEATURE
>ioctl
>
>This reports the PASID capability data to userspace via VFIO_DEVICE_FEATURE,
>hence userspace could probe PASID capability by it. This is a bit different
>with other capabilities which are reported to userspace when the user reads
>the device's PCI configuration space. There are two reasons for this.
>
> - First, Qemu by default exposes all available PCI capabilities in vfio-pci
>   config space to the guest as read-only, so adding PASID capability in the
>   vfio-pci config space will make it exposed to the guest automatically while
>   an old Qemu doesn't really support it.
>
> - Second, PASID capability does not exit on VFs (instead shares the cap of
>   the PF). Creating a virtual PASID capability in vfio-pci config space needs
>   to find a hole to place it, but doing so may require device specific
>   knowledge to avoid potential conflict with device specific registers like
>   hiden bits in VF config space. It's simpler by moving this burden to the
>   VMM instead of maintaining a quirk system in the kernel.
>
>Suggested-by: Alex Williamson 
>Signed-off-by: Yi Liu 
>---
> drivers/vfio/pci/vfio_pci_core.c | 47 
> include/uapi/linux/vfio.h| 13 +
> 2 files changed, 60 insertions(+)
>
>diff --git a/drivers/vfio/pci/vfio_pci_core.c 
>b/drivers/vfio/pci/vfio_pci_core.c
>index 1929103ee59a..8038aa45500e 100644
>--- a/drivers/vfio/pci/vfio_pci_core.c
>+++ b/drivers/vfio/pci/vfio_pci_core.c
>@@ -1495,6 +1495,51 @@ static int vfio_pci_core_feature_token(struct
>vfio_device *device, u32 flags,
>   return 0;
> }
>
>+static int vfio_pci_core_feature_pasid(struct vfio_device *device, u32 flags,
>+ struct vfio_device_feature_pasid __user
>*arg,
>+ size_t argsz)
>+{
>+  struct vfio_pci_core_device *vdev =
>+  container_of(device, struct vfio_pci_core_device, vdev);
>+  struct vfio_device_feature_pasid pasid = { 0 };
>+  struct pci_dev *pdev = vdev->pdev;
>+  u32 capabilities = 0;
>+  int ret;
>+
>+  /* We do not support SET of the PASID capability */
>+  ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
>+   sizeof(pasid));
>+  if (ret != 1)
>+  return ret;
>+
>+  /*
>+   * Needs go to PF if the device is VF as VF shares its PF's
>+   * PASID Capability.
>+   */
>+  if (pdev->is_virtfn)
>+  pdev = pci_physfn(pdev);
>+
>+  if (!pdev->pasid_enabled)
>+  goto out;

Does a PF bound to VFIO have pasid enabled by default?
Isn't the guest kernel's responsibility to enable pasid cap of an assigned PF?

Thanks
Zhenzhong

>+
>+#ifdef CONFIG_PCI_PASID
>+  pci_read_config_dword(pdev, pdev->pasid_cap + PCI_PASID_CAP,
>+&capabilities);
>+#endif
>+
>+  if (capabilities & PCI_PASID_CAP_EXEC)
>+  pasid.capabilities |= VFIO_DEVICE_PASID_CAP_EXEC;
>+  if (capabilities & PCI_PASID_CAP_PRIV)
>+  pasid.capabilities |= VFIO_DEVICE_PASID_CAP_PRIV;
>+
>+  pasid.width = (capabilities >> 8) & 0x1f;
>+
>+out:
>+  if (copy_to_user(arg, &pasid, sizeof(pasid)))
>+  return -EFAULT;
>+  return 0;
>+}
>+
> int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>   void __user *arg, size_t argsz)
> {
>@@ -1508,6 +1553,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device
>*device, u32 flags,
>   return vfio_pci_core_pm_exit(device, flags, arg, argsz);
>   case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>   return vfio_pci_core_feature_token(device, flags, arg, argsz);
>+  case VFIO_DEVICE_FEATURE_PASID:
>+  return vfio_pci_core_feature_pasid(device, flags, arg, argsz);
>   default:
>   return -ENOTTY;
>   }
>diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>index 495193629029..8326faf8622b 100644
>--- a/include/uapi/linux/vfio.h
>+++ b/include/uapi/linux/vfio.h
>@@ -1512,6 +1512,19 @@ struct vfio_device_feature_bus_master {
> };
> #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
>
>+/**
>+ * Upon VFIO_DEVICE_FEATURE_GET, return the PASID capability for the device.
>+ * Zero width means no support for PASID.
>+ */
>+struct vfio_device_feature_pasid {
>+  __u16 capabilities;
>+#define VFIO_DEVICE_PASID_CAP_EXEC(1 << 0)
>+#define VFIO_DEVICE_PASID_CAP_PRIV(1 << 1)
>+  __u8 width;
>+  __u8 __reserved;
>+};
>+#define VFIO_DEVICE_FEATURE_PASID 11
>+
> /*  API for Type1 VFIO IOMMU  */
>
> /**
>--
>2.34.1




RE: [PATCH 2/3] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT

2023-11-26 Thread Duan, Zhenzhong



>-Original Message-
>From: Liu, Yi L 
>Sent: Monday, November 27, 2023 2:39 PM
>Subject: [PATCH 2/3] vfio: Add
>VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT
>
>This adds ioctls for the userspace to attach a given pasid of a vfio
>device to/from an IOAS/HWPT.
>
>Signed-off-by: Yi Liu 
>---
> drivers/vfio/device_cdev.c | 45 +++
> drivers/vfio/vfio.h|  4 +++
> drivers/vfio/vfio_main.c   |  8 ++
> include/uapi/linux/vfio.h  | 55 ++
> 4 files changed, 112 insertions(+)
>
>diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
>index e75da0a70d1f..c2ac7ed44537 100644
>--- a/drivers/vfio/device_cdev.c
>+++ b/drivers/vfio/device_cdev.c
>@@ -210,6 +210,51 @@ int vfio_df_ioctl_detach_pt(struct vfio_device_file *df,
>   return 0;
> }
>
>+int vfio_df_ioctl_pasid_attach_pt(struct vfio_device_file *df,
>+struct vfio_device_pasid_attach_iommufd_pt
>__user *arg)
>+{
>+  struct vfio_device *device = df->device;
>+  struct vfio_device_pasid_attach_iommufd_pt attach;
>+  unsigned long minsz;
>+  int ret;
>+
>+  minsz = offsetofend(struct vfio_device_pasid_attach_iommufd_pt, pt_id);
>+
>+  if (copy_from_user(&attach, arg, minsz))
>+  return -EFAULT;
>+
>+  if (attach.argsz < minsz || attach.flags)
>+  return -EINVAL;
>+
>+  mutex_lock(&device->dev_set->lock);
>+  ret = device->ops->pasid_attach_ioas(device, attach.pasid,
>&attach.pt_id);
>+  mutex_unlock(&device->dev_set->lock);
>+
>+  return ret;
>+}
>+
>+int vfio_df_ioctl_pasid_detach_pt(struct vfio_device_file *df,
>+struct vfio_device_pasid_detach_iommufd_pt
>__user *arg)
>+{
>+  struct vfio_device *device = df->device;
>+  struct vfio_device_pasid_detach_iommufd_pt detach;
>+  unsigned long minsz;
>+
>+  minsz = offsetofend(struct vfio_device_pasid_detach_iommufd_pt, flags);

Pasid isn't copied, should use pasid here?

Thanks
Zhenzhong



[PATCH 3/3] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl

2023-11-26 Thread Yi Liu
This reports the PASID capability data to userspace via VFIO_DEVICE_FEATURE,
hence userspace could probe PASID capability by it. This is a bit different
with other capabilities which are reported to userspace when the user reads
the device's PCI configuration space. There are two reasons for this.

 - First, Qemu by default exposes all available PCI capabilities in vfio-pci
   config space to the guest as read-only, so adding PASID capability in the
   vfio-pci config space will make it exposed to the guest automatically while
   an old Qemu doesn't really support it.

 - Second, PASID capability does not exit on VFs (instead shares the cap of
   the PF). Creating a virtual PASID capability in vfio-pci config space needs
   to find a hole to place it, but doing so may require device specific
   knowledge to avoid potential conflict with device specific registers like
   hiden bits in VF config space. It's simpler by moving this burden to the
   VMM instead of maintaining a quirk system in the kernel.

Suggested-by: Alex Williamson 
Signed-off-by: Yi Liu 
---
 drivers/vfio/pci/vfio_pci_core.c | 47 
 include/uapi/linux/vfio.h| 13 +
 2 files changed, 60 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 1929103ee59a..8038aa45500e 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1495,6 +1495,51 @@ static int vfio_pci_core_feature_token(struct 
vfio_device *device, u32 flags,
return 0;
 }
 
+static int vfio_pci_core_feature_pasid(struct vfio_device *device, u32 flags,
+  struct vfio_device_feature_pasid __user 
*arg,
+  size_t argsz)
+{
+   struct vfio_pci_core_device *vdev =
+   container_of(device, struct vfio_pci_core_device, vdev);
+   struct vfio_device_feature_pasid pasid = { 0 };
+   struct pci_dev *pdev = vdev->pdev;
+   u32 capabilities = 0;
+   int ret;
+
+   /* We do not support SET of the PASID capability */
+   ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
+sizeof(pasid));
+   if (ret != 1)
+   return ret;
+
+   /*
+* Needs go to PF if the device is VF as VF shares its PF's
+* PASID Capability.
+*/
+   if (pdev->is_virtfn)
+   pdev = pci_physfn(pdev);
+
+   if (!pdev->pasid_enabled)
+   goto out;
+
+#ifdef CONFIG_PCI_PASID
+   pci_read_config_dword(pdev, pdev->pasid_cap + PCI_PASID_CAP,
+ &capabilities);
+#endif
+
+   if (capabilities & PCI_PASID_CAP_EXEC)
+   pasid.capabilities |= VFIO_DEVICE_PASID_CAP_EXEC;
+   if (capabilities & PCI_PASID_CAP_PRIV)
+   pasid.capabilities |= VFIO_DEVICE_PASID_CAP_PRIV;
+
+   pasid.width = (capabilities >> 8) & 0x1f;
+
+out:
+   if (copy_to_user(arg, &pasid, sizeof(pasid)))
+   return -EFAULT;
+   return 0;
+}
+
 int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
void __user *arg, size_t argsz)
 {
@@ -1508,6 +1553,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device 
*device, u32 flags,
return vfio_pci_core_pm_exit(device, flags, arg, argsz);
case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
return vfio_pci_core_feature_token(device, flags, arg, argsz);
+   case VFIO_DEVICE_FEATURE_PASID:
+   return vfio_pci_core_feature_pasid(device, flags, arg, argsz);
default:
return -ENOTTY;
}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 495193629029..8326faf8622b 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1512,6 +1512,19 @@ struct vfio_device_feature_bus_master {
 };
 #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
 
+/**
+ * Upon VFIO_DEVICE_FEATURE_GET, return the PASID capability for the device.
+ * Zero width means no support for PASID.
+ */
+struct vfio_device_feature_pasid {
+   __u16 capabilities;
+#define VFIO_DEVICE_PASID_CAP_EXEC (1 << 0)
+#define VFIO_DEVICE_PASID_CAP_PRIV (1 << 1)
+   __u8 width;
+   __u8 __reserved;
+};
+#define VFIO_DEVICE_FEATURE_PASID 11
+
 /*  API for Type1 VFIO IOMMU  */
 
 /**
-- 
2.34.1




[PATCH 2/3] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT

2023-11-26 Thread Yi Liu
This adds ioctls for the userspace to attach a given pasid of a vfio
device to/from an IOAS/HWPT.

Signed-off-by: Yi Liu 
---
 drivers/vfio/device_cdev.c | 45 +++
 drivers/vfio/vfio.h|  4 +++
 drivers/vfio/vfio_main.c   |  8 ++
 include/uapi/linux/vfio.h  | 55 ++
 4 files changed, 112 insertions(+)

diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
index e75da0a70d1f..c2ac7ed44537 100644
--- a/drivers/vfio/device_cdev.c
+++ b/drivers/vfio/device_cdev.c
@@ -210,6 +210,51 @@ int vfio_df_ioctl_detach_pt(struct vfio_device_file *df,
return 0;
 }
 
+int vfio_df_ioctl_pasid_attach_pt(struct vfio_device_file *df,
+ struct vfio_device_pasid_attach_iommufd_pt 
__user *arg)
+{
+   struct vfio_device *device = df->device;
+   struct vfio_device_pasid_attach_iommufd_pt attach;
+   unsigned long minsz;
+   int ret;
+
+   minsz = offsetofend(struct vfio_device_pasid_attach_iommufd_pt, pt_id);
+
+   if (copy_from_user(&attach, arg, minsz))
+   return -EFAULT;
+
+   if (attach.argsz < minsz || attach.flags)
+   return -EINVAL;
+
+   mutex_lock(&device->dev_set->lock);
+   ret = device->ops->pasid_attach_ioas(device, attach.pasid, 
&attach.pt_id);
+   mutex_unlock(&device->dev_set->lock);
+
+   return ret;
+}
+
+int vfio_df_ioctl_pasid_detach_pt(struct vfio_device_file *df,
+ struct vfio_device_pasid_detach_iommufd_pt 
__user *arg)
+{
+   struct vfio_device *device = df->device;
+   struct vfio_device_pasid_detach_iommufd_pt detach;
+   unsigned long minsz;
+
+   minsz = offsetofend(struct vfio_device_pasid_detach_iommufd_pt, flags);
+
+   if (copy_from_user(&detach, arg, minsz))
+   return -EFAULT;
+
+   if (detach.argsz < minsz || detach.flags)
+   return -EINVAL;
+
+   mutex_lock(&device->dev_set->lock);
+   device->ops->pasid_detach_ioas(device, detach.pasid);
+   mutex_unlock(&device->dev_set->lock);
+
+   return 0;
+}
+
 static char *vfio_device_devnode(const struct device *dev, umode_t *mode)
 {
return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 307e3f29b527..d228cdb6b345 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -353,6 +353,10 @@ int vfio_df_ioctl_attach_pt(struct vfio_device_file *df,
struct vfio_device_attach_iommufd_pt __user *arg);
 int vfio_df_ioctl_detach_pt(struct vfio_device_file *df,
struct vfio_device_detach_iommufd_pt __user *arg);
+int vfio_df_ioctl_pasid_attach_pt(struct vfio_device_file *df,
+ struct vfio_device_pasid_attach_iommufd_pt 
__user *arg);
+int vfio_df_ioctl_pasid_detach_pt(struct vfio_device_file *df,
+ struct vfio_device_pasid_detach_iommufd_pt 
__user *arg);
 
 #if IS_ENABLED(CONFIG_VFIO_DEVICE_CDEV)
 void vfio_init_device_cdev(struct vfio_device *device);
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 8d4995ada74a..ff50c239873d 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1240,6 +1240,14 @@ static long vfio_device_fops_unl_ioctl(struct file 
*filep,
case VFIO_DEVICE_DETACH_IOMMUFD_PT:
ret = vfio_df_ioctl_detach_pt(df, uptr);
goto out;
+
+   case VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT:
+   ret = vfio_df_ioctl_pasid_attach_pt(df, uptr);
+   goto out;
+
+   case VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT:
+   ret = vfio_df_ioctl_pasid_detach_pt(df, uptr);
+   goto out;
}
}
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 94b3badefde3..495193629029 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -977,6 +977,61 @@ struct vfio_device_detach_iommufd_pt {
 
 #define VFIO_DEVICE_DETACH_IOMMUFD_PT  _IO(VFIO_TYPE, VFIO_BASE + 20)
 
+/*
+ * VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 21,
+ *   struct 
vfio_device_pasid_attach_iommufd_pt)
+ * @argsz: User filled size of this data.
+ * @flags: Must be 0.
+ * @pasid: The pasid to be attached.
+ * @pt_id: Input the target id which can represent an ioas or a hwpt
+ * allocated via iommufd subsystem.
+ * Output the input ioas id or the attached hwpt id which could
+ * be the specified hwpt itself or a hwpt automatically created
+ * for the specified ioas by kernel during the attachment.
+ *
+ * Associate a pasid (of a cdev device) with an address space within the
+ * bound iommufd. Undo by VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT or dev

[PATCH 1/3] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices

2023-11-26 Thread Yi Liu
From: Kevin Tian 

This adds pasid_at|de]tach_ioas ops for attaching hwpt to pasid of a
device and the helpers for it. For now, only vfio-pci supports pasid
attach/detach.

Signed-off-by: Kevin Tian 
Signed-off-by: Yi Liu 
---
 drivers/vfio/iommufd.c  | 48 +
 drivers/vfio/pci/vfio_pci.c |  2 ++
 include/linux/vfio.h| 11 +
 3 files changed, 61 insertions(+)

diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
index 82eba6966fa5..43a702b9b4d3 100644
--- a/drivers/vfio/iommufd.c
+++ b/drivers/vfio/iommufd.c
@@ -119,6 +119,7 @@ int vfio_iommufd_physical_bind(struct vfio_device *vdev,
if (IS_ERR(idev))
return PTR_ERR(idev);
vdev->iommufd_device = idev;
+   xa_init(&vdev->pasid_pts);
return 0;
 }
 EXPORT_SYMBOL_GPL(vfio_iommufd_physical_bind);
@@ -127,6 +128,17 @@ void vfio_iommufd_physical_unbind(struct vfio_device *vdev)
 {
lockdep_assert_held(&vdev->dev_set->lock);
 
+   if (!xa_empty(&vdev->pasid_pts)) {
+   void *entry;
+   unsigned long index;
+
+   xa_for_each(&vdev->pasid_pts, index, entry) {
+   xa_erase(&vdev->pasid_pts, index);
+   iommufd_device_pasid_detach(vdev->iommufd_device, 
index);
+   }
+   xa_destroy(&vdev->pasid_pts);
+   }
+
if (vdev->iommufd_attached) {
iommufd_device_detach(vdev->iommufd_device);
vdev->iommufd_attached = false;
@@ -168,6 +180,42 @@ void vfio_iommufd_physical_detach_ioas(struct vfio_device 
*vdev)
 }
 EXPORT_SYMBOL_GPL(vfio_iommufd_physical_detach_ioas);
 
+int vfio_iommufd_physical_pasid_attach_ioas(struct vfio_device *vdev,
+   u32 pasid, u32 *pt_id)
+{
+   void *entry;
+   int rc;
+
+   lockdep_assert_held(&vdev->dev_set->lock);
+
+   if (WARN_ON(!vdev->iommufd_device))
+   return -EINVAL;
+
+   entry = xa_load(&vdev->pasid_pts, pasid);
+   if (xa_is_value(entry))
+   rc = iommufd_device_pasid_replace(vdev->iommufd_device, pasid, 
pt_id);
+   else
+   rc = iommufd_device_pasid_attach(vdev->iommufd_device, pasid, 
pt_id);
+   if (rc)
+   return rc;
+   xa_store(&vdev->pasid_pts, pasid, xa_mk_value(*pt_id), GFP_KERNEL);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_iommufd_physical_pasid_attach_ioas);
+
+void vfio_iommufd_physical_pasid_detach_ioas(struct vfio_device *vdev, u32 
pasid)
+{
+   lockdep_assert_held(&vdev->dev_set->lock);
+
+   if (WARN_ON(!vdev->iommufd_device) ||
+   !xa_is_value(xa_load(&vdev->pasid_pts, pasid)))
+   return;
+
+   iommufd_device_pasid_detach(vdev->iommufd_device, pasid);
+   xa_erase(&vdev->pasid_pts, pasid);
+}
+EXPORT_SYMBOL_GPL(vfio_iommufd_physical_pasid_detach_ioas);
+
 /*
  * The emulated standard ops mean that vfio_device is going to use the
  * "mdev path" and will call vfio_pin_pages()/vfio_dma_rw(). Drivers using this
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index cb5b7f865d58..e0198851ffd2 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -142,6 +142,8 @@ static const struct vfio_device_ops vfio_pci_ops = {
.unbind_iommufd = vfio_iommufd_physical_unbind,
.attach_ioas= vfio_iommufd_physical_attach_ioas,
.detach_ioas= vfio_iommufd_physical_detach_ioas,
+   .pasid_attach_ioas  = vfio_iommufd_physical_pasid_attach_ioas,
+   .pasid_detach_ioas  = vfio_iommufd_physical_pasid_detach_ioas,
 };
 
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 454e9295970c..7b06d1bc7cb3 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -66,6 +66,7 @@ struct vfio_device {
void (*put_kvm)(struct kvm *kvm);
 #if IS_ENABLED(CONFIG_IOMMUFD)
struct iommufd_device *iommufd_device;
+   struct xarray pasid_pts;
u8 iommufd_attached:1;
 #endif
u8 cdev_opened:1;
@@ -83,6 +84,8 @@ struct vfio_device {
  *  bound iommufd. Undo in unbind_iommufd if @detach_ioas is not
  *  called.
  * @detach_ioas: Opposite of attach_ioas
+ * @pasid_attach_ioas: The pasid variation of attach_ioas
+ * @pasid_detach_ioas: Opposite of pasid_attach_ioas
  * @open_device: Called when the first file descriptor is opened for this 
device
  * @close_device: Opposite of open_device
  * @read: Perform read(2) on device file descriptor
@@ -107,6 +110,8 @@ struct vfio_device_ops {
void(*unbind_iommufd)(struct vfio_device *vdev);
int (*attach_ioas)(struct vfio_device *vdev, u32 *pt_id);
void(*detach_ioas)(struct vfio_device *vdev);
+   int (*pasid_attach_ioas)(struct vfio_device *vdev, u32 pasid, u32 
*pt_id);
+   void(*pasid_detach_ioas)(struct vfio_device *v

[PATCH 0/3] vfio-pci support pasid attach/detach

2023-11-26 Thread Yi Liu
This adds the pasid attach/detach uAPIs for userspace to attach/detach
a PASID of a device to/from a given ioas/hwpt. Only vfio-pci driver is
enabled in this series. After this series, PASID-capable devices bound
with vfio-pci can report PASID capability to userspace and VM to enable
PASID usages like Shared Virtual Addressing (SVA).

This series first adds the helpers for pasid attach in vfio core and then
add the device cdev ioctls for pasid attach/detach, finally exposes the
device PASID capability to user. It depends on iommufd pasid attach/detach
series [1].

Complete code can be found at [2], tested with a draft Qemu branch[3]

[1] 
https://lore.kernel.org/linux-iommu/20231127063428.127436-1-yi.l@intel.com/
[2] https://github.com/yiliu1765/iommufd/tree/iommufd_pasid
[3] 
https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1%2Bpasid

Change log:

v1:
 - Report PASID capability via VFIO_DEVICE_FEATURE (Alex)

rfc: 
https://lore.kernel.org/linux-iommu/20230926093121.18676-1-yi.l@intel.com/

Regards,
Yi Liu

Kevin Tian (1):
  vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices

Yi Liu (2):
  vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT
  vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl

 drivers/vfio/device_cdev.c   | 45 +
 drivers/vfio/iommufd.c   | 48 ++
 drivers/vfio/pci/vfio_pci.c  |  2 +
 drivers/vfio/pci/vfio_pci_core.c | 47 ++
 drivers/vfio/vfio.h  |  4 ++
 drivers/vfio/vfio_main.c |  8 
 include/linux/vfio.h | 11 ++
 include/uapi/linux/vfio.h| 68 
 8 files changed, 233 insertions(+)

-- 
2.34.1




[PATCH 8/8] iommu/vt-d: Add set_dev_pasid callback for nested domain

2023-11-26 Thread Yi Liu
From: Lu Baolu 

This allows the upper layers to set a nested type domain to a PASID of a
device if the PASID feature is supported by the IOMMU hardware.

The set_dev_pasid callback for non-nest domain has already be there, so
this only needs to add it for nested domains.

Signed-off-by: Lu Baolu 
Signed-off-by: Yi Liu 
---
 drivers/iommu/intel/nested.c | 47 
 1 file changed, 47 insertions(+)

diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c
index 44ad48db7ea0..f6f687750104 100644
--- a/drivers/iommu/intel/nested.c
+++ b/drivers/iommu/intel/nested.c
@@ -68,6 +68,52 @@ static int intel_nested_attach_dev(struct iommu_domain 
*domain,
return 0;
 }
 
+static int intel_nested_set_dev_pasid(struct iommu_domain *domain,
+ struct device *dev, ioasid_t pasid)
+{
+   struct device_domain_info *info = dev_iommu_priv_get(dev);
+   struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+   struct intel_iommu *iommu = info->iommu;
+   struct dev_pasid_info *dev_pasid;
+   unsigned long flags;
+   int ret = 0;
+
+   if (!pasid_supported(iommu))
+   return -EOPNOTSUPP;
+
+   if (iommu->agaw < dmar_domain->s2_domain->agaw)
+   return -EINVAL;
+
+   ret = prepare_domain_attach_device(&dmar_domain->s2_domain->domain, 
dev);
+   if (ret)
+   return ret;
+
+   dev_pasid = kzalloc(sizeof(*dev_pasid), GFP_KERNEL);
+   if (!dev_pasid)
+   return -ENOMEM;
+
+   ret = domain_attach_iommu(dmar_domain, iommu);
+   if (ret)
+   goto err_free;
+
+   ret = intel_pasid_setup_nested(iommu, dev, pasid, dmar_domain);
+   if (ret)
+   goto err_detach_iommu;
+
+   dev_pasid->dev = dev;
+   dev_pasid->pasid = pasid;
+   spin_lock_irqsave(&dmar_domain->lock, flags);
+   list_add(&dev_pasid->link_domain, &dmar_domain->dev_pasids);
+   spin_unlock_irqrestore(&dmar_domain->lock, flags);
+
+   return 0;
+err_detach_iommu:
+   domain_detach_iommu(dmar_domain, iommu);
+err_free:
+   kfree(dev_pasid);
+   return ret;
+}
+
 static void intel_nested_domain_free(struct iommu_domain *domain)
 {
kfree(to_dmar_domain(domain));
@@ -128,6 +174,7 @@ static int intel_nested_cache_invalidate_user(struct 
iommu_domain *domain,
 
 static const struct iommu_domain_ops intel_nested_domain_ops = {
.attach_dev = intel_nested_attach_dev,
+   .set_dev_pasid  = intel_nested_set_dev_pasid,
.free   = intel_nested_domain_free,
.cache_invalidate_user  = intel_nested_cache_invalidate_user,
 };
-- 
2.34.1




[PATCH 7/8] iommufd/selftest: Add coverage for iommufd pasid attach/detach

2023-11-26 Thread Yi Liu
This tests iommufd pasid attach/replace/detach.

Signed-off-by: Yi Liu 
---
 tools/testing/selftests/iommu/iommufd.c   | 172 ++
 .../selftests/iommu/iommufd_fail_nth.c|  28 ++-
 tools/testing/selftests/iommu/iommufd_utils.h |  78 
 3 files changed, 274 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/iommu/iommufd.c 
b/tools/testing/selftests/iommu/iommufd.c
index 2781d5bc6309..4dfd2d33c0a0 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -2221,4 +2221,176 @@ TEST_F(vfio_compat_mock_domain, huge_map)
}
 }
 
+FIXTURE(iommufd_device_pasid)
+{
+   int fd;
+   uint32_t ioas_id;
+   uint32_t hwpt_id;
+   uint32_t stdev_id;
+   uint32_t device_id;
+};
+
+FIXTURE_SETUP(iommufd_device_pasid)
+{
+   self->fd = open("/dev/iommu", O_RDWR);
+   ASSERT_NE(-1, self->fd);
+   test_ioctl_ioas_alloc(&self->ioas_id);
+
+   test_cmd_mock_domain(self->ioas_id, &self->stdev_id,
+&self->hwpt_id, &self->device_id);
+}
+
+FIXTURE_TEARDOWN(iommufd_device_pasid)
+{
+   teardown_iommufd(self->fd, _metadata);
+}
+
+TEST_F(iommufd_device_pasid, pasid_attach)
+{
+   if (self->device_id) {
+   struct iommu_hwpt_selftest data = {
+   .iotlb =  IOMMU_TEST_IOTLB_DEFAULT,
+   };
+   uint32_t nested_hwpt_id[2] = {};
+   uint32_t parent_hwpt_id = 0;
+   uint32_t pasid = 100;
+   bool result;
+
+   /* Allocate two nested hwpts sharing one common parent hwpt */
+   test_cmd_hwpt_alloc(self->device_id, self->ioas_id,
+   IOMMU_HWPT_ALLOC_NEST_PARENT,
+   &parent_hwpt_id);
+
+   test_cmd_hwpt_alloc_nested(self->device_id, parent_hwpt_id, 0,
+  &nested_hwpt_id[0],
+  IOMMU_HWPT_DATA_SELFTEST,
+  &data, sizeof(data));
+   test_cmd_hwpt_alloc_nested(self->device_id, parent_hwpt_id, 0,
+  &nested_hwpt_id[1],
+  IOMMU_HWPT_DATA_SELFTEST,
+  &data, sizeof(data));
+
+   /*
+* Attach ioas to pasid 100, should succeed, domain should
+* be valid.
+*/
+   test_cmd_pasid_attach(pasid, self->ioas_id);
+   ASSERT_EQ(0,
+ test_cmd_pasid_check_domain(self->fd, self->stdev_id,
+ pasid, self->hwpt_id,
+ &result));
+   EXPECT_EQ(1, result);
+
+   /*
+* Try attach pasid 100 with self->ioas_id, should succeed
+* as it is the same with existing hwpt.
+*/
+   test_cmd_pasid_attach(pasid, self->ioas_id);
+
+   /*
+* Try attach pasid 100 with another hwpt, should FAIL
+* as attach does not allow overwrite, use REPLACE instead.
+*/
+   test_err_cmd_pasid_attach(EINVAL, pasid, nested_hwpt_id[0]);
+
+   /*
+* Detach hwpt from pasid 100, and check if the pasid 100
+* has null domain. Should be done before the next attach.
+*/
+   test_cmd_pasid_detach(pasid);
+   ASSERT_EQ(0,
+ test_cmd_pasid_check_domain(self->fd, self->stdev_id,
+ pasid, 0, &result));
+   EXPECT_EQ(1, result);
+
+   /*
+* Attach nested hwpt to pasid 100, should succeed, domain
+* should be valid.
+*/
+   test_cmd_pasid_attach(pasid, nested_hwpt_id[0]);
+   ASSERT_EQ(0,
+ test_cmd_pasid_check_domain(self->fd, self->stdev_id,
+ pasid, nested_hwpt_id[0],
+ &result));
+   EXPECT_EQ(1, result);
+
+   /*
+* Detach hwpt from pasid 100, and check if the pasid 100
+* has null domain
+*/
+   test_cmd_pasid_detach(pasid);
+   ASSERT_EQ(0,
+ test_cmd_pasid_check_domain(self->fd, self->stdev_id,
+ pasid, 0, &result));
+   EXPECT_EQ(1, result);
+
+   /* Replace tests */
+   pasid = 200;
+
+   /*
+* Replace pasid 200 without attaching it first, should
+* fail with -EINVAL.
+*/
+   test_err

[PATCH 6/8] iommufd/selftest: Add test ops to test pasid attach/detach

2023-11-26 Thread Yi Liu
This adds 4 test ops for pasid attach/replace/detach testing. There are
ops to attach/detach pasid, and also op to check the attached domain of
a pasid.

Signed-off-by: Yi Liu 
---
 drivers/iommu/iommufd/iommufd_test.h |  24 ++
 drivers/iommu/iommufd/selftest.c | 116 +++
 2 files changed, 140 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_test.h 
b/drivers/iommu/iommufd/iommufd_test.h
index 35d6207a96dd..67c2a8508b92 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -22,6 +22,10 @@ enum {
IOMMU_TEST_OP_MOCK_DOMAIN_FLAGS,
IOMMU_TEST_OP_DIRTY,
IOMMU_TEST_OP_MD_CHECK_IOTLB,
+   IOMMU_TEST_OP_PASID_ATTACH,
+   IOMMU_TEST_OP_PASID_REPLACE,
+   IOMMU_TEST_OP_PASID_DETACH,
+   IOMMU_TEST_OP_PASID_CHECK_DOMAIN,
 };
 
 enum {
@@ -126,6 +130,26 @@ struct iommu_test_cmd {
__u32 id;
__u32 iotlb;
} check_iotlb;
+   struct {
+   __u32 pasid;
+   __u32 pt_id;
+   /* @id is stdev_id for IOMMU_TEST_OP_PASID_ATTACH */
+   } pasid_attach;
+   struct {
+   __u32 pasid;
+   __u32 pt_id;
+   /* @id is stdev_id for IOMMU_TEST_OP_PASID_ATTACH */
+   } pasid_replace;
+   struct {
+   __u32 pasid;
+   /* @id is stdev_id for IOMMU_TEST_OP_PASID_DETACH */
+   } pasid_detach;
+   struct {
+   __u32 pasid;
+   __u32 hwpt_id;
+   __u64 out_result_ptr;
+   /* @id is stdev_id for IOMMU_TEST_OP_HWPT_GET_DOMAIN */
+   } pasid_check;
};
__u32 last;
 };
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 9b9fd3f50264..a3139ad534a1 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -1340,6 +1340,114 @@ static int iommufd_test_dirty(struct iommufd_ucmd 
*ucmd, unsigned int mockpt_id,
return rc;
 }
 
+static int iommufd_test_pasid_attach(struct iommufd_ucmd *ucmd,
+struct iommu_test_cmd *cmd)
+{
+   struct selftest_obj *sobj;
+   int rc;
+
+   sobj = iommufd_test_get_self_test_device(ucmd->ictx, cmd->id);
+   if (IS_ERR(sobj))
+   return PTR_ERR(sobj);
+
+   rc = iommufd_device_pasid_attach(sobj->idev.idev,
+cmd->pasid_attach.pasid,
+&cmd->pasid_attach.pt_id);
+   iommufd_put_object(&sobj->obj);
+   return rc;
+}
+
+static int iommufd_test_pasid_replace(struct iommufd_ucmd *ucmd,
+ struct iommu_test_cmd *cmd)
+{
+   struct selftest_obj *sobj;
+   int rc;
+
+   sobj = iommufd_test_get_self_test_device(ucmd->ictx, cmd->id);
+   if (IS_ERR(sobj))
+   return PTR_ERR(sobj);
+
+   rc = iommufd_device_pasid_replace(sobj->idev.idev,
+ cmd->pasid_attach.pasid,
+ &cmd->pasid_attach.pt_id);
+   iommufd_put_object(&sobj->obj);
+   return rc;
+}
+
+static int iommufd_test_pasid_detach(struct iommufd_ucmd *ucmd,
+struct iommu_test_cmd *cmd)
+{
+   struct selftest_obj *sobj;
+
+   sobj = iommufd_test_get_self_test_device(ucmd->ictx, cmd->id);
+   if (IS_ERR(sobj))
+   return PTR_ERR(sobj);
+
+   iommufd_device_pasid_detach(sobj->idev.idev,
+   cmd->pasid_detach.pasid);
+   iommufd_put_object(&sobj->obj);
+   return 0;
+}
+
+static inline struct iommufd_hw_pagetable *
+iommufd_get_hwpt(struct iommufd_ucmd *ucmd, u32 id)
+{
+   struct iommufd_object *pt_obj;
+
+   pt_obj = iommufd_get_object(ucmd->ictx, id, IOMMUFD_OBJ_ANY);
+   if (IS_ERR(pt_obj))
+   return ERR_CAST(pt_obj);
+
+   if (pt_obj->type != IOMMUFD_OBJ_HWPT_NESTED &&
+   pt_obj->type != IOMMUFD_OBJ_HWPT_PAGING) {
+   iommufd_put_object(pt_obj);
+   return ERR_PTR(-EINVAL);
+   }
+
+   return container_of(pt_obj, struct iommufd_hw_pagetable, obj);
+}
+
+static int iommufd_test_pasid_check_domain(struct iommufd_ucmd *ucmd,
+  struct iommu_test_cmd *cmd)
+{
+   struct iommu_domain *attached_domain, *expect_domain = NULL;
+   struct iommufd_hw_pagetable *hwpt = NULL;
+   struct selftest_obj *sobj;
+   struct mock_dev *mdev;
+   bool result;
+   int rc = 0;
+
+   sobj = iommufd_test_get_self_test_device(ucmd->ictx, cmd->id);
+   if (IS_ERR(sobj))
+   return PTR_ERR(sobj);
+
+   mdev = sobj->idev.mock_dev;
+
+   attached_domain 

[PATCH 5/8] iommufd/selftest: Add a helper to get test device

2023-11-26 Thread Yi Liu
There is need to get the selftest device (sobj->type == TYPE_IDEV) in
multiple places, so have a helper to for it.

Signed-off-by: Yi Liu 
---
 drivers/iommu/iommufd/selftest.c | 32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 221a206bf38f..9b9fd3f50264 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -720,29 +720,39 @@ static int iommufd_test_mock_domain(struct iommufd_ucmd 
*ucmd,
return rc;
 }
 
-/* Replace the mock domain with a manually allocated hw_pagetable */
-static int iommufd_test_mock_domain_replace(struct iommufd_ucmd *ucmd,
-   unsigned int device_id, u32 pt_id,
-   struct iommu_test_cmd *cmd)
+static struct selftest_obj *
+iommufd_test_get_self_test_device(struct iommufd_ctx *ictx, u32 id)
 {
struct iommufd_object *dev_obj;
struct selftest_obj *sobj;
-   int rc;
 
/*
 * Prefer to use the OBJ_SELFTEST because the destroy_rwsem will ensure
 * it doesn't race with detach, which is not allowed.
 */
-   dev_obj =
-   iommufd_get_object(ucmd->ictx, device_id, IOMMUFD_OBJ_SELFTEST);
+   dev_obj = iommufd_get_object(ictx, id, IOMMUFD_OBJ_SELFTEST);
if (IS_ERR(dev_obj))
-   return PTR_ERR(dev_obj);
+   return (struct selftest_obj *)dev_obj;
 
sobj = container_of(dev_obj, struct selftest_obj, obj);
if (sobj->type != TYPE_IDEV) {
-   rc = -EINVAL;
-   goto out_dev_obj;
+   iommufd_put_object(dev_obj);
+   return ERR_PTR(-EINVAL);
}
+   return sobj;
+}
+
+/* Replace the mock domain with a manually allocated hw_pagetable */
+static int iommufd_test_mock_domain_replace(struct iommufd_ucmd *ucmd,
+   unsigned int device_id, u32 pt_id,
+   struct iommu_test_cmd *cmd)
+{
+   struct selftest_obj *sobj;
+   int rc;
+
+   sobj = iommufd_test_get_self_test_device(ucmd->ictx, device_id);
+   if (IS_ERR(sobj))
+   return PTR_ERR(sobj);
 
rc = iommufd_device_replace(sobj->idev.idev, &pt_id);
if (rc)
@@ -752,7 +762,7 @@ static int iommufd_test_mock_domain_replace(struct 
iommufd_ucmd *ucmd,
rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
 
 out_dev_obj:
-   iommufd_put_object(dev_obj);
+   iommufd_put_object(&sobj->obj);
return rc;
 }
 
-- 
2.34.1




[PATCH 3/8] iommufd: Support attach/replace hwpt per pasid

2023-11-26 Thread Yi Liu
From: Kevin Tian 

This introduces three APIs for device drivers to manage pasid attach/
replace/detach.

int iommufd_device_pasid_attach(struct iommufd_device *idev,
u32 pasid, u32 *pt_id);
int iommufd_device_pasid_replace(struct iommufd_device *idev,
 u32 pasid, u32 *pt_id);
void iommufd_device_pasid_detach(struct iommufd_device *idev,
 u32 pasid);

pasid operations have different implications when comparing to device
operations:

 - No connection to iommufd_group since pasid is a device capability
   and can be enabled only in singleton group;

 - no reserved region per pasid otherwise SVA architecture is already
   broken (CPU address space doesn't count device reserved regions);

 - accordingly no sw_msi trick;

 - immediated_attach is not supported, expecting that arm-smmu driver
   will already remove that requirement before supporting this pasid
   operation. This avoids unnecessary change in iommufd_hw_pagetable_alloc()
   to carry the pasid from device.c.

With above differences, this puts all pasid related logics into a new
pasid.c file.

Cache coherency enforcement is still applied to pasid operations since
it is about memory accesses post page table walking (no matter the walk
is per RID or per PASID).

Since the attach is per PASID, this introduces a pasid_hwpts xarray to
track the per-pasid attach data.

Signed-off-by: Kevin Tian 
Signed-off-by: Yi Liu 
---
 drivers/iommu/iommufd/Makefile  |   1 +
 drivers/iommu/iommufd/device.c  |  17 ++-
 drivers/iommu/iommufd/iommufd_private.h |  15 +++
 drivers/iommu/iommufd/pasid.c   | 138 
 include/linux/iommufd.h |   6 ++
 5 files changed, 176 insertions(+), 1 deletion(-)
 create mode 100644 drivers/iommu/iommufd/pasid.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 34b446146961..4b4d516b025c 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -6,6 +6,7 @@ iommufd-y := \
ioas.o \
main.o \
pages.o \
+   pasid.o \
vfio_compat.o
 
 iommufd-$(CONFIG_IOMMUFD_TEST) += selftest.o
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 0992d9d46af9..a7574d4d5ffa 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -136,6 +136,7 @@ void iommufd_device_destroy(struct iommufd_object *obj)
struct iommufd_device *idev =
container_of(obj, struct iommufd_device, obj);
 
+   WARN_ON(!xa_empty(&idev->pasid_hwpts));
iommu_device_release_dma_owner(idev->dev);
iommufd_put_group(idev->igroup);
if (!iommufd_selftest_is_mock_dev(idev->dev))
@@ -216,6 +217,8 @@ struct iommufd_device *iommufd_device_bind(struct 
iommufd_ctx *ictx,
/* igroup refcount moves into iommufd_device */
idev->igroup = igroup;
 
+   xa_init(&idev->pasid_hwpts);
+
/*
 * If the caller fails after this success it must call
 * iommufd_unbind_device() which is safe since we hold this refcount.
@@ -534,7 +537,17 @@ iommufd_device_do_replace(struct iommufd_device *idev,
 static struct iommufd_hw_pagetable *do_attach(struct iommufd_device *idev,
struct iommufd_hw_pagetable *hwpt, struct attach_data *data)
 {
-   return data->attach_fn(idev, hwpt);
+   if (data->pasid == IOMMU_PASID_INVALID) {
+   BUG_ON((data->attach_fn != iommufd_device_do_attach) &&
+  (data->attach_fn != iommufd_device_do_replace));
+   return data->attach_fn(idev, hwpt);
+   } else {
+   BUG_ON((data->pasid_attach_fn !=
+   iommufd_device_pasid_do_attach) &&
+  (data->pasid_attach_fn !=
+   iommufd_device_pasid_do_replace));
+   return data->pasid_attach_fn(idev, data->pasid, hwpt);
+   }
 }
 
 /*
@@ -684,6 +697,7 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 
*pt_id)
int rc;
struct attach_data data = {
.attach_fn = &iommufd_device_do_attach,
+   .pasid = IOMMU_PASID_INVALID,
};
 
rc = iommufd_device_change_pt(idev, pt_id, &data);
@@ -718,6 +732,7 @@ int iommufd_device_replace(struct iommufd_device *idev, u32 
*pt_id)
 {
struct attach_data data = {
.attach_fn = &iommufd_device_do_replace,
+   .pasid = IOMMU_PASID_INVALID,
};
 
return iommufd_device_change_pt(idev, pt_id, &data);
diff --git a/drivers/iommu/iommufd/iommufd_private.h 
b/drivers/iommu/iommufd/iommufd_private.h
index 24fee2c37ce8..d37b7d0bfffe 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -349,6 +349,7 @@ struct iommufd_device {
struct list_head group_item;
/* always the physical device */

[PATCH 4/8] iommufd/selftest: Add set_dev_pasid and remove_dev_pasid in mock iommu

2023-11-26 Thread Yi Liu
The two callbacks are needed to make pasid_attach/detach path complete for
mock device. A nop is enough for set_dev_pasid, a domain type check in the
remove_dev_pasid is also helpful.

Signed-off-by: Yi Liu 
---
 drivers/iommu/iommufd/selftest.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index f2d4599f701b..221a206bf38f 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -440,6 +440,31 @@ static struct iommu_device *mock_probe_device(struct 
device *dev)
return &mock_iommu_device;
 }
 
+static void mock_iommu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
+{
+   struct iommu_domain *domain;
+
+   /* Domain type specific cleanup: */
+   domain = iommu_get_domain_for_dev_pasid(dev, pasid, 0);
+   if (domain) {
+   switch (domain->type) {
+   case IOMMU_DOMAIN_NESTED:
+   case IOMMU_DOMAIN_UNMANAGED:
+   break;
+   default:
+   /* should never reach here */
+   WARN_ON(1);
+   break;
+   }
+   }
+}
+
+static int mock_domain_set_dev_pasid_nop(struct iommu_domain *domain,
+struct device *dev, ioasid_t pasid)
+{
+   return 0;
+}
+
 static const struct iommu_ops mock_ops = {
/*
 * IOMMU_DOMAIN_BLOCKED cannot be returned from def_domain_type()
@@ -455,6 +480,7 @@ static const struct iommu_ops mock_ops = {
.capable = mock_domain_capable,
.device_group = generic_device_group,
.probe_device = mock_probe_device,
+   .remove_dev_pasid = mock_iommu_remove_dev_pasid,
.default_domain_ops =
&(struct iommu_domain_ops){
.free = mock_domain_free,
@@ -462,6 +488,7 @@ static const struct iommu_ops mock_ops = {
.map_pages = mock_domain_map_pages,
.unmap_pages = mock_domain_unmap_pages,
.iova_to_phys = mock_domain_iova_to_phys,
+   .set_dev_pasid = mock_domain_set_dev_pasid_nop,
},
 };
 
@@ -519,6 +546,7 @@ static struct iommu_domain_ops domain_nested_ops = {
.free = mock_domain_free_nested,
.attach_dev = mock_domain_nop_attach,
.cache_invalidate_user = mock_domain_cache_invalidate_user,
+   .set_dev_pasid = mock_domain_set_dev_pasid_nop,
 };
 
 static inline struct iommufd_hw_pagetable *
-- 
2.34.1




[PATCH 2/8] iommufd: replace attach_fn with a structure

2023-11-26 Thread Yi Liu
Most of the core logic before conducting the actual device attach/
replace operation can be shared with pasid attach/replace. Create
a new structure so more information (e.g. pasid) can be later added
along with the attach_fn.

Signed-off-by: Kevin Tian 
Signed-off-by: Yi Liu 
---
 drivers/iommu/iommufd/device.c  | 35 -
 drivers/iommu/iommufd/iommufd_private.h |  8 ++
 2 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 59d3a07300d9..0992d9d46af9 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -531,8 +531,11 @@ iommufd_device_do_replace(struct iommufd_device *idev,
return ERR_PTR(rc);
 }
 
-typedef struct iommufd_hw_pagetable *(*attach_fn)(
-   struct iommufd_device *idev, struct iommufd_hw_pagetable *hwpt);
+static struct iommufd_hw_pagetable *do_attach(struct iommufd_device *idev,
+   struct iommufd_hw_pagetable *hwpt, struct attach_data *data)
+{
+   return data->attach_fn(idev, hwpt);
+}
 
 /*
  * When automatically managing the domains we search for a compatible domain in
@@ -542,7 +545,7 @@ typedef struct iommufd_hw_pagetable *(*attach_fn)(
 static struct iommufd_hw_pagetable *
 iommufd_device_auto_get_domain(struct iommufd_device *idev,
   struct iommufd_ioas *ioas, u32 *pt_id,
-  attach_fn do_attach)
+  struct attach_data *data)
 {
/*
 * iommufd_hw_pagetable_attach() is called by
@@ -551,7 +554,7 @@ iommufd_device_auto_get_domain(struct iommufd_device *idev,
 * to use the immediate_attach path as it supports drivers that can't
 * directly allocate a domain.
 */
-   bool immediate_attach = do_attach == iommufd_device_do_attach;
+   bool immediate_attach = data->attach_fn == iommufd_device_do_attach;
struct iommufd_hw_pagetable *destroy_hwpt;
struct iommufd_hwpt_paging *hwpt_paging;
struct iommufd_hw_pagetable *hwpt;
@@ -569,7 +572,7 @@ iommufd_device_auto_get_domain(struct iommufd_device *idev,
hwpt = &hwpt_paging->common;
if (!iommufd_lock_obj(&hwpt->obj))
continue;
-   destroy_hwpt = (*do_attach)(idev, hwpt);
+   destroy_hwpt = do_attach(idev, hwpt, data);
if (IS_ERR(destroy_hwpt)) {
iommufd_put_object(&hwpt->obj);
/*
@@ -596,7 +599,7 @@ iommufd_device_auto_get_domain(struct iommufd_device *idev,
hwpt = &hwpt_paging->common;
 
if (!immediate_attach) {
-   destroy_hwpt = (*do_attach)(idev, hwpt);
+   destroy_hwpt = do_attach(idev, hwpt, data);
if (IS_ERR(destroy_hwpt))
goto out_abort;
} else {
@@ -617,8 +620,8 @@ iommufd_device_auto_get_domain(struct iommufd_device *idev,
return destroy_hwpt;
 }
 
-static int iommufd_device_change_pt(struct iommufd_device *idev, u32 *pt_id,
-   attach_fn do_attach)
+int iommufd_device_change_pt(struct iommufd_device *idev, u32 *pt_id,
+struct attach_data *data)
 {
struct iommufd_hw_pagetable *destroy_hwpt;
struct iommufd_object *pt_obj;
@@ -633,7 +636,7 @@ static int iommufd_device_change_pt(struct iommufd_device 
*idev, u32 *pt_id,
struct iommufd_hw_pagetable *hwpt =
container_of(pt_obj, struct iommufd_hw_pagetable, obj);
 
-   destroy_hwpt = (*do_attach)(idev, hwpt);
+   destroy_hwpt = do_attach(idev, hwpt, data);
if (IS_ERR(destroy_hwpt))
goto out_put_pt_obj;
break;
@@ -643,7 +646,7 @@ static int iommufd_device_change_pt(struct iommufd_device 
*idev, u32 *pt_id,
container_of(pt_obj, struct iommufd_ioas, obj);
 
destroy_hwpt = iommufd_device_auto_get_domain(idev, ioas, pt_id,
- do_attach);
+ data);
if (IS_ERR(destroy_hwpt))
goto out_put_pt_obj;
break;
@@ -679,8 +682,11 @@ static int iommufd_device_change_pt(struct iommufd_device 
*idev, u32 *pt_id,
 int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id)
 {
int rc;
+   struct attach_data data = {
+   .attach_fn = &iommufd_device_do_attach,
+   };
 
-   rc = iommufd_device_change_pt(idev, pt_id, &iommufd_device_do_attach);
+   rc = iommufd_device_change_pt(idev, pt_id, &data);
if (rc)
return rc;
 
@@ -710,8 +716,11 @@ EXPORT_SYMBOL_NS_GPL(iommufd_device_attach, IOMMUFD);
  */
 int iommufd_device_replace(struct iommufd_device *idev, u32 *pt_id)
 {
-   return iommufd_dev

[PATCH 1/8] iommu: Introduce a replace API for device pasid

2023-11-26 Thread Yi Liu
From: Lu Baolu 

Provide a high-level API to allow replacements of one domain with
another for specific pasid of a device. This is similar to
iommu_group_replace_domain() and it is also expected to be used
only by IOMMUFD.

Signed-off-by: Lu Baolu 
Signed-off-by: Yi Liu 
---
 drivers/iommu/iommu-priv.h |  2 +
 drivers/iommu/iommu.c  | 82 +++---
 2 files changed, 70 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h
index 2024a2313348..5c32637f6325 100644
--- a/drivers/iommu/iommu-priv.h
+++ b/drivers/iommu/iommu-priv.h
@@ -19,6 +19,8 @@ static inline const struct iommu_ops *dev_iommu_ops(struct 
device *dev)
 
 int iommu_group_replace_domain(struct iommu_group *group,
   struct iommu_domain *new_domain);
+int iommu_replace_device_pasid(struct iommu_domain *domain,
+  struct device *dev, ioasid_t pasid);
 
 int iommu_device_register_bus(struct iommu_device *iommu,
  const struct iommu_ops *ops, struct bus_type *bus,
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 9d573e971aff..ec213ebd5ecc 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3430,6 +3430,27 @@ static void __iommu_remove_group_pasid(struct 
iommu_group *group,
}
 }
 
+static int __iommu_group_attach_pasid(struct iommu_domain *domain,
+ struct iommu_group *group, ioasid_t pasid)
+{
+   void *curr;
+   int ret;
+
+   lockdep_assert_held(&group->mutex);
+
+   curr = xa_cmpxchg(&group->pasid_array, pasid, NULL, domain, GFP_KERNEL);
+   if (curr)
+   return xa_err(curr) ? : -EBUSY;
+
+   ret = __iommu_set_group_pasid(domain, group, pasid);
+   if (ret) {
+   __iommu_remove_group_pasid(group, pasid);
+   xa_erase(&group->pasid_array, pasid);
+   }
+
+   return ret;
+}
+
 /*
  * iommu_attach_device_pasid() - Attach a domain to pasid of device
  * @domain: the iommu domain.
@@ -3453,19 +3474,9 @@ int iommu_attach_device_pasid(struct iommu_domain 
*domain,
return -ENODEV;
 
mutex_lock(&group->mutex);
-   curr = xa_cmpxchg(&group->pasid_array, pasid, NULL, domain, GFP_KERNEL);
-   if (curr) {
-   ret = xa_err(curr) ? : -EBUSY;
-   goto out_unlock;
-   }
-
-   ret = __iommu_set_group_pasid(domain, group, pasid);
-   if (ret) {
-   __iommu_remove_group_pasid(group, pasid);
-   xa_erase(&group->pasid_array, pasid);
-   }
-out_unlock:
+   ret = __iommu_group_attach_pasid(domain, group, pasid);
mutex_unlock(&group->mutex);
+
return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_attach_device_pasid);
@@ -3479,8 +3490,8 @@ EXPORT_SYMBOL_GPL(iommu_attach_device_pasid);
  * The @domain must have been attached to @pasid of the @dev with
  * iommu_attach_device_pasid().
  */
-void iommu_detach_device_pasid(struct iommu_domain *domain, struct device *dev,
-  ioasid_t pasid)
+void iommu_detach_device_pasid(struct iommu_domain *domain,
+  struct device *dev, ioasid_t pasid)
 {
/* Caller must be a probed driver on dev */
struct iommu_group *group = dev->iommu_group;
@@ -3492,6 +3503,49 @@ void iommu_detach_device_pasid(struct iommu_domain 
*domain, struct device *dev,
 }
 EXPORT_SYMBOL_GPL(iommu_detach_device_pasid);
 
+/**
+ * iommu_replace_device_pasid - replace the domain that a pasid is attached to
+ * @domain: new IOMMU domain to replace with
+ * @dev: the physical device
+ * @pasid: pasid that will be attached to the new domain
+ *
+ * This API allows the pasid to switch domains. Return 0 on success, or an
+ * error. The pasid will roll back to use the old domain if failure. The
+ * caller could call iommu_detach_device_pasid() before free the old domain
+ * in order to avoid use-after-free case.
+ */
+int iommu_replace_device_pasid(struct iommu_domain *domain,
+  struct device *dev, ioasid_t pasid)
+{
+   struct iommu_group *group = dev->iommu_group;
+   struct iommu_domain *old_domain;
+   int ret;
+
+   if (!domain)
+   return -EINVAL;
+
+   if (!group)
+   return -ENODEV;
+
+   mutex_lock(&group->mutex);
+   __iommu_remove_group_pasid(group, pasid);
+   old_domain = xa_erase(&group->pasid_array, pasid);
+   ret = __iommu_group_attach_pasid(domain, group, pasid);
+   if (ret)
+   goto err_rollback;
+   mutex_unlock(&group->mutex);
+
+   return 0;
+
+err_rollback:
+   if (old_domain)
+   __iommu_group_attach_pasid(old_domain, group, pasid);
+   mutex_unlock(&group->mutex);
+
+   return ret;
+}
+EXPORT_SYMBOL_NS_GPL(iommu_replace_device_pasid, IOMMUFD_INTERNAL);
+
 /*
  * iommu_get_domain_for_dev_pasid() - Retrieve domain for @pasid of @dev
  *

[PATCH 0/8] iommufd support pasid attach/replace

2023-11-26 Thread Yi Liu
PASID (Process Address Space ID) is a PCIe extension to tag the DMA
transactions out of a physical device, and most modern IOMMU hardware
have supported PASID granular address translation. So a PASID-capable
device can be attached to multiple hwpts (a.k.a. domains), each attachment
is tagged with a pasid.

This series first adds a missing iommu API to replace domain for a pasid,
then adds iommufd APIs for device drivers to attach/replace/detach pasid
to/from hwpt per userspace's request, and adds selftest to validate the
iommufd APIs.

pasid attach/replace is mandatory on Intel VT-d given the PASID table
locates in the physical address space hence must be managed by the kernel,
both for supporting vSVA and coming SIOV. But it's optional on ARM/AMD
which allow configuring the PASID/CD table either in host physical address
space or nested on top of an GPA address space. This series only add VT-d
support as the minimal requirement.

Complete code can be found in below link:

https://github.com/yiliu1765/iommufd/tree/iommufd_pasid

Change log:

v1:
 - Implemnet iommu_replace_device_pasid() to fall back to the original domain
   if this replacement failed (Kevin)
 - Add check in do_attach() to check corressponding attach_fn per the pasid 
value.

rfc: 
https://lore.kernel.org/linux-iommu/20230926092651.17041-1-yi.l@intel.com/

Regards,
Yi Liu

Kevin Tian (1):
  iommufd: Support attach/replace hwpt per pasid

Lu Baolu (2):
  iommu: Introduce a replace API for device pasid
  iommu/vt-d: Add set_dev_pasid callback for nested domain

Yi Liu (5):
  iommufd: replace attach_fn with a structure
  iommufd/selftest: Add set_dev_pasid and remove_dev_pasid in mock iommu
  iommufd/selftest: Add a helper to get test device
  iommufd/selftest: Add test ops to test pasid attach/detach
  iommufd/selftest: Add coverage for iommufd pasid attach/detach

 drivers/iommu/intel/nested.c  |  47 +
 drivers/iommu/iommu-priv.h|   2 +
 drivers/iommu/iommu.c |  82 ++--
 drivers/iommu/iommufd/Makefile|   1 +
 drivers/iommu/iommufd/device.c|  50 +++--
 drivers/iommu/iommufd/iommufd_private.h   |  23 +++
 drivers/iommu/iommufd/iommufd_test.h  |  24 +++
 drivers/iommu/iommufd/pasid.c | 138 ++
 drivers/iommu/iommufd/selftest.c  | 176 --
 include/linux/iommufd.h   |   6 +
 tools/testing/selftests/iommu/iommufd.c   | 172 +
 .../selftests/iommu/iommufd_fail_nth.c|  28 ++-
 tools/testing/selftests/iommu/iommufd_utils.h |  78 
 13 files changed, 785 insertions(+), 42 deletions(-)
 create mode 100644 drivers/iommu/iommufd/pasid.c

-- 
2.34.1




Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations

2023-11-26 Thread Yonghong Song



On 11/27/23 12:44 AM, Yonghong Song wrote:


On 11/26/23 8:52 PM, Eduard Zingerman wrote:

On Sun, 2023-11-26 at 18:04 -0600, Daniel Xu wrote:
[...]

Tbh I'm not sure. This test passes with preserve_static_offset
because it suppresses preserve_access_index. In general clang
translates bitfield access to a set of IR statements like:

   C:
 struct foo {
   unsigned _;
   unsigned a:1;
   ...
 };
 ... foo->a ...

   IR:
 %a = getelementptr inbounds %struct.foo, ptr %0, i32 0, i32 1
 %bf.load = load i8, ptr %a, align 4
 %bf.clear = and i8 %bf.load, 1
 %bf.cast = zext i8 %bf.clear to i32

With preserve_static_offset the getelementptr+load are replaced by a
single statement which is preserved as-is till code generation,
thus load with align 4 is preserved.

On the other hand, I'm not sure that clang guarantees that load or
stores used for bitfield access would be always aligned according to
verifier expectations.

I think we should check if there are some clang knobs that prevent
generation of unaligned memory access. I'll take a look.

Is there a reason to prefer fixing in compiler? I'm not opposed to it,
but the downside to compiler fix is it takes years to propagate and
sprinkles ifdefs into the code.

Would it be possible to have an analogue of BPF_CORE_READ_BITFIELD()?

Well, the contraption below passes verification, tunnel selftest
appears to work. I might have messed up some shifts in the macro, 
though.


I didn't test it. But from high level it should work.



Still, if clang would peek unlucky BYTE_{OFFSET,SIZE} for a particular
field access might be unaligned.


clang should pick a sensible BYTE_SIZE/BYTE_OFFSET to meet
alignment requirement. This is also required for BPF_CORE_READ_BITFIELD.



---

diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c 
b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c

index 3065a716544d..41cd913ac7ff 100644
--- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
@@ -9,6 +9,7 @@
  #include "vmlinux.h"
  #include 
  #include 
+#include 
  #include "bpf_kfuncs.h"
  #include "bpf_tracing_net.h"
  @@ -144,6 +145,38 @@ int ip6gretap_get_tunnel(struct __sk_buff *skb)
  return TC_ACT_OK;
  }
  +#define BPF_CORE_WRITE_BITFIELD(s, field, new_val) ({    \
+    void *p = (void *)s + __CORE_RELO(s, field, BYTE_OFFSET);    \
+    unsigned byte_size = __CORE_RELO(s, field, BYTE_SIZE);    \
+    unsigned lshift = __CORE_RELO(s, field, LSHIFT_U64); \
+    unsigned rshift = __CORE_RELO(s, field, RSHIFT_U64); \
+    unsigned bit_size = (rshift - lshift);    \
+    unsigned long long nval, val, hi, lo;    \
+    \
+    asm volatile("" : "=r"(p) : "0"(p));    \


Use asm volatile("" : "+r"(p)) ?


+    \
+    switch (byte_size) {    \
+    case 1: val = *(unsigned char *)p; break;    \
+    case 2: val = *(unsigned short *)p; break;    \
+    case 4: val = *(unsigned int *)p; break;    \
+    case 8: val = *(unsigned long long *)p; break;    \
+    }    \
+    hi = val >> (bit_size + rshift);    \
+    hi <<= bit_size + rshift;    \
+    lo = val << (bit_size + lshift);    \
+    lo >>= bit_size + lshift;    \
+    nval = new_val;    \
+    nval <<= lshift;    \
+    nval >>= rshift;    \
+    val = hi | nval | lo;    \
+    switch (byte_size) {    \
+    case 1: *(unsigned char *)p  = val; break;    \
+    case 2: *(unsigned short *)p = val; break;    \
+    case 4: *(unsigned int *)p   = val; break;    \
+    case 8: *(unsigned long long *)p = val; break;    \
+    }    \
+})


I think this should be put in libbpf public header files but not sure
where to put it. bpf_core_read.h although it is core write?

But on the other hand, this is a uapi struct bitfield write,
strictly speaking, CORE write is really unnecessary here. It
would be great if we can relieve users from dealing with
such unnecessary CORE writes. In that sense, for this particular
case, I would prefer rewriting the code by using byte-level
stores...

or preserve_static_offset to clearly mean to undo bitfield CORE ...

[...]




Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations

2023-11-26 Thread Yonghong Song



On 11/26/23 8:52 PM, Eduard Zingerman wrote:

On Sun, 2023-11-26 at 18:04 -0600, Daniel Xu wrote:
[...]

Tbh I'm not sure. This test passes with preserve_static_offset
because it suppresses preserve_access_index. In general clang
translates bitfield access to a set of IR statements like:

   C:
 struct foo {
   unsigned _;
   unsigned a:1;
   ...
 };
 ... foo->a ...

   IR:
 %a = getelementptr inbounds %struct.foo, ptr %0, i32 0, i32 1
 %bf.load = load i8, ptr %a, align 4
 %bf.clear = and i8 %bf.load, 1
 %bf.cast = zext i8 %bf.clear to i32

With preserve_static_offset the getelementptr+load are replaced by a
single statement which is preserved as-is till code generation,
thus load with align 4 is preserved.

On the other hand, I'm not sure that clang guarantees that load or
stores used for bitfield access would be always aligned according to
verifier expectations.

I think we should check if there are some clang knobs that prevent
generation of unaligned memory access. I'll take a look.

Is there a reason to prefer fixing in compiler? I'm not opposed to it,
but the downside to compiler fix is it takes years to propagate and
sprinkles ifdefs into the code.

Would it be possible to have an analogue of BPF_CORE_READ_BITFIELD()?

Well, the contraption below passes verification, tunnel selftest
appears to work. I might have messed up some shifts in the macro, though.


I didn't test it. But from high level it should work.



Still, if clang would peek unlucky BYTE_{OFFSET,SIZE} for a particular
field access might be unaligned.


clang should pick a sensible BYTE_SIZE/BYTE_OFFSET to meet
alignment requirement. This is also required for BPF_CORE_READ_BITFIELD.



---

diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c 
b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
index 3065a716544d..41cd913ac7ff 100644
--- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
@@ -9,6 +9,7 @@
  #include "vmlinux.h"
  #include 
  #include 
+#include 
  #include "bpf_kfuncs.h"
  #include "bpf_tracing_net.h"
  
@@ -144,6 +145,38 @@ int ip6gretap_get_tunnel(struct __sk_buff *skb)

return TC_ACT_OK;
  }
  
+#define BPF_CORE_WRITE_BITFIELD(s, field, new_val) ({			\

+   void *p = (void *)s + __CORE_RELO(s, field, BYTE_OFFSET);   \
+   unsigned byte_size = __CORE_RELO(s, field, BYTE_SIZE);  \
+   unsigned lshift = __CORE_RELO(s, field, LSHIFT_U64);\
+   unsigned rshift = __CORE_RELO(s, field, RSHIFT_U64);\
+   unsigned bit_size = (rshift - lshift);  \
+   unsigned long long nval, val, hi, lo;   \
+   \
+   asm volatile("" : "=r"(p) : "0"(p));  \


Use asm volatile("" : "+r"(p)) ?


+   \
+   switch (byte_size) {\
+   case 1: val = *(unsigned char *)p; break;   \
+   case 2: val = *(unsigned short *)p; break;  \
+   case 4: val = *(unsigned int *)p; break;\
+   case 8: val = *(unsigned long long *)p; break;  \
+   }   \
+   hi = val >> (bit_size + rshift);  \
+   hi <<= bit_size + rshift; \
+   lo = val << (bit_size + lshift);  \
+   lo >>= bit_size + lshift; \
+   nval = new_val; \
+   nval <<= lshift;  \
+   nval >>= rshift;  \
+   val = hi | nval | lo;   \
+   switch (byte_size) {\
+   case 1: *(unsigned char *)p  = val; break;  \
+   case 2: *(unsigned short *)p = val; break;  \
+   case 4: *(unsigned int *)p   = val; break;  \
+   case 8: *(unsigned long long *)p = val; break;  \
+   }   \
+})


I think this should be put in libbpf public header files but not sure
where to put it. bpf_core_read.h although it is core write?

But on the other hand, this is a uapi struct bitfield write,
strictly speaking, CORE write is really unnecessary here. It
would be great if we can relieve users from dealing with
such unnecessary CORE writes. In that sense, for this particular
case, I would prefer rewriting the code by using byte-level
stores...


+
  SEC("tc")
  int erspan_set_tunnel(struct __sk_buff *skb)
  {
@@ -173,9 +206,9 @@ in

Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations

2023-11-26 Thread Yonghong Song



On 11/26/23 3:14 PM, Eduard Zingerman wrote:

On Sat, 2023-11-25 at 20:22 -0800, Yonghong Song wrote:
[...]

--- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
@@ -6,7 +6,10 @@
* modify it under the terms of version 2 of the GNU General Public
* License as published by the Free Software Foundation.
*/
-#define BPF_NO_PRESERVE_ACCESS_INDEX
+#if __has_attribute(preserve_static_offset)
+struct __attribute__((preserve_static_offset)) erspan_md2;
+struct __attribute__((preserve_static_offset)) erspan_metadata;
+#endif
   #include "vmlinux.h"

[...]

   int bpf_skb_get_fou_encap(struct __sk_buff *skb_ctx,
@@ -174,9 +177,13 @@ int erspan_set_tunnel(struct __sk_buff *skb)
  __u8 hwid = 7;
   
  md.version = 2;

+#if __has_attribute(preserve_static_offset)
  md.u.md2.dir = direction;
  md.u.md2.hwid = hwid & 0xf;
  md.u.md2.hwid_upper = (hwid >> 4) & 0x3;
+#else
+   /* Change bit-field store to byte(s)-level stores. */
+#endif
   #endif
   
  ret = bpf_skb_set_tunnel_opt(skb, &md, sizeof(md));




Eduard, could you double check whether this is a valid use case
to solve this kind of issue with preserve_static_offset attribute?

Tbh I'm not sure. This test passes with preserve_static_offset
because it suppresses preserve_access_index. In general clang
translates bitfield access to a set of IR statements like:

   C:
 struct foo {
   unsigned _;
   unsigned a:1;
   ...
 };
 ... foo->a ...

   IR:
 %a = getelementptr inbounds %struct.foo, ptr %0, i32 0, i32 1
 %bf.load = load i8, ptr %a, align 4
 %bf.clear = and i8 %bf.load, 1
 %bf.cast = zext i8 %bf.clear to i32

With preserve_static_offset the getelementptr+load are replaced by a
single statement which is preserved as-is till code generation,
thus load with align 4 is preserved.

On the other hand, I'm not sure that clang guarantees that load or
stores used for bitfield access would be always aligned according to
verifier expectations.


I think it should be true. The frontend does alignment analysis based on
types and (packed vs. unpacked) and assign each load/store with proper
alignment (like 'align 4' in the above). 'align 4' truely means
the load itself is 4-byte aligned. Otherwise, it will be very confusing
for arch's which do not support unaligned memory access (e.g. BPF).



I think we should check if there are some clang knobs that prevent
generation of unaligned memory access. I'll take a look.






Re: [PATCH net-next 00/38] Conver all net selftests to run in unique namespace

2023-11-26 Thread David Ahern
On 11/24/23 2:26 AM, Hangbin Liu wrote:
> As Guillaume pointed, many selftests create namespaces with very common
> names (like "client" or "server") or even (partially) run directly in 
> init_net.
> This makes these tests prone to failure if another namespace with the same
> name already exists. It also makes it impossible to run several instances
> of these tests in parallel.
> 
> This patch set conver all the net selftests to run in unique namespace,
> so we can update the selftest freamwork to run all tests in it's own namespace
> in parallel. After update, we only need to wait for the test which need
> longest time.
> 
> ]# per_test_logging=1 time ./run_kselftest.sh -n -c net
> TAP version 13
> # selftests: net: reuseport_bpf_numa
> not ok 3 selftests: net: reuseport_bpf_numa # exit=1
> # selftests: net: reuseport_bpf_cpu
> not ok 2 selftests: net: reuseport_bpf_cpu # exit=1
> # selftests: net: reuseport_dualstack
> not ok 4 selftests: net: reuseport_dualstack # exit=1
> # selftests: net: reuseaddr_conflict
> ok 5 selftests: net: reuseaddr_conflict
> 
> ...
> 
> # selftests: net: test_vxlan_mdb.sh
> ok 90 selftests: net: test_vxlan_mdb.sh
> # selftests: net: fib_nexthops.sh
> not ok 41 selftests: net: fib_nexthops.sh # exit=1
> # selftests: net: fcnal-test.sh
> not ok 36 selftests: net: fcnal-test.sh # exit=1
> 
> real55m1.238s
> user12m10.350s
> sys 22m17.432s
> 
> 

I have not looked at the details of each script change, but as long as
not test is broken by the change I am fine with the intent.

Acked-by: David Ahern 





Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations

2023-11-26 Thread Eduard Zingerman
On Sun, 2023-11-26 at 18:04 -0600, Daniel Xu wrote:
[...]
> > Tbh I'm not sure. This test passes with preserve_static_offset
> > because it suppresses preserve_access_index. In general clang
> > translates bitfield access to a set of IR statements like:
> > 
> >   C:
> > struct foo {
> >   unsigned _;
> >   unsigned a:1;
> >   ...
> > };
> > ... foo->a ...
> > 
> >   IR:
> > %a = getelementptr inbounds %struct.foo, ptr %0, i32 0, i32 1
> > %bf.load = load i8, ptr %a, align 4
> > %bf.clear = and i8 %bf.load, 1
> > %bf.cast = zext i8 %bf.clear to i32
> > 
> > With preserve_static_offset the getelementptr+load are replaced by a
> > single statement which is preserved as-is till code generation,
> > thus load with align 4 is preserved.
> > 
> > On the other hand, I'm not sure that clang guarantees that load or
> > stores used for bitfield access would be always aligned according to
> > verifier expectations.
> > 
> > I think we should check if there are some clang knobs that prevent
> > generation of unaligned memory access. I'll take a look.
> 
> Is there a reason to prefer fixing in compiler? I'm not opposed to it,
> but the downside to compiler fix is it takes years to propagate and
> sprinkles ifdefs into the code.
>
> Would it be possible to have an analogue of BPF_CORE_READ_BITFIELD()?

Well, the contraption below passes verification, tunnel selftest
appears to work. I might have messed up some shifts in the macro, though.

Still, if clang would peek unlucky BYTE_{OFFSET,SIZE} for a particular
field access might be unaligned.

---

diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c 
b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
index 3065a716544d..41cd913ac7ff 100644
--- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
@@ -9,6 +9,7 @@
 #include "vmlinux.h"
 #include 
 #include 
+#include 
 #include "bpf_kfuncs.h"
 #include "bpf_tracing_net.h"
 
@@ -144,6 +145,38 @@ int ip6gretap_get_tunnel(struct __sk_buff *skb)
return TC_ACT_OK;
 }
 
+#define BPF_CORE_WRITE_BITFIELD(s, field, new_val) ({  \
+   void *p = (void *)s + __CORE_RELO(s, field, BYTE_OFFSET);   \
+   unsigned byte_size = __CORE_RELO(s, field, BYTE_SIZE);  \
+   unsigned lshift = __CORE_RELO(s, field, LSHIFT_U64);\
+   unsigned rshift = __CORE_RELO(s, field, RSHIFT_U64);\
+   unsigned bit_size = (rshift - lshift);  \
+   unsigned long long nval, val, hi, lo;   \
+   \
+   asm volatile("" : "=r"(p) : "0"(p));\
+   \
+   switch (byte_size) {\
+   case 1: val = *(unsigned char *)p; break;   \
+   case 2: val = *(unsigned short *)p; break;  \
+   case 4: val = *(unsigned int *)p; break;\
+   case 8: val = *(unsigned long long *)p; break;  \
+   }   \
+   hi = val >> (bit_size + rshift);\
+   hi <<= bit_size + rshift;   \
+   lo = val << (bit_size + lshift);\
+   lo >>= bit_size + lshift;   \
+   nval = new_val; \
+   nval <<= lshift;\
+   nval >>= rshift;\
+   val = hi | nval | lo;   \
+   switch (byte_size) {\
+   case 1: *(unsigned char *)p  = val; break;  \
+   case 2: *(unsigned short *)p = val; break;  \
+   case 4: *(unsigned int *)p   = val; break;  \
+   case 8: *(unsigned long long *)p = val; break;  \
+   }   \
+})
+
 SEC("tc")
 int erspan_set_tunnel(struct __sk_buff *skb)
 {
@@ -173,9 +206,9 @@ int erspan_set_tunnel(struct __sk_buff *skb)
__u8 hwid = 7;
 
md.version = 2;
-   md.u.md2.dir = direction;
-   md.u.md2.hwid = hwid & 0xf;
-   md.u.md2.hwid_upper = (hwid >> 4) & 0x3;
+   BPF_CORE_WRITE_BITFIELD(&md.u.md2, dir, direction);
+   BPF_CORE_WRITE_BITFIELD(&md.u.md2, hwid, (hwid & 0xf));
+   BPF_CORE_WRITE_BITFIELD(&md.u.md2, hwid_upper, (hwid >> 4) & 0x3);
 #endif
 
ret = bpf_skb_set_tunnel_opt(skb, &md, sizeof(md));
@@ -214,8 +247,9 @@ int erspan_get_tunnel(struct __sk_buff *skb)
bpf_printk("\tindex %x\n", ind

Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations

2023-11-26 Thread Daniel Xu
Hi,

On Sun, Nov 26, 2023 at 10:14:21PM +0200, Eduard Zingerman wrote:
> On Sat, 2023-11-25 at 20:22 -0800, Yonghong Song wrote:
> [...]
> > --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > @@ -6,7 +6,10 @@
> >* modify it under the terms of version 2 of the GNU General Public
> >* License as published by the Free Software Foundation.
> >*/
> > -#define BPF_NO_PRESERVE_ACCESS_INDEX
> > +#if __has_attribute(preserve_static_offset)
> > +struct __attribute__((preserve_static_offset)) erspan_md2;
> > +struct __attribute__((preserve_static_offset)) erspan_metadata;
> > +#endif
> >   #include "vmlinux.h"
> [...]
> >   int bpf_skb_get_fou_encap(struct __sk_buff *skb_ctx,
> > @@ -174,9 +177,13 @@ int erspan_set_tunnel(struct __sk_buff *skb)
> >  __u8 hwid = 7;
> >   
> >  md.version = 2;
> > +#if __has_attribute(preserve_static_offset)
> >  md.u.md2.dir = direction;
> >  md.u.md2.hwid = hwid & 0xf;
> >  md.u.md2.hwid_upper = (hwid >> 4) & 0x3;
> > +#else
> > +   /* Change bit-field store to byte(s)-level stores. */
> > +#endif
> >   #endif
> >   
> >  ret = bpf_skb_set_tunnel_opt(skb, &md, sizeof(md));
> > 
> > 
> > 
> > Eduard, could you double check whether this is a valid use case
> > to solve this kind of issue with preserve_static_offset attribute?
> 
> Tbh I'm not sure. This test passes with preserve_static_offset
> because it suppresses preserve_access_index. In general clang
> translates bitfield access to a set of IR statements like:
> 
>   C:
> struct foo {
>   unsigned _;
>   unsigned a:1;
>   ...
> };
> ... foo->a ...
> 
>   IR:
> %a = getelementptr inbounds %struct.foo, ptr %0, i32 0, i32 1
> %bf.load = load i8, ptr %a, align 4
> %bf.clear = and i8 %bf.load, 1
> %bf.cast = zext i8 %bf.clear to i32
> 
> With preserve_static_offset the getelementptr+load are replaced by a
> single statement which is preserved as-is till code generation,
> thus load with align 4 is preserved.
> 
> On the other hand, I'm not sure that clang guarantees that load or
> stores used for bitfield access would be always aligned according to
> verifier expectations.
> 
> I think we should check if there are some clang knobs that prevent
> generation of unaligned memory access. I'll take a look.

Is there a reason to prefer fixing in compiler? I'm not opposed to it,
but the downside to compiler fix is it takes years to propagate and
sprinkles ifdefs into the code.

Would it be possible to have an analogue of BPF_CORE_READ_BITFIELD()?

Thanks,
Daniel



Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations

2023-11-26 Thread Eduard Zingerman
On Sat, 2023-11-25 at 20:22 -0800, Yonghong Song wrote:
[...]
> --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> @@ -6,7 +6,10 @@
>* modify it under the terms of version 2 of the GNU General Public
>* License as published by the Free Software Foundation.
>*/
> -#define BPF_NO_PRESERVE_ACCESS_INDEX
> +#if __has_attribute(preserve_static_offset)
> +struct __attribute__((preserve_static_offset)) erspan_md2;
> +struct __attribute__((preserve_static_offset)) erspan_metadata;
> +#endif
>   #include "vmlinux.h"
[...]
>   int bpf_skb_get_fou_encap(struct __sk_buff *skb_ctx,
> @@ -174,9 +177,13 @@ int erspan_set_tunnel(struct __sk_buff *skb)
>  __u8 hwid = 7;
>   
>  md.version = 2;
> +#if __has_attribute(preserve_static_offset)
>  md.u.md2.dir = direction;
>  md.u.md2.hwid = hwid & 0xf;
>  md.u.md2.hwid_upper = (hwid >> 4) & 0x3;
> +#else
> +   /* Change bit-field store to byte(s)-level stores. */
> +#endif
>   #endif
>   
>  ret = bpf_skb_set_tunnel_opt(skb, &md, sizeof(md));
> 
> 
> 
> Eduard, could you double check whether this is a valid use case
> to solve this kind of issue with preserve_static_offset attribute?

Tbh I'm not sure. This test passes with preserve_static_offset
because it suppresses preserve_access_index. In general clang
translates bitfield access to a set of IR statements like:

  C:
struct foo {
  unsigned _;
  unsigned a:1;
  ...
};
... foo->a ...

  IR:
%a = getelementptr inbounds %struct.foo, ptr %0, i32 0, i32 1
%bf.load = load i8, ptr %a, align 4
%bf.clear = and i8 %bf.load, 1
%bf.cast = zext i8 %bf.clear to i32

With preserve_static_offset the getelementptr+load are replaced by a
single statement which is preserved as-is till code generation,
thus load with align 4 is preserved.

On the other hand, I'm not sure that clang guarantees that load or
stores used for bitfield access would be always aligned according to
verifier expectations.

I think we should check if there are some clang knobs that prevent
generation of unaligned memory access. I'll take a look.



Re: [RFC PATCH 0/5] RISC-V: Add dynamic TSO support

2023-11-26 Thread Guo Ren
On Fri, Nov 24, 2023 at 12:54:30PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 24, 2023 at 12:04:09PM +0100, Jonas Oberhauser wrote:
> 
> > > I think ARM64 approached this problem by adding the
> > > load-acquire/store-release instructions and for TSO based code,
> > > translate into those (eg. x86 -> arm64 transpilers).
> > 
> > 
> > Although those instructions have a bit more ordering constraints.
> > 
> > I have heard rumors that the apple chips also have a register that can be
> > set at runtime.
I could understand the rumor, smart design! Thx for sharing.

> 
> Oh, I thought they made do with the load-acquire/store-release thingies.
> But to be fair, I haven't been paying *that* much attention to the apple
> stuff.
> 
> I did read about how they fudged some of the x86 flags thing.
> 
> > And there are some IBM machines that have a setting, but not sure how it is
> > controlled.
> 
> Cute, I'm assuming this is the Power series (s390 already being TSO)? I
> wasn't aware they had this.
> 
> > > IIRC Risc-V actually has such instructions as well, so *why* are you
> > > doing this?!?!
> > 
> > 
> > Unfortunately, at least last time I checked RISC-V still hadn't gotten such
> > instructions.
> > What they have is the *semantics* of the instructions, but no actual opcodes
> > to encode them.
> 
> Well, that sucks..
> 
> > I argued for them in the RISC-V memory group, but it was considered to be
> > outside the scope of that group.
> > 
> > Transpiling with sufficient DMB ISH to get the desired ordering is really
> > bad for performance.
> 
> Ha!, quite dreadful I would imagine.
> 
> > That is not to say that linux should support this. Perhaps linux should
> > pressure RISC-V into supporting implicit barriers instead.
> 
> I'm not sure I count for much in this regard, but yeah, that sounds like
> a plan :-)
> 



Re: [PATCH 2/3] tools/nolibc: add support for getrlimit/setrlimit

2023-11-26 Thread Thomas Weißschuh
On 2023-11-26 10:28:28+0100, Willy Tarreau wrote:
> Hi Thomas,
> 
> > +int test_rlimit(void)
> > +{
> > +   struct rlimit rlim = {
> > +   .rlim_cur = 1 << 20,
> > +   .rlim_max = 1 << 20,
> > +   };
> > +   int ret;
> > +
> > +   ret = setrlimit(RLIMIT_CORE, &rlim);
> > +   if (ret)
> > +   return -1;
> > +
> > +   rlim.rlim_cur = 0;
> > +   rlim.rlim_max = 0;
> > +
> > +   ret = getrlimit(RLIMIT_CORE, &rlim);
> > +   if (ret)
> > +   return -1;
> > +
> > +   if (rlim.rlim_cur != 1 << 20)
> > +   return -1;
> > +   if (rlim.rlim_max != 1 << 20)
> > +   return -1;
> 
> I think you should used two different values here for cur and max so
> that you can also detect stupid API bugs such as a union being used
> instead of a struct, or copy-pastes in the implementation etc. For
> example using 1<<20 and 1<<21 should do the trick.

Good point, I incorporated the suggestion.

> Otherwise Ack-by me for the whole series, of course.

Thanks!

FYI I retested and pushed the series.


Thomas



Re: [PATCH 2/3] tools/nolibc: add support for getrlimit/setrlimit

2023-11-26 Thread Willy Tarreau
Hi Thomas,

> +int test_rlimit(void)
> +{
> + struct rlimit rlim = {
> + .rlim_cur = 1 << 20,
> + .rlim_max = 1 << 20,
> + };
> + int ret;
> +
> + ret = setrlimit(RLIMIT_CORE, &rlim);
> + if (ret)
> + return -1;
> +
> + rlim.rlim_cur = 0;
> + rlim.rlim_max = 0;
> +
> + ret = getrlimit(RLIMIT_CORE, &rlim);
> + if (ret)
> + return -1;
> +
> + if (rlim.rlim_cur != 1 << 20)
> + return -1;
> + if (rlim.rlim_max != 1 << 20)
> + return -1;

I think you should used two different values here for cur and max so
that you can also detect stupid API bugs such as a union being used
instead of a struct, or copy-pastes in the implementation etc. For
example using 1<<20 and 1<<21 should do the trick.

Otherwise Ack-by me for the whole series, of course.

Thanks,
Willy