Re: [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ

2021-10-12 Thread Jason Wang


在 2021/10/1 下午3:06, Eugenio Pérez 写道:

Use translations added in VhostIOVATree in SVQ.

Now every element needs to store the previous address also, so VirtQueue
can consume the elements properly. This adds a little overhead per VQ
element, having to allocate more memory to stash them. As a possible
optimization, this allocation could be avoided if the descriptor is not
a chain but a single one, but this is left undone.

TODO: iova range should be queried before, and add logic to fail when
GPA is outside of its range and memory listener or svq add it.

Signed-off-by: Eugenio Pérez
---
  hw/virtio/vhost-shadow-virtqueue.h |   4 +-
  hw/virtio/vhost-shadow-virtqueue.c | 130 -
  hw/virtio/vhost-vdpa.c |  40 -
  hw/virtio/trace-events |   1 +
  4 files changed, 152 insertions(+), 23 deletions(-)



Think hard about the whole logic. This is safe since qemu memory map 
will fail if guest submits a invalidate IOVA.


Then I wonder if we do something much more simpler:

1) Using qemu VA as IOVA but only maps the VA that belongs to guest
2) Then we don't need any IOVA tree here, what we need is to just map 
vring and use qemu VA without any translation


Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v4 17/20] vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow virtqueue

2021-10-12 Thread Jason Wang


在 2021/10/1 下午3:06, Eugenio Pérez 写道:

Signed-off-by: Eugenio Pérez 



Commit log please.

Thanks



---
  hw/virtio/vhost-shadow-virtqueue.c | 24 +++-
  1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
b/hw/virtio/vhost-shadow-virtqueue.c
index 775f8d36a0..2fd0bab75d 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -60,6 +60,9 @@ typedef struct VhostShadowVirtqueue {
  
  /* Next head to consume from device */

  uint16_t used_idx;
+
+/* Cache for the exposed notification flag */
+bool notification;
  } VhostShadowVirtqueue;
  
  /* If the device is using some of these, SVQ cannot communicate */

@@ -105,6 +108,24 @@ bool vhost_svq_valid_device_features(uint64_t 
*dev_features)
  return r;
  }
  
+static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)

+{
+uint16_t notification_flag;
+
+if (svq->notification == enable) {
+return;
+}
+
+notification_flag = cpu_to_le16(VRING_AVAIL_F_NO_INTERRUPT);
+
+svq->notification = enable;
+if (enable) {
+svq->vring.avail->flags &= ~notification_flag;
+} else {
+svq->vring.avail->flags |= notification_flag;
+}
+}
+
  static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
  const struct iovec *iovec,
  size_t num, bool more_descs, bool write)
@@ -273,7 +294,7 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
  do {
  unsigned i = 0;
  
-/* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */

+vhost_svq_set_notification(svq, false);
  while (true) {
  g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
  if (!elem) {
@@ -286,6 +307,7 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
  
  virtqueue_flush(vq, i);

  event_notifier_set(&svq->guest_call_notifier);
+vhost_svq_set_notification(svq, true);
  } while (vhost_svq_more_used(svq));
  }
  


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v4 16/20] vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue kick

2021-10-12 Thread Jason Wang


在 2021/10/1 下午3:05, Eugenio Pérez 写道:

Signed-off-by: Eugenio Pérez 
---
  hw/virtio/vhost-shadow-virtqueue.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
b/hw/virtio/vhost-shadow-virtqueue.c
index df7e6fa3ec..775f8d36a0 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -173,6 +173,15 @@ static void vhost_svq_add(VhostShadowVirtqueue *svq, 
VirtQueueElement *elem)
  svq->ring_id_maps[qemu_head] = elem;
  }
  
+static void vhost_svq_kick(VhostShadowVirtqueue *svq)

+{
+/* Make sure we are reading updated device flag */



I guess this would be better:

    /* We need to expose available array entries before checking used
 * flags. */

(Borrowed from kernel codes).

Thanks



+smp_mb();
+if (!(svq->vring.used->flags & VRING_USED_F_NO_NOTIFY)) {
+event_notifier_set(&svq->kick_notifier);
+}
+}
+
  /* Handle guest->device notifications */
  static void vhost_handle_guest_kick(EventNotifier *n)
  {
@@ -197,7 +206,7 @@ static void vhost_handle_guest_kick(EventNotifier *n)
  }
  
  vhost_svq_add(svq, elem);

-event_notifier_set(&svq->kick_notifier);
+vhost_svq_kick(svq);
  }
  
  virtio_queue_set_notification(svq->vq, true);


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v4 15/20] vhost: Shadow virtqueue buffers forwarding

2021-10-12 Thread Jason Wang


在 2021/10/1 下午3:05, Eugenio Pérez 写道:

Initial version of shadow virtqueue that actually forward buffers. There
are no iommu support at the moment, and that will be addressed in future
patches of this series. Since all vhost-vdpa devices uses forced IOMMU,
this means that SVQ is not usable at this point of the series on any
device.

For simplicity it only supports modern devices, that expects vring
in little endian, with split ring and no event idx or indirect
descriptors. Support for them will not be added in this series.

It reuses the VirtQueue code for the device part. The driver part is
based on Linux's virtio_ring driver, but with stripped functionality
and optimizations so it's easier to review. Later commits add simpler
ones.

SVQ uses VIRTIO_CONFIG_S_DEVICE_STOPPED to pause the device and
retrieve its status (next available idx the device was going to
consume) race-free. It can later reset the device to replace vring
addresses etc. When SVQ starts qemu can resume consuming the guest's
driver ring from that state, without notice from the latter.

This status bit VIRTIO_CONFIG_S_DEVICE_STOPPED is currently discussed
in VirtIO, and is implemented in qemu VirtIO-net devices in previous
commits.

Removal of _S_DEVICE_STOPPED bit (in other words, resuming the device)
can be done in the future if an use case arises. At this moment we can
just rely on reseting the full device.

Signed-off-by: Eugenio Pérez 
---
  qapi/net.json  |   2 +-
  hw/virtio/vhost-shadow-virtqueue.c | 237 -
  hw/virtio/vhost-vdpa.c | 109 -
  3 files changed, 337 insertions(+), 11 deletions(-)

diff --git a/qapi/net.json b/qapi/net.json
index fe546b0e7c..1f4a55f2c5 100644
--- a/qapi/net.json
+++ b/qapi/net.json
@@ -86,7 +86,7 @@
  #
  # @name: the device name of the VirtIO device
  #
-# @enable: true to use the alternate shadow VQ notifications
+# @enable: true to use the alternate shadow VQ buffers fowarding path
  #
  # Returns: Error if failure, or 'no error' for success.
  #
diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
b/hw/virtio/vhost-shadow-virtqueue.c
index 34e159d4fd..df7e6fa3ec 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -10,6 +10,7 @@
  #include "qemu/osdep.h"
  #include "hw/virtio/vhost-shadow-virtqueue.h"
  #include "hw/virtio/vhost.h"
+#include "hw/virtio/virtio-access.h"
  
  #include "standard-headers/linux/vhost_types.h"
  
@@ -44,15 +45,135 @@ typedef struct VhostShadowVirtqueue {
  
  /* Virtio device */

  VirtIODevice *vdev;
+
+/* Map for returning guest's descriptors */
+VirtQueueElement **ring_id_maps;
+
+/* Next head to expose to device */
+uint16_t avail_idx_shadow;
+
+/* Next free descriptor */
+uint16_t free_head;
+
+/* Last seen used idx */
+uint16_t shadow_used_idx;
+
+/* Next head to consume from device */
+uint16_t used_idx;



Let's use "last_used_idx" as kernel driver did.



  } VhostShadowVirtqueue;
  
  /* If the device is using some of these, SVQ cannot communicate */

  bool vhost_svq_valid_device_features(uint64_t *dev_features)
  {
-return true;
+uint64_t b;
+bool r = true;
+
+for (b = VIRTIO_TRANSPORT_F_START; b <= VIRTIO_TRANSPORT_F_END; ++b) {
+switch (b) {
+case VIRTIO_F_NOTIFY_ON_EMPTY:
+case VIRTIO_F_ANY_LAYOUT:
+/* SVQ is fine with this feature */
+continue;
+
+case VIRTIO_F_ACCESS_PLATFORM:
+/* SVQ needs this feature disabled. Can't continue */



So code can explain itself, need a comment to explain why.



+if (*dev_features & BIT_ULL(b)) {
+clear_bit(b, dev_features);
+r = false;
+}
+break;
+
+case VIRTIO_F_VERSION_1:
+/* SVQ needs this feature, so can't continue */



A comment to explain why SVQ needs this feature.



+if (!(*dev_features & BIT_ULL(b))) {
+set_bit(b, dev_features);
+r = false;
+}
+continue;
+
+default:
+/*
+ * SVQ must disable this feature, let's hope the device is fine
+ * without it.
+ */
+if (*dev_features & BIT_ULL(b)) {
+clear_bit(b, dev_features);
+}
+}
+}
+
+return r;
+}



Let's move this to patch 14.



+
+static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
+const struct iovec *iovec,
+size_t num, bool more_descs, bool write)
+{
+uint16_t i = svq->free_head, last = svq->free_head;
+unsigned n;
+uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
+vring_desc_t *descs = svq->vring.desc;
+
+if (num == 0) {
+return;
+}
+
+for (n = 0; n < num; n++) {
+if (more_descs || (n + 1 < num)) {
+descs[i].fl

Re: [RFC PATCH v4 13/20] vdpa: Save host and guest features

2021-10-12 Thread Jason Wang


在 2021/10/1 下午3:05, Eugenio Pérez 写道:

Those are needed for SVQ: Host ones are needed to check if SVQ knows
how to talk with the device and for feature negotiation, and guest ones
to know if SVQ can talk with it.

Signed-off-by: Eugenio Pérez 
---
  include/hw/virtio/vhost-vdpa.h |  2 ++
  hw/virtio/vhost-vdpa.c | 31 ---
  2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index fddac248b3..9044ae694b 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -26,6 +26,8 @@ typedef struct vhost_vdpa {
  int device_fd;
  uint32_t msg_type;
  MemoryListener listener;
+uint64_t host_features;
+uint64_t guest_features;



Any reason that we can't use the features stored in VirtioDevice?

Thanks



  bool shadow_vqs_enabled;
  GPtrArray *shadow_vqs;
  struct vhost_dev *dev;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 6c5f4c98b8..a057e8277d 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -439,10 +439,19 @@ static int vhost_vdpa_set_mem_table(struct vhost_dev *dev,
  return 0;
  }
  
-static int vhost_vdpa_set_features(struct vhost_dev *dev,

-   uint64_t features)
+/**
+ * Internal set_features() that follows vhost/VirtIO protocol for that
+ */
+static int vhost_vdpa_backend_set_features(struct vhost_dev *dev,
+   uint64_t features)
  {
+struct vhost_vdpa *v = dev->opaque;
+
  int ret;
+if (v->host_features & BIT_ULL(VIRTIO_F_QUEUE_STATE)) {
+features |= BIT_ULL(VIRTIO_F_QUEUE_STATE);
+}
+
  trace_vhost_vdpa_set_features(dev, features);
  ret = vhost_vdpa_call(dev, VHOST_SET_FEATURES, &features);
  uint8_t status = 0;
@@ -455,6 +464,17 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
  return !(status & VIRTIO_CONFIG_S_FEATURES_OK);
  }
  
+/**

+ * Exposed vhost set features
+ */
+static int vhost_vdpa_set_features(struct vhost_dev *dev,
+   uint64_t features)
+{
+struct vhost_vdpa *v = dev->opaque;
+v->guest_features = features;
+return vhost_vdpa_backend_set_features(dev, features);
+}
+
  static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
  {
  uint64_t features;
@@ -673,12 +693,17 @@ static int vhost_vdpa_set_vring_call(struct vhost_dev 
*dev,
  }
  
  static int vhost_vdpa_get_features(struct vhost_dev *dev,

- uint64_t *features)
+   uint64_t *features)
  {
  int ret;
  
  ret = vhost_vdpa_call(dev, VHOST_GET_FEATURES, features);

  trace_vhost_vdpa_get_features(dev, *features);
+
+if (ret == 0) {
+struct vhost_vdpa *v = dev->opaque;
+v->host_features = *features;
+}
  return ret;
  }
  


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v4 12/20] virtio: Add vhost_shadow_vq_get_vring_addr

2021-10-12 Thread Jason Wang


在 2021/10/1 下午3:05, Eugenio Pérez 写道:

It reports the shadow virtqueue address from qemu virtual address space



I think both the title and commit log needs to more tweaks. Looking at 
the codes, what id does is actually introduce vring into svq.





Signed-off-by: Eugenio Pérez 
---
  hw/virtio/vhost-shadow-virtqueue.h |  4 +++
  hw/virtio/vhost-shadow-virtqueue.c | 50 ++
  2 files changed, 54 insertions(+)

diff --git a/hw/virtio/vhost-shadow-virtqueue.h 
b/hw/virtio/vhost-shadow-virtqueue.h
index 237cfceb9c..2df3d117f5 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -16,6 +16,10 @@ typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
  
  EventNotifier *vhost_svq_get_svq_call_notifier(VhostShadowVirtqueue *svq);

  void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int 
call_fd);
+void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
+  struct vhost_vring_addr *addr);
+size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq);
+size_t vhost_svq_device_area_size(const VhostShadowVirtqueue *svq);
  
  bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,

   VhostShadowVirtqueue *svq);
diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
b/hw/virtio/vhost-shadow-virtqueue.c
index 3fe129cf63..5c1899f6af 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -18,6 +18,9 @@
  
  /* Shadow virtqueue to relay notifications */

  typedef struct VhostShadowVirtqueue {
+/* Shadow vring */
+struct vring vring;
+
  /* Shadow kick notifier, sent to vhost */
  EventNotifier kick_notifier;
  /* Shadow call notifier, sent to vhost */
@@ -38,6 +41,9 @@ typedef struct VhostShadowVirtqueue {
  
  /* Virtio queue shadowing */

  VirtQueue *vq;
+
+/* Virtio device */
+VirtIODevice *vdev;
  } VhostShadowVirtqueue;
  
  /* Forward guest notifications */

@@ -93,6 +99,35 @@ void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue 
*svq, int call_fd)
  event_notifier_init_fd(&svq->guest_call_notifier, call_fd);
  }
  
+/*

+ * Get the shadow vq vring address.
+ * @svq Shadow virtqueue
+ * @addr Destination to store address
+ */
+void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
+  struct vhost_vring_addr *addr)
+{
+addr->desc_user_addr = (uint64_t)svq->vring.desc;
+addr->avail_user_addr = (uint64_t)svq->vring.avail;
+addr->used_user_addr = (uint64_t)svq->vring.used;
+}
+
+size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq)
+{
+uint16_t vq_idx = virtio_get_queue_index(svq->vq);
+size_t desc_size = virtio_queue_get_desc_size(svq->vdev, vq_idx);
+size_t avail_size = virtio_queue_get_avail_size(svq->vdev, vq_idx);
+
+return ROUND_UP(desc_size + avail_size, qemu_real_host_page_size);



Is this round up required by the spec?



+}
+
+size_t vhost_svq_device_area_size(const VhostShadowVirtqueue *svq)
+{
+uint16_t vq_idx = virtio_get_queue_index(svq->vq);
+size_t used_size = virtio_queue_get_used_size(svq->vdev, vq_idx);
+return ROUND_UP(used_size, qemu_real_host_page_size);
+}
+
  /*
   * Restore the vhost guest to host notifier, i.e., disables svq effect.
   */
@@ -178,6 +213,10 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
  VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
  {
  int vq_idx = dev->vq_index + idx;
+unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
+size_t desc_size = virtio_queue_get_desc_size(dev->vdev, vq_idx);
+size_t driver_size;
+size_t device_size;
  g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
  int r;
  
@@ -196,6 +235,15 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)

  }
  
  svq->vq = virtio_get_queue(dev->vdev, vq_idx);

+svq->vdev = dev->vdev;
+driver_size = vhost_svq_driver_area_size(svq);
+device_size = vhost_svq_device_area_size(svq);
+svq->vring.num = num;
+svq->vring.desc = qemu_memalign(qemu_real_host_page_size, driver_size);
+svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
+memset(svq->vring.desc, 0, driver_size);



Any reason for using the contiguous area for both desc and avail?

Thanks



+svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
+memset(svq->vring.used, 0, device_size);
  event_notifier_set_handler(&svq->call_notifier,
 vhost_svq_handle_call);
  return g_steal_pointer(&svq);
@@ -215,5 +263,7 @@ void vhost_svq_free(VhostShadowVirtqueue *vq)
  event_notifier_cleanup(&vq->kick_notifier);
  event_notifier_set_handler(&vq->call_notifier, NULL);
  event_notifier_cleanup(&vq->call_notifier);
+qemu_vfree(vq->vring.desc);
+qemu_vfree(vq->vring.used);
  g_free(vq);
  }


__

Re: [RFC PATCH v4 11/20] vhost: Route host->guest notification through shadow virtqueue

2021-10-12 Thread Jason Wang


在 2021/10/1 下午3:05, Eugenio Pérez 写道:

This will make qemu aware of the device used buffers, allowing it to
write the guest memory with its contents if needed.

Since the use of vhost_virtqueue_start can unmasks and discard call
events, vhost_virtqueue_start should be modified in one of these ways:
* Split in two: One of them uses all logic to start a queue with no
   side effects for the guest, and another one tha actually assumes that
   the guest has just started the device. Vdpa should use just the
   former.
* Actually store and check if the guest notifier is masked, and do it
   conditionally.
* Left as it is, and duplicate all the logic in vhost-vdpa.



Btw, the log looks not clear. I guess this patch goes for method 3. If 
yes, we need explain it and why.


Thanks




Signed-off-by: Eugenio Pérez


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v4 11/20] vhost: Route host->guest notification through shadow virtqueue

2021-10-12 Thread Jason Wang


在 2021/10/1 下午3:05, Eugenio Pérez 写道:

This will make qemu aware of the device used buffers, allowing it to
write the guest memory with its contents if needed.

Since the use of vhost_virtqueue_start can unmasks and discard call
events, vhost_virtqueue_start should be modified in one of these ways:
* Split in two: One of them uses all logic to start a queue with no
   side effects for the guest, and another one tha actually assumes that
   the guest has just started the device. Vdpa should use just the
   former.
* Actually store and check if the guest notifier is masked, and do it
   conditionally.
* Left as it is, and duplicate all the logic in vhost-vdpa.

Signed-off-by: Eugenio Pérez 
---
  hw/virtio/vhost-shadow-virtqueue.c | 19 +++
  hw/virtio/vhost-vdpa.c | 38 +-
  2 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
b/hw/virtio/vhost-shadow-virtqueue.c
index 21dc99ab5d..3fe129cf63 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -53,6 +53,22 @@ static void vhost_handle_guest_kick(EventNotifier *n)
  event_notifier_set(&svq->kick_notifier);
  }
  
+/* Forward vhost notifications */

+static void vhost_svq_handle_call_no_test(EventNotifier *n)
+{
+VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
+ call_notifier);
+
+event_notifier_set(&svq->guest_call_notifier);
+}
+
+static void vhost_svq_handle_call(EventNotifier *n)
+{
+if (likely(event_notifier_test_and_clear(n))) {
+vhost_svq_handle_call_no_test(n);
+}
+}
+
  /*
   * Obtain the SVQ call notifier, where vhost device notifies SVQ that there
   * exists pending used buffers.
@@ -180,6 +196,8 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, 
int idx)
  }
  
  svq->vq = virtio_get_queue(dev->vdev, vq_idx);

+event_notifier_set_handler(&svq->call_notifier,
+   vhost_svq_handle_call);
  return g_steal_pointer(&svq);
  
  err_init_call_notifier:

@@ -195,6 +213,7 @@ err_init_kick_notifier:
  void vhost_svq_free(VhostShadowVirtqueue *vq)
  {
  event_notifier_cleanup(&vq->kick_notifier);
+event_notifier_set_handler(&vq->call_notifier, NULL);
  event_notifier_cleanup(&vq->call_notifier);
  g_free(vq);
  }
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index bc34de2439..6c5f4c98b8 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -712,13 +712,40 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev 
*dev, unsigned idx)
  {
  struct vhost_vdpa *v = dev->opaque;
  VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, idx);
-return vhost_svq_start(dev, idx, svq);
+EventNotifier *vhost_call_notifier = vhost_svq_get_svq_call_notifier(svq);
+struct vhost_vring_file vhost_call_file = {
+.index = idx + dev->vq_index,
+.fd = event_notifier_get_fd(vhost_call_notifier),
+};
+int r;
+bool b;
+
+/* Set shadow vq -> guest notifier */
+assert(v->call_fd[idx]);



We need aovid the asser() here. On which case we can hit this?



+vhost_svq_set_guest_call_notifier(svq, v->call_fd[idx]);
+
+b = vhost_svq_start(dev, idx, svq);
+if (unlikely(!b)) {
+return false;
+}
+
+/* Set device -> SVQ notifier */
+r = vhost_vdpa_set_vring_dev_call(dev, &vhost_call_file);
+if (unlikely(r)) {
+error_report("vhost_vdpa_set_vring_call for shadow vq failed");
+return false;
+}



Similar to kick, do we need to set_vring_call() before vhost_svq_start()?



+
+/* Check for pending calls */
+event_notifier_set(vhost_call_notifier);



Interesting, can this result spurious interrupt?



+return true;
  }
  
  static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)

  {
  struct vhost_dev *hdev = v->dev;
  unsigned n;
+int r;
  
  if (enable == v->shadow_vqs_enabled) {

  return hdev->nvqs;
@@ -752,9 +779,18 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa 
*v, bool enable)
  if (!enable) {
  /* Disable all queues or clean up failed start */
  for (n = 0; n < v->shadow_vqs->len; ++n) {
+struct vhost_vring_file file = {
+.index = vhost_vdpa_get_vq_index(hdev, n),
+.fd = v->call_fd[n],
+};
+
+r = vhost_vdpa_set_vring_call(hdev, &file);
+assert(r == 0);
+
  unsigned vq_idx = vhost_vdpa_get_vq_index(hdev, n);
  VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, n);
  vhost_svq_stop(hdev, n, svq);
+/* TODO: This can unmask or override call fd! */



I don't get this comment. Does this mean the current code can't work 
with mask_notifiers? If yes, this is something we need to fix.


Thanks



  vhost_virtqueue_start(hdev, hdev

Re: [RFC PATCH v4 10/20] vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call

2021-10-12 Thread Jason Wang


在 2021/10/1 下午3:05, Eugenio Pérez 写道:

Signed-off-by: Eugenio Pérez 
---
  hw/virtio/vhost-vdpa.c | 17 ++---
  1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 57a857444a..bc34de2439 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -649,16 +649,27 @@ static int vhost_vdpa_set_vring_kick(struct vhost_dev 
*dev,
  return vhost_vdpa_call(dev, VHOST_SET_VRING_KICK, file);
  }
  
+static int vhost_vdpa_set_vring_dev_call(struct vhost_dev *dev,

+ struct vhost_vring_file *file)
+{
+trace_vhost_vdpa_set_vring_call(dev, file->index, file->fd);
+return vhost_vdpa_call(dev, VHOST_SET_VRING_CALL, file);
+}
+
  static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,
 struct vhost_vring_file *file)
  {
  struct vhost_vdpa *v = dev->opaque;
  int vdpa_idx = vhost_vdpa_get_vq_index(dev, file->index);
  
-trace_vhost_vdpa_set_vring_call(dev, file->index, file->fd);

-
  v->call_fd[vdpa_idx] = file->fd;
-return vhost_vdpa_call(dev, VHOST_SET_VRING_CALL, file);
+if (v->shadow_vqs_enabled) {
+VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, vdpa_idx);
+vhost_svq_set_guest_call_notifier(svq, file->fd);
+return 0;
+} else {
+return vhost_vdpa_set_vring_dev_call(dev, file);
+}



I feel like we should do the same for kick fd.

Thanks



  }
  
  static int vhost_vdpa_get_features(struct vhost_dev *dev,


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v4 09/20] vdpa: Save call_fd in vhost-vdpa

2021-10-12 Thread Jason Wang


在 2021/10/1 下午3:05, Eugenio Pérez 写道:

We need to know it to switch to Shadow VirtQueue.

Signed-off-by: Eugenio Pérez 
---
  include/hw/virtio/vhost-vdpa.h | 2 ++
  hw/virtio/vhost-vdpa.c | 5 +
  2 files changed, 7 insertions(+)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 48aae59d8e..fddac248b3 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -30,6 +30,8 @@ typedef struct vhost_vdpa {
  GPtrArray *shadow_vqs;
  struct vhost_dev *dev;
  QLIST_ENTRY(vhost_vdpa) entry;
+/* File descriptor the device uses to call VM/SVQ */
+int call_fd[VIRTIO_QUEUE_MAX];



Any reason we don't do this for kick_fd or why 
virtio_queue_get_guest_notifier() can't work here? Need a comment or 
commit log.


I think we need to have a consistent way to handle both kick and call fd.

Thanks



  VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
  } VhostVDPA;
  
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c

index 36c954a779..57a857444a 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -652,7 +652,12 @@ static int vhost_vdpa_set_vring_kick(struct vhost_dev *dev,
  static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,
 struct vhost_vring_file *file)
  {
+struct vhost_vdpa *v = dev->opaque;
+int vdpa_idx = vhost_vdpa_get_vq_index(dev, file->index);
+
  trace_vhost_vdpa_set_vring_call(dev, file->index, file->fd);
+
+v->call_fd[vdpa_idx] = file->fd;
  return vhost_vdpa_call(dev, VHOST_SET_VRING_CALL, file);
  }
  


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v4 08/20] vhost: Route guest->host notification through shadow virtqueue

2021-10-12 Thread Jason Wang


在 2021/10/1 下午3:05, Eugenio Pérez 写道:

Shadow virtqueue notifications forwarding is disabled when vhost_dev
stops, so code flow follows usual cleanup.

Also, host notifiers must be disabled at SVQ start,



Any reason for this?



and they will not
start if SVQ has been enabled when device is stopped. This is trivial
to address, but it is left out for simplicity at this moment.



It looks to me this patch also contains the following logics

1) codes to enable svq

2) codes to let svq to be enabled from QMP.

I think they need to be split out, we may endup with the following 
series of patches


1) svq skeleton with enable/disable
2) route host notifier to svq
3) route guest notifier to svq
4) codes to enable svq
5) enable svq via QMP




Signed-off-by: Eugenio Pérez 
---
  qapi/net.json  |   2 +-
  hw/virtio/vhost-shadow-virtqueue.h |   8 ++
  include/hw/virtio/vhost-vdpa.h |   4 +
  hw/virtio/vhost-shadow-virtqueue.c | 138 -
  hw/virtio/vhost-vdpa.c | 116 +++-
  5 files changed, 264 insertions(+), 4 deletions(-)

diff --git a/qapi/net.json b/qapi/net.json
index a2c30fd455..fe546b0e7c 100644
--- a/qapi/net.json
+++ b/qapi/net.json
@@ -88,7 +88,7 @@
  #
  # @enable: true to use the alternate shadow VQ notifications
  #
-# Returns: Always error, since SVQ is not implemented at the moment.
+# Returns: Error if failure, or 'no error' for success.
  #
  # Since: 6.2
  #
diff --git a/hw/virtio/vhost-shadow-virtqueue.h 
b/hw/virtio/vhost-shadow-virtqueue.h
index 27ac6388fa..237cfceb9c 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -14,6 +14,14 @@
  
  typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
  
+EventNotifier *vhost_svq_get_svq_call_notifier(VhostShadowVirtqueue *svq);



Let's move this function to another patch since it's unrelated to the 
guest->host routing.




+void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
+
+bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
+ VhostShadowVirtqueue *svq);
+void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
+VhostShadowVirtqueue *svq);
+
  VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx);
  
  void vhost_svq_free(VhostShadowVirtqueue *vq);

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 0d565bb5bd..48aae59d8e 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -12,6 +12,8 @@
  #ifndef HW_VIRTIO_VHOST_VDPA_H
  #define HW_VIRTIO_VHOST_VDPA_H
  
+#include 

+
  #include "qemu/queue.h"
  #include "hw/virtio/virtio.h"
  
@@ -24,6 +26,8 @@ typedef struct vhost_vdpa {

  int device_fd;
  uint32_t msg_type;
  MemoryListener listener;
+bool shadow_vqs_enabled;
+GPtrArray *shadow_vqs;
  struct vhost_dev *dev;
  QLIST_ENTRY(vhost_vdpa) entry;
  VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
b/hw/virtio/vhost-shadow-virtqueue.c
index c4826a1b56..21dc99ab5d 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -9,9 +9,12 @@
  
  #include "qemu/osdep.h"

  #include "hw/virtio/vhost-shadow-virtqueue.h"
+#include "hw/virtio/vhost.h"
+
+#include "standard-headers/linux/vhost_types.h"
  
  #include "qemu/error-report.h"

-#include "qemu/event_notifier.h"
+#include "qemu/main-loop.h"
  
  /* Shadow virtqueue to relay notifications */

  typedef struct VhostShadowVirtqueue {
@@ -19,14 +22,146 @@ typedef struct VhostShadowVirtqueue {
  EventNotifier kick_notifier;
  /* Shadow call notifier, sent to vhost */
  EventNotifier call_notifier;
+
+/*
+ * Borrowed virtqueue's guest to host notifier.
+ * To borrow it in this event notifier allows to register on the event
+ * loop and access the associated shadow virtqueue easily. If we use the
+ * VirtQueue, we don't have an easy way to retrieve it.
+ *
+ * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
+ */
+EventNotifier host_notifier;
+
+/* Guest's call notifier, where SVQ calls guest. */
+EventNotifier guest_call_notifier;



To be consistent, let's simply use "guest_notifier" here.



+
+/* Virtio queue shadowing */
+VirtQueue *vq;
  } VhostShadowVirtqueue;
  
+/* Forward guest notifications */

+static void vhost_handle_guest_kick(EventNotifier *n)
+{
+VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
+ host_notifier);
+
+if (unlikely(!event_notifier_test_and_clear(n))) {
+return;
+}



Is there a chance that we may stop the processing of available buffers 
during the svq enabling? There could be no kick from the guest in this case.




+
+event_notifier_set(&svq->kick_notifier);
+}
+
+/*
+ * Obtain the SVQ call notifier, where vhost device no

Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Dan Williams
On Tue, Oct 12, 2021 at 2:28 PM Andi Kleen  wrote:
[..]
> >> But how do you debug the kernel then? Making early undebuggable seems
> >> just bad policy to me.
> > I am not proposing making the early undebuggable.
>
>
> That's the implication of moving the policy into initrd.
>
>
> If only initrd can authorize then it won't be possible to authorize
> before initrd, thus the early console won't work.

Again, the proposal is that the allow-list is limited to just enough
devices to startup and debug the initramfs and no more. Everything
else can be dynamic, and this allows for a powerful custom override
interface without needing to debate additional ABI like command line
overrides, and minimizes future changes to this kernel-internal
allow-list.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared

2021-10-12 Thread Michael S. Tsirkin
On Tue, Oct 12, 2021 at 02:18:01PM -0700, Andi Kleen wrote:
> 
> > Interesting. VT-d tradeoffs ... what are they?
> 
> The connection to the device is not encrypted and also not authenticated.
> 
> This is different that even talking to the (untrusted) host through shared
> memory where you at least still have a common key.

Well it's different sure enough but how is talking to host less secure?
Cold boot attacks and such?

> > Allowing hypervisor to write into BIOS looks like it will
> > trivially lead to code execution, won't it?
> 
> This is not about BIOS code executing. While the guest firmware runs it is
> protected of course. This is for BIOS structures like ACPI tables that are
> mapped by Linux. While AML can run byte code it can normally not write to
> arbitrary memory.

I thought you basically create an OperationRegion of SystemMemory type,
and off you go. Maybe the OSPM in Linux is clever and protects
some memory, I wouldn't know.

> The risk is more that all the Linux code dealing with this hasn't been
> hardened to deal with malicious input.
> 
> -Andi


-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Andi Kleen




But that was due to performance problems in hot paths. Nothing of this
applies here.

It applies because a new API that individual driver authors is being
proposed and that's an ongoing maintenance burden that might be
mitigated by hiding that implementation detail from leaf drivers.


Right now we're only talking about 2 places to change, and none of those 
are actually in individual drivers, but in the virtio generic code and 
in the MSI code.


While there might be drivers in the future that do it directly it will 
be always the exception, normal drivers don't have to deal with this.





For me it both seems very straight forward and simple (but then I'm biased)

You seem to be having a difficult time iterating this proposal toward
consensus. I don't think the base principles are being contested as
much as the semantics, scope, and need for the proposed API that is in
the purview of all leaf driver developers.

Right now there is no leaf driver changed at all.



I'd rather see more concerted efforts focused/limited core changes
rather than leaf driver changes until there is a clearer definition of
hardened.

A hardened driver is a driver that

- Had similar security (not API) oriented review of its IO operations
(mainly MMIO access, but also PCI config space) as a non privileged user
interface (like a ioctl). That review should be focused on memory safety.

- Had some fuzzing on these IO interfaces using to be released tools.

What is the intersection of ioremap() users that are outside of the
proposed probe authorization regime AND want confidential computing
support?



Right now it's zero I believe.

That is there is other low level code that sets memory shared, but it's 
not using ioremap, but some other mechanisms.




are needed

for booting. For example in TDX we can't print something to the console
with this mechanism, so you would never get any output before the
initrd. Just seems like a nightmare for debugging anything. There really
needs to be an authorization mechanism that works reasonably early.

I can see a point of having user space overrides though, but we need to
have a sane kernel default that works early.

Right, as I suggested [1], just enough early authorization to
bootstrap/debug initramfs and then that can authorize the remainder.

But how do you debug the kernel then? Making early undebuggable seems
just bad policy to me.

I am not proposing making the early undebuggable.



That's the implication of moving the policy into initrd.


If only initrd can authorize then it won't be possible to authorize 
before initrd, thus the early console won't work.


-Andi


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Andi Kleen



On 10/12/2021 2:18 PM, Michael S. Tsirkin wrote:

On Tue, Oct 12, 2021 at 02:14:44PM -0700, Dan Williams wrote:

Especially in this case where the virtio use case being
opted-in is *already* in a path that has been authorized by the
device-filter policy engine.

That's a good point. Andi, how about setting a per-device flag
if its ID has been allowed and then making pci_iomap create
a shared mapping transparently?


Yes for pci_iomap we could do that.

If someone uses raw ioremap without a device it won't work, but I don't 
think that's the case for virtio at least.


I suppose we could solve that problem if it actually happens.

-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Michael S. Tsirkin
On Tue, Oct 12, 2021 at 02:14:44PM -0700, Dan Williams wrote:
> Especially in this case where the virtio use case being
> opted-in is *already* in a path that has been authorized by the
> device-filter policy engine.

That's a good point. Andi, how about setting a per-device flag
if its ID has been allowed and then making pci_iomap create
a shared mapping transparently?

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared

2021-10-12 Thread Andi Kleen




Interesting. VT-d tradeoffs ... what are they?


The connection to the device is not encrypted and also not authenticated.

This is different that even talking to the (untrusted) host through 
shared memory where you at least still have a common key.



Allowing hypervisor to write into BIOS looks like it will
trivially lead to code execution, won't it?


This is not about BIOS code executing. While the guest firmware runs it 
is protected of course. This is for BIOS structures like ACPI tables 
that are mapped by Linux. While AML can run byte code it can normally 
not write to arbitrary memory.


The risk is more that all the Linux code dealing with this hasn't been 
hardened to deal with malicious input.


-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Dan Williams
On Tue, Oct 12, 2021 at 11:35 AM Andi Kleen  wrote:
>
>
> > The "better safe-than-sorry" argument is hard to build consensus
> > around. The spectre mitigations ran into similar problems where the
> > community rightly wanted to see the details and instrument the
> > problematic paths rather than blanket sprinkle lfence "just to be
> > safe".
>
> But that was due to performance problems in hot paths. Nothing of this
> applies here.

It applies because a new API that individual driver authors is being
proposed and that's an ongoing maintenance burden that might be
mitigated by hiding that implementation detail from leaf drivers.

>
> > In this case the rules about when a driver is suitably
> > "hardened" are vague and the overlapping policy engines are confusing.
>
> What is confusing exactly?

Multiple places to go to enable functionality. The device-filter
firewall policy can collide with the ioremap access control policy.

> For me it both seems very straight forward and simple (but then I'm biased)

You seem to be having a difficult time iterating this proposal toward
consensus. I don't think the base principles are being contested as
much as the semantics, scope, and need for the proposed API that is in
the purview of all leaf driver developers.

> The policy is:
>
> - Have an allow list at driver registration.
>
> - Have an additional opt-in for MMIO mappings (and potentially config
> space, but that's not currently there) to cover init calls completely.

The proliferation of policy engines and driver special casing is the
issue. Especially in this case where the virtio use case being
opted-in is *already* in a path that has been authorized by the
device-filter policy engine. I.e. why special case the ioremap() in
virtio to be additionally authorized when the device has already been
authorized to probe? Put another way, the easiest driver API change to
merge would be no additional changes in leaf drivers.

>
> >
> > I'd rather see more concerted efforts focused/limited core changes
> > rather than leaf driver changes until there is a clearer definition of
> > hardened.
>
> A hardened driver is a driver that
>
> - Had similar security (not API) oriented review of its IO operations
> (mainly MMIO access, but also PCI config space) as a non privileged user
> interface (like a ioctl). That review should be focused on memory safety.
>
> - Had some fuzzing on these IO interfaces using to be released tools.

What is the intersection of ioremap() users that are outside of the
proposed probe authorization regime AND want confidential computing
support?

> Right now it's only three virtio drivers (console, net, block)
>
> Really it's no different than what we do for every new unprivileged user
> interface.
>
>
> > I.e. instead of jumping to the assertion that fixing up
> > these init-path vulnerabilities are too big to fix, dig to the next
> > level to provide more evidence that per-driver opt-in is the only
> > viable option.
> >
> > For example, how many of these problematic paths are built-in to the
> > average kernel config?
>
> I don't think arguments from "the average kernel config" (if such a
> thing even exists) are useful. That's would be just hand waving.

I'm trying to bridge to your contention that this enabling can not
rely on custom kernel configs and must offer protection on the same
kernel image that might ship in the host, but lets set this aside and
focus on when and where leaf drivers need to adopt a new API.

> > A strawman might be to add a sprinkling error
> > exits in the module_init() of the problematic drivers, and only fail
> > if the module is built-in, and let modprobe policy handle the rest.
>
>
> That would be already hundreds of changes. I have no idea how such a
> thing could be maintained or sustained either.
>
> Really I don't even see how these alternatives can be considered. Tree
> sweeps should always be last resort. They're a pain for everyone. But
> here they're casually thrown around as alternatives to straight forward
> one or two line changes.

If it looked straightforward I'm not sure we would be having this
discussion, I think it's reasonable to ask if this is a per-driver
opt-in responsibility that must be added in addition to probe
authorization.

> >> Default policy in user space just seems to be a bad idea here. Who
> >> should know if a driver is hardened other than the kernel? Maintaining
> >> the list somewhere else just doesn't make sense to me.
> > I do not understand the maintenance burden correlation of where the
> > policy is driven vs where the list is maintained?
>
> All the hardening and auditing happens in the kernel tree. So it seems
> the natural place to store the result is in the kernel tree.
>
> But there's no single package for initrd, so you would need custom
> configurations for all the supported distros.
>
> Also we're really arguing about a list that currently has three entries.
>
>
> >   Even if I agreed
> > with the contention that out-of-t

Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Michael S. Tsirkin
On Tue, Oct 12, 2021 at 06:36:16PM +, Reshetova, Elena wrote:
> > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> > others) in init functions that also register drivers (thanks Elena for
> > the number)
> 
> To provide more numbers on this. What I can see so far from a smatch-based
> analysis, we have 409 __init style functions (.probe & builtin/module_
> _platform_driver_probe excluded) for 5.15 with allyesconfig.

I don't think we care about allyesconfig at all though.
Just don't do that.
How about allmodconfig? This is closer to what distros actually do.

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared

2021-10-12 Thread Michael S. Tsirkin
On Tue, Oct 12, 2021 at 10:55:20AM -0700, Andi Kleen wrote:
> 
> > I mean ... it's already wide spread.
> 
> 
> I meant wide spread usage with confidential guests.
> 
> > If we support it with TDX
> > it will be used with TDX.
> 
> It has some security trade offs. The main reason to use TDX is security.
> Also when people take the VT-d tradeoffs they might be ok with the BIOS
> trade offs too.
> 
> -Andi

Interesting. VT-d tradeoffs ... what are they?
Allowing hypervisor to write into BIOS looks like it will
trivially lead to code execution, won't it?

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Andi Kleen



On 10/12/2021 12:13 PM, Dan Williams wrote:

On Tue, Oct 12, 2021 at 11:57 AM Reshetova, Elena
 wrote:



I suspect the true number is even higher because that doesn't include IO
inside calls to other modules and indirect pointers, correct?

Actually everything should be included. Smatch has cross-function db and
I am using it for getting the call chains and it follows function pointers.
Also since I am starting from a list of individual read IOs, every single
base read IO in drivers/* should be covered as far as I can see. But if it uses
some weird IO wrappers then the actual list might be higher.

Why analyze individual IO calls? I thought the goal here was to
disable entire classes of ioremap() users?


This is everything that would need to be moved somewhere else if we 
didn't disable the entire classes of ioremap users.


-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Dan Williams
On Tue, Oct 12, 2021 at 11:57 AM Reshetova, Elena
 wrote:
>
>
> > I suspect the true number is even higher because that doesn't include IO
> > inside calls to other modules and indirect pointers, correct?
>
> Actually everything should be included. Smatch has cross-function db and
> I am using it for getting the call chains and it follows function pointers.
> Also since I am starting from a list of individual read IOs, every single
> base read IO in drivers/* should be covered as far as I can see. But if it 
> uses
> some weird IO wrappers then the actual list might be higher.

Why analyze individual IO calls? I thought the goal here was to
disable entire classes of ioremap() users?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Andi Kleen



On 10/12/2021 11:36 AM, Reshetova, Elena wrote:

The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
others) in init functions that also register drivers (thanks Elena for
the number)

To provide more numbers on this. What I can see so far from a smatch-based
analysis, we have 409 __init style functions (.probe & builtin/module_
_platform_driver_probe excluded) for 5.15 with allyesconfig. The number of
distinct individual IO reads (MSRs included) is much higher than 2.4k and on the
range of 30k because quite often there are more than a single IO read in the
same source function. The full list of accesses and the possible call paths is 
huge
for manually looking at, but here is the list of the 409 functions if anyone 
wants
to take a look:



Thanks Elena.


I suspect the true number is even higher because that doesn't include IO 
inside calls to other modules and indirect pointers, correct?



-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Andi Kleen



On 10/11/2021 10:31 PM, Christoph Hellwig wrote:

On Mon, Oct 11, 2021 at 03:09:09PM -0400, Michael S. Tsirkin wrote:

The reason we have trouble is that it's not clear what does the API mean
outside the realm of TDX.
If we really, truly want an API that says "ioremap and it's a hardened
driver" then I guess ioremap_hardened_driver is what you want.

Yes.  And why would be we ioremap the BIOS anyway?  It is not I/O memory
in any of the senses we generally use ioremap for.


I/O memory is anything outside the kernel memory map.

-Andi


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Andi Kleen




The "better safe-than-sorry" argument is hard to build consensus
around. The spectre mitigations ran into similar problems where the
community rightly wanted to see the details and instrument the
problematic paths rather than blanket sprinkle lfence "just to be
safe".


But that was due to performance problems in hot paths. Nothing of this 
applies here.



In this case the rules about when a driver is suitably
"hardened" are vague and the overlapping policy engines are confusing.


What is confusing exactly?

For me it both seems very straight forward and simple (but then I'm biased)

The policy is:

- Have an allow list at driver registration.

- Have an additional opt-in for MMIO mappings (and potentially config 
space, but that's not currently there) to cover init calls completely.




I'd rather see more concerted efforts focused/limited core changes
rather than leaf driver changes until there is a clearer definition of
hardened.


A hardened driver is a driver that

- Had similar security (not API) oriented review of its IO operations 
(mainly MMIO access, but also PCI config space) as a non privileged user 
interface (like a ioctl). That review should be focused on memory safety.


- Had some fuzzing on these IO interfaces using to be released tools.

Right now it's only three virtio drivers (console, net, block)

Really it's no different than what we do for every new unprivileged user 
interface.




I.e. instead of jumping to the assertion that fixing up
these init-path vulnerabilities are too big to fix, dig to the next
level to provide more evidence that per-driver opt-in is the only
viable option.

For example, how many of these problematic paths are built-in to the
average kernel config?


I don't think arguments from "the average kernel config" (if such a 
thing even exists) are useful. That's would be just hand waving.




A strawman might be to add a sprinkling error
exits in the module_init() of the problematic drivers, and only fail
if the module is built-in, and let modprobe policy handle the rest.



That would be already hundreds of changes. I have no idea how such a 
thing could be maintained or sustained either.


Really I don't even see how these alternatives can be considered. Tree 
sweeps should always be last resort. They're a pain for everyone. But 
here they're casually thrown around as alternatives to straight forward 
one or two line changes.








Default policy in user space just seems to be a bad idea here. Who
should know if a driver is hardened other than the kernel? Maintaining
the list somewhere else just doesn't make sense to me.

I do not understand the maintenance burden correlation of where the
policy is driven vs where the list is maintained?


All the hardening and auditing happens in the kernel tree. So it seems 
the natural place to store the result is in the kernel tree.


But there's no single package for initrd, so you would need custom 
configurations for all the supported distros.


Also we're really arguing about a list that currently has three entries.



  Even if I agreed
with the contention that out-of-tree userspace would have a hard time
tracking the "hardened" driver list there is still an in-tree
userspace path to explore. E.g. perf maintains lists of things tightly
coupled to the kernel, this authorized device list seems to be in the
same category of data.


You mean the event list? perf is in the kernel tree, so it's maintained 
together with the kernel.


But we don't have a kernel initrd.






Also there is the more practical problem that some devices are needed
for booting. For example in TDX we can't print something to the console
with this mechanism, so you would never get any output before the
initrd. Just seems like a nightmare for debugging anything. There really
needs to be an authorization mechanism that works reasonably early.

I can see a point of having user space overrides though, but we need to
have a sane kernel default that works early.

Right, as I suggested [1], just enough early authorization to
bootstrap/debug initramfs and then that can authorize the remainder.


But how do you debug the kernel then? Making early undebuggable seems 
just bad policy to me.


And if you fix if for the console why not add the two more entries for 
virtio net and block too?



-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared

2021-10-12 Thread Andi Kleen




I mean ... it's already wide spread.



I meant wide spread usage with confidential guests.


If we support it with TDX
it will be used with TDX.


It has some security trade offs. The main reason to use TDX is security. 
Also when people take the VT-d tradeoffs they might be ok with the BIOS 
trade offs too.


-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Dan Williams
On Sun, Oct 10, 2021 at 3:11 PM Andi Kleen  wrote:
>
>
> On 10/9/2021 1:39 PM, Dan Williams wrote:
> > On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin  wrote:
> >> On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote:
> >>> From: Andi Kleen 
> >>>
> >>> For Confidential VM guests like TDX, the host is untrusted and hence
> >>> the devices emulated by the host or any data coming from the host
> >>> cannot be trusted. So the drivers that interact with the outside world
> >>> have to be hardened by sharing memory with host on need basis
> >>> with proper hardening fixes.
> >>>
> >>> For the PCI driver case, to share the memory with the host add
> >>> pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.
> >>>
> >>> Signed-off-by: Andi Kleen 
> >>> Signed-off-by: Kuppuswamy Sathyanarayanan 
> >>> 
> >> So I proposed to make all pci mappings shared, eliminating the need
> >> to patch drivers.
> >>
> >> To which Andi replied
> >>  One problem with removing the ioremap opt-in is that
> >>  it's still possible for drivers to get at devices without going 
> >> through probe.
> >>
> >> To which Greg replied:
> >> https://lore.kernel.org/all/yvxbnj431yiww...@kroah.com/
> >>  If there are in-kernel PCI drivers that do not do this, they need 
> >> to be
> >>  fixed today.
> >>
> >> Can you guys resolve the differences here?
> > I agree with you and Greg here. If a driver is accessing hardware
> > resources outside of the bind lifetime of one of the devices it
> > supports, and in a way that neither modrobe-policy nor
> > device-authorization -policy infrastructure can block, that sounds
> > like a bug report.
>
> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> others) in init functions that also register drivers (thanks Elena for
> the number)
>
> Some are probably old drivers that could be fixed, but it's quite a few
> legitimate cases. For example for platform or ISA drivers that's the
> only way they can be implemented because they often have no other
> enumeration mechanism. For PCI drivers it's rarer, but also still can
> happen. One example that comes to mind here is the x86 Intel uncore
> drivers, which support a mix of MSR, ioremap and PCI config space
> accesses all from the same driver. This particular example can (and
> should be) fixed in other ways, but similar things also happen in other
> drivers, and they're not all broken. Even for the broken ones they're
> usually for some crufty old devices that has very few users, so it's
> likely untestable in practice.
>
> My point is just that the ecosystem of devices that Linux supports is
> messy enough that there are legitimate exceptions from the "First IO
> only in probe call only" rule.
>
> And we can't just fix them all. Even if we could it would be hard to
> maintain.
>
> Using a "firewall model" hooking into a few strategic points like we're
> proposing here is much saner for everyone.
>
> Now we can argue about the details. Right now what we're proposing has
> some redundancies: it has both a device model filter and low level
> filter for ioremap (this patch and some others). The low level filter is
> for catching issues that don't clearly fit into the
> "enumeration<->probe" model. You could call that redundant, but I would
> call it defense in depth or better safe than sorry. In theory it would
> be enough to have the low level opt-in only, but that would have the
> drawback that is something gets enumerated after all you would have all
> kind of weird device driver failures and in some cases even killed
> guests. So I think it makes sense to have

The "better safe-than-sorry" argument is hard to build consensus
around. The spectre mitigations ran into similar problems where the
community rightly wanted to see the details and instrument the
problematic paths rather than blanket sprinkle lfence "just to be
safe". In this case the rules about when a driver is suitably
"hardened" are vague and the overlapping policy engines are confusing.

I'd rather see more concerted efforts focused/limited core changes
rather than leaf driver changes until there is a clearer definition of
hardened. I.e. instead of jumping to the assertion that fixing up
these init-path vulnerabilities are too big to fix, dig to the next
level to provide more evidence that per-driver opt-in is the only
viable option.

For example, how many of these problematic paths are built-in to the
average kernel config? A strawman might be to add a sprinkling error
exits in the module_init() of the problematic drivers, and only fail
if the module is built-in, and let modprobe policy handle the rest.

>
>
> > Fix those drivers instead of sprinkling
> > ioremap_shared in select places and with unclear rules about when a
> > driver is allowed to do "shared" mappings.
>
> Only add it when the driver has been audited and hardened.
>
> But I agree we need on a documented process for this. I will work on
> some doc

WorldCist'22 - 10th World Conference on Information Systems and Technologies | Montenegro

2021-10-12 Thread IT
* Conference listed in CORE Ranking

** Google Scholar H5-Index = 23

*** Best papers selected for SCI/SSCI journals



--

WorldCIST'22 - 10th World Conference on Information Systems and Technologies

12-14 April 2022, Budva, Montenegro

http://worldcist.org 

---


The WorldCist'22 - 10th World Conference on Information Systems and 
Technologies, to be held in Budva, Montenegro, 12-14 April 2022, is a global 
forum for researchers and practitioners to present and discuss the most recent 
innovations, trends, results, experiences and concerns in the several 
perspectives of Information Systems and Technologies.

We are pleased to invite you to submit your papers to WorldCist'22. All 
submissions will be reviewed on the basis of relevance, originality, importance 
and clarity.



TOPICS

Submitted papers should be related with one or more of the main themes proposed 
for the Conference:

A) Information and Knowledge Management (IKM);

B) Organizational Models and Information Systems (OMIS);

C) Software and Systems Modeling (SSM);

D) Software Systems, Architectures, Applications and Tools (SSAAT);

E) Multimedia Systems and Applications (MSA);

F) Computer Networks, Mobility and Pervasive Systems (CNMPS);

G) Intelligent and Decision Support Systems (IDSS);

H) Big Data Analytics and Applications (BDAA);

I) Human-Computer Interaction (HCI);

J) Ethics, Computers and Security (ECS)

K) Health Informatics (HIS);

L) Information Technologies in Education (ITE);

M) Technologies for Biomedical Applications (TBA)

N) Information Technologies in Radiocommunications (ITR);



TYPES of SUBMISSIONS and DECISIONS

Four types of papers can be submitted:

Full paper: Finished or consolidated R&D works, to be included in one of the 
Conference themes. These papers are assigned a 10-page limit.

Short paper: Ongoing works with relevant preliminary results, open to 
discussion. These papers are assigned a 7-page limit.

Poster paper: Initial work with relevant ideas, open to discussion. These 
papers are assigned to a 4-page limit.

Company paper: Companies' papers that show practical experience, R & D, tools, 
etc., focused on some topics of the conference. These papers are assigned to a 
4-page limit.

Submitted papers must comply with the format of Advances in Intelligent Systems 
and Computing Series (see Instructions for Authors at Springer Website), be 
written in English, must not have been published before, not be under review 
for any other conference or publication and not include any information leading 
to the authors’ identification. Therefore, the authors’ names, affiliations and 
bibliographic references should not be included in the version for evaluation 
by the Program Committee. This information should only be included in the 
camera-ready version, saved in Word or Latex format and also in PDF format. 
These files must be accompanied by the Consent to Publish form filled out, in a 
ZIP file, and uploaded at the conference management system.

All papers will be subjected to a “double-blind review” by at least two members 
of the Program Committee.

Based on Program Committee evaluation, a paper can be rejected or accepted by 
the Conference Chairs. In the later case, it can be accepted as the type 
originally submitted or as another type. Thus, full papers can be accepted as 
short papers or poster papers only. Similarly, short papers can be accepted as 
poster papers only.

Poster papers and Company papers are not published in the Conference 
Proceedings, being only presented and discussed. The authors of accepted poster 
papers should build and print a poster to be exhibited during the Conference. 
This poster must follow an A1 or A2 vertical format. The Conference includes 
Work Sessions where these posters are presented and orally discussed, with a 7 
minute limit per poster.

The authors of accepted Full papers will have 15 minutes to present their work 
in a Conference Work Session; approximately 5 minutes of discussion will follow 
each presentation. The authors of accepted Short papers and Company papers will 
have 11 minutes to present their work in a Conference Work Session; 
approximately 4 minutes of discussion will follow each presentation.



PUBLICATION & INDEXING

To ensure that a full paper or short paper is published, poster paper or 
company paper is presented, at least one of the authors must be fully 
registered by the 8nd of January 2022, and the paper must comply with the 
suggested layout and page-limit. Additionally, all recommended changes must be 
addressed by the authors before they submit the camera-ready version.

No more than one paper per registration will be published. An extra fee must be 
paid for publication of additional papers, with a ma

WorldCist'22 - 10th World Conference on Information Systems and Technologies | Montenegro

2021-10-12 Thread IT
* Conference listed in CORE Ranking

** Google Scholar H5-Index = 23

*** Best papers selected for SCI/SSCI journals



--

WorldCIST'22 - 10th World Conference on Information Systems and Technologies

12-14 April 2022, Budva, Montenegro

http://worldcist.org 

---


The WorldCist'22 - 10th World Conference on Information Systems and 
Technologies, to be held in Budva, Montenegro, 12-14 April 2022, is a global 
forum for researchers and practitioners to present and discuss the most recent 
innovations, trends, results, experiences and concerns in the several 
perspectives of Information Systems and Technologies.

We are pleased to invite you to submit your papers to WorldCist'22. All 
submissions will be reviewed on the basis of relevance, originality, importance 
and clarity.



TOPICS

Submitted papers should be related with one or more of the main themes proposed 
for the Conference:

A) Information and Knowledge Management (IKM);

B) Organizational Models and Information Systems (OMIS);

C) Software and Systems Modeling (SSM);

D) Software Systems, Architectures, Applications and Tools (SSAAT);

E) Multimedia Systems and Applications (MSA);

F) Computer Networks, Mobility and Pervasive Systems (CNMPS);

G) Intelligent and Decision Support Systems (IDSS);

H) Big Data Analytics and Applications (BDAA);

I) Human-Computer Interaction (HCI);

J) Ethics, Computers and Security (ECS)

K) Health Informatics (HIS);

L) Information Technologies in Education (ITE);

M) Technologies for Biomedical Applications (TBA)

N) Information Technologies in Radiocommunications (ITR);



TYPES of SUBMISSIONS and DECISIONS

Four types of papers can be submitted:

Full paper: Finished or consolidated R&D works, to be included in one of the 
Conference themes. These papers are assigned a 10-page limit.

Short paper: Ongoing works with relevant preliminary results, open to 
discussion. These papers are assigned a 7-page limit.

Poster paper: Initial work with relevant ideas, open to discussion. These 
papers are assigned to a 4-page limit.

Company paper: Companies' papers that show practical experience, R & D, tools, 
etc., focused on some topics of the conference. These papers are assigned to a 
4-page limit.

Submitted papers must comply with the format of Advances in Intelligent Systems 
and Computing Series (see Instructions for Authors at Springer Website), be 
written in English, must not have been published before, not be under review 
for any other conference or publication and not include any information leading 
to the authors’ identification. Therefore, the authors’ names, affiliations and 
bibliographic references should not be included in the version for evaluation 
by the Program Committee. This information should only be included in the 
camera-ready version, saved in Word or Latex format and also in PDF format. 
These files must be accompanied by the Consent to Publish form filled out, in a 
ZIP file, and uploaded at the conference management system.

All papers will be subjected to a “double-blind review” by at least two members 
of the Program Committee.

Based on Program Committee evaluation, a paper can be rejected or accepted by 
the Conference Chairs. In the later case, it can be accepted as the type 
originally submitted or as another type. Thus, full papers can be accepted as 
short papers or poster papers only. Similarly, short papers can be accepted as 
poster papers only.

Poster papers and Company papers are not published in the Conference 
Proceedings, being only presented and discussed. The authors of accepted poster 
papers should build and print a poster to be exhibited during the Conference. 
This poster must follow an A1 or A2 vertical format. The Conference includes 
Work Sessions where these posters are presented and orally discussed, with a 7 
minute limit per poster.

The authors of accepted Full papers will have 15 minutes to present their work 
in a Conference Work Session; approximately 5 minutes of discussion will follow 
each presentation. The authors of accepted Short papers and Company papers will 
have 11 minutes to present their work in a Conference Work Session; 
approximately 4 minutes of discussion will follow each presentation.



PUBLICATION & INDEXING

To ensure that a full paper or short paper is published, poster paper or 
company paper is presented, at least one of the authors must be fully 
registered by the 8nd of January 2022, and the paper must comply with the 
suggested layout and page-limit. Additionally, all recommended changes must be 
addressed by the authors before they submit the camera-ready version.

No more than one paper per registration will be published. An extra fee must be 
paid for publication of additional papers, with a ma

Re: [PATCH mlx5-next 1/7] RDMA/mlx5: Don't set esc_size in user mr

2021-10-12 Thread Leon Romanovsky
On Tue, Oct 12, 2021 at 11:04:33AM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 12, 2021 at 04:57:16PM +0300, Aharon Landau wrote:
> > 
> > On 10/12/2021 3:52 PM, Jason Gunthorpe wrote:
> > > On Tue, Oct 12, 2021 at 01:26:29PM +0300, Leon Romanovsky wrote:
> > > > From: Aharon Landau 
> > > > 
> > > > reg_create() is used for user MRs only and should not set desc_size at
> > > > all. Attempt to set it causes to the following trace:
> > > > 
> > > > BUG: unable to handle page fault for address: 0008
> > > > PGD 0 P4D 0
> > > > Oops:  [#1] SMP PTI
> > > > CPU: 5 PID: 890 Comm: ib_write_bw Not tainted 5.15.0-rc4+ #47
> > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> > > > rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> > > > RIP: 0010:mlx5_ib_dereg_mr+0x14/0x3b0 [mlx5_ib]
> > > > Code: 48 63 cd 4c 89 f7 48 89 0c 24 e8 37 30 03 e1 48 8b 0c 24 eb a0 90 
> > > > 0f 1f 44 00 00 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 30 <48> 8b 2f 
> > > > 65 48 8b 04 25 28 00 00 00 48 89 44 24 28 31 c0 8b 87 c8
> > > > RSP: 0018:88811afa3a60 EFLAGS: 00010286
> > > > RAX: 001c RBX: 0008 RCX: 
> > > > RDX:  RSI:  RDI: 0008
> > > > RBP: 0008 R08:  R09: c000f7ff
> > > > R10: 88811afa38f8 R11: 88811afa38f0 R12: a02c7ac0
> > > > R13:  R14: 88811afa3cd8 R15: 88810772fa00
> > > > FS:  7f47b9080740() GS:88852cd4() 
> > > > knlGS:
> > > > CS:  0010 DS:  ES:  CR0: 80050033
> > > > CR2: 0008 CR3: 00010761e003 CR4: 00370ea0
> > > > DR0:  DR1:  DR2: 
> > > > DR3:  DR6: fffe0ff0 DR7: 0400
> > > > Call Trace:
> > > >   mlx5_ib_free_odp_mr+0x95/0xc0 [mlx5_ib]
> > > >   mlx5_ib_dereg_mr+0x128/0x3b0 [mlx5_ib]
> > > >   ib_dereg_mr_user+0x45/0xb0 [ib_core]
> > > >   ? xas_load+0x8/0x80
> > > >   destroy_hw_idr_uobject+0x1a/0x50 [ib_uverbs]
> > > >   uverbs_destroy_uobject+0x2f/0x150 [ib_uverbs]
> > > >   uobj_destroy+0x3c/0x70 [ib_uverbs]
> > > >   ib_uverbs_cmd_verbs+0x467/0xb00 [ib_uverbs]
> > > >   ? uverbs_finalize_object+0x60/0x60 [ib_uverbs]
> > > >   ? ttwu_queue_wakelist+0xa9/0xe0
> > > >   ? pty_write+0x85/0x90
> > > >   ? file_tty_write.isra.33+0x214/0x330
> > > >   ? process_echoes+0x60/0x60
> > > >   ib_uverbs_ioctl+0xa7/0x110 [ib_uverbs]
> > > >   __x64_sys_ioctl+0x10d/0x8e0
> > > >   ? vfs_write+0x17f/0x260
> > > >   do_syscall_64+0x3c/0x80
> > > >   entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > > 
> > > > Fixes: a639e66703ee ("RDMA/mlx5: Zero out ODP related items in the 
> > > > mlx5_ib_mr")
> > > Can you explain why this is crashing?
> > > 
> > > reg_create isn't used on the ODP implicit children path.
> > > 
> > > Jason
> > It is not implicit ODP flow, therefore, mr->implicit_children shouldn't be
> > set.
> 
> It should always be initialized. That seems to be the bug here, add the
> missing xa_init as well as delete the extra desc_size set:

I would expect such change in mlx5_ib_init_odp_mr().

> 
> diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
> index b4d2322e9ca564..46626e0fe08905 100644
> --- a/drivers/infiniband/hw/mlx5/mr.c
> +++ b/drivers/infiniband/hw/mlx5/mr.c
> @@ -1525,6 +1525,7 @@ static struct ib_mr *create_user_odp_mr(struct ib_pd 
> *pd, u64 start, u64 length,
> ib_umem_release(&odp->umem);
> return ERR_CAST(mr);
> }
> +   xa_init(&mr->implicit_children);
>  
> odp->private = mr;
> err = mlx5r_store_odp_mkey(dev, &mr->mmkey);
> 
> Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH mlx5-next 0/7] Clean MR key use across mlx5_* modules

2021-10-12 Thread Michael S. Tsirkin
On Tue, Oct 12, 2021 at 01:26:28PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> Hi,
> 
> This is cleanup series of mlx5_* MR mkey management.
> 
> Thanks


Looks fine superficially

Acked-by: Michael S. Tsirkin 

> Aharon Landau (7):
>   RDMA/mlx5: Don't set esc_size in user mr
>   RDMA/mlx5: Remove iova from struct mlx5_core_mkey
>   RDMA/mlx5: Remove size from struct mlx5_core_mkey
>   RDMA/mlx5: Remove pd from struct mlx5_core_mkey
>   RDMA/mlx5: Replace struct mlx5_core_mkey by u32 key
>   RDMA/mlx5: Move struct mlx5_core_mkey to mlx5_ib
>   RDMA/mlx5: Attach ndescs to mlx5_ib_mkey
> 
>  drivers/infiniband/hw/mlx5/devx.c | 13 +--
>  drivers/infiniband/hw/mlx5/devx.h |  2 +-
>  drivers/infiniband/hw/mlx5/mlx5_ib.h  | 31 ---
>  drivers/infiniband/hw/mlx5/mr.c   | 82 +--
>  drivers/infiniband/hw/mlx5/odp.c  | 38 +++--
>  drivers/infiniband/hw/mlx5/wr.c   | 10 +--
>  .../mellanox/mlx5/core/diag/fw_tracer.c   |  6 +-
>  .../mellanox/mlx5/core/diag/fw_tracer.h   |  2 +-
>  .../mellanox/mlx5/core/diag/rsc_dump.c| 10 +--
>  drivers/net/ethernet/mellanox/mlx5/core/en.h  |  2 +-
>  .../net/ethernet/mellanox/mlx5/core/en/ptp.c  |  2 +-
>  .../net/ethernet/mellanox/mlx5/core/en/trap.c |  2 +-
>  .../ethernet/mellanox/mlx5/core/en_common.c   |  6 +-
>  .../net/ethernet/mellanox/mlx5/core/en_main.c | 13 ++-
>  .../ethernet/mellanox/mlx5/core/fpga/conn.c   | 10 +--
>  .../ethernet/mellanox/mlx5/core/fpga/core.h   |  2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/mr.c  | 27 +++---
>  .../mellanox/mlx5/core/steering/dr_icm_pool.c | 10 +--
>  .../mellanox/mlx5/core/steering/dr_send.c | 11 ++-
>  .../mellanox/mlx5/core/steering/dr_types.h|  2 +-
>  drivers/vdpa/mlx5/core/mlx5_vdpa.h|  8 +-
>  drivers/vdpa/mlx5/core/mr.c   |  8 +-
>  drivers/vdpa/mlx5/core/resources.c| 13 +--
>  drivers/vdpa/mlx5/net/mlx5_vnet.c |  2 +-
>  include/linux/mlx5/driver.h   | 30 ++-
>  25 files changed, 147 insertions(+), 195 deletions(-)
> 
> -- 
> 2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH mlx5-next 6/7] RDMA/mlx5: Move struct mlx5_core_mkey to mlx5_ib

2021-10-12 Thread Leon Romanovsky
From: Aharon Landau 

Move mlx5_core_mkey struct to mlx5_ib, as the mlx5_core doesn't use it
at this point.

Signed-off-by: Aharon Landau 
Reviewed-by: Shay Drory 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/devx.c|  2 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 25 +++--
 drivers/infiniband/hw/mlx5/mr.c  | 12 +---
 drivers/infiniband/hw/mlx5/odp.c |  8 
 include/linux/mlx5/driver.h  | 13 -
 5 files changed, 29 insertions(+), 31 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/devx.c 
b/drivers/infiniband/hw/mlx5/devx.c
index 465ea835f854..2778b10ffd48 100644
--- a/drivers/infiniband/hw/mlx5/devx.c
+++ b/drivers/infiniband/hw/mlx5/devx.c
@@ -1293,7 +1293,7 @@ static int devx_handle_mkey_indirect(struct devx_obj *obj,
 void *in, void *out)
 {
struct mlx5_ib_devx_mr *devx_mr = &obj->devx_mr;
-   struct mlx5_core_mkey *mkey;
+   struct mlx5_ib_mkey *mkey;
void *mkc;
u8 key;
 
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index cf8b0653f0ce..ef6087a9f93b 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -628,6 +628,19 @@ struct mlx5_user_mmap_entry {
u32 page_idx;
 };
 
+enum mlx5_mkey_type {
+   MLX5_MKEY_MR = 1,
+   MLX5_MKEY_MW,
+   MLX5_MKEY_INDIRECT_DEVX,
+};
+
+struct mlx5_ib_mkey {
+   u32 key;
+   enum mlx5_mkey_type type;
+   struct wait_queue_head wait;
+   refcount_t usecount;
+};
+
 #define MLX5_IB_MTT_PRESENT (MLX5_IB_MTT_READ | MLX5_IB_MTT_WRITE)
 
 #define MLX5_IB_DM_MEMIC_ALLOWED_ACCESS (IB_ACCESS_LOCAL_WRITE   |\
@@ -646,7 +659,7 @@ struct mlx5_user_mmap_entry {
 
 struct mlx5_ib_mr {
struct ib_mr ibmr;
-   struct mlx5_core_mkey mmkey;
+   struct mlx5_ib_mkey mmkey;
 
/* User MR data */
struct mlx5_cache_ent *cache_ent;
@@ -722,12 +735,12 @@ static inline bool is_dmabuf_mr(struct mlx5_ib_mr *mr)
 
 struct mlx5_ib_mw {
struct ib_mwibmw;
-   struct mlx5_core_mkey   mmkey;
+   struct mlx5_ib_mkey mmkey;
int ndescs;
 };
 
 struct mlx5_ib_devx_mr {
-   struct mlx5_core_mkey   mmkey;
+   struct mlx5_ib_mkey mmkey;
int ndescs;
 };
 
@@ -1605,7 +1618,7 @@ static inline bool mlx5_ib_can_reconfig_with_umr(struct 
mlx5_ib_dev *dev,
 }
 
 static inline int mlx5r_store_odp_mkey(struct mlx5_ib_dev *dev,
-  struct mlx5_core_mkey *mmkey)
+  struct mlx5_ib_mkey *mmkey)
 {
refcount_set(&mmkey->usecount, 1);
 
@@ -1614,14 +1627,14 @@ static inline int mlx5r_store_odp_mkey(struct 
mlx5_ib_dev *dev,
 }
 
 /* deref an mkey that can participate in ODP flow */
-static inline void mlx5r_deref_odp_mkey(struct mlx5_core_mkey *mmkey)
+static inline void mlx5r_deref_odp_mkey(struct mlx5_ib_mkey *mmkey)
 {
if (refcount_dec_and_test(&mmkey->usecount))
wake_up(&mmkey->wait);
 }
 
 /* deref an mkey that can participate in ODP flow and wait for relese */
-static inline void mlx5r_deref_wait_odp_mkey(struct mlx5_core_mkey *mmkey)
+static inline void mlx5r_deref_wait_odp_mkey(struct mlx5_ib_mkey *mmkey)
 {
mlx5r_deref_odp_mkey(mmkey);
wait_event(mmkey->wait, refcount_read(&mmkey->usecount) == 0);
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 2260b0298965..675be5b1de9c 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -89,9 +89,8 @@ static void set_mkc_access_pd_addr_fields(void *mkc, int acc, 
u64 start_addr,
MLX5_SET64(mkc, mkc, start_addr, start_addr);
 }
 
-static void
-assign_mkey_variant(struct mlx5_ib_dev *dev, struct mlx5_core_mkey *mkey,
-   u32 *in)
+static void assign_mkey_variant(struct mlx5_ib_dev *dev,
+   struct mlx5_ib_mkey *mkey, u32 *in)
 {
u8 key = atomic_inc_return(&dev->mkey_var);
void *mkc;
@@ -101,9 +100,8 @@ assign_mkey_variant(struct mlx5_ib_dev *dev, struct 
mlx5_core_mkey *mkey,
mkey->key = key;
 }
 
-static int
-mlx5_ib_create_mkey(struct mlx5_ib_dev *dev, struct mlx5_core_mkey *mkey,
-   u32 *in, int inlen)
+static int mlx5_ib_create_mkey(struct mlx5_ib_dev *dev,
+  struct mlx5_ib_mkey *mkey, u32 *in, int inlen)
 {
int ret;
 
@@ -117,7 +115,7 @@ mlx5_ib_create_mkey(struct mlx5_ib_dev *dev, struct 
mlx5_core_mkey *mkey,
 
 static int
 mlx5_ib_create_mkey_cb(struct mlx5_ib_dev *dev,
-  struct mlx5_core_mkey *mkey,
+  struct mlx5_ib_mkey *mkey,
   struct mlx5_async_ctx *async_ctx,
   u32 *in, int inlen, u32 *out, int outlen,
   struct mlx5_async_work *conte

[PATCH mlx5-next 7/7] RDMA/mlx5: Attach ndescs to mlx5_ib_mkey

2021-10-12 Thread Leon Romanovsky
From: Aharon Landau 

Generalize the use of ndescs by adding it to mlx5_ib_mkey.

Signed-off-by: Aharon Landau 
Reviewed-by: Shay Drory 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/devx.c| 10 --
 drivers/infiniband/hw/mlx5/devx.h|  2 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  8 +---
 drivers/infiniband/hw/mlx5/mr.c  | 24 
 drivers/infiniband/hw/mlx5/odp.c | 22 ++
 drivers/infiniband/hw/mlx5/wr.c  | 10 +-
 6 files changed, 25 insertions(+), 51 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/devx.c 
b/drivers/infiniband/hw/mlx5/devx.c
index 2778b10ffd48..2baecf0367e5 100644
--- a/drivers/infiniband/hw/mlx5/devx.c
+++ b/drivers/infiniband/hw/mlx5/devx.c
@@ -1292,18 +1292,16 @@ static int devx_handle_mkey_indirect(struct devx_obj 
*obj,
 struct mlx5_ib_dev *dev,
 void *in, void *out)
 {
-   struct mlx5_ib_devx_mr *devx_mr = &obj->devx_mr;
-   struct mlx5_ib_mkey *mkey;
+   struct mlx5_ib_mkey *mkey = &obj->mkey;
void *mkc;
u8 key;
 
-   mkey = &devx_mr->mmkey;
mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
key = MLX5_GET(mkc, mkc, mkey_7_0);
mkey->key = mlx5_idx_to_mkey(
MLX5_GET(create_mkey_out, out, mkey_index)) | key;
mkey->type = MLX5_MKEY_INDIRECT_DEVX;
-   devx_mr->ndescs = MLX5_GET(mkc, mkc, translations_octword_size);
+   mkey->ndescs = MLX5_GET(mkc, mkc, translations_octword_size);
init_waitqueue_head(&mkey->wait);
 
return mlx5r_store_odp_mkey(dev, mkey);
@@ -1381,13 +1379,13 @@ static int devx_obj_cleanup(struct ib_uobject *uobject,
dev = mlx5_udata_to_mdev(&attrs->driver_udata);
if (obj->flags & DEVX_OBJ_FLAGS_INDIRECT_MKEY &&
xa_erase(&obj->ib_dev->odp_mkeys,
-mlx5_base_mkey(obj->devx_mr.mmkey.key)))
+mlx5_base_mkey(obj->mkey.key)))
/*
 * The pagefault_single_data_segment() does commands against
 * the mmkey, we must wait for that to stop before freeing the
 * mkey, as another allocation could get the same mkey #.
 */
-   mlx5r_deref_wait_odp_mkey(&obj->devx_mr.mmkey);
+   mlx5r_deref_wait_odp_mkey(&obj->mkey);
 
if (obj->flags & DEVX_OBJ_FLAGS_DCT)
ret = mlx5_core_destroy_dct(obj->ib_dev, &obj->core_dct);
diff --git a/drivers/infiniband/hw/mlx5/devx.h 
b/drivers/infiniband/hw/mlx5/devx.h
index 1f69866aed16..ee2213275fd6 100644
--- a/drivers/infiniband/hw/mlx5/devx.h
+++ b/drivers/infiniband/hw/mlx5/devx.h
@@ -16,7 +16,7 @@ struct devx_obj {
u32 dinbox[MLX5_MAX_DESTROY_INBOX_SIZE_DW];
u32 flags;
union {
-   struct mlx5_ib_devx_mr  devx_mr;
+   struct mlx5_ib_mkey mkey;
struct mlx5_core_dctcore_dct;
struct mlx5_core_cq core_cq;
u32 flow_counter_bulk_size;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index ef6087a9f93b..ed173af8ae75 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -637,6 +637,7 @@ enum mlx5_mkey_type {
 struct mlx5_ib_mkey {
u32 key;
enum mlx5_mkey_type type;
+   int ndescs;
struct wait_queue_head wait;
refcount_t usecount;
 };
@@ -681,7 +682,6 @@ struct mlx5_ib_mr {
void *descs_alloc;
dma_addr_t desc_map;
int max_descs;
-   int ndescs;
int desc_size;
int access_mode;
 
@@ -736,12 +736,6 @@ static inline bool is_dmabuf_mr(struct mlx5_ib_mr *mr)
 struct mlx5_ib_mw {
struct ib_mwibmw;
struct mlx5_ib_mkey mmkey;
-   int ndescs;
-};
-
-struct mlx5_ib_devx_mr {
-   struct mlx5_ib_mkey mmkey;
-   int ndescs;
 };
 
 struct mlx5_ib_umr_context {
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 675be5b1de9c..0c4a5ac2544c 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -2335,7 +2335,7 @@ int mlx5_ib_alloc_mw(struct ib_mw *ibmw, struct ib_udata 
*udata)
 
mw->mmkey.type = MLX5_MKEY_MW;
ibmw->rkey = mw->mmkey.key;
-   mw->ndescs = ndescs;
+   mw->mmkey.ndescs = ndescs;
 
resp.response_length =
min(offsetofend(typeof(resp), response_length), udata->outlen);
@@ -2431,7 +2431,7 @@ mlx5_ib_map_pa_mr_sg_pi(struct ib_mr *ibmr, struct 
scatterlist *data_sg,
mr->meta_length = 0;
if (data_sg_nents == 1) {

[PATCH mlx5-next 5/7] RDMA/mlx5: Replace struct mlx5_core_mkey by u32 key

2021-10-12 Thread Leon Romanovsky
From: Aharon Landau 

In mlx5_core and vdpa there is no use of mlx5_core_mkey members except
for the key itself.

As preparation for moving mlx5_core_mkey to mlx5_ib, the occurrences of
struct mlx5_core_mkey in all modules except for mlx5_ib are replaced by
a u32 key.

Signed-off-by: Aharon Landau 
Reviewed-by: Shay Drory 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/mr.c   | 21 --
 drivers/infiniband/hw/mlx5/odp.c  |  2 +-
 .../mellanox/mlx5/core/diag/fw_tracer.c   |  6 ++---
 .../mellanox/mlx5/core/diag/fw_tracer.h   |  2 +-
 .../mellanox/mlx5/core/diag/rsc_dump.c| 10 -
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en/ptp.c  |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en/trap.c |  2 +-
 .../ethernet/mellanox/mlx5/core/en_common.c   |  6 ++---
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 13 +--
 .../ethernet/mellanox/mlx5/core/fpga/conn.c   | 10 -
 .../ethernet/mellanox/mlx5/core/fpga/core.h   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/mr.c  | 22 +--
 .../mellanox/mlx5/core/steering/dr_icm_pool.c | 10 -
 .../mellanox/mlx5/core/steering/dr_send.c | 11 +-
 .../mellanox/mlx5/core/steering/dr_types.h|  2 +-
 drivers/vdpa/mlx5/core/mlx5_vdpa.h|  8 +++
 drivers/vdpa/mlx5/core/mr.c   |  8 +++
 drivers/vdpa/mlx5/core/resources.c|  8 +++
 drivers/vdpa/mlx5/net/mlx5_vnet.c |  2 +-
 include/linux/mlx5/driver.h   | 14 +---
 21 files changed, 82 insertions(+), 81 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 1450c2f0da00..2260b0298965 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -105,8 +105,14 @@ static int
 mlx5_ib_create_mkey(struct mlx5_ib_dev *dev, struct mlx5_core_mkey *mkey,
u32 *in, int inlen)
 {
+   int ret;
+
assign_mkey_variant(dev, mkey, in);
-   return mlx5_core_create_mkey(dev->mdev, mkey, in, inlen);
+   ret = mlx5_core_create_mkey(dev->mdev, &mkey->key, in, inlen);
+   if (!ret)
+   init_waitqueue_head(&mkey->wait);
+
+   return ret;
 }
 
 static int
@@ -134,7 +140,7 @@ static int destroy_mkey(struct mlx5_ib_dev *dev, struct 
mlx5_ib_mr *mr)
 {
WARN_ON(xa_load(&dev->odp_mkeys, mlx5_base_mkey(mr->mmkey.key)));
 
-   return mlx5_core_destroy_mkey(dev->mdev, &mr->mmkey);
+   return mlx5_core_destroy_mkey(dev->mdev, mr->mmkey.key);
 }
 
 static void create_mkey_callback(int status, struct mlx5_async_work *context)
@@ -261,10 +267,11 @@ static struct mlx5_ib_mr *create_cache_mr(struct 
mlx5_cache_ent *ent)
goto free_in;
}
 
-   err = mlx5_core_create_mkey(ent->dev->mdev, &mr->mmkey, in, inlen);
+   err = mlx5_core_create_mkey(ent->dev->mdev, &mr->mmkey.key, in, inlen);
if (err)
goto free_mr;
 
+   init_waitqueue_head(&mr->mmkey.wait);
mr->mmkey.type = MLX5_MKEY_MR;
WRITE_ONCE(ent->dev->cache.last_add, jiffies);
spin_lock_irq(&ent->lock);
@@ -291,7 +298,7 @@ static void remove_cache_mr_locked(struct mlx5_cache_ent 
*ent)
ent->available_mrs--;
ent->total_mrs--;
spin_unlock_irq(&ent->lock);
-   mlx5_core_destroy_mkey(ent->dev->mdev, &mr->mmkey);
+   mlx5_core_destroy_mkey(ent->dev->mdev, mr->mmkey.key);
kfree(mr);
spin_lock_irq(&ent->lock);
 }
@@ -651,7 +658,7 @@ static void clean_keys(struct mlx5_ib_dev *dev, int c)
ent->available_mrs--;
ent->total_mrs--;
spin_unlock_irq(&ent->lock);
-   mlx5_core_destroy_mkey(dev->mdev, &mr->mmkey);
+   mlx5_core_destroy_mkey(dev->mdev, mr->mmkey.key);
}
 
list_for_each_entry_safe(mr, tmp_mr, &del_list, list) {
@@ -2350,7 +2357,7 @@ int mlx5_ib_alloc_mw(struct ib_mw *ibmw, struct ib_udata 
*udata)
return 0;
 
 free_mkey:
-   mlx5_core_destroy_mkey(dev->mdev, &mw->mmkey);
+   mlx5_core_destroy_mkey(dev->mdev, mw->mmkey.key);
 free:
kfree(in);
return err;
@@ -2369,7 +2376,7 @@ int mlx5_ib_dealloc_mw(struct ib_mw *mw)
 */
mlx5r_deref_wait_odp_mkey(&mmw->mmkey);
 
-   return mlx5_core_destroy_mkey(dev->mdev, &mmw->mmkey);
+   return mlx5_core_destroy_mkey(dev->mdev, mmw->mmkey.key);
 }
 
 int mlx5_ib_check_mr_status(struct ib_mr *ibmr, u32 check_mask,
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 3b24cfc96d36..8ffd3ea8db7c 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -909,7 +909,7 @@ static int pagefault_single_data_segment(struct mlx5_ib_dev 
*dev,
pklm = (struct mlx5_klm *)MLX5_ADDR_OF(query_mkey_out, out,
  

[PATCH mlx5-next 3/7] RDMA/mlx5: Remove size from struct mlx5_core_mkey

2021-10-12 Thread Leon Romanovsky
From: Aharon Landau 

mkey->size is already stored in ibmr->length, no need to store it here.

Signed-off-by: Aharon Landau 
Reviewed-by: Shay Drory 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/devx.c| 1 -
 drivers/infiniband/hw/mlx5/mr.c  | 5 ++---
 drivers/net/ethernet/mellanox/mlx5/core/mr.c | 1 -
 drivers/vdpa/mlx5/core/resources.c   | 1 -
 include/linux/mlx5/driver.h  | 1 -
 5 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/devx.c 
b/drivers/infiniband/hw/mlx5/devx.c
index 439b62e23afb..3d416850bba8 100644
--- a/drivers/infiniband/hw/mlx5/devx.c
+++ b/drivers/infiniband/hw/mlx5/devx.c
@@ -1303,7 +1303,6 @@ static int devx_handle_mkey_indirect(struct devx_obj *obj,
mkey->key = mlx5_idx_to_mkey(
MLX5_GET(create_mkey_out, out, mkey_index)) | key;
mkey->type = MLX5_MKEY_INDIRECT_DEVX;
-   mkey->size = MLX5_GET64(mkc, mkc, len);
mkey->pd = MLX5_GET(mkc, mkc, pd);
devx_mr->ndescs = MLX5_GET(mkc, mkc, translations_octword_size);
init_waitqueue_head(&mkey->wait);
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index e264c6be38f8..6f831598b987 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -968,7 +968,6 @@ static struct mlx5_ib_mr *alloc_cacheable_mr(struct ib_pd 
*pd,
 
mr->ibmr.pd = pd;
mr->umem = umem;
-   mr->mmkey.size = umem->length;
mr->mmkey.pd = to_mpd(pd)->pdn;
mr->page_shift = order_base_2(page_size);
set_mr_fields(dev, mr, umem->length, access_flags, iova);
@@ -1080,7 +1079,7 @@ static void *mlx5_ib_create_xlt_wr(struct mlx5_ib_mr *mr,
wr->wr.opcode = MLX5_IB_WR_UMR;
wr->pd = mr->ibmr.pd;
wr->mkey = mr->mmkey.key;
-   wr->length = mr->mmkey.size;
+   wr->length = mr->ibmr.length;
wr->virt_addr = mr->ibmr.iova;
wr->access_flags = mr->access_flags;
wr->page_shift = mr->page_shift;
@@ -1768,7 +1767,7 @@ static int umr_rereg_pas(struct mlx5_ib_mr *mr, struct 
ib_pd *pd,
 
mr->ibmr.length = new_umem->length;
mr->ibmr.iova = iova;
-   mr->mmkey.size = new_umem->length;
+   mr->ibmr.length = new_umem->length;
mr->page_shift = order_base_2(page_size);
mr->umem = new_umem;
err = mlx5_ib_update_mr_pas(mr, upd_flags);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mr.c 
b/drivers/net/ethernet/mellanox/mlx5/core/mr.c
index d239d559994f..b5dd44944265 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mr.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mr.c
@@ -52,7 +52,6 @@ int mlx5_core_create_mkey(struct mlx5_core_dev *dev,
 
mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
mkey_index = MLX5_GET(create_mkey_out, lout, mkey_index);
-   mkey->size = MLX5_GET64(mkc, mkc, len);
mkey->key = (u32)mlx5_mkey_variant(mkey->key) | 
mlx5_idx_to_mkey(mkey_index);
mkey->pd = MLX5_GET(mkc, mkc, pd);
init_waitqueue_head(&mkey->wait);
diff --git a/drivers/vdpa/mlx5/core/resources.c 
b/drivers/vdpa/mlx5/core/resources.c
index 14d4314cdc29..d3d8b8b4e377 100644
--- a/drivers/vdpa/mlx5/core/resources.c
+++ b/drivers/vdpa/mlx5/core/resources.c
@@ -215,7 +215,6 @@ int mlx5_vdpa_create_mkey(struct mlx5_vdpa_dev *mvdev, 
struct mlx5_core_mkey *mk
 
mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
mkey_index = MLX5_GET(create_mkey_out, lout, mkey_index);
-   mkey->size = MLX5_GET64(mkc, mkc, len);
mkey->key |= mlx5_idx_to_mkey(mkey_index);
mkey->pd = MLX5_GET(mkc, mkc, pd);
return 0;
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 573c0cbb0879..cba157689ca8 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -364,7 +364,6 @@ enum {
 };
 
 struct mlx5_core_mkey {
-   u64 size;
u32 key;
u32 pd;
u32 type;
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH mlx5-next 4/7] RDMA/mlx5: Remove pd from struct mlx5_core_mkey

2021-10-12 Thread Leon Romanovsky
From: Aharon Landau 

There is no read of mkey->pd, only write. Remove it.

Signed-off-by: Aharon Landau 
Reviewed-by: Shay Drory 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/devx.c| 1 -
 drivers/infiniband/hw/mlx5/mr.c  | 3 ---
 drivers/net/ethernet/mellanox/mlx5/core/mr.c | 3 ---
 drivers/vdpa/mlx5/core/resources.c   | 3 ---
 include/linux/mlx5/driver.h  | 1 -
 5 files changed, 11 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/devx.c 
b/drivers/infiniband/hw/mlx5/devx.c
index 3d416850bba8..465ea835f854 100644
--- a/drivers/infiniband/hw/mlx5/devx.c
+++ b/drivers/infiniband/hw/mlx5/devx.c
@@ -1303,7 +1303,6 @@ static int devx_handle_mkey_indirect(struct devx_obj *obj,
mkey->key = mlx5_idx_to_mkey(
MLX5_GET(create_mkey_out, out, mkey_index)) | key;
mkey->type = MLX5_MKEY_INDIRECT_DEVX;
-   mkey->pd = MLX5_GET(mkc, mkc, pd);
devx_mr->ndescs = MLX5_GET(mkc, mkc, translations_octword_size);
init_waitqueue_head(&mkey->wait);
 
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 6f831598b987..1450c2f0da00 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -968,7 +968,6 @@ static struct mlx5_ib_mr *alloc_cacheable_mr(struct ib_pd 
*pd,
 
mr->ibmr.pd = pd;
mr->umem = umem;
-   mr->mmkey.pd = to_mpd(pd)->pdn;
mr->page_shift = order_base_2(page_size);
set_mr_fields(dev, mr, umem->length, access_flags, iova);
 
@@ -1712,7 +1711,6 @@ static int umr_rereg_pd_access(struct mlx5_ib_mr *mr, 
struct ib_pd *pd,
return err;
 
mr->access_flags = access_flags;
-   mr->mmkey.pd = to_mpd(pd)->pdn;
return 0;
 }
 
@@ -1757,7 +1755,6 @@ static int umr_rereg_pas(struct mlx5_ib_mr *mr, struct 
ib_pd *pd,
 
if (flags & IB_MR_REREG_PD) {
mr->ibmr.pd = pd;
-   mr->mmkey.pd = to_mpd(pd)->pdn;
upd_flags |= MLX5_IB_UPD_XLT_PD;
}
if (flags & IB_MR_REREG_ACCESS) {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mr.c 
b/drivers/net/ethernet/mellanox/mlx5/core/mr.c
index b5dd44944265..6e99fd166f98 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mr.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mr.c
@@ -41,7 +41,6 @@ int mlx5_core_create_mkey(struct mlx5_core_dev *dev,
 {
u32 lout[MLX5_ST_SZ_DW(create_mkey_out)] = {};
u32 mkey_index;
-   void *mkc;
int err;
 
MLX5_SET(create_mkey_in, in, opcode, MLX5_CMD_OP_CREATE_MKEY);
@@ -50,10 +49,8 @@ int mlx5_core_create_mkey(struct mlx5_core_dev *dev,
if (err)
return err;
 
-   mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
mkey_index = MLX5_GET(create_mkey_out, lout, mkey_index);
mkey->key = (u32)mlx5_mkey_variant(mkey->key) | 
mlx5_idx_to_mkey(mkey_index);
-   mkey->pd = MLX5_GET(mkc, mkc, pd);
init_waitqueue_head(&mkey->wait);
 
mlx5_core_dbg(dev, "out 0x%x, mkey 0x%x\n", mkey_index, mkey->key);
diff --git a/drivers/vdpa/mlx5/core/resources.c 
b/drivers/vdpa/mlx5/core/resources.c
index d3d8b8b4e377..72b2d80e75b0 100644
--- a/drivers/vdpa/mlx5/core/resources.c
+++ b/drivers/vdpa/mlx5/core/resources.c
@@ -203,7 +203,6 @@ int mlx5_vdpa_create_mkey(struct mlx5_vdpa_dev *mvdev, 
struct mlx5_core_mkey *mk
 {
u32 lout[MLX5_ST_SZ_DW(create_mkey_out)] = {};
u32 mkey_index;
-   void *mkc;
int err;
 
MLX5_SET(create_mkey_in, in, opcode, MLX5_CMD_OP_CREATE_MKEY);
@@ -213,10 +212,8 @@ int mlx5_vdpa_create_mkey(struct mlx5_vdpa_dev *mvdev, 
struct mlx5_core_mkey *mk
if (err)
return err;
 
-   mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
mkey_index = MLX5_GET(create_mkey_out, lout, mkey_index);
mkey->key |= mlx5_idx_to_mkey(mkey_index);
-   mkey->pd = MLX5_GET(mkc, mkc, pd);
return 0;
 }
 
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index cba157689ca8..31e19f06bc3a 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -365,7 +365,6 @@ enum {
 
 struct mlx5_core_mkey {
u32 key;
-   u32 pd;
u32 type;
struct wait_queue_head wait;
refcount_t usecount;
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH mlx5-next 2/7] RDMA/mlx5: Remove iova from struct mlx5_core_mkey

2021-10-12 Thread Leon Romanovsky
From: Aharon Landau 

iova is already stored in ibmr->iova, no need to store it here.

Signed-off-by: Aharon Landau 
Reviewed-by: Shay Drory 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/devx.c|  1 -
 drivers/infiniband/hw/mlx5/mr.c  | 16 
 drivers/infiniband/hw/mlx5/odp.c |  8 
 drivers/net/ethernet/mellanox/mlx5/core/mr.c |  1 -
 drivers/vdpa/mlx5/core/resources.c   |  1 -
 include/linux/mlx5/driver.h  |  1 -
 6 files changed, 12 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/devx.c 
b/drivers/infiniband/hw/mlx5/devx.c
index 7c920be2921d..439b62e23afb 100644
--- a/drivers/infiniband/hw/mlx5/devx.c
+++ b/drivers/infiniband/hw/mlx5/devx.c
@@ -1303,7 +1303,6 @@ static int devx_handle_mkey_indirect(struct devx_obj *obj,
mkey->key = mlx5_idx_to_mkey(
MLX5_GET(create_mkey_out, out, mkey_index)) | key;
mkey->type = MLX5_MKEY_INDIRECT_DEVX;
-   mkey->iova = MLX5_GET64(mkc, mkc, start_addr);
mkey->size = MLX5_GET64(mkc, mkc, len);
mkey->pd = MLX5_GET(mkc, mkc, pd);
devx_mr->ndescs = MLX5_GET(mkc, mkc, translations_octword_size);
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 0ffe7f5b77b6..e264c6be38f8 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -904,12 +904,13 @@ static struct mlx5_cache_ent 
*mr_cache_ent_from_order(struct mlx5_ib_dev *dev,
 }
 
 static void set_mr_fields(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr,
- u64 length, int access_flags)
+ u64 length, int access_flags, u64 iova)
 {
mr->ibmr.lkey = mr->mmkey.key;
mr->ibmr.rkey = mr->mmkey.key;
mr->ibmr.length = length;
mr->ibmr.device = &dev->ib_dev;
+   mr->ibmr.iova = iova;
mr->access_flags = access_flags;
 }
 
@@ -967,11 +968,10 @@ static struct mlx5_ib_mr *alloc_cacheable_mr(struct ib_pd 
*pd,
 
mr->ibmr.pd = pd;
mr->umem = umem;
-   mr->mmkey.iova = iova;
mr->mmkey.size = umem->length;
mr->mmkey.pd = to_mpd(pd)->pdn;
mr->page_shift = order_base_2(page_size);
-   set_mr_fields(dev, mr, umem->length, access_flags);
+   set_mr_fields(dev, mr, umem->length, access_flags, iova);
 
return mr;
 }
@@ -1081,7 +1081,7 @@ static void *mlx5_ib_create_xlt_wr(struct mlx5_ib_mr *mr,
wr->pd = mr->ibmr.pd;
wr->mkey = mr->mmkey.key;
wr->length = mr->mmkey.size;
-   wr->virt_addr = mr->mmkey.iova;
+   wr->virt_addr = mr->ibmr.iova;
wr->access_flags = mr->access_flags;
wr->page_shift = mr->page_shift;
wr->xlt_size = sg->length;
@@ -1336,7 +1336,7 @@ static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, 
struct ib_umem *umem,
}
mr->mmkey.type = MLX5_MKEY_MR;
mr->umem = umem;
-   set_mr_fields(dev, mr, umem->length, access_flags);
+   set_mr_fields(dev, mr, umem->length, access_flags, iova);
kvfree(in);
 
mlx5_ib_dbg(dev, "mkey = 0x%x\n", mr->mmkey.key);
@@ -1383,7 +1383,7 @@ static struct ib_mr *mlx5_ib_get_dm_mr(struct ib_pd *pd, 
u64 start_addr,
 
kfree(in);
 
-   set_mr_fields(dev, mr, length, acc);
+   set_mr_fields(dev, mr, length, acc, start_addr);
 
return &mr->ibmr;
 
@@ -1767,7 +1767,7 @@ static int umr_rereg_pas(struct mlx5_ib_mr *mr, struct 
ib_pd *pd,
}
 
mr->ibmr.length = new_umem->length;
-   mr->mmkey.iova = iova;
+   mr->ibmr.iova = iova;
mr->mmkey.size = new_umem->length;
mr->page_shift = order_base_2(page_size);
mr->umem = new_umem;
@@ -1847,7 +1847,7 @@ struct ib_mr *mlx5_ib_rereg_user_mr(struct ib_mr *ib_mr, 
int flags, u64 start,
mr->umem = NULL;
atomic_sub(ib_umem_num_pages(umem), &dev->mdev->priv.reg_pages);
 
-   return create_real_mr(new_pd, umem, mr->mmkey.iova,
+   return create_real_mr(new_pd, umem, mr->ibmr.iova,
  new_access_flags);
}
 
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 77890a85fc2d..3b24cfc96d36 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -430,7 +430,7 @@ static struct mlx5_ib_mr *implicit_get_child_mr(struct 
mlx5_ib_mr *imr,
mr->umem = &odp->umem;
mr->ibmr.lkey = mr->mmkey.key;
mr->ibmr.rkey = mr->mmkey.key;
-   mr->mmkey.iova = idx * MLX5_IMR_MTT_SIZE;
+   mr->ibmr.iova = idx * MLX5_IMR_MTT_SIZE;
mr->parent = imr;
odp->private = mr;
 
@@ -500,7 +500,7 @@ struct mlx5_ib_mr *mlx5_ib_alloc_implicit_mr(struct 
mlx5_ib_pd *pd,
}
 
imr->ibmr.pd = &pd->ibpd;
-   imr->mmkey.iova = 0;
+   imr->ibmr.iova = 0;
imr->umem = &umem_odp->umem;
imr->ibmr.lkey = imr->mmkey.key;

[PATCH mlx5-next 1/7] RDMA/mlx5: Don't set esc_size in user mr

2021-10-12 Thread Leon Romanovsky
From: Aharon Landau 

reg_create() is used for user MRs only and should not set desc_size at
all. Attempt to set it causes to the following trace:

BUG: unable to handle page fault for address: 0008
PGD 0 P4D 0
Oops:  [#1] SMP PTI
CPU: 5 PID: 890 Comm: ib_write_bw Not tainted 5.15.0-rc4+ #47
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:mlx5_ib_dereg_mr+0x14/0x3b0 [mlx5_ib]
Code: 48 63 cd 4c 89 f7 48 89 0c 24 e8 37 30 03 e1 48 8b 0c 24 eb a0 90 0f 1f 
44 00 00 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 30 <48> 8b 2f 65 48 8b 04 25 
28 00 00 00 48 89 44 24 28 31 c0 8b 87 c8
RSP: 0018:88811afa3a60 EFLAGS: 00010286
RAX: 001c RBX: 0008 RCX: 
RDX:  RSI:  RDI: 0008
RBP: 0008 R08:  R09: c000f7ff
R10: 88811afa38f8 R11: 88811afa38f0 R12: a02c7ac0
R13:  R14: 88811afa3cd8 R15: 88810772fa00
FS:  7f47b9080740() GS:88852cd4() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0008 CR3: 00010761e003 CR4: 00370ea0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 mlx5_ib_free_odp_mr+0x95/0xc0 [mlx5_ib]
 mlx5_ib_dereg_mr+0x128/0x3b0 [mlx5_ib]
 ib_dereg_mr_user+0x45/0xb0 [ib_core]
 ? xas_load+0x8/0x80
 destroy_hw_idr_uobject+0x1a/0x50 [ib_uverbs]
 uverbs_destroy_uobject+0x2f/0x150 [ib_uverbs]
 uobj_destroy+0x3c/0x70 [ib_uverbs]
 ib_uverbs_cmd_verbs+0x467/0xb00 [ib_uverbs]
 ? uverbs_finalize_object+0x60/0x60 [ib_uverbs]
 ? ttwu_queue_wakelist+0xa9/0xe0
 ? pty_write+0x85/0x90
 ? file_tty_write.isra.33+0x214/0x330
 ? process_echoes+0x60/0x60
 ib_uverbs_ioctl+0xa7/0x110 [ib_uverbs]
 __x64_sys_ioctl+0x10d/0x8e0
 ? vfs_write+0x17f/0x260
 do_syscall_64+0x3c/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: a639e66703ee ("RDMA/mlx5: Zero out ODP related items in the mlx5_ib_mr")
Signed-off-by: Aharon Landau 
Reviewed-by: Maor Gottlieb 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/mr.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 9091de381e83..0ffe7f5b77b6 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1335,7 +1335,6 @@ static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, 
struct ib_umem *umem,
goto err_2;
}
mr->mmkey.type = MLX5_MKEY_MR;
-   mr->desc_size = sizeof(struct mlx5_mtt);
mr->umem = umem;
set_mr_fields(dev, mr, umem->length, access_flags);
kvfree(in);
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH mlx5-next 0/7] Clean MR key use across mlx5_* modules

2021-10-12 Thread Leon Romanovsky
From: Leon Romanovsky 

Hi,

This is cleanup series of mlx5_* MR mkey management.

Thanks

Aharon Landau (7):
  RDMA/mlx5: Don't set esc_size in user mr
  RDMA/mlx5: Remove iova from struct mlx5_core_mkey
  RDMA/mlx5: Remove size from struct mlx5_core_mkey
  RDMA/mlx5: Remove pd from struct mlx5_core_mkey
  RDMA/mlx5: Replace struct mlx5_core_mkey by u32 key
  RDMA/mlx5: Move struct mlx5_core_mkey to mlx5_ib
  RDMA/mlx5: Attach ndescs to mlx5_ib_mkey

 drivers/infiniband/hw/mlx5/devx.c | 13 +--
 drivers/infiniband/hw/mlx5/devx.h |  2 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h  | 31 ---
 drivers/infiniband/hw/mlx5/mr.c   | 82 +--
 drivers/infiniband/hw/mlx5/odp.c  | 38 +++--
 drivers/infiniband/hw/mlx5/wr.c   | 10 +--
 .../mellanox/mlx5/core/diag/fw_tracer.c   |  6 +-
 .../mellanox/mlx5/core/diag/fw_tracer.h   |  2 +-
 .../mellanox/mlx5/core/diag/rsc_dump.c| 10 +--
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en/ptp.c  |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en/trap.c |  2 +-
 .../ethernet/mellanox/mlx5/core/en_common.c   |  6 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 13 ++-
 .../ethernet/mellanox/mlx5/core/fpga/conn.c   | 10 +--
 .../ethernet/mellanox/mlx5/core/fpga/core.h   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/mr.c  | 27 +++---
 .../mellanox/mlx5/core/steering/dr_icm_pool.c | 10 +--
 .../mellanox/mlx5/core/steering/dr_send.c | 11 ++-
 .../mellanox/mlx5/core/steering/dr_types.h|  2 +-
 drivers/vdpa/mlx5/core/mlx5_vdpa.h|  8 +-
 drivers/vdpa/mlx5/core/mr.c   |  8 +-
 drivers/vdpa/mlx5/core/resources.c| 13 +--
 drivers/vdpa/mlx5/net/mlx5_vnet.c |  2 +-
 include/linux/mlx5/driver.h   | 30 ++-
 25 files changed, 147 insertions(+), 195 deletions(-)

-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/9] More virtio hardening

2021-10-12 Thread Jason Wang
On Tue, Oct 12, 2021 at 3:03 PM Michael S. Tsirkin  wrote:
>
> On Tue, Oct 12, 2021 at 02:43:13PM +0800, Jason Wang wrote:
> > On Tue, Oct 12, 2021 at 2:35 PM Michael S. Tsirkin  wrote:
> > >
> > > On Tue, Oct 12, 2021 at 02:11:10PM +0800, Jason Wang wrote:
> > > > On Tue, Oct 12, 2021 at 1:44 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Tue, Oct 12, 2021 at 10:43:57AM +0800, Jason Wang wrote:
> > > > > > On Mon, Oct 11, 2021 at 8:36 PM Michael S. Tsirkin 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Mon, Oct 11, 2021 at 03:36:51PM +0800, Jason Wang wrote:
> > > > > > > > On Tue, Oct 5, 2021 at 3:42 PM Michael S. Tsirkin 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Sep 13, 2021 at 01:53:44PM +0800, Jason Wang wrote:
> > > > > > > > > > Hi All:
> > > > > > > > > >
> > > > > > > > > > This series treis to do more hardening for virito.
> > > > > > > > > >
> > > > > > > > > > patch 1 validates the num_queues for virio-blk device.
> > > > > > > > > > patch 2-4 validates max_nr_ports for virito-console device.
> > > > > > > > > > patch 5-7 harden virtio-pci interrupts to make sure no 
> > > > > > > > > > exepcted
> > > > > > > > > > interrupt handler is tiggered. If this makes sense we can 
> > > > > > > > > > do similar
> > > > > > > > > > things in other transport drivers.
> > > > > > > > > > patch 8-9 validate used ring length.
> > > > > > > > > >
> > > > > > > > > > Smoking test on blk/net with packed=on/off and 
> > > > > > > > > > iommu_platform=on/off.
> > > > > > > > > >
> > > > > > > > > > Please review.
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > >
> > > > > > > > > So I poked at console at least, and I think I see
> > > > > > > > > an issue: if interrupt handler queues a work/bh,
> > > > > > > > > then it can still run while reset is in progress.
> > > > > > > >
> > > > > > > > Looks like a bug which is unrelated to the hardening?
> > > > > > >
> > > > > > > Won't preventing use after free be relevant?
> > > > > >
> > > > > > Oh right.
> > > > > >
> > > > > > > I frankly don't know what does hardening means then.
> > > > > > > > E.g the driver
> > > > > > > > should sync with work/bh before reset.
> > > > > > >
> > > > > > > No, there's no way to fix it ATM without extra locks and state 
> > > > > > > which I
> > > > > > > think we should strive to avoid or make it generic, not 
> > > > > > > per-driver,
> > > > > > > since sync before reset is useless, new interrupts will just 
> > > > > > > arrive and
> > > > > > > queue more work. And a sync after reset is too late since driver 
> > > > > > > will
> > > > > > > try to add buffers.
> > > > > >
> > > > > > Can we do something like
> > > > > >
> > > > > > 1) disable interrupt
> > > > > > 2) sync bh
> > > > > >
> > > > > > Or I guess this is somehow you meant in the following steps.
> > > > >
> > > > > So that would mean a new API to disable vq interrupts.
> > > > > reset will re-enable.
> > > > > E.g. virtqueue_cancel_cb_before_reset()?
> > > > >
> > > > > Then drivers can sync, then reset.
> > > > > This means maintaining more state though, which I don't like.
> > > > >
> > > > > An alternative is something like this:
> > > > >
> > > > > static void (*virtio_flush_device)(struct virtio_device *dev);
> > > > >
> > > > > void virtio_reset_device(struct virtio_device *dev, 
> > > > > virtio_flush_device flush)
> > > > > {
> > > > > might_sleep();
> > > > > if (flush) {
> > > > > dev->config->disable_interrupts(dev);
> > > > > flush(dev);
> > > > > dev->config->reset(dev);
> > > > > dev->config->enable_interrupts(dev);
> > > >
> > > > I wonder whether this is needed. As done in this series,
> > > > enable_interrupt should be done in virtio_device_ready().
> > > >
> > > > Others should work.
> > > >
> > > > > } else {
> > > > > dev->config->reset(dev);
> > > > > }
> > > > > }
> > > > >
> > > > > I have patches wrapping all reset calls in virtio_reset_device
> > > > > (without the flush parameter but that's easy to tweak).
> > > >
> > > > Does it work if I post V2 and you post those patches on top?
> > >
> > > The reset things? Sure.
> >
> > Ok.
> >
> > >
> > > > >
> > > > >
> > > > > > >
> > > > > > > Maybe we can break device. Two issues with that
> > > > > > > - drivers might not be ready to handle add_buf failures
> > > > > > > - restore needs to unbreak then and we don't have a way to do 
> > > > > > > that yet
> > > > > > >
> > > > > > > So .. careful reading of all device drivers and hoping we don't 
> > > > > > > mess
> > > > > > > things up even more ... here we come.
> > > > > >
> > > > > > Yes.
> > > > >
> > > > > The biggest issue with this trick is drivers not handling add_buf
> > > > > errors, adding a failure path here risks creating memory leaks.
> > > > > OTOH with e.g. bounce buffers maybe it's possible for add buf to
> > > > > fail anyway?
> > > >
> > > > I'm not

Re: [PATCH 0/9] More virtio hardening

2021-10-12 Thread Michael S. Tsirkin
On Tue, Oct 12, 2021 at 02:43:13PM +0800, Jason Wang wrote:
> On Tue, Oct 12, 2021 at 2:35 PM Michael S. Tsirkin  wrote:
> >
> > On Tue, Oct 12, 2021 at 02:11:10PM +0800, Jason Wang wrote:
> > > On Tue, Oct 12, 2021 at 1:44 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Tue, Oct 12, 2021 at 10:43:57AM +0800, Jason Wang wrote:
> > > > > On Mon, Oct 11, 2021 at 8:36 PM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Oct 11, 2021 at 03:36:51PM +0800, Jason Wang wrote:
> > > > > > > On Tue, Oct 5, 2021 at 3:42 PM Michael S. Tsirkin 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Mon, Sep 13, 2021 at 01:53:44PM +0800, Jason Wang wrote:
> > > > > > > > > Hi All:
> > > > > > > > >
> > > > > > > > > This series treis to do more hardening for virito.
> > > > > > > > >
> > > > > > > > > patch 1 validates the num_queues for virio-blk device.
> > > > > > > > > patch 2-4 validates max_nr_ports for virito-console device.
> > > > > > > > > patch 5-7 harden virtio-pci interrupts to make sure no 
> > > > > > > > > exepcted
> > > > > > > > > interrupt handler is tiggered. If this makes sense we can do 
> > > > > > > > > similar
> > > > > > > > > things in other transport drivers.
> > > > > > > > > patch 8-9 validate used ring length.
> > > > > > > > >
> > > > > > > > > Smoking test on blk/net with packed=on/off and 
> > > > > > > > > iommu_platform=on/off.
> > > > > > > > >
> > > > > > > > > Please review.
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > >
> > > > > > > > So I poked at console at least, and I think I see
> > > > > > > > an issue: if interrupt handler queues a work/bh,
> > > > > > > > then it can still run while reset is in progress.
> > > > > > >
> > > > > > > Looks like a bug which is unrelated to the hardening?
> > > > > >
> > > > > > Won't preventing use after free be relevant?
> > > > >
> > > > > Oh right.
> > > > >
> > > > > > I frankly don't know what does hardening means then.
> > > > > > > E.g the driver
> > > > > > > should sync with work/bh before reset.
> > > > > >
> > > > > > No, there's no way to fix it ATM without extra locks and state 
> > > > > > which I
> > > > > > think we should strive to avoid or make it generic, not per-driver,
> > > > > > since sync before reset is useless, new interrupts will just arrive 
> > > > > > and
> > > > > > queue more work. And a sync after reset is too late since driver 
> > > > > > will
> > > > > > try to add buffers.
> > > > >
> > > > > Can we do something like
> > > > >
> > > > > 1) disable interrupt
> > > > > 2) sync bh
> > > > >
> > > > > Or I guess this is somehow you meant in the following steps.
> > > >
> > > > So that would mean a new API to disable vq interrupts.
> > > > reset will re-enable.
> > > > E.g. virtqueue_cancel_cb_before_reset()?
> > > >
> > > > Then drivers can sync, then reset.
> > > > This means maintaining more state though, which I don't like.
> > > >
> > > > An alternative is something like this:
> > > >
> > > > static void (*virtio_flush_device)(struct virtio_device *dev);
> > > >
> > > > void virtio_reset_device(struct virtio_device *dev, virtio_flush_device 
> > > > flush)
> > > > {
> > > > might_sleep();
> > > > if (flush) {
> > > > dev->config->disable_interrupts(dev);
> > > > flush(dev);
> > > > dev->config->reset(dev);
> > > > dev->config->enable_interrupts(dev);
> > >
> > > I wonder whether this is needed. As done in this series,
> > > enable_interrupt should be done in virtio_device_ready().
> > >
> > > Others should work.
> > >
> > > > } else {
> > > > dev->config->reset(dev);
> > > > }
> > > > }
> > > >
> > > > I have patches wrapping all reset calls in virtio_reset_device
> > > > (without the flush parameter but that's easy to tweak).
> > >
> > > Does it work if I post V2 and you post those patches on top?
> >
> > The reset things? Sure.
> 
> Ok.
> 
> >
> > > >
> > > >
> > > > > >
> > > > > > Maybe we can break device. Two issues with that
> > > > > > - drivers might not be ready to handle add_buf failures
> > > > > > - restore needs to unbreak then and we don't have a way to do that 
> > > > > > yet
> > > > > >
> > > > > > So .. careful reading of all device drivers and hoping we don't mess
> > > > > > things up even more ... here we come.
> > > > >
> > > > > Yes.
> > > >
> > > > The biggest issue with this trick is drivers not handling add_buf
> > > > errors, adding a failure path here risks creating memory leaks.
> > > > OTOH with e.g. bounce buffers maybe it's possible for add buf to
> > > > fail anyway?
> > >
> > > I'm not sure I get this, a simple git grep told me at least the return
> > > value of add_inbuf() were all checked.
> > >
> > > Thanks
> >
> > Checked locally, but not always error is handled all the way
> > to the top. E.g. add_port in console returns an error code
> > but that is never checked. Well, console is a mess generally.
> 
> I see. I c