Re: [PATCH v2] vhost: introduce mdev based hardware backend

2019-10-23 Thread Tiwei Bie
On Wed, Oct 23, 2019 at 03:25:00PM +0800, Jason Wang wrote:
> On 2019/10/23 下午3:07, Tiwei Bie wrote:
> > On Wed, Oct 23, 2019 at 01:46:23PM +0800, Jason Wang wrote:
> > > On 2019/10/23 上午11:02, Tiwei Bie wrote:
> > > > On Tue, Oct 22, 2019 at 09:30:16PM +0800, Jason Wang wrote:
> > > > > On 2019/10/22 下午5:52, Tiwei Bie wrote:
> > > > > > This patch introduces a mdev based hardware vhost backend.
> > > > > > This backend is built on top of the same abstraction used
> > > > > > in virtio-mdev and provides a generic vhost interface for
> > > > > > userspace to accelerate the virtio devices in guest.
> > > > > > 
> > > > > > This backend is implemented as a mdev device driver on top
> > > > > > of the same mdev device ops used in virtio-mdev but using
> > > > > > a different mdev class id, and it will register the device
> > > > > > as a VFIO device for userspace to use. Userspace can setup
> > > > > > the IOMMU with the existing VFIO container/group APIs and
> > > > > > then get the device fd with the device name. After getting
> > > > > > the device fd of this device, userspace can use vhost ioctls
> > > > > > to setup the backend.
> > > > > > 
> > > > > > Signed-off-by: Tiwei Bie 
> > > > > > ---
> > > > > > This patch depends on below series:
> > > > > > https://lkml.org/lkml/2019/10/17/286
> > > > > > 
> > > > > > v1 -> v2:
> > > > > > - Replace _SET_STATE with _SET_STATUS (MST);
> > > > > > - Check status bits at each step (MST);
> > > > > > - Report the max ring size and max number of queues (MST);
> > > > > > - Add missing MODULE_DEVICE_TABLE (Jason);
> > > > > > - Only support the network backend w/o multiqueue for now;
> > > > > Any idea on how to extend it to support devices other than net? I 
> > > > > think we
> > > > > want a generic API or an API that could be made generic in the future.
> > > > > 
> > > > > Do we want to e.g having a generic vhost mdev for all kinds of 
> > > > > devices or
> > > > > introducing e.g vhost-net-mdev and vhost-scsi-mdev?
> > > > One possible way is to do what vhost-user does. I.e. Apart from
> > > > the generic ring, features, ... related ioctls, we also introduce
> > > > device specific ioctls when we need them. As vhost-mdev just needs
> > > > to forward configs between parent and userspace and even won't
> > > > cache any info when possible,
> > > 
> > > So it looks to me this is only possible if we expose e.g set_config and
> > > get_config to userspace.
> > The set_config and get_config interface isn't really everything
> > of device specific settings. We also have ctrlq in virtio-net.
> 
> 
> Yes, but it could be processed by the exist API. Isn't it? Just set ctrl vq
> address and let parent to deal with that.

I mean how to expose ctrlq related settings to userspace?

> 
> 
> > 
> > > 
> > > > I think it might be better to do
> > > > this in one generic vhost-mdev module.
> > > 
> > > Looking at definitions of VhostUserRequest in qemu, it mixed generic API
> > > with device specific API. If we want go this ways (a generic vhost-mdev),
> > > more questions needs to be answered:
> > > 
> > > 1) How could userspace know which type of vhost it would use? Do we need 
> > > to
> > > expose virtio subsystem device in for userspace this case?
> > > 
> > > 2) That generic vhost-mdev module still need to filter out unsupported
> > > ioctls for a specific type. E.g if it probes a net device, it should 
> > > refuse
> > > API for other type. This in fact a vhost-mdev-net but just not modularize 
> > > it
> > > on top of vhost-mdev.
> > > 
> > > 
> > > > > > - Some minor fixes and improvements;
> > > > > > - Rebase on top of virtio-mdev series v4;
> > [...]
> > > > > > +
> > > > > > +static long vhost_mdev_get_features(struct vhost_mdev *m, u64 
> > > > > > __user *featurep)
> > > > > > +{
> > > > > > +   if (copy_to_user(featurep, >features, sizeof(m->features)))
> > > > > > +   return -EFAULT;
> > > >

Re: [PATCH v2] vhost: introduce mdev based hardware backend

2019-10-23 Thread Tiwei Bie
On Wed, Oct 23, 2019 at 01:46:23PM +0800, Jason Wang wrote:
> On 2019/10/23 上午11:02, Tiwei Bie wrote:
> > On Tue, Oct 22, 2019 at 09:30:16PM +0800, Jason Wang wrote:
> > > On 2019/10/22 下午5:52, Tiwei Bie wrote:
> > > > This patch introduces a mdev based hardware vhost backend.
> > > > This backend is built on top of the same abstraction used
> > > > in virtio-mdev and provides a generic vhost interface for
> > > > userspace to accelerate the virtio devices in guest.
> > > > 
> > > > This backend is implemented as a mdev device driver on top
> > > > of the same mdev device ops used in virtio-mdev but using
> > > > a different mdev class id, and it will register the device
> > > > as a VFIO device for userspace to use. Userspace can setup
> > > > the IOMMU with the existing VFIO container/group APIs and
> > > > then get the device fd with the device name. After getting
> > > > the device fd of this device, userspace can use vhost ioctls
> > > > to setup the backend.
> > > > 
> > > > Signed-off-by: Tiwei Bie 
> > > > ---
> > > > This patch depends on below series:
> > > > https://lkml.org/lkml/2019/10/17/286
> > > > 
> > > > v1 -> v2:
> > > > - Replace _SET_STATE with _SET_STATUS (MST);
> > > > - Check status bits at each step (MST);
> > > > - Report the max ring size and max number of queues (MST);
> > > > - Add missing MODULE_DEVICE_TABLE (Jason);
> > > > - Only support the network backend w/o multiqueue for now;
> > > 
> > > Any idea on how to extend it to support devices other than net? I think we
> > > want a generic API or an API that could be made generic in the future.
> > > 
> > > Do we want to e.g having a generic vhost mdev for all kinds of devices or
> > > introducing e.g vhost-net-mdev and vhost-scsi-mdev?
> > One possible way is to do what vhost-user does. I.e. Apart from
> > the generic ring, features, ... related ioctls, we also introduce
> > device specific ioctls when we need them. As vhost-mdev just needs
> > to forward configs between parent and userspace and even won't
> > cache any info when possible,
> 
> 
> So it looks to me this is only possible if we expose e.g set_config and
> get_config to userspace.

The set_config and get_config interface isn't really everything
of device specific settings. We also have ctrlq in virtio-net.

> 
> 
> > I think it might be better to do
> > this in one generic vhost-mdev module.
> 
> 
> Looking at definitions of VhostUserRequest in qemu, it mixed generic API
> with device specific API. If we want go this ways (a generic vhost-mdev),
> more questions needs to be answered:
> 
> 1) How could userspace know which type of vhost it would use? Do we need to
> expose virtio subsystem device in for userspace this case?
> 
> 2) That generic vhost-mdev module still need to filter out unsupported
> ioctls for a specific type. E.g if it probes a net device, it should refuse
> API for other type. This in fact a vhost-mdev-net but just not modularize it
> on top of vhost-mdev.
> 
> 
> > 
> > > 
> > > > - Some minor fixes and improvements;
> > > > - Rebase on top of virtio-mdev series v4;
[...]
> > > > +
> > > > +static long vhost_mdev_get_features(struct vhost_mdev *m, u64 __user 
> > > > *featurep)
> > > > +{
> > > > +   if (copy_to_user(featurep, >features, sizeof(m->features)))
> > > > +   return -EFAULT;
> > > 
> > > As discussed in previous version do we need to filter out MQ feature here?
> > I think it's more straightforward to let the parent drivers to
> > filter out the unsupported features. Otherwise it would be tricky
> > when we want to add more features in vhost-mdev module,
> 
> 
> It's as simple as remove the feature from blacklist?

It's not really that easy. It may break the old drivers.

> 
> 
> > i.e. if
> > the parent drivers may expose unsupported features and relay on
> > vhost-mdev to filter them out, these features will be exposed
> > to userspace automatically when they are enabled in vhost-mdev
> > in the future.
> 
> 
> The issue is, it's only that vhost-mdev knows its own limitation. E.g in
> this patch, vhost-mdev only implements a subset of transport API, but parent
> doesn't know about that.
> 
> Still MQ as an example, there's no way (or no need) for parent to know that
> vhost-mdev does not support MQ.

The mdev is a MDEV_CLAS

Re: [PATCH v2] vhost: introduce mdev based hardware backend

2019-10-22 Thread Tiwei Bie
On Tue, Oct 22, 2019 at 09:30:16PM +0800, Jason Wang wrote:
> On 2019/10/22 下午5:52, Tiwei Bie wrote:
> > This patch introduces a mdev based hardware vhost backend.
> > This backend is built on top of the same abstraction used
> > in virtio-mdev and provides a generic vhost interface for
> > userspace to accelerate the virtio devices in guest.
> > 
> > This backend is implemented as a mdev device driver on top
> > of the same mdev device ops used in virtio-mdev but using
> > a different mdev class id, and it will register the device
> > as a VFIO device for userspace to use. Userspace can setup
> > the IOMMU with the existing VFIO container/group APIs and
> > then get the device fd with the device name. After getting
> > the device fd of this device, userspace can use vhost ioctls
> > to setup the backend.
> > 
> > Signed-off-by: Tiwei Bie 
> > ---
> > This patch depends on below series:
> > https://lkml.org/lkml/2019/10/17/286
> > 
> > v1 -> v2:
> > - Replace _SET_STATE with _SET_STATUS (MST);
> > - Check status bits at each step (MST);
> > - Report the max ring size and max number of queues (MST);
> > - Add missing MODULE_DEVICE_TABLE (Jason);
> > - Only support the network backend w/o multiqueue for now;
> 
> 
> Any idea on how to extend it to support devices other than net? I think we
> want a generic API or an API that could be made generic in the future.
> 
> Do we want to e.g having a generic vhost mdev for all kinds of devices or
> introducing e.g vhost-net-mdev and vhost-scsi-mdev?

One possible way is to do what vhost-user does. I.e. Apart from
the generic ring, features, ... related ioctls, we also introduce
device specific ioctls when we need them. As vhost-mdev just needs
to forward configs between parent and userspace and even won't
cache any info when possible, I think it might be better to do
this in one generic vhost-mdev module.

> 
> 
> > - Some minor fixes and improvements;
> > - Rebase on top of virtio-mdev series v4;
> > 
> > RFC v4 -> v1:
> > - Implement vhost-mdev as a mdev device driver directly and
> >connect it to VFIO container/group. (Jason);
> > - Pass ring addresses as GPAs/IOVAs in vhost-mdev to avoid
> >meaningless HVA->GPA translations (Jason);
> > 
> > RFC v3 -> RFC v4:
> > - Build vhost-mdev on top of the same abstraction used by
> >virtio-mdev (Jason);
> > - Introduce vhost fd and pass VFIO fd via SET_BACKEND ioctl (MST);
> > 
> > RFC v2 -> RFC v3:
> > - Reuse vhost's ioctls instead of inventing a VFIO regions/irqs
> >based vhost protocol on top of vfio-mdev (Jason);
> > 
> > RFC v1 -> RFC v2:
> > - Introduce a new VFIO device type to build a vhost protocol
> >on top of vfio-mdev;
> > 
> >   drivers/vfio/mdev/mdev_core.c |  12 +
> >   drivers/vhost/Kconfig |   9 +
> >   drivers/vhost/Makefile|   3 +
> >   drivers/vhost/mdev.c  | 415 ++
> >   include/linux/mdev.h  |   3 +
> >   include/uapi/linux/vhost.h|  13 ++
> >   6 files changed, 455 insertions(+)
> >   create mode 100644 drivers/vhost/mdev.c
> > 
> > diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> > index 5834f6b7c7a5..2963f65e6648 100644
> > --- a/drivers/vfio/mdev/mdev_core.c
> > +++ b/drivers/vfio/mdev/mdev_core.c
> > @@ -69,6 +69,18 @@ void mdev_set_virtio_ops(struct mdev_device *mdev,
> >   }
> >   EXPORT_SYMBOL(mdev_set_virtio_ops);
> > +/* Specify the vhost device ops for the mdev device, this
> > + * must be called during create() callback for vhost mdev device.
> > + */
> > +void mdev_set_vhost_ops(struct mdev_device *mdev,
> > +   const struct virtio_mdev_device_ops *vhost_ops)
> > +{
> > +   WARN_ON(mdev->class_id);
> > +   mdev->class_id = MDEV_CLASS_ID_VHOST;
> > +   mdev->device_ops = vhost_ops;
> > +}
> > +EXPORT_SYMBOL(mdev_set_vhost_ops);
> > +
> >   const void *mdev_get_dev_ops(struct mdev_device *mdev)
> >   {
> > return mdev->device_ops;
> > diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
> > index 3d03ccbd1adc..7b5c2f655af7 100644
> > --- a/drivers/vhost/Kconfig
> > +++ b/drivers/vhost/Kconfig
> > @@ -34,6 +34,15 @@ config VHOST_VSOCK
> > To compile this driver as a module, choose M here: the module will be 
> > called
> > vhost_vsock.
> > +config VHOST_MDEV
> > +   tristate "Vhost driver for Mediated devices"
> > +   depends on EVENTFD &

[PATCH v2] vhost: introduce mdev based hardware backend

2019-10-22 Thread Tiwei Bie
This patch introduces a mdev based hardware vhost backend.
This backend is built on top of the same abstraction used
in virtio-mdev and provides a generic vhost interface for
userspace to accelerate the virtio devices in guest.

This backend is implemented as a mdev device driver on top
of the same mdev device ops used in virtio-mdev but using
a different mdev class id, and it will register the device
as a VFIO device for userspace to use. Userspace can setup
the IOMMU with the existing VFIO container/group APIs and
then get the device fd with the device name. After getting
the device fd of this device, userspace can use vhost ioctls
to setup the backend.

Signed-off-by: Tiwei Bie 
---
This patch depends on below series:
https://lkml.org/lkml/2019/10/17/286

v1 -> v2:
- Replace _SET_STATE with _SET_STATUS (MST);
- Check status bits at each step (MST);
- Report the max ring size and max number of queues (MST);
- Add missing MODULE_DEVICE_TABLE (Jason);
- Only support the network backend w/o multiqueue for now;
- Some minor fixes and improvements;
- Rebase on top of virtio-mdev series v4;

RFC v4 -> v1:
- Implement vhost-mdev as a mdev device driver directly and
  connect it to VFIO container/group. (Jason);
- Pass ring addresses as GPAs/IOVAs in vhost-mdev to avoid
  meaningless HVA->GPA translations (Jason);

RFC v3 -> RFC v4:
- Build vhost-mdev on top of the same abstraction used by
  virtio-mdev (Jason);
- Introduce vhost fd and pass VFIO fd via SET_BACKEND ioctl (MST);

RFC v2 -> RFC v3:
- Reuse vhost's ioctls instead of inventing a VFIO regions/irqs
  based vhost protocol on top of vfio-mdev (Jason);

RFC v1 -> RFC v2:
- Introduce a new VFIO device type to build a vhost protocol
  on top of vfio-mdev;

 drivers/vfio/mdev/mdev_core.c |  12 +
 drivers/vhost/Kconfig |   9 +
 drivers/vhost/Makefile|   3 +
 drivers/vhost/mdev.c  | 415 ++
 include/linux/mdev.h  |   3 +
 include/uapi/linux/vhost.h|  13 ++
 6 files changed, 455 insertions(+)
 create mode 100644 drivers/vhost/mdev.c

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 5834f6b7c7a5..2963f65e6648 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -69,6 +69,18 @@ void mdev_set_virtio_ops(struct mdev_device *mdev,
 }
 EXPORT_SYMBOL(mdev_set_virtio_ops);
 
+/* Specify the vhost device ops for the mdev device, this
+ * must be called during create() callback for vhost mdev device.
+ */
+void mdev_set_vhost_ops(struct mdev_device *mdev,
+   const struct virtio_mdev_device_ops *vhost_ops)
+{
+   WARN_ON(mdev->class_id);
+   mdev->class_id = MDEV_CLASS_ID_VHOST;
+   mdev->device_ops = vhost_ops;
+}
+EXPORT_SYMBOL(mdev_set_vhost_ops);
+
 const void *mdev_get_dev_ops(struct mdev_device *mdev)
 {
return mdev->device_ops;
diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 3d03ccbd1adc..7b5c2f655af7 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -34,6 +34,15 @@ config VHOST_VSOCK
To compile this driver as a module, choose M here: the module will be 
called
vhost_vsock.
 
+config VHOST_MDEV
+   tristate "Vhost driver for Mediated devices"
+   depends on EVENTFD && VFIO && VFIO_MDEV
+   select VHOST
+   default n
+   ---help---
+   Say M here to enable the vhost_mdev module for use with
+   the mediated device based hardware vhost accelerators.
+
 config VHOST
tristate
---help---
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 6c6df24f770c..ad9c0f8c6d8c 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -10,4 +10,7 @@ vhost_vsock-y := vsock.o
 
 obj-$(CONFIG_VHOST_RING) += vringh.o
 
+obj-$(CONFIG_VHOST_MDEV) += vhost_mdev.o
+vhost_mdev-y := mdev.o
+
 obj-$(CONFIG_VHOST)+= vhost.o
diff --git a/drivers/vhost/mdev.c b/drivers/vhost/mdev.c
new file mode 100644
index ..5f9cae61018c
--- /dev/null
+++ b/drivers/vhost/mdev.c
@@ -0,0 +1,415 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018-2019 Intel Corporation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vhost.h"
+
+/* Currently, only network backend w/o multiqueue is supported. */
+#define VHOST_MDEV_VQ_MAX  2
+
+struct vhost_mdev {
+   /* The lock is to protect this structure. */
+   struct mutex mutex;
+   struct vhost_dev dev;
+   struct vhost_virtqueue *vqs;
+   int nvqs;
+   u64 status;
+   u64 features;
+   u64 acked_features;
+   bool opened;
+   struct mdev_device *mdev;
+};
+
+static void handle_vq_kick(struct vhost_work *work)
+{
+   struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+ poll.work);
+   struct vhost_mdev

Re: [PATCH] vhost: introduce mdev based hardware backend

2019-09-27 Thread Tiwei Bie
On Fri, Sep 27, 2019 at 03:14:42PM +0800, Jason Wang wrote:
> On 2019/9/27 下午12:54, Tiwei Bie wrote:
> > > > +
> > > > +   /*
> > > > +* In vhost-mdev, userspace should pass ring addresses
> > > > +* in guest physical addresses when IOMMU is disabled or
> > > > +* IOVAs when IOMMU is enabled.
> > > > +*/
> > > A question here, consider we're using noiommu mode. If guest physical
> > > address is passed here, how can a device use that?
> > > 
> > > I believe you meant "host physical address" here? And it also have the
> > > implication that the HPA should be continuous (e.g using hugetlbfs).
> > The comment is talking about the virtual IOMMU (i.e. iotlb in vhost).
> > It should be rephrased to cover the noiommu case as well. Thanks for
> > spotting this.
> 
> 
> So the question still, if GPA is passed how can it be used by the
> virtio-mdev device?

Sorry if I didn't make it clear..
Of course, GPA can't be passed in noiommu mode.


> 
> Thanks
> 


Re: [PATCH] vhost: introduce mdev based hardware backend

2019-09-26 Thread Tiwei Bie
On Fri, Sep 27, 2019 at 11:46:06AM +0800, Jason Wang wrote:
> On 2019/9/26 下午12:54, Tiwei Bie wrote:
> > +
> > +static long vhost_mdev_start(struct vhost_mdev *m)
> > +{
> > +   struct mdev_device *mdev = m->mdev;
> > +   const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(mdev);
> > +   struct virtio_mdev_callback cb;
> > +   struct vhost_virtqueue *vq;
> > +   int idx;
> > +
> > +   ops->set_features(mdev, m->acked_features);
> > +
> > +   mdev_add_status(mdev, VIRTIO_CONFIG_S_FEATURES_OK);
> > +   if (!(mdev_get_status(mdev) & VIRTIO_CONFIG_S_FEATURES_OK))
> > +   goto reset;
> > +
> > +   for (idx = 0; idx < m->nvqs; idx++) {
> > +   vq = >vqs[idx];
> > +
> > +   if (!vq->desc || !vq->avail || !vq->used)
> > +   break;
> > +
> > +   if (ops->set_vq_state(mdev, idx, vq->last_avail_idx))
> > +   goto reset;
> 
> 
> If we do set_vq_state() in SET_VRING_BASE, we won't need this step here.

Yeah, I plan to do it in the next version.

> 
> 
> > +
> > +   /*
> > +* In vhost-mdev, userspace should pass ring addresses
> > +* in guest physical addresses when IOMMU is disabled or
> > +* IOVAs when IOMMU is enabled.
> > +*/
> 
> 
> A question here, consider we're using noiommu mode. If guest physical
> address is passed here, how can a device use that?
> 
> I believe you meant "host physical address" here? And it also have the
> implication that the HPA should be continuous (e.g using hugetlbfs).

The comment is talking about the virtual IOMMU (i.e. iotlb in vhost).
It should be rephrased to cover the noiommu case as well. Thanks for
spotting this.


> > +
> > +   switch (cmd) {
> > +   case VHOST_MDEV_SET_STATE:
> > +   r = vhost_set_state(m, argp);
> > +   break;
> > +   case VHOST_GET_FEATURES:
> > +   r = vhost_get_features(m, argp);
> > +   break;
> > +   case VHOST_SET_FEATURES:
> > +   r = vhost_set_features(m, argp);
> > +   break;
> > +   case VHOST_GET_VRING_BASE:
> > +   r = vhost_get_vring_base(m, argp);
> > +   break;
> 
> 
> Does it mean the SET_VRING_BASE may only take affect after
> VHOST_MEV_SET_STATE?

Yeah, in this version, SET_VRING_BASE won't set the base to the
device directly. But I plan to not delay this anymore in the next
version to support the SET_STATUS.

> 
> 
> > +   default:
> > +   r = vhost_dev_ioctl(>dev, cmd, argp);
> > +   if (r == -ENOIOCTLCMD)
> > +   r = vhost_vring_ioctl(>dev, cmd, argp);
> > +   }
> > +
> > +   mutex_unlock(>mutex);
> > +   return r;
> > +}
> > +
> > +static const struct vfio_device_ops vfio_vhost_mdev_dev_ops = {
> > +   .name   = "vfio-vhost-mdev",
> > +   .open   = vhost_mdev_open,
> > +   .release= vhost_mdev_release,
> > +   .ioctl  = vhost_mdev_unlocked_ioctl,
> > +};
> > +
> > +static int vhost_mdev_probe(struct device *dev)
> > +{
> > +   struct mdev_device *mdev = mdev_from_dev(dev);
> > +   const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(mdev);
> > +   struct vhost_mdev *m;
> > +   int nvqs, r;
> > +
> > +   m = kzalloc(sizeof(*m), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> > +   if (!m)
> > +   return -ENOMEM;
> > +
> > +   mutex_init(>mutex);
> > +
> > +   nvqs = ops->get_queue_max(mdev);
> > +   m->nvqs = nvqs;
> 
> 
> The name could be confusing, get_queue_max() is to get the maximum number of
> entries for a virtqueue supported by this device.

OK. It might be better to rename it to something like:

get_vq_num_max()

which is more consistent with the set_vq_num().

> 
> It looks to me that we need another API to query the maximum number of
> virtqueues supported by the device.

Yeah.

Thanks,
Tiwei


> 
> Thanks
> 
> 
> > +
> > +   m->vqs = kmalloc_array(nvqs, sizeof(struct vhost_virtqueue),
> > +  GFP_KERNEL);
> > +   if (!m->vqs) {
> > +   r = -ENOMEM;
> > +   goto err;
> > +   }
> > +
> > +   r = vfio_add_group_dev(dev, _vhost_mdev_dev_ops, m);
> > +   if (r)
> > +   goto err;
> > +
> > +   m->features = ops->get_features(mdev);
> > +   m->mdev = mdev;
> > +   return 0;
> > +
> > +err:
> 

Re: [PATCH] vhost: introduce mdev based hardware backend

2019-09-26 Thread Tiwei Bie
On Fri, Sep 27, 2019 at 11:51:35AM +0800, Jason Wang wrote:
> On 2019/9/27 上午11:46, Jason Wang wrote:
> > +
> > +static struct mdev_class_id id_table[] = {
> > +    { MDEV_ID_VHOST },
> > +    { 0 },
> > +};
> > +
> > +static struct mdev_driver vhost_mdev_driver = {
> > +    .name    = "vhost_mdev",
> > +    .probe    = vhost_mdev_probe,
> > +    .remove    = vhost_mdev_remove,
> > +    .id_table = id_table,
> > +};
> > +
> 
> 
> And you probably need to add MODULE_DEVICE_TABLE() as well.

Yeah, thanks!


> 
> Thanks
> 


Re: [PATCH] vhost: introduce mdev based hardware backend

2019-09-26 Thread Tiwei Bie
On Thu, Sep 26, 2019 at 09:26:22AM -0400, Michael S. Tsirkin wrote:
> On Thu, Sep 26, 2019 at 09:14:39PM +0800, Tiwei Bie wrote:
> > > 4. Does device need to limit max ring size?
> > > 5. Does device need to limit max number of queues?
> > 
> > I think so. It's helpful to have ioctls to report the max
> > ring size and max number of queues.
> 
> Also, let's not repeat the vhost net mistakes, let's lock
> everything to the order required by the virtio spec,
> checking status bits at each step.
> E.g.:
>   set backend features
>   set features
>   detect and program vqs
>   enable vqs
>   enable driver
> 
> and check status at each step to force the correct order.
> e.g. don't allow enabling vqs after driver ok, etc

Got it. Thanks a lot!

Regards,
Tiwei

> 
> -- 
> MST


Re: [PATCH] vhost: introduce mdev based hardware backend

2019-09-26 Thread Tiwei Bie
On Thu, Sep 26, 2019 at 04:35:18AM -0400, Michael S. Tsirkin wrote:
> On Thu, Sep 26, 2019 at 12:54:27PM +0800, Tiwei Bie wrote:
[...]
> > diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> > index 40d028eed645..5afbc2f08fa3 100644
> > --- a/include/uapi/linux/vhost.h
> > +++ b/include/uapi/linux/vhost.h
> > @@ -116,4 +116,12 @@
> >  #define VHOST_VSOCK_SET_GUEST_CID  _IOW(VHOST_VIRTIO, 0x60, __u64)
> >  #define VHOST_VSOCK_SET_RUNNING_IOW(VHOST_VIRTIO, 0x61, int)
> >  
> > +/* VHOST_MDEV specific defines */
> > +
> > +#define VHOST_MDEV_SET_STATE   _IOW(VHOST_VIRTIO, 0x70, __u64)
> > +
> > +#define VHOST_MDEV_S_STOPPED   0
> > +#define VHOST_MDEV_S_RUNNING   1
> > +#define VHOST_MDEV_S_MAX   2
> > +
> >  #endif
> 
> So assuming we have an underlying device that behaves like virtio:

I think they are really good questions/suggestions. Thanks!

> 
> 1. Should we use SET_STATUS maybe?

I like this idea. I will give it a try.

> 2. Do we want a reset ioctl?

I think it is helpful. If we use SET_STATUS, maybe we
can use it to support the reset.

> 3. Do we want ability to enable rings individually?

I will make it possible at least in the vhost layer.

> 4. Does device need to limit max ring size?
> 5. Does device need to limit max number of queues?

I think so. It's helpful to have ioctls to report the max
ring size and max number of queues.

Thanks!
Tiwei


> 
> -- 
> MST


[PATCH] vhost: introduce mdev based hardware backend

2019-09-25 Thread Tiwei Bie
This patch introduces a mdev based hardware vhost backend.
This backend is built on top of the same abstraction used
in virtio-mdev and provides a generic vhost interface for
userspace to accelerate the virtio devices in guest.

This backend is implemented as a mdev device driver on top
of the same mdev device ops used in virtio-mdev but using
a different mdev class id, and it will register the device
as a VFIO device for userspace to use. Userspace can setup
the IOMMU with the existing VFIO container/group APIs and
then get the device fd with the device name. After getting
the device fd of this device, userspace can use vhost ioctls
to setup the backend.

Signed-off-by: Tiwei Bie 
---
This patch depends on below series:
https://lkml.org/lkml/2019/9/24/357

RFC v4 -> v1:
- Implement vhost-mdev as a mdev device driver directly and
  connect it to VFIO container/group. (Jason);
- Pass ring addresses as GPAs/IOVAs in vhost-mdev to avoid
  meaningless HVA->GPA translations (Jason);

RFC v3 -> RFC v4:
- Build vhost-mdev on top of the same abstraction used by
  virtio-mdev (Jason);
- Introduce vhost fd and pass VFIO fd via SET_BACKEND ioctl (MST);

RFC v2 -> RFC v3:
- Reuse vhost's ioctls instead of inventing a VFIO regions/irqs
  based vhost protocol on top of vfio-mdev (Jason);

RFC v1 -> RFC v2:
- Introduce a new VFIO device type to build a vhost protocol
  on top of vfio-mdev;

 drivers/vhost/Kconfig  |   9 +
 drivers/vhost/Makefile |   3 +
 drivers/vhost/mdev.c   | 381 +
 include/uapi/linux/vhost.h |   8 +
 4 files changed, 401 insertions(+)
 create mode 100644 drivers/vhost/mdev.c

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 3d03ccbd1adc..decf0be8efe9 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -34,6 +34,15 @@ config VHOST_VSOCK
To compile this driver as a module, choose M here: the module will be 
called
vhost_vsock.
 
+config VHOST_MDEV
+   tristate "Vhost driver for Mediated devices"
+   depends on EVENTFD && VFIO && VFIO_MDEV
+   select VHOST
+   default n
+   ---help---
+   Say M here to enable the vhost_mdev module for use with
+   the mediated device based hardware vhost accelerators
+
 config VHOST
tristate
---help---
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 6c6df24f770c..ad9c0f8c6d8c 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -10,4 +10,7 @@ vhost_vsock-y := vsock.o
 
 obj-$(CONFIG_VHOST_RING) += vringh.o
 
+obj-$(CONFIG_VHOST_MDEV) += vhost_mdev.o
+vhost_mdev-y := mdev.o
+
 obj-$(CONFIG_VHOST)+= vhost.o
diff --git a/drivers/vhost/mdev.c b/drivers/vhost/mdev.c
new file mode 100644
index ..1c12a25b86a2
--- /dev/null
+++ b/drivers/vhost/mdev.c
@@ -0,0 +1,381 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018-2019 Intel Corporation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vhost.h"
+
+struct vhost_mdev {
+   /* The lock is to protect this structure. */
+   struct mutex mutex;
+   struct vhost_dev dev;
+   struct vhost_virtqueue *vqs;
+   int nvqs;
+   u64 state;
+   u64 features;
+   u64 acked_features;
+   bool opened;
+   struct mdev_device *mdev;
+};
+
+static u8 mdev_get_status(struct mdev_device *mdev)
+{
+   const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(mdev);
+
+   return ops->get_status(mdev);
+}
+
+static void mdev_set_status(struct mdev_device *mdev, u8 status)
+{
+   const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(mdev);
+
+   return ops->set_status(mdev, status);
+}
+
+static void mdev_add_status(struct mdev_device *mdev, u8 status)
+{
+   status |= mdev_get_status(mdev);
+   mdev_set_status(mdev, status);
+}
+
+static void mdev_reset(struct mdev_device *mdev)
+{
+   mdev_set_status(mdev, 0);
+}
+
+static void handle_vq_kick(struct vhost_work *work)
+{
+   struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+ poll.work);
+   struct vhost_mdev *m = container_of(vq->dev, struct vhost_mdev, dev);
+   const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
+
+   ops->kick_vq(m->mdev, vq - m->vqs);
+}
+
+static irqreturn_t vhost_mdev_virtqueue_cb(void *private)
+{
+   struct vhost_virtqueue *vq = private;
+   struct eventfd_ctx *call_ctx = vq->call_ctx;
+
+   if (call_ctx)
+   eventfd_signal(call_ctx, 1);
+   return IRQ_HANDLED;
+}
+
+static long vhost_mdev_reset(struct vhost_mdev *m)
+{
+   struct mdev_device *mdev = m->mdev;
+
+   mdev_reset(mdev);
+   mdev_add_status(mdev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
+   mdev_add_status(mdev, VIRTIO_CONFIG_S_DRIVER);
+   return 0;
+}
+
+static l

Re: [RFC v4 3/3] vhost: introduce mdev based hardware backend

2019-09-19 Thread Tiwei Bie
On Tue, Sep 17, 2019 at 03:26:30PM +0800, Jason Wang wrote:
> On 2019/9/17 上午9:02, Tiwei Bie wrote:
> > diff --git a/drivers/vhost/mdev.c b/drivers/vhost/mdev.c
> > new file mode 100644
> > index ..8c6597aff45e
> > --- /dev/null
> > +++ b/drivers/vhost/mdev.c
> > @@ -0,0 +1,462 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (C) 2018-2019 Intel Corporation.
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#include "vhost.h"
> > +
> > +struct vhost_mdev {
> > +   struct mutex mutex;
> > +   struct vhost_dev dev;
> > +   struct vhost_virtqueue *vqs;
> > +   int nvqs;
> > +   u64 state;
> > +   u64 features;
> > +   u64 acked_features;
> > +   struct vfio_group *vfio_group;
> > +   struct vfio_device *vfio_device;
> > +   struct mdev_device *mdev;
> > +};
> > +
> > +/*
> > + * XXX
> > + * We assume virtio_mdev.ko exposes below symbols for now, as we
> > + * don't have a proper way to access parent ops directly yet.
> > + *
> > + * virtio_mdev_readl()
> > + * virtio_mdev_writel()
> > + */
> > +extern u32 virtio_mdev_readl(struct mdev_device *mdev, loff_t off);
> > +extern void virtio_mdev_writel(struct mdev_device *mdev, loff_t off, u32 
> > val);
> 
> 
> Need to consider a better approach, I feel we should do it through some kind
> of mdev driver instead of talk to mdev device directly.

Yeah, a better approach is really needed here.
Besides, we may want a way to allow accessing the mdev
device_ops proposed in below series outside the
drivers/vfio/mdev/ directory.

https://lkml.org/lkml/2019/9/12/151

I.e. allow putting mdev drivers outside above directory.


> > +
> > +   for (queue_id = 0; queue_id < m->nvqs; queue_id++) {
> > +   vq = >vqs[queue_id];
> > +
> > +   if (!vq->desc || !vq->avail || !vq->used)
> > +   break;
> > +
> > +   virtio_mdev_writel(mdev, VIRTIO_MDEV_QUEUE_NUM, vq->num);
> > +
> > +   if (!vhost_translate_ring_addr(vq, (u64)vq->desc,
> > +  vhost_get_desc_size(vq, vq->num),
> > +  ))
> > +   return -EINVAL;
> 
> 
> Interesting, any reason for doing such kinds of translation to HVA? I
> believe the add should already an IOVA that has been map by VFIO.

Currently, in the software based vhost-kernel and vhost-user
backends, QEMU will pass ring addresses as HVA in SET_VRING_ADDR
ioctl when iotlb isn't enabled. If it's OK to let QEMU pass GPA
in vhost-mdev in this case, then this translation won't be needed.

Thanks,
Tiwei


Re: [RFC v4 0/3] vhost: introduce mdev based hardware backend

2019-09-19 Thread Tiwei Bie
On Fri, Sep 20, 2019 at 09:30:58AM +0800, Jason Wang wrote:
> On 2019/9/19 下午11:45, Tiwei Bie wrote:
> > On Thu, Sep 19, 2019 at 09:08:11PM +0800, Jason Wang wrote:
> > > On 2019/9/18 下午10:32, Michael S. Tsirkin wrote:
> > > > > > > So I have some questions:
> > > > > > > 
> > > > > > > 1) Compared to method 2, what's the advantage of creating a new 
> > > > > > > vhost char
> > > > > > > device? I guess it's for keep the API compatibility?
> > > > > > One benefit is that we can avoid doing vhost ioctls on
> > > > > > VFIO device fd.
> > > > > Yes, but any benefit from doing this?
> > > > It does seem a bit more modular, but it's certainly not a big deal.
> > > Ok, if we go this way, it could be as simple as provide some callback to
> > > vhost, then vhost can just forward the ioctl through parent_ops.
> > > 
> > > > > > > 2) For method 2, is there any easy way for user/admin to 
> > > > > > > distinguish e.g
> > > > > > > ordinary vfio-mdev for vhost from ordinary vfio-mdev?
> > > > > > I think device-api could be a choice.
> > > > > Ok.
> > > > > 
> > > > > 
> > > > > > > I saw you introduce
> > > > > > > ops matching helper but it's not friendly to management.
> > > > > > The ops matching helper is just to check whether a given
> > > > > > vfio-device is based on a mdev device.
> > > > > > 
> > > > > > > 3) A drawback of 1) and 2) is that it must follow vfio_device_ops 
> > > > > > > that
> > > > > > > assumes the parameter comes from userspace, it prevents support 
> > > > > > > kernel
> > > > > > > virtio drivers.
> > > > > > > 
> > > > > > > 4) So comes the idea of method 3, since it register a new 
> > > > > > > vhost-mdev driver,
> > > > > > > we can use device specific ops instead of VFIO ones, then we can 
> > > > > > > have a
> > > > > > > common API between vDPA parent and vhost-mdev/virtio-mdev drivers.
> > > > > > As the above draft shows, this requires introducing a new
> > > > > > VFIO device driver. I think Alex's opinion matters here.
> > > Just to clarify, a new type of mdev driver but provides dummy
> > > vfio_device_ops for VFIO to make container DMA ioctl work.
> > I see. Thanks! IIUC, you mean we can provide a very tiny
> > VFIO device driver in drivers/vhost/mdev.c, e.g.:
> > 
> > static int vfio_vhost_mdev_open(void *device_data)
> > {
> > if (!try_module_get(THIS_MODULE))
> > return -ENODEV;
> > return 0;
> > }
> > 
> > static void vfio_vhost_mdev_release(void *device_data)
> > {
> > module_put(THIS_MODULE);
> > }
> > 
> > static const struct vfio_device_ops vfio_vhost_mdev_dev_ops = {
> > .name   = "vfio-vhost-mdev",
> > .open   = vfio_vhost_mdev_open,
> > .release= vfio_vhost_mdev_release,
> > };
> > 
> > static int vhost_mdev_probe(struct device *dev)
> > {
> > struct mdev_device *mdev = to_mdev_device(dev);
> > 
> > ... Check the mdev device_id proposed in ...
> > ... https://lkml.org/lkml/2019/9/12/151 ...
> 
> 
> To clarify, this should be done through the id_table fields in
> vhost_mdev_driver, and it should claim it supports virtio-mdev device only:
> 
> 
> static struct mdev_class_id id_table[] = {
>     { MDEV_ID_VIRTIO },
>     { 0 },
> };
> 
> 
> static struct mdev_driver vhost_mdev_driver = {
>     ...
>     .id_table = id_table,
> }

In this way, both of virtio-mdev and vhost-mdev will try to
take this device. We may want a way to let vhost-mdev take this
device only when users explicitly ask it to do it. Or maybe we
can have a different MDEV_ID for vhost-mdev but share the device
ops with virtio-mdev.

> 
> 
> > 
> > return vfio_add_group_dev(dev, _vhost_mdev_dev_ops, mdev);
> 
> 
> And in vfio_vhost_mdev_ops, all its need is to just implement vhost-net
> ioctl and translate them to virtio-mdev transport (e.g device_ops I proposed
> or ioctls other whatever other method) API.

I see, so my previous understanding is basically correct:

https://lkml.org/lkml/2019/9/17/332

I.e. we won't have a separate vhost fd and we will do all 

Re: [RFC v4 0/3] vhost: introduce mdev based hardware backend

2019-09-19 Thread Tiwei Bie
On Thu, Sep 19, 2019 at 09:08:11PM +0800, Jason Wang wrote:
> On 2019/9/18 下午10:32, Michael S. Tsirkin wrote:
> > > > > So I have some questions:
> > > > > 
> > > > > 1) Compared to method 2, what's the advantage of creating a new vhost 
> > > > > char
> > > > > device? I guess it's for keep the API compatibility?
> > > > One benefit is that we can avoid doing vhost ioctls on
> > > > VFIO device fd.
> > > Yes, but any benefit from doing this?
> > It does seem a bit more modular, but it's certainly not a big deal.
> 
> Ok, if we go this way, it could be as simple as provide some callback to
> vhost, then vhost can just forward the ioctl through parent_ops.
> 
> > 
> > > > > 2) For method 2, is there any easy way for user/admin to distinguish 
> > > > > e.g
> > > > > ordinary vfio-mdev for vhost from ordinary vfio-mdev?
> > > > I think device-api could be a choice.
> > > Ok.
> > > 
> > > 
> > > > > I saw you introduce
> > > > > ops matching helper but it's not friendly to management.
> > > > The ops matching helper is just to check whether a given
> > > > vfio-device is based on a mdev device.
> > > > 
> > > > > 3) A drawback of 1) and 2) is that it must follow vfio_device_ops that
> > > > > assumes the parameter comes from userspace, it prevents support kernel
> > > > > virtio drivers.
> > > > > 
> > > > > 4) So comes the idea of method 3, since it register a new vhost-mdev 
> > > > > driver,
> > > > > we can use device specific ops instead of VFIO ones, then we can have 
> > > > > a
> > > > > common API between vDPA parent and vhost-mdev/virtio-mdev drivers.
> > > > As the above draft shows, this requires introducing a new
> > > > VFIO device driver. I think Alex's opinion matters here.
> 
> Just to clarify, a new type of mdev driver but provides dummy
> vfio_device_ops for VFIO to make container DMA ioctl work.

I see. Thanks! IIUC, you mean we can provide a very tiny
VFIO device driver in drivers/vhost/mdev.c, e.g.:

static int vfio_vhost_mdev_open(void *device_data)
{
if (!try_module_get(THIS_MODULE))
return -ENODEV;
return 0;
}

static void vfio_vhost_mdev_release(void *device_data)
{
module_put(THIS_MODULE);
}

static const struct vfio_device_ops vfio_vhost_mdev_dev_ops = {
.name   = "vfio-vhost-mdev",
.open   = vfio_vhost_mdev_open,
.release= vfio_vhost_mdev_release,
};

static int vhost_mdev_probe(struct device *dev)
{
struct mdev_device *mdev = to_mdev_device(dev);

... Check the mdev device_id proposed in ...
... https://lkml.org/lkml/2019/9/12/151 ...

return vfio_add_group_dev(dev, _vhost_mdev_dev_ops, mdev);
}

static void vhost_mdev_remove(struct device *dev)
{
vfio_del_group_dev(dev);
}

static struct mdev_driver vhost_mdev_driver = {
.name   = "vhost_mdev",
.probe  = vhost_mdev_probe,
.remove = vhost_mdev_remove,
};

So we can bind above mdev driver to the virtio-mdev compatible
mdev devices when we want to use vhost-mdev.

After binding above driver to the mdev device, we can setup IOMMU
via VFIO and get VFIO device fd of this mdev device, and pass it
to vhost fd (/dev/vhost-mdev) with a SET_BACKEND ioctl.

Thanks,
Tiwei

> 
> Thanks
> 
> 
> > > Yes, it is.
> > > 
> > > Thanks
> > > 
> > > 


Re: [RFC v4 0/3] vhost: introduce mdev based hardware backend

2019-09-17 Thread Tiwei Bie
On Tue, Sep 17, 2019 at 11:32:03AM +0800, Jason Wang wrote:
> On 2019/9/17 上午9:02, Tiwei Bie wrote:
> > This RFC is to demonstrate below ideas,
> > 
> > a) Build vhost-mdev on top of the same abstraction defined in
> > the virtio-mdev series [1];
> > 
> > b) Introduce /dev/vhost-mdev to do vhost ioctls and support
> > setting mdev device as backend;
> > 
> > Now the userspace API looks like this:
> > 
> > - Userspace generates a compatible mdev device;
> > 
> > - Userspace opens this mdev device with VFIO API (including
> >doing IOMMU programming for this mdev device with VFIO's
> >container/group based interface);
> > 
> > - Userspace opens /dev/vhost-mdev and gets vhost fd;
> > 
> > - Userspace uses vhost ioctls to setup vhost (userspace should
> >do VHOST_MDEV_SET_BACKEND ioctl with VFIO group fd and device
> >fd first before doing other vhost ioctls);
> > 
> > Only compile test has been done for this series for now.
> 
> 
> Have a hard thought on the architecture:

Thanks a lot! Do appreciate it!

> 
> 1) Create a vhost char device and pass vfio mdev device fd to it as a
> backend and translate vhost-mdev ioctl to virtio mdev transport (e.g
> read/write). DMA was done through the VFIO DMA mapping on the container that
> is attached.

Yeah, that's what we are doing in this series.

> 
> We have two more choices:
> 
> 2) Use vfio-mdev but do not create vhost-mdev device, instead, just
> implement vhost ioctl on vfio_device_ops, and translate them into
> virtio-mdev transport or just pass ioctl to parent.

Yeah. Instead of introducing /dev/vhost-mdev char device, do
vhost ioctls on VFIO device fd directly. That's what we did
in RFC v3.

> 
> 3) Don't use vfio-mdev, create a new vhost-mdev driver, during probe still
> try to add dev to vfio group and talk to parent with device specific ops

If my understanding is correct, this means we need to introduce
a new VFIO device driver to replace the existing vfio-mdev driver
in our case. Below is a quick draft just to show my understanding:

#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include "mdev_private.h"

/* XXX: we need a proper way to include below vhost header. */
#include "../../vhost/vhost.h"

static int vfio_vhost_mdev_open(void *device_data)
{
if (!try_module_get(THIS_MODULE))
return -ENODEV;

/* ... */
vhost_dev_init(...);

return 0;
}

static void vfio_vhost_mdev_release(void *device_data)
{
/* ... */
module_put(THIS_MODULE);
}

static long vfio_vhost_mdev_unlocked_ioctl(void *device_data,
   unsigned int cmd, unsigned long arg)
{
struct mdev_device *mdev = device_data;
struct mdev_parent *parent = mdev->parent;

/*
 * Use vhost ioctls.
 *
 * We will have a different parent_ops design.
 * And potentially, we can share the same parent_ops
 * with virtio_mdev.
 */
switch (cmd) {
case VHOST_GET_FEATURES:
parent->ops->get_features(mdev, ...);
break;
/* ... */
}

return 0;
}

static ssize_t vfio_vhost_mdev_read(void *device_data, char __user *buf,
size_t count, loff_t *ppos)
{
/* ... */
return 0;
}

static ssize_t vfio_vhost_mdev_write(void *device_data, const char __user *buf,
 size_t count, loff_t *ppos)
{
/* ... */
return 0;
}

static int vfio_vhost_mdev_mmap(void *device_data, struct vm_area_struct *vma)
{
/* ... */
return 0;
}

static const struct vfio_device_ops vfio_vhost_mdev_dev_ops = {
.name   = "vfio-vhost-mdev",
.open   = vfio_vhost_mdev_open,
.release= vfio_vhost_mdev_release,
.ioctl  = vfio_vhost_mdev_unlocked_ioctl,
.read   = vfio_vhost_mdev_read,
.write  = vfio_vhost_mdev_write,
.mmap   = vfio_vhost_mdev_mmap,
};

static int vfio_vhost_mdev_probe(struct device *dev)
{
struct mdev_device *mdev = to_mdev_device(dev);

/* ... */
return vfio_add_group_dev(dev, _vhost_mdev_dev_ops, mdev);
}

static void vfio_vhost_mdev_remove(struct device *dev)
{
/* ... */
vfio_del_group_dev(dev);
}

static struct mdev_driver vfio_vhost_mdev_driver = {
.name   = "vfio_vhost_mdev",
.probe  = vfio_vhost_mdev_probe,
.remove = vfio_vhost_mdev_remove,
};

static int __init vfio_vhost_mdev_init(void)
{
return mdev_register_driver(_vhost_mdev_driver, THIS_MODULE);
}
module_init(vfio_vhost_mdev_init)

static void __exit vfio_vhost

[RFC v4 3/3] vhost: introduce mdev based hardware backend

2019-09-16 Thread Tiwei Bie
More details about this patch can be found from the cover
letter for now. Only compile test has been done for now.

Signed-off-by: Tiwei Bie 
---
 drivers/vhost/Kconfig|   9 +
 drivers/vhost/Makefile   |   3 +
 drivers/vhost/mdev.c | 462 +++
 drivers/vhost/vhost.c|  39 ++-
 drivers/vhost/vhost.h|   6 +
 include/uapi/linux/vhost.h   |  10 +
 include/uapi/linux/vhost_types.h |   5 +
 7 files changed, 528 insertions(+), 6 deletions(-)
 create mode 100644 drivers/vhost/mdev.c

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 3d03ccbd1adc..ef9783156d2e 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -34,6 +34,15 @@ config VHOST_VSOCK
To compile this driver as a module, choose M here: the module will be 
called
vhost_vsock.
 
+config VHOST_MDEV
+   tristate "Mediated device based hardware vhost accelerator"
+   depends on EVENTFD && VFIO && VFIO_MDEV
+   select VHOST
+   default n
+   ---help---
+   Say Y here to enable the vhost_mdev module
+   for use with hardware vhost accelerators
+
 config VHOST
tristate
---help---
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 6c6df24f770c..ad9c0f8c6d8c 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -10,4 +10,7 @@ vhost_vsock-y := vsock.o
 
 obj-$(CONFIG_VHOST_RING) += vringh.o
 
+obj-$(CONFIG_VHOST_MDEV) += vhost_mdev.o
+vhost_mdev-y := mdev.o
+
 obj-$(CONFIG_VHOST)+= vhost.o
diff --git a/drivers/vhost/mdev.c b/drivers/vhost/mdev.c
new file mode 100644
index ..8c6597aff45e
--- /dev/null
+++ b/drivers/vhost/mdev.c
@@ -0,0 +1,462 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018-2019 Intel Corporation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vhost.h"
+
+struct vhost_mdev {
+   struct mutex mutex;
+   struct vhost_dev dev;
+   struct vhost_virtqueue *vqs;
+   int nvqs;
+   u64 state;
+   u64 features;
+   u64 acked_features;
+   struct vfio_group *vfio_group;
+   struct vfio_device *vfio_device;
+   struct mdev_device *mdev;
+};
+
+/*
+ * XXX
+ * We assume virtio_mdev.ko exposes below symbols for now, as we
+ * don't have a proper way to access parent ops directly yet.
+ *
+ * virtio_mdev_readl()
+ * virtio_mdev_writel()
+ */
+extern u32 virtio_mdev_readl(struct mdev_device *mdev, loff_t off);
+extern void virtio_mdev_writel(struct mdev_device *mdev, loff_t off, u32 val);
+
+static u8 mdev_get_status(struct mdev_device *mdev)
+{
+   return virtio_mdev_readl(mdev, VIRTIO_MDEV_STATUS);
+}
+
+static void mdev_set_status(struct mdev_device *mdev, u8 status)
+{
+   virtio_mdev_writel(mdev, VIRTIO_MDEV_STATUS, status);
+}
+
+static void mdev_add_status(struct mdev_device *mdev, u8 status)
+{
+   status |= mdev_get_status(mdev);
+   mdev_set_status(mdev, status);
+}
+
+static void mdev_reset(struct mdev_device *mdev)
+{
+   mdev_set_status(mdev, 0);
+}
+
+static void handle_vq_kick(struct vhost_work *work)
+{
+   struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+ poll.work);
+   struct vhost_mdev *m = container_of(vq->dev, struct vhost_mdev, dev);
+
+   virtio_mdev_writel(m->mdev, VIRTIO_MDEV_QUEUE_NOTIFY, vq - m->vqs);
+}
+
+static long vhost_mdev_start_backend(struct vhost_mdev *m)
+{
+   struct mdev_device *mdev = m->mdev;
+   u64 features = m->acked_features;
+   u64 addr;
+   struct vhost_virtqueue *vq;
+   int queue_id;
+
+   features |= 1ULL << VIRTIO_F_IOMMU_PLATFORM;
+
+   virtio_mdev_writel(mdev, VIRTIO_MDEV_DRIVER_FEATURES_SEL, 1);
+   virtio_mdev_writel(mdev, VIRTIO_MDEV_DRIVER_FEATURES,
+  (u32)(features >> 32));
+
+   virtio_mdev_writel(mdev, VIRTIO_MDEV_DRIVER_FEATURES_SEL, 0);
+   virtio_mdev_writel(mdev, VIRTIO_MDEV_DRIVER_FEATURES,
+  (u32)features);
+
+   mdev_add_status(mdev, VIRTIO_CONFIG_S_FEATURES_OK);
+   if (!(mdev_get_status(mdev) & VIRTIO_CONFIG_S_FEATURES_OK))
+   return -ENODEV;
+
+   for (queue_id = 0; queue_id < m->nvqs; queue_id++) {
+   vq = >vqs[queue_id];
+
+   if (!vq->desc || !vq->avail || !vq->used)
+   break;
+
+   virtio_mdev_writel(mdev, VIRTIO_MDEV_QUEUE_NUM, vq->num);
+
+   if (!vhost_translate_ring_addr(vq, (u64)vq->desc,
+  vhost_get_desc_size(vq, vq->num),
+  ))
+   return -EINVAL;
+
+   virtio_mdev_writel(mdev, VIRTIO_MDEV_QUEUE_DESC_LOW, addr);
+   virtio

[RFC v4 2/3] vfio: support checking vfio driver by device ops

2019-09-16 Thread Tiwei Bie
This patch introduces the support for checking the VFIO driver
by device ops. And vfio-mdev's device ops is also exported to
make it possible to check whether a VFIO device is based on a
mdev device.

Signed-off-by: Tiwei Bie 
---
 drivers/vfio/mdev/vfio_mdev.c | 3 ++-
 drivers/vfio/vfio.c   | 7 +++
 include/linux/vfio.h  | 7 +++
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c
index 30964a4e0a28..e0f31c5a5db2 100644
--- a/drivers/vfio/mdev/vfio_mdev.c
+++ b/drivers/vfio/mdev/vfio_mdev.c
@@ -98,7 +98,7 @@ static int vfio_mdev_mmap(void *device_data, struct 
vm_area_struct *vma)
return parent->ops->mmap(mdev, vma);
 }
 
-static const struct vfio_device_ops vfio_mdev_dev_ops = {
+const struct vfio_device_ops vfio_mdev_dev_ops = {
.name   = "vfio-mdev",
.open   = vfio_mdev_open,
.release= vfio_mdev_release,
@@ -107,6 +107,7 @@ static const struct vfio_device_ops vfio_mdev_dev_ops = {
.write  = vfio_mdev_write,
.mmap   = vfio_mdev_mmap,
 };
+EXPORT_SYMBOL_GPL(vfio_mdev_dev_ops);
 
 static int vfio_mdev_probe(struct device *dev)
 {
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 697fd079bb3f..1145110909e4 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1806,6 +1806,13 @@ long vfio_external_check_extension(struct vfio_group 
*group, unsigned long arg)
 }
 EXPORT_SYMBOL_GPL(vfio_external_check_extension);
 
+bool vfio_device_ops_match(struct vfio_device *device,
+  const struct vfio_device_ops *ops)
+{
+   return device->ops == ops;
+}
+EXPORT_SYMBOL_GPL(vfio_device_ops_match);
+
 /**
  * Sub-module support
  */
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index e75b24fd7c5c..741c5bb567a8 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -56,6 +56,8 @@ extern struct vfio_device *vfio_device_get_from_fd(struct 
vfio_group *group,
   int device_fd);
 extern void vfio_device_put(struct vfio_device *device);
 extern void *vfio_device_data(struct vfio_device *device);
+extern bool vfio_device_ops_match(struct vfio_device *device,
+ const struct vfio_device_ops *ops);
 
 /**
  * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
@@ -199,4 +201,9 @@ extern int vfio_virqfd_enable(void *opaque,
  void *data, struct virqfd **pvirqfd, int fd);
 extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
 
+/*
+ * VFIO device ops
+ */
+extern const struct vfio_device_ops vfio_mdev_dev_ops;
+
 #endif /* VFIO_H */
-- 
2.17.1



[RFC v4 1/3] vfio: support getting vfio device from device fd

2019-09-16 Thread Tiwei Bie
This patch introduces the support for getting VFIO device
from VFIO device fd. With this support, it's possible for
vhost to get VFIO device from the group fd and device fd
set by the userspace.

Signed-off-by: Tiwei Bie 
---
 drivers/vfio/vfio.c  | 25 +
 include/linux/vfio.h |  4 
 2 files changed, 29 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 388597930b64..697fd079bb3f 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -890,6 +890,31 @@ static struct vfio_device 
*vfio_device_get_from_name(struct vfio_group *group,
return device;
 }
 
+struct vfio_device *vfio_device_get_from_fd(struct vfio_group *group,
+   int device_fd)
+{
+   struct fd f;
+   struct vfio_device *it, *device = ERR_PTR(-ENODEV);
+
+   f = fdget(device_fd);
+   if (!f.file)
+   return ERR_PTR(-EBADF);
+
+   mutex_lock(>device_lock);
+   list_for_each_entry(it, >device_list, group_next) {
+   if (it == f.file->private_data) {
+   device = it;
+   vfio_device_get(device);
+   break;
+   }
+   }
+   mutex_unlock(>device_lock);
+
+   fdput(f);
+   return device;
+}
+EXPORT_SYMBOL_GPL(vfio_device_get_from_fd);
+
 /*
  * Caller must hold a reference to the vfio_device
  */
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index e42a711a2800..e75b24fd7c5c 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -15,6 +15,8 @@
 #include 
 #include 
 
+struct vfio_group;
+
 /**
  * struct vfio_device_ops - VFIO bus driver device callbacks
  *
@@ -50,6 +52,8 @@ extern int vfio_add_group_dev(struct device *dev,
 
 extern void *vfio_del_group_dev(struct device *dev);
 extern struct vfio_device *vfio_device_get_from_dev(struct device *dev);
+extern struct vfio_device *vfio_device_get_from_fd(struct vfio_group *group,
+  int device_fd);
 extern void vfio_device_put(struct vfio_device *device);
 extern void *vfio_device_data(struct vfio_device *device);
 
-- 
2.17.1



[RFC v4 0/3] vhost: introduce mdev based hardware backend

2019-09-16 Thread Tiwei Bie
This RFC is to demonstrate below ideas,

a) Build vhost-mdev on top of the same abstraction defined in
   the virtio-mdev series [1];

b) Introduce /dev/vhost-mdev to do vhost ioctls and support
   setting mdev device as backend;

Now the userspace API looks like this:

- Userspace generates a compatible mdev device;

- Userspace opens this mdev device with VFIO API (including
  doing IOMMU programming for this mdev device with VFIO's
  container/group based interface);

- Userspace opens /dev/vhost-mdev and gets vhost fd;

- Userspace uses vhost ioctls to setup vhost (userspace should
  do VHOST_MDEV_SET_BACKEND ioctl with VFIO group fd and device
  fd first before doing other vhost ioctls);

Only compile test has been done for this series for now.

RFCv3: https://patchwork.kernel.org/patch/7785/

[1] https://lkml.org/lkml/2019/9/10/135

Tiwei Bie (3):
  vfio: support getting vfio device from device fd
  vfio: support checking vfio driver by device ops
  vhost: introduce mdev based hardware backend

 drivers/vfio/mdev/vfio_mdev.c|   3 +-
 drivers/vfio/vfio.c  |  32 +++
 drivers/vhost/Kconfig|   9 +
 drivers/vhost/Makefile   |   3 +
 drivers/vhost/mdev.c | 462 +++
 drivers/vhost/vhost.c|  39 ++-
 drivers/vhost/vhost.h|   6 +
 include/linux/vfio.h |  11 +
 include/uapi/linux/vhost.h   |  10 +
 include/uapi/linux/vhost_types.h |   5 +
 10 files changed, 573 insertions(+), 7 deletions(-)
 create mode 100644 drivers/vhost/mdev.c

-- 
2.17.1



Re: [RFC PATCH 3/4] virtio: introudce a mdev based transport

2019-09-10 Thread Tiwei Bie
On Wed, Sep 11, 2019 at 10:52:03AM +0800, Jason Wang wrote:
> On 2019/9/11 上午9:47, Tiwei Bie wrote:
> > On Tue, Sep 10, 2019 at 04:19:34PM +0800, Jason Wang wrote:
> > > This path introduces a new mdev transport for virtio. This is used to
> > > use kernel virtio driver to drive the mediated device that is capable
> > > of populating virtqueue directly.
> > > 
> > > A new virtio-mdev driver will be registered to the mdev bus, when a
> > > new virtio-mdev device is probed, it will register the device with
> > > mdev based config ops. This means, unlike the exist hardware
> > > transport, this is a software transport between mdev driver and mdev
> > > device. The transport was implemented through:
> > > 
> > > - configuration access was implemented through parent_ops->read()/write()
> > > - vq/config callback was implemented through parent_ops->ioctl()
> > > 
> > > This transport is derived from virtio MMIO protocol and was wrote for
> > > kernel driver. But for the transport itself, but the design goal is to
> > > be generic enough to support userspace driver (this part will be added
> > > in the future).
> > > 
> > > Note:
> > > - current mdev assume all the parameter of parent_ops was from
> > >userspace. This prevents us from implementing the kernel mdev
> > >driver. For a quick POC, this patch just abuse those parameter and
> > >assume the mdev device implementation will treat them as kernel
> > >pointer. This should be addressed in the formal series by extending
> > >mdev_parent_ops.
> > > - for a quick POC, I just drive the transport from MMIO, I'm pretty
> > >there's lot of optimization space for this.
> > > 
> > > Signed-off-by: Jason Wang 
> > > ---
> > >   drivers/vfio/mdev/Kconfig|   7 +
> > >   drivers/vfio/mdev/Makefile   |   1 +
> > >   drivers/vfio/mdev/virtio_mdev.c  | 500 +++
> > >   include/uapi/linux/virtio_mdev.h | 131 
> > >   4 files changed, 639 insertions(+)
> > >   create mode 100644 drivers/vfio/mdev/virtio_mdev.c
> > >   create mode 100644 include/uapi/linux/virtio_mdev.h
> > > 
> > [...]
> > > diff --git a/include/uapi/linux/virtio_mdev.h 
> > > b/include/uapi/linux/virtio_mdev.h
> > > new file mode 100644
> > > index ..8040de6b960a
> > > --- /dev/null
> > > +++ b/include/uapi/linux/virtio_mdev.h
> > > @@ -0,0 +1,131 @@
> > > +/*
> > > + * Virtio mediated device driver
> > > + *
> > > + * Copyright 2019, Red Hat Corp.
> > > + *
> > > + * Based on Virtio MMIO driver by ARM Ltd, copyright ARM Ltd. 2011
> > > + *
> > > + * This header is BSD licensed so anyone can use the definitions to 
> > > implement
> > > + * compatible drivers/servers.
> > > + *
> > > + * Redistribution and use in source and binary forms, with or without
> > > + * modification, are permitted provided that the following conditions
> > > + * are met:
> > > + * 1. Redistributions of source code must retain the above copyright
> > > + *notice, this list of conditions and the following disclaimer.
> > > + * 2. Redistributions in binary form must reproduce the above copyright
> > > + *notice, this list of conditions and the following disclaimer in the
> > > + *documentation and/or other materials provided with the 
> > > distribution.
> > > + * 3. Neither the name of IBM nor the names of its contributors
> > > + *may be used to endorse or promote products derived from this 
> > > software
> > > + *without specific prior written permission.
> > > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
> > > ``AS IS'' AND
> > > + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> > > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 
> > > PURPOSE
> > > + * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> > > + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 
> > > CONSEQUENTIAL
> > > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE 
> > > GOODS
> > > + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> > > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, 
> > > STRICT
> > > + * LIABILITY, OR TORT (IN

Re: [RFC PATCH 3/4] virtio: introudce a mdev based transport

2019-09-10 Thread Tiwei Bie
On Tue, Sep 10, 2019 at 04:19:34PM +0800, Jason Wang wrote:
> This path introduces a new mdev transport for virtio. This is used to
> use kernel virtio driver to drive the mediated device that is capable
> of populating virtqueue directly.
> 
> A new virtio-mdev driver will be registered to the mdev bus, when a
> new virtio-mdev device is probed, it will register the device with
> mdev based config ops. This means, unlike the exist hardware
> transport, this is a software transport between mdev driver and mdev
> device. The transport was implemented through:
> 
> - configuration access was implemented through parent_ops->read()/write()
> - vq/config callback was implemented through parent_ops->ioctl()
> 
> This transport is derived from virtio MMIO protocol and was wrote for
> kernel driver. But for the transport itself, but the design goal is to
> be generic enough to support userspace driver (this part will be added
> in the future).
> 
> Note:
> - current mdev assume all the parameter of parent_ops was from
>   userspace. This prevents us from implementing the kernel mdev
>   driver. For a quick POC, this patch just abuse those parameter and
>   assume the mdev device implementation will treat them as kernel
>   pointer. This should be addressed in the formal series by extending
>   mdev_parent_ops.
> - for a quick POC, I just drive the transport from MMIO, I'm pretty
>   there's lot of optimization space for this.
> 
> Signed-off-by: Jason Wang 
> ---
>  drivers/vfio/mdev/Kconfig|   7 +
>  drivers/vfio/mdev/Makefile   |   1 +
>  drivers/vfio/mdev/virtio_mdev.c  | 500 +++
>  include/uapi/linux/virtio_mdev.h | 131 
>  4 files changed, 639 insertions(+)
>  create mode 100644 drivers/vfio/mdev/virtio_mdev.c
>  create mode 100644 include/uapi/linux/virtio_mdev.h
> 
[...]
> diff --git a/include/uapi/linux/virtio_mdev.h 
> b/include/uapi/linux/virtio_mdev.h
> new file mode 100644
> index ..8040de6b960a
> --- /dev/null
> +++ b/include/uapi/linux/virtio_mdev.h
> @@ -0,0 +1,131 @@
> +/*
> + * Virtio mediated device driver
> + *
> + * Copyright 2019, Red Hat Corp.
> + *
> + * Based on Virtio MMIO driver by ARM Ltd, copyright ARM Ltd. 2011
> + *
> + * This header is BSD licensed so anyone can use the definitions to implement
> + * compatible drivers/servers.
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *notice, this list of conditions and the following disclaimer in the
> + *documentation and/or other materials provided with the distribution.
> + * 3. Neither the name of IBM nor the names of its contributors
> + *may be used to endorse or promote products derived from this software
> + *without specific prior written permission.
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS 
> IS'' AND
> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> + * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + */
> +#ifndef _LINUX_VIRTIO_MDEV_H
> +#define _LINUX_VIRTIO_MDEV_H
> +
> +#include 
> +#include 
> +#include 
> +
> +/*
> + * Ioctls
> + */
> +
> +struct virtio_mdev_callback {
> + irqreturn_t (*callback)(void *);
> + void *private;
> +};
> +
> +#define VIRTIO_MDEV 0xAF
> +#define VIRTIO_MDEV_SET_VQ_CALLBACK _IOW(VIRTIO_MDEV, 0x00, \
> +  struct virtio_mdev_callback)
> +#define VIRTIO_MDEV_SET_CONFIG_CALLBACK _IOW(VIRTIO_MDEV, 0x01, \
> + struct virtio_mdev_callback)
> +
> +#define VIRTIO_MDEV_DEVICE_API_STRING"virtio-mdev"
> +
> +/*
> + * Control registers
> + */
> +
> +/* Magic value ("virt" string) - Read Only */
> +#define VIRTIO_MDEV_MAGIC_VALUE  0x000
> +
> +/* Virtio device version - Read Only */
> +#define VIRTIO_MDEV_VERSION  0x004
> +
> +/* Virtio device ID - Read Only */
> +#define VIRTIO_MDEV_DEVICE_ID0x008
> +
> +/* Virtio vendor ID - Read Only */
> +#define VIRTIO_MDEV_VENDOR_ID0x00c
> +
> +/* Bitmask of the features 

Re: [RFC v3] vhost: introduce mdev based hardware vhost backend

2019-09-03 Thread Tiwei Bie
On Tue, Sep 03, 2019 at 07:26:03AM -0400, Michael S. Tsirkin wrote:
> On Wed, Aug 28, 2019 at 01:37:12PM +0800, Tiwei Bie wrote:
> > Details about this can be found here:
> > 
> > https://lwn.net/Articles/750770/
> > 
> > What's new in this version
> > ==
> > 
> > There are three choices based on the discussion [1] in RFC v2:
> > 
> > > #1. We expose a VFIO device, so we can reuse the VFIO container/group
> > > based DMA API and potentially reuse a lot of VFIO code in QEMU.
> > >
> > > But in this case, we have two choices for the VFIO device interface
> > > (i.e. the interface on top of VFIO device fd):
> > >
> > > A) we may invent a new vhost protocol (as demonstrated by the code
> > >in this RFC) on VFIO device fd to make it work in VFIO's way,
> > >i.e. regions and irqs.
> > >
> > > B) Or as you proposed, instead of inventing a new vhost protocol,
> > >we can reuse most existing vhost ioctls on the VFIO device fd
> > >directly. There should be no conflicts between the VFIO ioctls
> > >(type is 0x3B) and VHOST ioctls (type is 0xAF) currently.
> > >
> > > #2. Instead of exposing a VFIO device, we may expose a VHOST device.
> > > And we will introduce a new mdev driver vhost-mdev to do this.
> > > It would be natural to reuse the existing kernel vhost interface
> > > (ioctls) on it as much as possible. But we will need to invent
> > > some APIs for DMA programming (reusing VHOST_SET_MEM_TABLE is a
> > > choice, but it's too heavy and doesn't support vIOMMU by itself).
> > 
> > This version is more like a quick PoC to try Jason's proposal on
> > reusing vhost ioctls. And the second way (#1/B) in above three
> > choices was chosen in this version to demonstrate the idea quickly.
> > 
> > Now the userspace API looks like this:
> > 
> > - VFIO's container/group based IOMMU API is used to do the
> >   DMA programming.
> > 
> > - Vhost's existing ioctls are used to setup the device.
> > 
> > And the device will report device_api as "vfio-vhost".
> > 
> > Note that, there are dirty hacks in this version. If we decide to
> > go this way, some refactoring in vhost.c/vhost.h may be needed.
> > 
> > PS. The direct mapping of the notify registers isn't implemented
> > in this version.
> > 
> > [1] https://lkml.org/lkml/2019/7/9/101
> > 
> > Signed-off-by: Tiwei Bie 
> 
> 
> 
> > +long vhost_mdev_ioctl(struct mdev_device *mdev, unsigned int cmd,
> > + unsigned long arg)
> > +{
> > +   void __user *argp = (void __user *)arg;
> > +   struct vhost_mdev *vdpa;
> > +   unsigned long minsz;
> > +   int ret = 0;
> > +
> > +   if (!mdev)
> > +   return -EINVAL;
> > +
> > +   vdpa = mdev_get_drvdata(mdev);
> > +   if (!vdpa)
> > +   return -ENODEV;
> > +
> > +   switch (cmd) {
> > +   case VFIO_DEVICE_GET_INFO:
> > +   {
> > +   struct vfio_device_info info;
> > +
> > +   minsz = offsetofend(struct vfio_device_info, num_irqs);
> > +
> > +   if (copy_from_user(, (void __user *)arg, minsz)) {
> > +   ret = -EFAULT;
> > +   break;
> > +   }
> > +
> > +   if (info.argsz < minsz) {
> > +   ret = -EINVAL;
> > +   break;
> > +   }
> > +
> > +   info.flags = VFIO_DEVICE_FLAGS_VHOST;
> > +   info.num_regions = 0;
> > +   info.num_irqs = 0;
> > +
> > +   if (copy_to_user((void __user *)arg, , minsz)) {
> > +   ret = -EFAULT;
> > +   break;
> > +   }
> > +
> > +   break;
> > +   }
> > +   case VFIO_DEVICE_GET_REGION_INFO:
> > +   case VFIO_DEVICE_GET_IRQ_INFO:
> > +   case VFIO_DEVICE_SET_IRQS:
> > +   case VFIO_DEVICE_RESET:
> > +   ret = -EINVAL;
> > +   break;
> > +
> > +   case VHOST_MDEV_SET_STATE:
> > +   ret = vhost_set_state(vdpa, argp);
> > +   break;
> > +   case VHOST_GET_FEATURES:
> > +   ret = vhost_get_features(vdpa, argp);
> > +   break;
> > +   case VHOST_SET_FEATURES:
> > +   ret = vhost_set_features(vdpa, argp);
> > +   break;
> > +   case VHOST_GET_VRING_BASE:
> > +

Re: [RFC v3] vhost: introduce mdev based hardware vhost backend

2019-09-02 Thread Tiwei Bie
On Mon, Sep 02, 2019 at 12:15:05PM +0800, Jason Wang wrote:
> On 2019/8/28 下午1:37, Tiwei Bie wrote:
> > Details about this can be found here:
> > 
> > https://lwn.net/Articles/750770/
> > 
> > What's new in this version
> > ==
> > 
> > There are three choices based on the discussion [1] in RFC v2:
> > 
> > > #1. We expose a VFIO device, so we can reuse the VFIO container/group
> > >  based DMA API and potentially reuse a lot of VFIO code in QEMU.
> > > 
> > >  But in this case, we have two choices for the VFIO device interface
> > >  (i.e. the interface on top of VFIO device fd):
> > > 
> > >  A) we may invent a new vhost protocol (as demonstrated by the code
> > > in this RFC) on VFIO device fd to make it work in VFIO's way,
> > > i.e. regions and irqs.
> > > 
> > >  B) Or as you proposed, instead of inventing a new vhost protocol,
> > > we can reuse most existing vhost ioctls on the VFIO device fd
> > > directly. There should be no conflicts between the VFIO ioctls
> > > (type is 0x3B) and VHOST ioctls (type is 0xAF) currently.
> > > 
> > > #2. Instead of exposing a VFIO device, we may expose a VHOST device.
> > >  And we will introduce a new mdev driver vhost-mdev to do this.
> > >  It would be natural to reuse the existing kernel vhost interface
> > >  (ioctls) on it as much as possible. But we will need to invent
> > >  some APIs for DMA programming (reusing VHOST_SET_MEM_TABLE is a
> > >  choice, but it's too heavy and doesn't support vIOMMU by itself).
> > This version is more like a quick PoC to try Jason's proposal on
> > reusing vhost ioctls. And the second way (#1/B) in above three
> > choices was chosen in this version to demonstrate the idea quickly.
> > 
> > Now the userspace API looks like this:
> > 
> > - VFIO's container/group based IOMMU API is used to do the
> >DMA programming.
> > 
> > - Vhost's existing ioctls are used to setup the device.
> > 
> > And the device will report device_api as "vfio-vhost".
> > 
> > Note that, there are dirty hacks in this version. If we decide to
> > go this way, some refactoring in vhost.c/vhost.h may be needed.
> > 
> > PS. The direct mapping of the notify registers isn't implemented
> >  in this version.
> > 
> > [1] https://lkml.org/lkml/2019/7/9/101
> 
> 
> Thanks for the patch, see comments inline.
> 
> 
> > 
> > Signed-off-by: Tiwei Bie 
> > ---
> >   drivers/vhost/Kconfig  |   9 +
> >   drivers/vhost/Makefile |   3 +
> >   drivers/vhost/mdev.c   | 382 +
> >   include/linux/vhost_mdev.h |  58 ++
> >   include/uapi/linux/vfio.h  |   2 +
> >   include/uapi/linux/vhost.h |   8 +
> >   6 files changed, 462 insertions(+)
> >   create mode 100644 drivers/vhost/mdev.c
> >   create mode 100644 include/linux/vhost_mdev.h
[...]
> > +
> > +   break;
> > +   }
> > +   case VFIO_DEVICE_GET_REGION_INFO:
> > +   case VFIO_DEVICE_GET_IRQ_INFO:
> > +   case VFIO_DEVICE_SET_IRQS:
> > +   case VFIO_DEVICE_RESET:
> > +   ret = -EINVAL;
> > +   break;
> > +
> > +   case VHOST_MDEV_SET_STATE:
> > +   ret = vhost_set_state(vdpa, argp);
> > +   break;
> 
> 
> So this is used to start or stop the device. This means if userspace want to
> drive a network device, the API is not 100% compatible. Any blocker for
> this? E.g for SET_BACKEND, we can pass a fd and then identify the type of
> backend.

This is a legacy from the previous RFC code. I didn't try to
get rid of it while getting this POC to work. I can try to make
the vhost ioctls fully compatible with the existing userspace
if possible.

> 
> Another question is, how can user know the type of a device?

Maybe we can introduce an attribute in $UUID/ to tell the type.

> 
> 
> > +   case VHOST_GET_FEATURES:
> > +   ret = vhost_get_features(vdpa, argp);
> > +   break;
> > +   case VHOST_SET_FEATURES:
> > +   ret = vhost_set_features(vdpa, argp);
> > +   break;
> > +   case VHOST_GET_VRING_BASE:
> > +   ret = vhost_get_vring_base(vdpa, argp);
> > +   break;
> > +   default:
> > +   ret = vhost_dev_ioctl(>dev, cmd, argp);
> > +   if (ret == -ENOIOCTLCMD)
> > +   ret = vhost_vring_ioctl(>dev, cmd, 

[RFC v3] vhost: introduce mdev based hardware vhost backend

2019-08-27 Thread Tiwei Bie
Details about this can be found here:

https://lwn.net/Articles/750770/

What's new in this version
==

There are three choices based on the discussion [1] in RFC v2:

> #1. We expose a VFIO device, so we can reuse the VFIO container/group
> based DMA API and potentially reuse a lot of VFIO code in QEMU.
>
> But in this case, we have two choices for the VFIO device interface
> (i.e. the interface on top of VFIO device fd):
>
> A) we may invent a new vhost protocol (as demonstrated by the code
>in this RFC) on VFIO device fd to make it work in VFIO's way,
>i.e. regions and irqs.
>
> B) Or as you proposed, instead of inventing a new vhost protocol,
>we can reuse most existing vhost ioctls on the VFIO device fd
>directly. There should be no conflicts between the VFIO ioctls
>(type is 0x3B) and VHOST ioctls (type is 0xAF) currently.
>
> #2. Instead of exposing a VFIO device, we may expose a VHOST device.
> And we will introduce a new mdev driver vhost-mdev to do this.
> It would be natural to reuse the existing kernel vhost interface
> (ioctls) on it as much as possible. But we will need to invent
> some APIs for DMA programming (reusing VHOST_SET_MEM_TABLE is a
> choice, but it's too heavy and doesn't support vIOMMU by itself).

This version is more like a quick PoC to try Jason's proposal on
reusing vhost ioctls. And the second way (#1/B) in above three
choices was chosen in this version to demonstrate the idea quickly.

Now the userspace API looks like this:

- VFIO's container/group based IOMMU API is used to do the
  DMA programming.

- Vhost's existing ioctls are used to setup the device.

And the device will report device_api as "vfio-vhost".

Note that, there are dirty hacks in this version. If we decide to
go this way, some refactoring in vhost.c/vhost.h may be needed.

PS. The direct mapping of the notify registers isn't implemented
in this version.

[1] https://lkml.org/lkml/2019/7/9/101

Signed-off-by: Tiwei Bie 
---
 drivers/vhost/Kconfig  |   9 +
 drivers/vhost/Makefile |   3 +
 drivers/vhost/mdev.c   | 382 +
 include/linux/vhost_mdev.h |  58 ++
 include/uapi/linux/vfio.h  |   2 +
 include/uapi/linux/vhost.h |   8 +
 6 files changed, 462 insertions(+)
 create mode 100644 drivers/vhost/mdev.c
 create mode 100644 include/linux/vhost_mdev.h

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 3d03ccbd1adc..2ba54fcf43b7 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -34,6 +34,15 @@ config VHOST_VSOCK
To compile this driver as a module, choose M here: the module will be 
called
vhost_vsock.
 
+config VHOST_MDEV
+   tristate "Hardware vhost accelerator abstraction"
+   depends on EVENTFD && VFIO && VFIO_MDEV
+   select VHOST
+   default n
+   ---help---
+   Say Y here to enable the vhost_mdev module
+   for use with hardware vhost accelerators
+
 config VHOST
tristate
---help---
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 6c6df24f770c..ad9c0f8c6d8c 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -10,4 +10,7 @@ vhost_vsock-y := vsock.o
 
 obj-$(CONFIG_VHOST_RING) += vringh.o
 
+obj-$(CONFIG_VHOST_MDEV) += vhost_mdev.o
+vhost_mdev-y := mdev.o
+
 obj-$(CONFIG_VHOST)+= vhost.o
diff --git a/drivers/vhost/mdev.c b/drivers/vhost/mdev.c
new file mode 100644
index ..6bef1d9ae2e6
--- /dev/null
+++ b/drivers/vhost/mdev.c
@@ -0,0 +1,382 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018-2019 Intel Corporation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vhost.h"
+
+struct vhost_mdev {
+   struct vhost_dev dev;
+   bool opened;
+   int nvqs;
+   u64 state;
+   u64 acked_features;
+   u64 features;
+   const struct vhost_mdev_device_ops *ops;
+   struct mdev_device *mdev;
+   void *private;
+   struct vhost_virtqueue vqs[];
+};
+
+static void handle_vq_kick(struct vhost_work *work)
+{
+   struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+ poll.work);
+   struct vhost_mdev *vdpa = container_of(vq->dev, struct vhost_mdev, dev);
+
+   vdpa->ops->notify(vdpa, vq - vdpa->vqs);
+}
+
+static int vhost_set_state(struct vhost_mdev *vdpa, u64 __user *statep)
+{
+   u64 state;
+
+   if (copy_from_user(, statep, sizeof(state)))
+   return -EFAULT;
+
+   if (state >= VHOST_MDEV_S_MAX)
+   return -EINVAL;
+
+   if (vdpa->state == state)
+   return 0;
+
+   mutex_lock(>dev.mutex);
+
+   vdpa->state = state;
+
+   switch (vdpa->stat

[PATCH 2/2] vhost/test: fix build for vhost test

2019-08-27 Thread Tiwei Bie
Since vhost_exceeds_weight() was introduced, callers need to specify
the packet weight and byte weight in vhost_dev_init(). Note that, the
packet weight isn't counted in this patch to keep the original behavior
unchanged.

Fixes: e82b9b0727ff ("vhost: introduce vhost_exceeds_weight()")
Cc: sta...@vger.kernel.org
Signed-off-by: Tiwei Bie 
---
 drivers/vhost/test.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index ac4f762c4f65..7804869c6a31 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -22,6 +22,12 @@
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_TEST_WEIGHT 0x8
 
+/* Max number of packets transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others with
+ * pkts.
+ */
+#define VHOST_TEST_PKT_WEIGHT 256
+
 enum {
VHOST_TEST_VQ = 0,
VHOST_TEST_VQ_MAX = 1,
@@ -80,10 +86,8 @@ static void handle_vq(struct vhost_test *n)
}
vhost_add_used_and_signal(>dev, vq, head, 0);
total_len += len;
-   if (unlikely(total_len >= VHOST_TEST_WEIGHT)) {
-   vhost_poll_queue(>poll);
+   if (unlikely(vhost_exceeds_weight(vq, 0, total_len)))
break;
-   }
}
 
mutex_unlock(>mutex);
@@ -115,7 +119,8 @@ static int vhost_test_open(struct inode *inode, struct file 
*f)
dev = >dev;
vqs[VHOST_TEST_VQ] = >vqs[VHOST_TEST_VQ];
n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
-   vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV);
+   vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
+  VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT);
 
f->private_data = n;
 
-- 
2.17.1



[PATCH 1/2] vhost/test: fix build for vhost test

2019-08-27 Thread Tiwei Bie
Since below commit, callers need to specify the iov_limit in
vhost_dev_init() explicitly.

Fixes: b46a0bf78ad7 ("vhost: fix OOB in get_rx_bufs()")
Cc: sta...@vger.kernel.org
Signed-off-by: Tiwei Bie 
---
 drivers/vhost/test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index 9e90e969af55..ac4f762c4f65 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -115,7 +115,7 @@ static int vhost_test_open(struct inode *inode, struct file 
*f)
dev = >dev;
vqs[VHOST_TEST_VQ] = >vqs[VHOST_TEST_VQ];
n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
-   vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX);
+   vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV);
 
f->private_data = n;
 
-- 
2.17.1



Re: [RFC v2] vhost: introduce mdev based hardware vhost backend

2019-07-10 Thread Tiwei Bie
On Wed, Jul 10, 2019 at 10:26:10AM +0800, Jason Wang wrote:
> On 2019/7/9 下午2:33, Tiwei Bie wrote:
> > On Tue, Jul 09, 2019 at 10:50:38AM +0800, Jason Wang wrote:
> > > On 2019/7/8 下午2:16, Tiwei Bie wrote:
> > > > On Fri, Jul 05, 2019 at 08:49:46AM -0600, Alex Williamson wrote:
> > > > > On Thu, 4 Jul 2019 14:21:34 +0800
> > > > > Tiwei Bie  wrote:
> > > > > > On Thu, Jul 04, 2019 at 12:31:48PM +0800, Jason Wang wrote:
> > > > > > > On 2019/7/3 下午9:08, Tiwei Bie wrote:
> > > > > > > > On Wed, Jul 03, 2019 at 08:16:23PM +0800, Jason Wang wrote:
> > > > > > > > > On 2019/7/3 下午7:52, Tiwei Bie wrote:
> > > > > > > > > > On Wed, Jul 03, 2019 at 06:09:51PM +0800, Jason Wang wrote:
> > > > > > > > > > > On 2019/7/3 下午5:13, Tiwei Bie wrote:
> > > > > > > > > > > > Details about this can be found here:
> > > > > > > > > > > > 
> > > > > > > > > > > > https://lwn.net/Articles/750770/
> > > > > > > > > > > > 
> > > > > > > > > > > > What's new in this version
> > > > > > > > > > > > ==
> > > > > > > > > > > > 
> > > > > > > > > > > > A new VFIO device type is introduced - vfio-vhost. This 
> > > > > > > > > > > > addressed
> > > > > > > > > > > > some comments from 
> > > > > > > > > > > > here:https://patchwork.ozlabs.org/cover/984763/
> > > > > > > > > > > > 
> > > > > > > > > > > > Below is the updated device interface:
> > > > > > > > > > > > 
> > > > > > > > > > > > Currently, there are two regions of this device: 1) 
> > > > > > > > > > > > CONFIG_REGION
> > > > > > > > > > > > (VFIO_VHOST_CONFIG_REGION_INDEX), which can be used to 
> > > > > > > > > > > > setup the
> > > > > > > > > > > > device; 2) NOTIFY_REGION 
> > > > > > > > > > > > (VFIO_VHOST_NOTIFY_REGION_INDEX), which
> > > > > > > > > > > > can be used to notify the device.
> > > > > > > > > > > > 
> > > > > > > > > > > > 1. CONFIG_REGION
> > > > > > > > > > > > 
> > > > > > > > > > > > The region described by CONFIG_REGION is the main 
> > > > > > > > > > > > control interface.
> > > > > > > > > > > > Messages will be written to or read from this region.
> > > > > > > > > > > > 
> > > > > > > > > > > > The message type is determined by the `request` field 
> > > > > > > > > > > > in message
> > > > > > > > > > > > header. The message size is encoded in the message 
> > > > > > > > > > > > header too.
> > > > > > > > > > > > The message format looks like this:
> > > > > > > > > > > > 
> > > > > > > > > > > > struct vhost_vfio_op {
> > > > > > > > > > > > __u64 request;
> > > > > > > > > > > > __u32 flags;
> > > > > > > > > > > > /* Flag values: */
> > > > > > > > > > > >   #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need 
> > > > > > > > > > > > reply */
> > > > > > > > > > > > __u32 size;
> > > > > > > > > > > > union {
> > > > > > > > > > > > __u64 u64;
> > > > > > > > > > > > struct vhost_vring_state state;
> > > > > > > > > > > > struct vhost_vring_addr addr;
> > > > > > > > > > > > } payload;
> > > > > > > > > > > > };
> > > > &

Re: [RFC v2] vhost: introduce mdev based hardware vhost backend

2019-07-09 Thread Tiwei Bie
On Tue, Jul 09, 2019 at 10:50:38AM +0800, Jason Wang wrote:
> On 2019/7/8 下午2:16, Tiwei Bie wrote:
> > On Fri, Jul 05, 2019 at 08:49:46AM -0600, Alex Williamson wrote:
> > > On Thu, 4 Jul 2019 14:21:34 +0800
> > > Tiwei Bie  wrote:
> > > > On Thu, Jul 04, 2019 at 12:31:48PM +0800, Jason Wang wrote:
> > > > > On 2019/7/3 下午9:08, Tiwei Bie wrote:
> > > > > > On Wed, Jul 03, 2019 at 08:16:23PM +0800, Jason Wang wrote:
> > > > > > > On 2019/7/3 下午7:52, Tiwei Bie wrote:
> > > > > > > > On Wed, Jul 03, 2019 at 06:09:51PM +0800, Jason Wang wrote:
> > > > > > > > > On 2019/7/3 下午5:13, Tiwei Bie wrote:
> > > > > > > > > > Details about this can be found here:
> > > > > > > > > > 
> > > > > > > > > > https://lwn.net/Articles/750770/
> > > > > > > > > > 
> > > > > > > > > > What's new in this version
> > > > > > > > > > ==
> > > > > > > > > > 
> > > > > > > > > > A new VFIO device type is introduced - vfio-vhost. This 
> > > > > > > > > > addressed
> > > > > > > > > > some comments from 
> > > > > > > > > > here:https://patchwork.ozlabs.org/cover/984763/
> > > > > > > > > > 
> > > > > > > > > > Below is the updated device interface:
> > > > > > > > > > 
> > > > > > > > > > Currently, there are two regions of this device: 1) 
> > > > > > > > > > CONFIG_REGION
> > > > > > > > > > (VFIO_VHOST_CONFIG_REGION_INDEX), which can be used to 
> > > > > > > > > > setup the
> > > > > > > > > > device; 2) NOTIFY_REGION (VFIO_VHOST_NOTIFY_REGION_INDEX), 
> > > > > > > > > > which
> > > > > > > > > > can be used to notify the device.
> > > > > > > > > > 
> > > > > > > > > > 1. CONFIG_REGION
> > > > > > > > > > 
> > > > > > > > > > The region described by CONFIG_REGION is the main control 
> > > > > > > > > > interface.
> > > > > > > > > > Messages will be written to or read from this region.
> > > > > > > > > > 
> > > > > > > > > > The message type is determined by the `request` field in 
> > > > > > > > > > message
> > > > > > > > > > header. The message size is encoded in the message header 
> > > > > > > > > > too.
> > > > > > > > > > The message format looks like this:
> > > > > > > > > > 
> > > > > > > > > > struct vhost_vfio_op {
> > > > > > > > > > __u64 request;
> > > > > > > > > > __u32 flags;
> > > > > > > > > > /* Flag values: */
> > > > > > > > > >  #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need 
> > > > > > > > > > reply */
> > > > > > > > > > __u32 size;
> > > > > > > > > > union {
> > > > > > > > > > __u64 u64;
> > > > > > > > > > struct vhost_vring_state state;
> > > > > > > > > > struct vhost_vring_addr addr;
> > > > > > > > > > } payload;
> > > > > > > > > > };
> > > > > > > > > > 
> > > > > > > > > > The existing vhost-kernel ioctl cmds are reused as the 
> > > > > > > > > > message
> > > > > > > > > > requests in above structure.
> > > > > > > > > Still a comments like V1. What's the advantage of inventing a 
> > > > > > > > > new protocol?
> > > > > > > > I'm trying to make it work in VFIO's way..
> > > > > > > > > I believe either of the following should be better:
> > > > > > > > > 
> > > > > > > > > - using vhost ioctl, 

Re: [RFC v2] vhost: introduce mdev based hardware vhost backend

2019-07-08 Thread Tiwei Bie
On Fri, Jul 05, 2019 at 08:49:46AM -0600, Alex Williamson wrote:
> On Thu, 4 Jul 2019 14:21:34 +0800
> Tiwei Bie  wrote:
> > On Thu, Jul 04, 2019 at 12:31:48PM +0800, Jason Wang wrote:
> > > On 2019/7/3 下午9:08, Tiwei Bie wrote:  
> > > > On Wed, Jul 03, 2019 at 08:16:23PM +0800, Jason Wang wrote:  
> > > > > On 2019/7/3 下午7:52, Tiwei Bie wrote:  
> > > > > > On Wed, Jul 03, 2019 at 06:09:51PM +0800, Jason Wang wrote:  
> > > > > > > On 2019/7/3 下午5:13, Tiwei Bie wrote:  
> > > > > > > > Details about this can be found here:
> > > > > > > > 
> > > > > > > > https://lwn.net/Articles/750770/
> > > > > > > > 
> > > > > > > > What's new in this version
> > > > > > > > ==
> > > > > > > > 
> > > > > > > > A new VFIO device type is introduced - vfio-vhost. This 
> > > > > > > > addressed
> > > > > > > > some comments from 
> > > > > > > > here:https://patchwork.ozlabs.org/cover/984763/
> > > > > > > > 
> > > > > > > > Below is the updated device interface:
> > > > > > > > 
> > > > > > > > Currently, there are two regions of this device: 1) 
> > > > > > > > CONFIG_REGION
> > > > > > > > (VFIO_VHOST_CONFIG_REGION_INDEX), which can be used to setup the
> > > > > > > > device; 2) NOTIFY_REGION (VFIO_VHOST_NOTIFY_REGION_INDEX), which
> > > > > > > > can be used to notify the device.
> > > > > > > > 
> > > > > > > > 1. CONFIG_REGION
> > > > > > > > 
> > > > > > > > The region described by CONFIG_REGION is the main control 
> > > > > > > > interface.
> > > > > > > > Messages will be written to or read from this region.
> > > > > > > > 
> > > > > > > > The message type is determined by the `request` field in message
> > > > > > > > header. The message size is encoded in the message header too.
> > > > > > > > The message format looks like this:
> > > > > > > > 
> > > > > > > > struct vhost_vfio_op {
> > > > > > > > __u64 request;
> > > > > > > > __u32 flags;
> > > > > > > > /* Flag values: */
> > > > > > > > #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */
> > > > > > > > __u32 size;
> > > > > > > > union {
> > > > > > > > __u64 u64;
> > > > > > > > struct vhost_vring_state state;
> > > > > > > > struct vhost_vring_addr addr;
> > > > > > > > } payload;
> > > > > > > > };
> > > > > > > > 
> > > > > > > > The existing vhost-kernel ioctl cmds are reused as the message
> > > > > > > > requests in above structure.  
> > > > > > > Still a comments like V1. What's the advantage of inventing a new 
> > > > > > > protocol?  
> > > > > > I'm trying to make it work in VFIO's way..
> > > > > >   
> > > > > > > I believe either of the following should be better:
> > > > > > > 
> > > > > > > - using vhost ioctl,  we can start from 
> > > > > > > SET_VRING_KICK/SET_VRING_CALL and
> > > > > > > extend it with e.g notify region. The advantages is that all 
> > > > > > > exist userspace
> > > > > > > program could be reused without modification (or minimal 
> > > > > > > modification). And
> > > > > > > vhost API hides lots of details that is not necessary to be 
> > > > > > > understood by
> > > > > > > application (e.g in the case of container).  
> > > > > > Do you mean reusing vhost's ioctl on VFIO device fd directly,
> > > > > > or introducing another mdev driver (i.e. vhost_mdev instead of
> > > > > > using the existing vfio_mdev) for mdev device?  
> > > > > Can we simply add them i

Re: [RFC v2] vhost: introduce mdev based hardware vhost backend

2019-07-04 Thread Tiwei Bie
On Fri, Jul 05, 2019 at 08:30:00AM +0800, Jason Wang wrote:
> On 2019/7/4 下午3:02, Tiwei Bie wrote:
> > On Thu, Jul 04, 2019 at 02:35:20PM +0800, Jason Wang wrote:
> > > On 2019/7/4 下午2:21, Tiwei Bie wrote:
> > > > On Thu, Jul 04, 2019 at 12:31:48PM +0800, Jason Wang wrote:
> > > > > On 2019/7/3 下午9:08, Tiwei Bie wrote:
> > > > > > On Wed, Jul 03, 2019 at 08:16:23PM +0800, Jason Wang wrote:
> > > > > > > On 2019/7/3 下午7:52, Tiwei Bie wrote:
> > > > > > > > On Wed, Jul 03, 2019 at 06:09:51PM +0800, Jason Wang wrote:
> > > > > > > > > On 2019/7/3 下午5:13, Tiwei Bie wrote:
> > > > > > > > > > Details about this can be found here:
> > > > > > > > > > 
> > > > > > > > > > https://lwn.net/Articles/750770/
> > > > > > > > > > 
> > > > > > > > > > What's new in this version
> > > > > > > > > > ==
> > > > > > > > > > 
> > > > > > > > > > A new VFIO device type is introduced - vfio-vhost. This 
> > > > > > > > > > addressed
> > > > > > > > > > some comments from 
> > > > > > > > > > here:https://patchwork.ozlabs.org/cover/984763/
> > > > > > > > > > 
> > > > > > > > > > Below is the updated device interface:
> > > > > > > > > > 
> > > > > > > > > > Currently, there are two regions of this device: 1) 
> > > > > > > > > > CONFIG_REGION
> > > > > > > > > > (VFIO_VHOST_CONFIG_REGION_INDEX), which can be used to 
> > > > > > > > > > setup the
> > > > > > > > > > device; 2) NOTIFY_REGION (VFIO_VHOST_NOTIFY_REGION_INDEX), 
> > > > > > > > > > which
> > > > > > > > > > can be used to notify the device.
> > > > > > > > > > 
> > > > > > > > > > 1. CONFIG_REGION
> > > > > > > > > > 
> > > > > > > > > > The region described by CONFIG_REGION is the main control 
> > > > > > > > > > interface.
> > > > > > > > > > Messages will be written to or read from this region.
> > > > > > > > > > 
> > > > > > > > > > The message type is determined by the `request` field in 
> > > > > > > > > > message
> > > > > > > > > > header. The message size is encoded in the message header 
> > > > > > > > > > too.
> > > > > > > > > > The message format looks like this:
> > > > > > > > > > 
> > > > > > > > > > struct vhost_vfio_op {
> > > > > > > > > > __u64 request;
> > > > > > > > > > __u32 flags;
> > > > > > > > > > /* Flag values: */
> > > > > > > > > >   #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need 
> > > > > > > > > > reply */
> > > > > > > > > > __u32 size;
> > > > > > > > > > union {
> > > > > > > > > > __u64 u64;
> > > > > > > > > > struct vhost_vring_state state;
> > > > > > > > > > struct vhost_vring_addr addr;
> > > > > > > > > > } payload;
> > > > > > > > > > };
> > > > > > > > > > 
> > > > > > > > > > The existing vhost-kernel ioctl cmds are reused as the 
> > > > > > > > > > message
> > > > > > > > > > requests in above structure.
> > > > > > > > > Still a comments like V1. What's the advantage of inventing a 
> > > > > > > > > new protocol?
> > > > > > > > I'm trying to make it work in VFIO's way..
> > > > > > > > 
> > > > > > > > > I believe either of the following should be better:
> > > > > > > > > 
> > > > > > > > > - using 

Re: [RFC v2] vhost: introduce mdev based hardware vhost backend

2019-07-04 Thread Tiwei Bie
On Thu, Jul 04, 2019 at 02:35:20PM +0800, Jason Wang wrote:
> On 2019/7/4 下午2:21, Tiwei Bie wrote:
> > On Thu, Jul 04, 2019 at 12:31:48PM +0800, Jason Wang wrote:
> > > On 2019/7/3 下午9:08, Tiwei Bie wrote:
> > > > On Wed, Jul 03, 2019 at 08:16:23PM +0800, Jason Wang wrote:
> > > > > On 2019/7/3 下午7:52, Tiwei Bie wrote:
> > > > > > On Wed, Jul 03, 2019 at 06:09:51PM +0800, Jason Wang wrote:
> > > > > > > On 2019/7/3 下午5:13, Tiwei Bie wrote:
> > > > > > > > Details about this can be found here:
> > > > > > > > 
> > > > > > > > https://lwn.net/Articles/750770/
> > > > > > > > 
> > > > > > > > What's new in this version
> > > > > > > > ==
> > > > > > > > 
> > > > > > > > A new VFIO device type is introduced - vfio-vhost. This 
> > > > > > > > addressed
> > > > > > > > some comments from 
> > > > > > > > here:https://patchwork.ozlabs.org/cover/984763/
> > > > > > > > 
> > > > > > > > Below is the updated device interface:
> > > > > > > > 
> > > > > > > > Currently, there are two regions of this device: 1) 
> > > > > > > > CONFIG_REGION
> > > > > > > > (VFIO_VHOST_CONFIG_REGION_INDEX), which can be used to setup the
> > > > > > > > device; 2) NOTIFY_REGION (VFIO_VHOST_NOTIFY_REGION_INDEX), which
> > > > > > > > can be used to notify the device.
> > > > > > > > 
> > > > > > > > 1. CONFIG_REGION
> > > > > > > > 
> > > > > > > > The region described by CONFIG_REGION is the main control 
> > > > > > > > interface.
> > > > > > > > Messages will be written to or read from this region.
> > > > > > > > 
> > > > > > > > The message type is determined by the `request` field in message
> > > > > > > > header. The message size is encoded in the message header too.
> > > > > > > > The message format looks like this:
> > > > > > > > 
> > > > > > > > struct vhost_vfio_op {
> > > > > > > > __u64 request;
> > > > > > > > __u32 flags;
> > > > > > > > /* Flag values: */
> > > > > > > >  #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */
> > > > > > > > __u32 size;
> > > > > > > > union {
> > > > > > > > __u64 u64;
> > > > > > > > struct vhost_vring_state state;
> > > > > > > > struct vhost_vring_addr addr;
> > > > > > > > } payload;
> > > > > > > > };
> > > > > > > > 
> > > > > > > > The existing vhost-kernel ioctl cmds are reused as the message
> > > > > > > > requests in above structure.
> > > > > > > Still a comments like V1. What's the advantage of inventing a new 
> > > > > > > protocol?
> > > > > > I'm trying to make it work in VFIO's way..
> > > > > > 
> > > > > > > I believe either of the following should be better:
> > > > > > > 
> > > > > > > - using vhost ioctl,  we can start from 
> > > > > > > SET_VRING_KICK/SET_VRING_CALL and
> > > > > > > extend it with e.g notify region. The advantages is that all 
> > > > > > > exist userspace
> > > > > > > program could be reused without modification (or minimal 
> > > > > > > modification). And
> > > > > > > vhost API hides lots of details that is not necessary to be 
> > > > > > > understood by
> > > > > > > application (e.g in the case of container).
> > > > > > Do you mean reusing vhost's ioctl on VFIO device fd directly,
> > > > > > or introducing another mdev driver (i.e. vhost_mdev instead of
> > > > > > using the existing vfio_mdev) for mdev device?
> > > > > Can we simply add them into ioctl of mdev_parent_ops?
> > > &

Re: [RFC v2] vhost: introduce mdev based hardware vhost backend

2019-07-04 Thread Tiwei Bie
On Thu, Jul 04, 2019 at 12:31:48PM +0800, Jason Wang wrote:
> On 2019/7/3 下午9:08, Tiwei Bie wrote:
> > On Wed, Jul 03, 2019 at 08:16:23PM +0800, Jason Wang wrote:
> > > On 2019/7/3 下午7:52, Tiwei Bie wrote:
> > > > On Wed, Jul 03, 2019 at 06:09:51PM +0800, Jason Wang wrote:
> > > > > On 2019/7/3 下午5:13, Tiwei Bie wrote:
> > > > > > Details about this can be found here:
> > > > > > 
> > > > > > https://lwn.net/Articles/750770/
> > > > > > 
> > > > > > What's new in this version
> > > > > > ==
> > > > > > 
> > > > > > A new VFIO device type is introduced - vfio-vhost. This addressed
> > > > > > some comments from here:https://patchwork.ozlabs.org/cover/984763/
> > > > > > 
> > > > > > Below is the updated device interface:
> > > > > > 
> > > > > > Currently, there are two regions of this device: 1) CONFIG_REGION
> > > > > > (VFIO_VHOST_CONFIG_REGION_INDEX), which can be used to setup the
> > > > > > device; 2) NOTIFY_REGION (VFIO_VHOST_NOTIFY_REGION_INDEX), which
> > > > > > can be used to notify the device.
> > > > > > 
> > > > > > 1. CONFIG_REGION
> > > > > > 
> > > > > > The region described by CONFIG_REGION is the main control interface.
> > > > > > Messages will be written to or read from this region.
> > > > > > 
> > > > > > The message type is determined by the `request` field in message
> > > > > > header. The message size is encoded in the message header too.
> > > > > > The message format looks like this:
> > > > > > 
> > > > > > struct vhost_vfio_op {
> > > > > > __u64 request;
> > > > > > __u32 flags;
> > > > > > /* Flag values: */
> > > > > > #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */
> > > > > > __u32 size;
> > > > > > union {
> > > > > > __u64 u64;
> > > > > > struct vhost_vring_state state;
> > > > > > struct vhost_vring_addr addr;
> > > > > > } payload;
> > > > > > };
> > > > > > 
> > > > > > The existing vhost-kernel ioctl cmds are reused as the message
> > > > > > requests in above structure.
> > > > > Still a comments like V1. What's the advantage of inventing a new 
> > > > > protocol?
> > > > I'm trying to make it work in VFIO's way..
> > > > 
> > > > > I believe either of the following should be better:
> > > > > 
> > > > > - using vhost ioctl,  we can start from SET_VRING_KICK/SET_VRING_CALL 
> > > > > and
> > > > > extend it with e.g notify region. The advantages is that all exist 
> > > > > userspace
> > > > > program could be reused without modification (or minimal 
> > > > > modification). And
> > > > > vhost API hides lots of details that is not necessary to be 
> > > > > understood by
> > > > > application (e.g in the case of container).
> > > > Do you mean reusing vhost's ioctl on VFIO device fd directly,
> > > > or introducing another mdev driver (i.e. vhost_mdev instead of
> > > > using the existing vfio_mdev) for mdev device?
> > > Can we simply add them into ioctl of mdev_parent_ops?
> > Right, either way, these ioctls have to be and just need to be
> > added in the ioctl of the mdev_parent_ops. But another thing we
> > also need to consider is that which file descriptor the userspace
> > will do the ioctl() on. So I'm wondering do you mean let the
> > userspace do the ioctl() on the VFIO device fd of the mdev
> > device?
> > 
> 
> Yes.

Got it! I'm not sure what's Alex opinion on this. If we all
agree with this, I can do it in this way.

> Is there any other way btw?

Just a quick thought.. Maybe totally a bad idea. I was thinking
whether it would be odd to do non-VFIO's ioctls on VFIO's device
fd. So I was wondering whether it's possible to allow binding
another mdev driver (e.g. vhost_mdev) to the supported mdev
devices. The new mdev driver, vhost_mdev, can provide similar
ways to let userspace open the mdev device and do the vhost ioctls
on it. To distinguish with the vfio_mdev compatible mdev devices,
the device API of the new vhost_mdev compatible mdev devices
might be e.g. "vhost-net" for net?

So in VFIO case, the device will be for passthru directly. And
in VHOST case, the device can be used to accelerate the existing
virtualized devices.

How do you think?

Thanks,
Tiwei
> 
> Thanks
> 


Re: [RFC v2] vhost: introduce mdev based hardware vhost backend

2019-07-03 Thread Tiwei Bie
On Wed, Jul 03, 2019 at 12:31:57PM -0600, Alex Williamson wrote:
> On Wed,  3 Jul 2019 17:13:39 +0800
> Tiwei Bie  wrote:
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 8f10748dac79..6c5718ab7eeb 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -201,6 +201,7 @@ struct vfio_device_info {
> >  #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)   /* vfio-amba device */
> >  #define VFIO_DEVICE_FLAGS_CCW  (1 << 4)/* vfio-ccw device */
> >  #define VFIO_DEVICE_FLAGS_AP   (1 << 5)/* vfio-ap device */
> > +#define VFIO_DEVICE_FLAGS_VHOST(1 << 6)/* vfio-vhost device */
> > __u32   num_regions;/* Max region index + 1 */
> > __u32   num_irqs;   /* Max IRQ index + 1 */
> >  };
> > @@ -217,6 +218,7 @@ struct vfio_device_info {
> >  #define VFIO_DEVICE_API_AMBA_STRING"vfio-amba"
> >  #define VFIO_DEVICE_API_CCW_STRING "vfio-ccw"
> >  #define VFIO_DEVICE_API_AP_STRING  "vfio-ap"
> > +#define VFIO_DEVICE_API_VHOST_STRING   "vfio-vhost"
> >  
> >  /**
> >   * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
> > @@ -573,6 +575,23 @@ enum {
> > VFIO_CCW_NUM_IRQS
> >  };
> >  
> > +/*
> > + * The vfio-vhost bus driver makes use of the following fixed region and
> > + * IRQ index mapping. Unimplemented regions return a size of zero.
> > + * Unimplemented IRQ types return a count of zero.
> > + */
> > +
> > +enum {
> > +   VFIO_VHOST_CONFIG_REGION_INDEX,
> > +   VFIO_VHOST_NOTIFY_REGION_INDEX,
> > +   VFIO_VHOST_NUM_REGIONS
> > +};
> > +
> > +enum {
> > +   VFIO_VHOST_VQ_IRQ_INDEX,
> > +   VFIO_VHOST_NUM_IRQS
> > +};
> > +
> 
> Note that the vfio API has evolved a bit since vfio-pci started this
> way, with fixed indexes for pre-defined region types.  We now support
> device specific regions which can be identified by a capability within
> the REGION_INFO ioctl return data.  This allows a bit more flexibility,
> at the cost of complexity, but the infrastructure already exists in
> kernel and QEMU to make it relatively easy.  I think we'll have the
> same support for interrupts soon too.  If you continue to pursue the
> vfio-vhost direction you might want to consider these before committing
> to fixed indexes.  Thanks,

Thanks for the details! Will give it a try!

Thanks,
Tiwei


Re: [RFC v2] vhost: introduce mdev based hardware vhost backend

2019-07-03 Thread Tiwei Bie
On Wed, Jul 03, 2019 at 08:16:23PM +0800, Jason Wang wrote:
> On 2019/7/3 下午7:52, Tiwei Bie wrote:
> > On Wed, Jul 03, 2019 at 06:09:51PM +0800, Jason Wang wrote:
> > > On 2019/7/3 下午5:13, Tiwei Bie wrote:
> > > > Details about this can be found here:
> > > > 
> > > > https://lwn.net/Articles/750770/
> > > > 
> > > > What's new in this version
> > > > ==
> > > > 
> > > > A new VFIO device type is introduced - vfio-vhost. This addressed
> > > > some comments from here: https://patchwork.ozlabs.org/cover/984763/
> > > > 
> > > > Below is the updated device interface:
> > > > 
> > > > Currently, there are two regions of this device: 1) CONFIG_REGION
> > > > (VFIO_VHOST_CONFIG_REGION_INDEX), which can be used to setup the
> > > > device; 2) NOTIFY_REGION (VFIO_VHOST_NOTIFY_REGION_INDEX), which
> > > > can be used to notify the device.
> > > > 
> > > > 1. CONFIG_REGION
> > > > 
> > > > The region described by CONFIG_REGION is the main control interface.
> > > > Messages will be written to or read from this region.
> > > > 
> > > > The message type is determined by the `request` field in message
> > > > header. The message size is encoded in the message header too.
> > > > The message format looks like this:
> > > > 
> > > > struct vhost_vfio_op {
> > > > __u64 request;
> > > > __u32 flags;
> > > > /* Flag values: */
> > > >#define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */
> > > > __u32 size;
> > > > union {
> > > > __u64 u64;
> > > > struct vhost_vring_state state;
> > > > struct vhost_vring_addr addr;
> > > > } payload;
> > > > };
> > > > 
> > > > The existing vhost-kernel ioctl cmds are reused as the message
> > > > requests in above structure.
> > > 
> > > Still a comments like V1. What's the advantage of inventing a new 
> > > protocol?
> > I'm trying to make it work in VFIO's way..
> > 
> > > I believe either of the following should be better:
> > > 
> > > - using vhost ioctl,  we can start from SET_VRING_KICK/SET_VRING_CALL and
> > > extend it with e.g notify region. The advantages is that all exist 
> > > userspace
> > > program could be reused without modification (or minimal modification). 
> > > And
> > > vhost API hides lots of details that is not necessary to be understood by
> > > application (e.g in the case of container).
> > Do you mean reusing vhost's ioctl on VFIO device fd directly,
> > or introducing another mdev driver (i.e. vhost_mdev instead of
> > using the existing vfio_mdev) for mdev device?
> 
> 
> Can we simply add them into ioctl of mdev_parent_ops?

Right, either way, these ioctls have to be and just need to be
added in the ioctl of the mdev_parent_ops. But another thing we
also need to consider is that which file descriptor the userspace
will do the ioctl() on. So I'm wondering do you mean let the
userspace do the ioctl() on the VFIO device fd of the mdev
device?

> 
> 
> > 
[...]
> > > > 3. VFIO interrupt ioctl API
> > > > 
> > > > VFIO interrupt ioctl API is used to setup device interrupts.
> > > > IRQ-bypass can also be supported.
> > > > 
> > > > Currently, the data path interrupt can be configured via the
> > > > VFIO_VHOST_VQ_IRQ_INDEX with virtqueue's callfd.
> > > 
> > > How about DMA API? Do you expect to use VFIO IOMMU API or using vhost
> > > SET_MEM_TABLE? VFIO IOMMU API is more generic for sure but with
> > > SET_MEM_TABLE DMA can be done at the level of parent device which means it
> > > can work for e.g the card with on-chip IOMMU.
> > Agree. In this RFC, it assumes userspace will use VFIO IOMMU API
> > to do the DMA programming. But like what you said, there could be
> > a problem when using cards with on-chip IOMMU.
> 
> 
> Yes, another issue is SET_MEM_TABLE can not be used to update just a part of
> the table. This seems less flexible than VFIO API but it could be extended.

Agree.

> 
> 
> > 
> > > And what's the plan for vIOMMU?
> > As this RFC assumes userspace will use VFIO IOMMU API, userspace
> > just needs to follow the same way like what vfio-pci device does
> > in QEMU to

Re: [RFC v2] vhost: introduce mdev based hardware vhost backend

2019-07-03 Thread Tiwei Bie
On Wed, Jul 03, 2019 at 06:09:51PM +0800, Jason Wang wrote:
> On 2019/7/3 下午5:13, Tiwei Bie wrote:
> > Details about this can be found here:
> > 
> > https://lwn.net/Articles/750770/
> > 
> > What's new in this version
> > ==
> > 
> > A new VFIO device type is introduced - vfio-vhost. This addressed
> > some comments from here: https://patchwork.ozlabs.org/cover/984763/
> > 
> > Below is the updated device interface:
> > 
> > Currently, there are two regions of this device: 1) CONFIG_REGION
> > (VFIO_VHOST_CONFIG_REGION_INDEX), which can be used to setup the
> > device; 2) NOTIFY_REGION (VFIO_VHOST_NOTIFY_REGION_INDEX), which
> > can be used to notify the device.
> > 
> > 1. CONFIG_REGION
> > 
> > The region described by CONFIG_REGION is the main control interface.
> > Messages will be written to or read from this region.
> > 
> > The message type is determined by the `request` field in message
> > header. The message size is encoded in the message header too.
> > The message format looks like this:
> > 
> > struct vhost_vfio_op {
> > __u64 request;
> > __u32 flags;
> > /* Flag values: */
> >   #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */
> > __u32 size;
> > union {
> > __u64 u64;
> > struct vhost_vring_state state;
> > struct vhost_vring_addr addr;
> > } payload;
> > };
> > 
> > The existing vhost-kernel ioctl cmds are reused as the message
> > requests in above structure.
> 
> 
> Still a comments like V1. What's the advantage of inventing a new protocol?

I'm trying to make it work in VFIO's way..

> I believe either of the following should be better:
> 
> - using vhost ioctl,  we can start from SET_VRING_KICK/SET_VRING_CALL and
> extend it with e.g notify region. The advantages is that all exist userspace
> program could be reused without modification (or minimal modification). And
> vhost API hides lots of details that is not necessary to be understood by
> application (e.g in the case of container).

Do you mean reusing vhost's ioctl on VFIO device fd directly,
or introducing another mdev driver (i.e. vhost_mdev instead of
using the existing vfio_mdev) for mdev device?

> 
> - using PCI layout, then you don't even need to re-invent notifiy region at
> all and we can pass-through them to guest.

Like what you said previously, virtio has transports other than PCI.
And it will look a bit odd when using transports other than PCI..

> 
> Personally, I prefer vhost ioctl.

+1

> 
> 
> > 
[...]
> > 
> > 3. VFIO interrupt ioctl API
> > 
> > VFIO interrupt ioctl API is used to setup device interrupts.
> > IRQ-bypass can also be supported.
> > 
> > Currently, the data path interrupt can be configured via the
> > VFIO_VHOST_VQ_IRQ_INDEX with virtqueue's callfd.
> 
> 
> How about DMA API? Do you expect to use VFIO IOMMU API or using vhost
> SET_MEM_TABLE? VFIO IOMMU API is more generic for sure but with
> SET_MEM_TABLE DMA can be done at the level of parent device which means it
> can work for e.g the card with on-chip IOMMU.

Agree. In this RFC, it assumes userspace will use VFIO IOMMU API
to do the DMA programming. But like what you said, there could be
a problem when using cards with on-chip IOMMU.

> 
> And what's the plan for vIOMMU?

As this RFC assumes userspace will use VFIO IOMMU API, userspace
just needs to follow the same way like what vfio-pci device does
in QEMU to support vIOMMU.

> 
> 
> > 
> > Signed-off-by: Tiwei Bie 
> > ---
> >   drivers/vhost/Makefile |   2 +
> >   drivers/vhost/vdpa.c   | 770 +
> >   include/linux/vdpa_mdev.h  |  72 
> >   include/uapi/linux/vfio.h  |  19 +
> >   include/uapi/linux/vhost.h |  25 ++
> >   5 files changed, 888 insertions(+)
> >   create mode 100644 drivers/vhost/vdpa.c
> >   create mode 100644 include/linux/vdpa_mdev.h
> 
> 
> We probably need some sample parent device implementation. It could be a
> software datapath like e.g we can start from virtio-net device in guest or a
> vhost/tap on host.

Yeah, something like this would be interesting!

Thanks,
Tiwei

> 
> Thanks
> 
> 
> > 


[RFC v2] vhost: introduce mdev based hardware vhost backend

2019-07-03 Thread Tiwei Bie
Details about this can be found here:

https://lwn.net/Articles/750770/

What's new in this version
==

A new VFIO device type is introduced - vfio-vhost. This addressed
some comments from here: https://patchwork.ozlabs.org/cover/984763/

Below is the updated device interface:

Currently, there are two regions of this device: 1) CONFIG_REGION
(VFIO_VHOST_CONFIG_REGION_INDEX), which can be used to setup the
device; 2) NOTIFY_REGION (VFIO_VHOST_NOTIFY_REGION_INDEX), which
can be used to notify the device.

1. CONFIG_REGION

The region described by CONFIG_REGION is the main control interface.
Messages will be written to or read from this region.

The message type is determined by the `request` field in message
header. The message size is encoded in the message header too.
The message format looks like this:

struct vhost_vfio_op {
__u64 request;
__u32 flags;
/* Flag values: */
 #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */
__u32 size;
union {
__u64 u64;
struct vhost_vring_state state;
struct vhost_vring_addr addr;
} payload;
};

The existing vhost-kernel ioctl cmds are reused as the message
requests in above structure.

Each message will be written to or read from this region at offset 0:

int vhost_vfio_write(struct vhost_dev *dev, struct vhost_vfio_op *op)
{
int count = VHOST_VFIO_OP_HDR_SIZE + op->size;
struct vhost_vfio *vfio = dev->opaque;
int ret;

ret = pwrite64(vfio->device_fd, op, count, vfio->config_offset);
if (ret != count)
return -1;

return 0;
}

int vhost_vfio_read(struct vhost_dev *dev, struct vhost_vfio_op *op)
{
int count = VHOST_VFIO_OP_HDR_SIZE + op->size;
struct vhost_vfio *vfio = dev->opaque;
uint64_t request = op->request;
int ret;

ret = pread64(vfio->device_fd, op, count, vfio->config_offset);
if (ret != count || request != op->request)
return -1;

return 0;
}

It's quite straightforward to set things to the device. Just need to
write the message to device directly:

int vhost_vfio_set_features(struct vhost_dev *dev, uint64_t features)
{
struct vhost_vfio_op op;

op.request = VHOST_SET_FEATURES;
op.flags = 0;
op.size = sizeof(features);
op.payload.u64 = features;

return vhost_vfio_write(dev, );
}

To get things from the device, two steps are needed.
Take VHOST_GET_FEATURE as an example:

int vhost_vfio_get_features(struct vhost_dev *dev, uint64_t *features)
{
struct vhost_vfio_op op;
int ret;

op.request = VHOST_GET_FEATURES;
op.flags = VHOST_VFIO_NEED_REPLY;
op.size = 0;

/* Just need to write the header */
ret = vhost_vfio_write(dev, );
if (ret != 0)
goto out;

/* `op` wasn't changed during write */
op.flags = 0;
op.size = sizeof(*features);

ret = vhost_vfio_read(dev, );
if (ret != 0)
goto out;

*features = op.payload.u64;
out:
return ret;
}

2. NOTIFIY_REGION (mmap-able)

The region described by NOTIFY_REGION will be used to notify
the device.

Each queue will have a page for notification, and it can be mapped
to VM (if hardware also supports), and the virtio driver in the VM
will be able to notify the device directly.

The region described by NOTIFY_REGION is also write-able. If
the accelerator's notification register(s) cannot be mapped to
the VM, write() can also be used to notify the device. Something
like this:

void notify_relay(void *opaque)
{
..
offset = host_page_size * queue_idx;

ret = pwrite64(vfio->device_fd, _idx, sizeof(queue_idx),
vfio->notify_offset + offset);
..
}

3. VFIO interrupt ioctl API

VFIO interrupt ioctl API is used to setup device interrupts.
IRQ-bypass can also be supported.

Currently, the data path interrupt can be configured via the
VFIO_VHOST_VQ_IRQ_INDEX with virtqueue's callfd.

Signed-off-by: Tiwei Bie 
---
 drivers/vhost/Makefile |   2 +
 drivers/vhost/vdpa.c   | 770 +
 include/linux/vdpa_mdev.h  |  72 
 include/uapi/linux/vfio.h  |  19 +
 include/uapi/linux/vhost.h |  25 ++
 5 files changed, 888 insertions(+)
 create mode 100644 drivers/vhost/vdpa.c
 create mode 100644 include/linux/vdpa_mdev.h

diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 6c6df24f770c..cabb71095940 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -10,4 +10,6 @@ vhost_vsock-y := vsock.o
 
 obj-$(CONFIG_VHOST_RING) += vringh.o
 
+obj-$(CONFIG_VHOST_VFIO) += vdpa.o
+
 obj-$(CONFIG_VHOST)+= vhost.o
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
new file mode 100644
index ..5c9426e2a091
--- /dev/null
+++ b

Re: [PATCH] virtio: drop internal struct from UAPI

2019-02-01 Thread Tiwei Bie
On Fri, Feb 01, 2019 at 05:16:01PM -0500, Michael S. Tsirkin wrote:
> There's no reason to expose struct vring_packed in UAPI - if we do we
> won't be able to change or drop it, and it's not part of any interface.
> 
> Let's move it to virtio_ring.c
> 
> Cc: Tiwei Bie 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  drivers/virtio/virtio_ring.c |  7 ++-
>  include/uapi/linux/virtio_ring.h | 10 --
>  2 files changed, 6 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 412e0c431d87..1c85b3423182 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -152,7 +152,12 @@ struct vring_virtqueue {
>   /* Available for packed ring */
>   struct {
>   /* Actual memory layout for this queue. */
> - struct vring_packed vring;
> + struct {
> + unsigned int num;
> + struct vring_packed_desc *desc;
> + struct vring_packed_desc_event *driver;
> + struct vring_packed_desc_event *device;
> + } vring;
>  
>   /* Driver ring wrap counter. */
>   bool avail_wrap_counter;
> diff --git a/include/uapi/linux/virtio_ring.h 
> b/include/uapi/linux/virtio_ring.h
> index 2414f8af26b3..4c4e24c291a5 100644
> --- a/include/uapi/linux/virtio_ring.h
> +++ b/include/uapi/linux/virtio_ring.h
> @@ -213,14 +213,4 @@ struct vring_packed_desc {
>   __le16 flags;
>  };
>  
> -struct vring_packed {
> - unsigned int num;
> -
> - struct vring_packed_desc *desc;
> -
> - struct vring_packed_desc_event *driver;
> -
> - struct vring_packed_desc_event *device;
> -};
> -
>  #endif /* _UAPI_LINUX_VIRTIO_RING_H */
> -- 
> MST

Acked-by: Tiwei Bie 


[PATCH v2] virtio: support VIRTIO_F_ORDER_PLATFORM

2019-01-23 Thread Tiwei Bie
This patch introduces the support for VIRTIO_F_ORDER_PLATFORM.
If this feature is negotiated, the driver must use the barriers
suitable for hardware devices. Otherwise, the device and driver
are assumed to be implemented in software, that is they can be
assumed to run on identical CPUs in an SMP configuration. Thus
a weaker form of memory barriers is sufficient to yield better
performance.

It is recommended that an add-in card based PCI device offers
this feature for portability. The device will fail to operate
further or will operate in a slower emulation mode if this
feature is offered but not accepted.

Signed-off-by: Tiwei Bie 
---
v2:
- Add more explanations in commit log (MST);

 drivers/virtio/virtio_ring.c   | 8 
 include/uapi/linux/virtio_config.h | 6 ++
 2 files changed, 14 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index cd7e755484e3..27d3f057493e 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1609,6 +1609,9 @@ static struct virtqueue *vring_create_virtqueue_packed(
!context;
vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
 
+   if (virtio_has_feature(vdev, VIRTIO_F_ORDER_PLATFORM))
+   vq->weak_barriers = false;
+
vq->packed.ring_dma_addr = ring_dma_addr;
vq->packed.driver_event_dma_addr = driver_event_dma_addr;
vq->packed.device_event_dma_addr = device_event_dma_addr;
@@ -2079,6 +2082,9 @@ struct virtqueue *__vring_new_virtqueue(unsigned int 
index,
!context;
vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
 
+   if (virtio_has_feature(vdev, VIRTIO_F_ORDER_PLATFORM))
+   vq->weak_barriers = false;
+
vq->split.queue_dma_addr = 0;
vq->split.queue_size_in_bytes = 0;
 
@@ -2213,6 +2219,8 @@ void vring_transport_features(struct virtio_device *vdev)
break;
case VIRTIO_F_RING_PACKED:
break;
+   case VIRTIO_F_ORDER_PLATFORM:
+   break;
default:
/* We don't understand this bit. */
__virtio_clear_bit(vdev, i);
diff --git a/include/uapi/linux/virtio_config.h 
b/include/uapi/linux/virtio_config.h
index 1196e1c1d4f6..ff8e7dc9d4dd 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -78,6 +78,12 @@
 /* This feature indicates support for the packed virtqueue layout. */
 #define VIRTIO_F_RING_PACKED   34
 
+/*
+ * This feature indicates that memory accesses by the driver and the
+ * device are ordered in a way described by the platform.
+ */
+#define VIRTIO_F_ORDER_PLATFORM36
+
 /*
  * Does the device support Single Root I/O Virtualization?
  */
-- 
2.17.1



Re: [PATCH] virtio: support VIRTIO_F_ORDER_PLATFORM

2019-01-23 Thread Tiwei Bie
On Tue, Jan 22, 2019 at 11:04:29PM -0500, Michael S. Tsirkin wrote:
> On Wed, Jan 23, 2019 at 01:03:46AM +0800, Tiwei Bie wrote:
> > This patch introduces the support for VIRTIO_F_ORDER_PLATFORM.
> > When this feature is negotiated, driver will use the barriers
> > suitable for hardware devices.
> > 
> > Signed-off-by: Tiwei Bie 
> 
> Could you pls add a bit more explanation in the commit log?
> E.g. which configurations are broken without this patch?
> How severe is the problem?

Sure. Will do that.

Thanks

> 
> I'm trying to decide whether this belongs in 5.0 or 5.1.
> 
> > ---
> >  drivers/virtio/virtio_ring.c   | 8 
> >  include/uapi/linux/virtio_config.h | 6 ++
> >  2 files changed, 14 insertions(+)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index cd7e755484e3..27d3f057493e 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -1609,6 +1609,9 @@ static struct virtqueue 
> > *vring_create_virtqueue_packed(
> > !context;
> > vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
> >  
> > +   if (virtio_has_feature(vdev, VIRTIO_F_ORDER_PLATFORM))
> > +   vq->weak_barriers = false;
> > +
> > vq->packed.ring_dma_addr = ring_dma_addr;
> > vq->packed.driver_event_dma_addr = driver_event_dma_addr;
> > vq->packed.device_event_dma_addr = device_event_dma_addr;
> > @@ -2079,6 +2082,9 @@ struct virtqueue *__vring_new_virtqueue(unsigned int 
> > index,
> > !context;
> > vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
> >  
> > +   if (virtio_has_feature(vdev, VIRTIO_F_ORDER_PLATFORM))
> > +   vq->weak_barriers = false;
> > +
> > vq->split.queue_dma_addr = 0;
> > vq->split.queue_size_in_bytes = 0;
> >  
> > @@ -2213,6 +2219,8 @@ void vring_transport_features(struct virtio_device 
> > *vdev)
> > break;
> > case VIRTIO_F_RING_PACKED:
> > break;
> > +   case VIRTIO_F_ORDER_PLATFORM:
> > +   break;
> > default:
> > /* We don't understand this bit. */
> > __virtio_clear_bit(vdev, i);
> > diff --git a/include/uapi/linux/virtio_config.h 
> > b/include/uapi/linux/virtio_config.h
> > index 1196e1c1d4f6..ff8e7dc9d4dd 100644
> > --- a/include/uapi/linux/virtio_config.h
> > +++ b/include/uapi/linux/virtio_config.h
> > @@ -78,6 +78,12 @@
> >  /* This feature indicates support for the packed virtqueue layout. */
> >  #define VIRTIO_F_RING_PACKED   34
> >  
> > +/*
> > + * This feature indicates that memory accesses by the driver and the
> > + * device are ordered in a way described by the platform.
> > + */
> > +#define VIRTIO_F_ORDER_PLATFORM36
> > +
> >  /*
> >   * Does the device support Single Root I/O Virtualization?
> >   */
> > -- 
> > 2.17.1


[PATCH] virtio: support VIRTIO_F_ORDER_PLATFORM

2019-01-22 Thread Tiwei Bie
This patch introduces the support for VIRTIO_F_ORDER_PLATFORM.
When this feature is negotiated, driver will use the barriers
suitable for hardware devices.

Signed-off-by: Tiwei Bie 
---
 drivers/virtio/virtio_ring.c   | 8 
 include/uapi/linux/virtio_config.h | 6 ++
 2 files changed, 14 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index cd7e755484e3..27d3f057493e 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1609,6 +1609,9 @@ static struct virtqueue *vring_create_virtqueue_packed(
!context;
vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
 
+   if (virtio_has_feature(vdev, VIRTIO_F_ORDER_PLATFORM))
+   vq->weak_barriers = false;
+
vq->packed.ring_dma_addr = ring_dma_addr;
vq->packed.driver_event_dma_addr = driver_event_dma_addr;
vq->packed.device_event_dma_addr = device_event_dma_addr;
@@ -2079,6 +2082,9 @@ struct virtqueue *__vring_new_virtqueue(unsigned int 
index,
!context;
vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
 
+   if (virtio_has_feature(vdev, VIRTIO_F_ORDER_PLATFORM))
+   vq->weak_barriers = false;
+
vq->split.queue_dma_addr = 0;
vq->split.queue_size_in_bytes = 0;
 
@@ -2213,6 +2219,8 @@ void vring_transport_features(struct virtio_device *vdev)
break;
case VIRTIO_F_RING_PACKED:
break;
+   case VIRTIO_F_ORDER_PLATFORM:
+   break;
default:
/* We don't understand this bit. */
__virtio_clear_bit(vdev, i);
diff --git a/include/uapi/linux/virtio_config.h 
b/include/uapi/linux/virtio_config.h
index 1196e1c1d4f6..ff8e7dc9d4dd 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -78,6 +78,12 @@
 /* This feature indicates support for the packed virtqueue layout. */
 #define VIRTIO_F_RING_PACKED   34
 
+/*
+ * This feature indicates that memory accesses by the driver and the
+ * device are ordered in a way described by the platform.
+ */
+#define VIRTIO_F_ORDER_PLATFORM36
+
 /*
  * Does the device support Single Root I/O Virtualization?
  */
-- 
2.17.1



Re: [RFC 3/3] virtio_ring: use new vring flags

2018-12-08 Thread Tiwei Bie
On Fri, Dec 07, 2018 at 01:10:48PM -0500, Michael S. Tsirkin wrote:
> On Fri, Dec 07, 2018 at 04:48:42PM +0800, Tiwei Bie wrote:
> > Switch to using the _SPLIT_ and _PACKED_ variants of vring flags
> > in split ring and packed ring respectively.
> > 
> > Signed-off-by: Tiwei Bie 
> > ---
> > @@ -502,7 +505,8 @@ static inline int virtqueue_add_split(struct virtqueue 
> > *_vq,
> > }
> > }
> > /* Last one doesn't continue. */
> > -   desc[prev].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
> > +   desc[prev].flags &= cpu_to_virtio16(_vq->vdev,
> > +   (u16)~BIT(VRING_SPLIT_DESC_F_NEXT));
> >  
> > if (indirect) {
> > /* Now that the indirect table is filled in, map it. */
> 
> I kind of dislike it that this forces use of a cast here.
> Kind of makes it more fragile. Let's use a temporary instead?

I tried something like this:

u16 mask = ~BIT(VRING_SPLIT_DESC_F_NEXT);

And it will still cause the warning:

warning: large integer implicitly truncated to unsigned type [-Woverflow]
  u16 mask = ~BIT(VRING_SPLIT_DESC_F_NEXT);

If the cast isn't wanted, maybe use ~(1 << VRING_SPLIT_DESC_F_NEXT)
directly?

> 
> > -- 
> > 2.17.1


Re: [RFC 2/3] virtio_ring: add VIRTIO_RING_NO_LEGACY

2018-12-08 Thread Tiwei Bie
On Fri, Dec 07, 2018 at 01:05:35PM -0500, Michael S. Tsirkin wrote:
> On Fri, Dec 07, 2018 at 04:48:41PM +0800, Tiwei Bie wrote:
> > Introduce VIRTIO_RING_NO_LEGACY to support disabling legacy
> > macros and layout definitions.
> > 
> > Suggested-by: Michael S. Tsirkin 
> > Signed-off-by: Tiwei Bie 
> > ---
> > VRING_AVAIL_ALIGN_SIZE, VRING_USED_ALIGN_SIZE and VRING_DESC_ALIGN_SIZE
> > are not pre-virtio 1.0, but can also be disabled by VIRTIO_RING_NO_LEGACY
> > in this patch, because their names are not consistent with other names.
> > Not sure whether this is a good idea. If we want this, we may also want
> > to define _SPLIT_ version for them.
> 
> I don't think it's a good idea to have alignment in there - the point of
> NO_LEGACY is to help catch bugs not to sanitize coding style IMHO.
> 
> And spec calls "legacy" the 0.X interfaces, let's not muddy the waters.

Make sense. Thanks!

> 
> > 
> >  include/uapi/linux/virtio_ring.h | 4 
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/include/uapi/linux/virtio_ring.h 
> > b/include/uapi/linux/virtio_ring.h
> > index 9b0c0d92ab62..192573827850 100644
> > --- a/include/uapi/linux/virtio_ring.h
> > +++ b/include/uapi/linux/virtio_ring.h
> > @@ -37,6 +37,7 @@
> >  #include 
> >  #include 
> >  
> > +#ifndef VIRTIO_RING_NO_LEGACY
> >  /*
> >   * Notice: unlike other _F_ flags, below flags are defined as shifted
> >   * values instead of shifts for compatibility.
> > @@ -51,6 +52,7 @@
> >  #define VRING_USED_F_NO_NOTIFY 1
> >  /* Same as VRING_SPLIT_AVAIL_F_NO_INTERRUPT. */
> >  #define VRING_AVAIL_F_NO_INTERRUPT 1
> > +#endif /* VIRTIO_RING_NO_LEGACY */
> >  
> >  /* Mark a buffer as continuing via the next field in split ring. */
> >  #define VRING_SPLIT_DESC_F_NEXT0
> > @@ -151,6 +153,7 @@ struct vring {
> > struct vring_used *used;
> >  };
> >  
> > +#ifndef VIRTIO_RING_NO_LEGACY
> >  /* Alignment requirements for vring elements.
> >   * When using pre-virtio 1.0 layout, these fall out naturally.
> >   */
> > @@ -203,6 +206,7 @@ static inline unsigned vring_size(unsigned int num, 
> > unsigned long align)
> >  + align - 1) & ~(align - 1))
> > + sizeof(__virtio16) * 3 + sizeof(struct vring_used_elem) * num;
> >  }
> > +#endif /* VIRTIO_RING_NO_LEGACY */
> >  
> >  /* The following is used with USED_EVENT_IDX and AVAIL_EVENT_IDX */
> >  /* Assuming a given event_idx value from the other side, if
> > -- 
> > 2.17.1


Re: [RFC 0/3] virtio_ring: define flags as shifts consistently

2018-12-08 Thread Tiwei Bie
On Fri, Dec 07, 2018 at 01:11:42PM -0500, Michael S. Tsirkin wrote:
> On Fri, Dec 07, 2018 at 04:48:39PM +0800, Tiwei Bie wrote:
> > This is a follow up of the discussion in this thread:
> > https://patchwork.ozlabs.org/patch/1001015/#2042353
> 
> How was this tested? I'd suggest building virtio
> before and after the changes, stripped binary
> should be exactly the same.

Sure, I will do the test with scripts/objdiff.

> 
> 
> > Tiwei Bie (3):
> >   virtio_ring: define flags as shifts consistently
> >   virtio_ring: add VIRTIO_RING_NO_LEGACY
> >   virtio_ring: use new vring flags
> > 
> >  drivers/virtio/virtio_ring.c | 100 ++-
> >  include/uapi/linux/virtio_ring.h |  61 +--
> >  2 files changed, 103 insertions(+), 58 deletions(-)
> > 
> > -- 
> > 2.17.1


[RFC 3/3] virtio_ring: use new vring flags

2018-12-07 Thread Tiwei Bie
Switch to using the _SPLIT_ and _PACKED_ variants of vring flags
in split ring and packed ring respectively.

Signed-off-by: Tiwei Bie 
---
 drivers/virtio/virtio_ring.c | 100 +--
 1 file changed, 59 insertions(+), 41 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index cd7e755484e3..2806f69c6c9f 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -371,17 +371,17 @@ static void vring_unmap_one_split(const struct 
vring_virtqueue *vq,
 
flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
 
-   if (flags & VRING_DESC_F_INDIRECT) {
+   if (flags & BIT(VRING_SPLIT_DESC_F_INDIRECT)) {
dma_unmap_single(vring_dma_dev(vq),
 virtio64_to_cpu(vq->vq.vdev, desc->addr),
 virtio32_to_cpu(vq->vq.vdev, desc->len),
-(flags & VRING_DESC_F_WRITE) ?
+(flags & BIT(VRING_SPLIT_DESC_F_WRITE)) ?
 DMA_FROM_DEVICE : DMA_TO_DEVICE);
} else {
dma_unmap_page(vring_dma_dev(vq),
   virtio64_to_cpu(vq->vq.vdev, desc->addr),
   virtio32_to_cpu(vq->vq.vdev, desc->len),
-  (flags & VRING_DESC_F_WRITE) ?
+  (flags & BIT(VRING_SPLIT_DESC_F_WRITE)) ?
   DMA_FROM_DEVICE : DMA_TO_DEVICE);
}
 }
@@ -481,7 +481,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
if (vring_mapping_error(vq, addr))
goto unmap_release;
 
-   desc[i].flags = cpu_to_virtio16(_vq->vdev, 
VRING_DESC_F_NEXT);
+   desc[i].flags = cpu_to_virtio16(_vq->vdev,
+   BIT(VRING_SPLIT_DESC_F_NEXT));
desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
desc[i].len = cpu_to_virtio32(_vq->vdev, sg->length);
prev = i;
@@ -494,7 +495,9 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
if (vring_mapping_error(vq, addr))
goto unmap_release;
 
-   desc[i].flags = cpu_to_virtio16(_vq->vdev, 
VRING_DESC_F_NEXT | VRING_DESC_F_WRITE);
+   desc[i].flags = cpu_to_virtio16(_vq->vdev,
+   BIT(VRING_SPLIT_DESC_F_NEXT) |
+   BIT(VRING_SPLIT_DESC_F_WRITE));
desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
desc[i].len = cpu_to_virtio32(_vq->vdev, sg->length);
prev = i;
@@ -502,7 +505,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
}
}
/* Last one doesn't continue. */
-   desc[prev].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
+   desc[prev].flags &= cpu_to_virtio16(_vq->vdev,
+   (u16)~BIT(VRING_SPLIT_DESC_F_NEXT));
 
if (indirect) {
/* Now that the indirect table is filled in, map it. */
@@ -513,7 +517,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
goto unmap_release;
 
vq->split.vring.desc[head].flags = cpu_to_virtio16(_vq->vdev,
-   VRING_DESC_F_INDIRECT);
+   BIT(VRING_SPLIT_DESC_F_INDIRECT));
vq->split.vring.desc[head].addr = cpu_to_virtio64(_vq->vdev,
addr);
 
@@ -603,8 +607,8 @@ static bool virtqueue_kick_prepare_split(struct virtqueue 
*_vq)
  new, old);
} else {
needs_kick = !(vq->split.vring.used->flags &
-   cpu_to_virtio16(_vq->vdev,
-   VRING_USED_F_NO_NOTIFY));
+   cpu_to_virtio16(_vq->vdev,
+   BIT(VRING_SPLIT_USED_F_NO_NOTIFY)));
}
END_USE(vq);
return needs_kick;
@@ -614,7 +618,8 @@ static void detach_buf_split(struct vring_virtqueue *vq, 
unsigned int head,
 void **ctx)
 {
unsigned int i, j;
-   __virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
+   __virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev,
+   BIT(VRING_SPLIT_DESC_F_NEXT));
 
/* Clear data ptr. */
vq->split.desc_state[head].data = NULL;
@@ -649,7 +654,8 @@ static void detach_buf_split(struct vring_virtqueue *vq, 
unsigned

[RFC 2/3] virtio_ring: add VIRTIO_RING_NO_LEGACY

2018-12-07 Thread Tiwei Bie
Introduce VIRTIO_RING_NO_LEGACY to support disabling legacy
macros and layout definitions.

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Tiwei Bie 
---
VRING_AVAIL_ALIGN_SIZE, VRING_USED_ALIGN_SIZE and VRING_DESC_ALIGN_SIZE
are not pre-virtio 1.0, but can also be disabled by VIRTIO_RING_NO_LEGACY
in this patch, because their names are not consistent with other names.
Not sure whether this is a good idea. If we want this, we may also want
to define _SPLIT_ version for them.

 include/uapi/linux/virtio_ring.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 9b0c0d92ab62..192573827850 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -37,6 +37,7 @@
 #include 
 #include 
 
+#ifndef VIRTIO_RING_NO_LEGACY
 /*
  * Notice: unlike other _F_ flags, below flags are defined as shifted
  * values instead of shifts for compatibility.
@@ -51,6 +52,7 @@
 #define VRING_USED_F_NO_NOTIFY 1
 /* Same as VRING_SPLIT_AVAIL_F_NO_INTERRUPT. */
 #define VRING_AVAIL_F_NO_INTERRUPT 1
+#endif /* VIRTIO_RING_NO_LEGACY */
 
 /* Mark a buffer as continuing via the next field in split ring. */
 #define VRING_SPLIT_DESC_F_NEXT0
@@ -151,6 +153,7 @@ struct vring {
struct vring_used *used;
 };
 
+#ifndef VIRTIO_RING_NO_LEGACY
 /* Alignment requirements for vring elements.
  * When using pre-virtio 1.0 layout, these fall out naturally.
  */
@@ -203,6 +206,7 @@ static inline unsigned vring_size(unsigned int num, 
unsigned long align)
 + align - 1) & ~(align - 1))
+ sizeof(__virtio16) * 3 + sizeof(struct vring_used_elem) * num;
 }
+#endif /* VIRTIO_RING_NO_LEGACY */
 
 /* The following is used with USED_EVENT_IDX and AVAIL_EVENT_IDX */
 /* Assuming a given event_idx value from the other side, if
-- 
2.17.1



[RFC 1/3] virtio_ring: define flags as shifts consistently

2018-12-07 Thread Tiwei Bie
Introduce _SPLIT_ and/or _PACKED_ variants for VRING_DESC_F_*,
VRING_AVAIL_F_NO_INTERRUPT and VRING_USED_F_NO_NOTIFY. These
variants are defined as shifts instead of shifted values for
consistency with other _F_ flags.

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Tiwei Bie 
---
 include/uapi/linux/virtio_ring.h | 57 ++--
 1 file changed, 40 insertions(+), 17 deletions(-)

diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 2414f8af26b3..9b0c0d92ab62 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -37,29 +37,52 @@
 #include 
 #include 
 
-/* This marks a buffer as continuing via the next field. */
+/*
+ * Notice: unlike other _F_ flags, below flags are defined as shifted
+ * values instead of shifts for compatibility.
+ */
+/* Same as VRING_SPLIT_DESC_F_NEXT. */
 #define VRING_DESC_F_NEXT  1
-/* This marks a buffer as write-only (otherwise read-only). */
+/* Same as VRING_SPLIT_DESC_F_WRITE. */
 #define VRING_DESC_F_WRITE 2
-/* This means the buffer contains a list of buffer descriptors. */
+/* Same as VRING_SPLIT_DESC_F_INDIRECT. */
 #define VRING_DESC_F_INDIRECT  4
-
-/*
- * Mark a descriptor as available or used in packed ring.
- * Notice: they are defined as shifts instead of shifted values.
- */
-#define VRING_PACKED_DESC_F_AVAIL  7
-#define VRING_PACKED_DESC_F_USED   15
-
-/* The Host uses this in used->flags to advise the Guest: don't kick me when
- * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
- * will still kick if it's out of buffers. */
+/* Same as VRING_SPLIT_USED_F_NO_NOTIFY. */
 #define VRING_USED_F_NO_NOTIFY 1
-/* The Guest uses this in avail->flags to advise the Host: don't interrupt me
- * when you consume a buffer.  It's unreliable, so it's simply an
- * optimization.  */
+/* Same as VRING_SPLIT_AVAIL_F_NO_INTERRUPT. */
 #define VRING_AVAIL_F_NO_INTERRUPT 1
 
+/* Mark a buffer as continuing via the next field in split ring. */
+#define VRING_SPLIT_DESC_F_NEXT0
+/* Mark a buffer as write-only (otherwise read-only) in split ring. */
+#define VRING_SPLIT_DESC_F_WRITE   1
+/* Mean the buffer contains a list of buffer descriptors in split ring. */
+#define VRING_SPLIT_DESC_F_INDIRECT2
+
+/*
+ * The Host uses this in used->flags in split ring to advise the Guest:
+ * don't kick me when you add a buffer.  It's unreliable, so it's simply
+ * an optimization.  Guest will still kick if it's out of buffers.
+ */
+#define VRING_SPLIT_USED_F_NO_NOTIFY   0
+/*
+ * The Guest uses this in avail->flags in split ring to advise the Host:
+ * don't interrupt me when you consume a buffer.  It's unreliable, so it's
+ * simply an optimization.
+ */
+#define VRING_SPLIT_AVAIL_F_NO_INTERRUPT   0
+
+/* Mark a buffer as continuing via the next field in packed ring. */
+#define VRING_PACKED_DESC_F_NEXT   0
+/* Mark a buffer as write-only (otherwise read-only) in packed ring. */
+#define VRING_PACKED_DESC_F_WRITE  1
+/* Mean the buffer contains a list of buffer descriptors in packed ring. */
+#define VRING_PACKED_DESC_F_INDIRECT   2
+
+/* Mark a descriptor as available or used in packed ring. */
+#define VRING_PACKED_DESC_F_AVAIL  7
+#define VRING_PACKED_DESC_F_USED   15
+
 /* Enable events in packed ring. */
 #define VRING_PACKED_EVENT_FLAG_ENABLE 0x0
 /* Disable events in packed ring. */
-- 
2.17.1



[RFC 1/3] virtio_ring: define flags as shifts consistently

2018-12-07 Thread Tiwei Bie
Introduce _SPLIT_ and/or _PACKED_ variants for VRING_DESC_F_*,
VRING_AVAIL_F_NO_INTERRUPT and VRING_USED_F_NO_NOTIFY. These
variants are defined as shifts instead of shifted values for
consistency with other _F_ flags.

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Tiwei Bie 
---
 include/uapi/linux/virtio_ring.h | 57 ++--
 1 file changed, 40 insertions(+), 17 deletions(-)

diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 2414f8af26b3..9b0c0d92ab62 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -37,29 +37,52 @@
 #include 
 #include 
 
-/* This marks a buffer as continuing via the next field. */
+/*
+ * Notice: unlike other _F_ flags, below flags are defined as shifted
+ * values instead of shifts for compatibility.
+ */
+/* Same as VRING_SPLIT_DESC_F_NEXT. */
 #define VRING_DESC_F_NEXT  1
-/* This marks a buffer as write-only (otherwise read-only). */
+/* Same as VRING_SPLIT_DESC_F_WRITE. */
 #define VRING_DESC_F_WRITE 2
-/* This means the buffer contains a list of buffer descriptors. */
+/* Same as VRING_SPLIT_DESC_F_INDIRECT. */
 #define VRING_DESC_F_INDIRECT  4
-
-/*
- * Mark a descriptor as available or used in packed ring.
- * Notice: they are defined as shifts instead of shifted values.
- */
-#define VRING_PACKED_DESC_F_AVAIL  7
-#define VRING_PACKED_DESC_F_USED   15
-
-/* The Host uses this in used->flags to advise the Guest: don't kick me when
- * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
- * will still kick if it's out of buffers. */
+/* Same as VRING_SPLIT_USED_F_NO_NOTIFY. */
 #define VRING_USED_F_NO_NOTIFY 1
-/* The Guest uses this in avail->flags to advise the Host: don't interrupt me
- * when you consume a buffer.  It's unreliable, so it's simply an
- * optimization.  */
+/* Same as VRING_SPLIT_AVAIL_F_NO_INTERRUPT. */
 #define VRING_AVAIL_F_NO_INTERRUPT 1
 
+/* Mark a buffer as continuing via the next field in split ring. */
+#define VRING_SPLIT_DESC_F_NEXT0
+/* Mark a buffer as write-only (otherwise read-only) in split ring. */
+#define VRING_SPLIT_DESC_F_WRITE   1
+/* Mean the buffer contains a list of buffer descriptors in split ring. */
+#define VRING_SPLIT_DESC_F_INDIRECT2
+
+/*
+ * The Host uses this in used->flags in split ring to advise the Guest:
+ * don't kick me when you add a buffer.  It's unreliable, so it's simply
+ * an optimization.  Guest will still kick if it's out of buffers.
+ */
+#define VRING_SPLIT_USED_F_NO_NOTIFY   0
+/*
+ * The Guest uses this in avail->flags in split ring to advise the Host:
+ * don't interrupt me when you consume a buffer.  It's unreliable, so it's
+ * simply an optimization.
+ */
+#define VRING_SPLIT_AVAIL_F_NO_INTERRUPT   0
+
+/* Mark a buffer as continuing via the next field in packed ring. */
+#define VRING_PACKED_DESC_F_NEXT   0
+/* Mark a buffer as write-only (otherwise read-only) in packed ring. */
+#define VRING_PACKED_DESC_F_WRITE  1
+/* Mean the buffer contains a list of buffer descriptors in packed ring. */
+#define VRING_PACKED_DESC_F_INDIRECT   2
+
+/* Mark a descriptor as available or used in packed ring. */
+#define VRING_PACKED_DESC_F_AVAIL  7
+#define VRING_PACKED_DESC_F_USED   15
+
 /* Enable events in packed ring. */
 #define VRING_PACKED_EVENT_FLAG_ENABLE 0x0
 /* Disable events in packed ring. */
-- 
2.17.1



[RFC 0/3] virtio_ring: define flags as shifts consistently

2018-12-07 Thread Tiwei Bie
This is a follow up of the discussion in this thread:
https://patchwork.ozlabs.org/patch/1001015/#2042353

Tiwei Bie (3):
  virtio_ring: define flags as shifts consistently
  virtio_ring: add VIRTIO_RING_NO_LEGACY
  virtio_ring: use new vring flags

 drivers/virtio/virtio_ring.c | 100 ++-
 include/uapi/linux/virtio_ring.h |  61 +--
 2 files changed, 103 insertions(+), 58 deletions(-)

-- 
2.17.1



Re: [PATCH net-next v3 01/13] virtio: add packed ring types and macros

2018-11-30 Thread Tiwei Bie
On Fri, Nov 30, 2018 at 08:52:42AM -0500, Michael S. Tsirkin wrote:
> On Fri, Nov 30, 2018 at 02:01:06PM +0100, Maxime Coquelin wrote:
> > On 11/30/18 1:47 PM, Michael S. Tsirkin wrote:
> > > On Fri, Nov 30, 2018 at 05:53:40PM +0800, Tiwei Bie wrote:
> > > > On Fri, Nov 30, 2018 at 04:10:55PM +0800, Jason Wang wrote:
> > > > > 
> > > > > On 2018/11/21 下午6:03, Tiwei Bie wrote:
> > > > > > Add types and macros for packed ring.
> > > > > > 
> > > > > > Signed-off-by: Tiwei Bie 
> > > > > > ---
> > > > > >include/uapi/linux/virtio_config.h |  3 +++
> > > > > >include/uapi/linux/virtio_ring.h   | 52 
> > > > > > ++
> > > > > >2 files changed, 55 insertions(+)
> > > > > > 
> > > > > > diff --git a/include/uapi/linux/virtio_config.h 
> > > > > > b/include/uapi/linux/virtio_config.h
> > > > > > index 449132c76b1c..1196e1c1d4f6 100644
> > > > > > --- a/include/uapi/linux/virtio_config.h
> > > > > > +++ b/include/uapi/linux/virtio_config.h
> > > > > > @@ -75,6 +75,9 @@
> > > > > > */
> > > > > >#define VIRTIO_F_IOMMU_PLATFORM  33
> > > > > > +/* This feature indicates support for the packed virtqueue layout. 
> > > > > > */
> > > > > > +#define VIRTIO_F_RING_PACKED   34
> > > > > > +
> > > > > >/*
> > > > > > * Does the device support Single Root I/O Virtualization?
> > > > > > */
> > > > > > diff --git a/include/uapi/linux/virtio_ring.h 
> > > > > > b/include/uapi/linux/virtio_ring.h
> > > > > > index 6d5d5faa989b..2414f8af26b3 100644
> > > > > > --- a/include/uapi/linux/virtio_ring.h
> > > > > > +++ b/include/uapi/linux/virtio_ring.h
> > > > > > @@ -44,6 +44,13 @@
> > > > > >/* This means the buffer contains a list of buffer descriptors. 
> > > > > > */
> > > > > >#define VRING_DESC_F_INDIRECT4
> > > > > > +/*
> > > > > > + * Mark a descriptor as available or used in packed ring.
> > > > > > + * Notice: they are defined as shifts instead of shifted values.
> > > > > 
> > > > > 
> > > > > This looks inconsistent to previous flags, any reason for using 
> > > > > shifts?
> > > > 
> > > > Yeah, it was suggested to use shifts, as _F_ should be a bit
> > > > number, not a shifted value:
> > > > 
> > > > https://patchwork.ozlabs.org/patch/942296/#1989390
> > > 
> > > But let's add a _SPLIT_ variant that uses shifts consistently.
> > 
> > Maybe we could avoid adding SPLIT and PACKED, but define as follow:
> > 
> > #define VRING_DESC_F_INDIRECT_SHIFT 2
> > #define VRING_DESC_F_INDIRECT (1ull <<  VRING_DESC_F_INDIRECT_SHIFT)
> > 
> > #define VRING_DESC_F_AVAIL_SHIFT 7
> > #define VRING_DESC_F_AVAIL (1ull << VRING_DESC_F_AVAIL_SHIFT)
> > 
> > I think it would be better for consistency.
> 
> I don't think that will help. the problem is that
> most of the existing virtio code consistently uses _F_ as shifts.
> So we just need to do something about these 5 being inconsistent:
> 
> include/uapi/linux/virtio_ring.h:#define VRING_DESC_F_NEXT  1
> include/uapi/linux/virtio_ring.h:#define VRING_DESC_F_WRITE 2
> include/uapi/linux/virtio_ring.h:#define VRING_DESC_F_INDIRECT  4
> include/uapi/linux/virtio_ring.h:#define VRING_USED_F_NO_NOTIFY 1
> include/uapi/linux/virtio_ring.h:#define VRING_AVAIL_F_NO_INTERRUPT 1
> 
> I do not want all of virtio to become verbose with _SHIFT, ergo
> we need to change the above 5 to have names which are with _F_ and
> have the bit number.

How about something like this:

#define VRING_COMM_DESC_F_NEXT  0
#define VRING_COMM_DESC_F_WRITE 1
#define VRING_COMM_DESC_F_INDIRECT  2

#define VRING_SPLIT_USED_F_NO_NOTIFY0
#define VRING_SPLIT_AVAIL_F_NO_INTERRUPT0

or

#define VRING_SPLIT_DESC_F_NEXT 0
#define VRING_SPLIT_DESC_F_WRITE1
#define VRING_SPLIT_DESC_F_INDIRECT 2

#define VRING_SPLIT_USED_F_NO_NOTIFY0
#define VRING_SPLIT_AVAIL_F_NO_INTERRUPT0

#define VRING_PACKED_DESC_F_NEXT0
#define VRING_PACKED

Re: [PATCH net-next v3 01/13] virtio: add packed ring types and macros

2018-11-30 Thread Tiwei Bie
On Fri, Nov 30, 2018 at 08:52:42AM -0500, Michael S. Tsirkin wrote:
> On Fri, Nov 30, 2018 at 02:01:06PM +0100, Maxime Coquelin wrote:
> > On 11/30/18 1:47 PM, Michael S. Tsirkin wrote:
> > > On Fri, Nov 30, 2018 at 05:53:40PM +0800, Tiwei Bie wrote:
> > > > On Fri, Nov 30, 2018 at 04:10:55PM +0800, Jason Wang wrote:
> > > > > 
> > > > > On 2018/11/21 下午6:03, Tiwei Bie wrote:
> > > > > > Add types and macros for packed ring.
> > > > > > 
> > > > > > Signed-off-by: Tiwei Bie 
> > > > > > ---
> > > > > >include/uapi/linux/virtio_config.h |  3 +++
> > > > > >include/uapi/linux/virtio_ring.h   | 52 
> > > > > > ++
> > > > > >2 files changed, 55 insertions(+)
> > > > > > 
> > > > > > diff --git a/include/uapi/linux/virtio_config.h 
> > > > > > b/include/uapi/linux/virtio_config.h
> > > > > > index 449132c76b1c..1196e1c1d4f6 100644
> > > > > > --- a/include/uapi/linux/virtio_config.h
> > > > > > +++ b/include/uapi/linux/virtio_config.h
> > > > > > @@ -75,6 +75,9 @@
> > > > > > */
> > > > > >#define VIRTIO_F_IOMMU_PLATFORM  33
> > > > > > +/* This feature indicates support for the packed virtqueue layout. 
> > > > > > */
> > > > > > +#define VIRTIO_F_RING_PACKED   34
> > > > > > +
> > > > > >/*
> > > > > > * Does the device support Single Root I/O Virtualization?
> > > > > > */
> > > > > > diff --git a/include/uapi/linux/virtio_ring.h 
> > > > > > b/include/uapi/linux/virtio_ring.h
> > > > > > index 6d5d5faa989b..2414f8af26b3 100644
> > > > > > --- a/include/uapi/linux/virtio_ring.h
> > > > > > +++ b/include/uapi/linux/virtio_ring.h
> > > > > > @@ -44,6 +44,13 @@
> > > > > >/* This means the buffer contains a list of buffer descriptors. 
> > > > > > */
> > > > > >#define VRING_DESC_F_INDIRECT4
> > > > > > +/*
> > > > > > + * Mark a descriptor as available or used in packed ring.
> > > > > > + * Notice: they are defined as shifts instead of shifted values.
> > > > > 
> > > > > 
> > > > > This looks inconsistent to previous flags, any reason for using 
> > > > > shifts?
> > > > 
> > > > Yeah, it was suggested to use shifts, as _F_ should be a bit
> > > > number, not a shifted value:
> > > > 
> > > > https://patchwork.ozlabs.org/patch/942296/#1989390
> > > 
> > > But let's add a _SPLIT_ variant that uses shifts consistently.
> > 
> > Maybe we could avoid adding SPLIT and PACKED, but define as follow:
> > 
> > #define VRING_DESC_F_INDIRECT_SHIFT 2
> > #define VRING_DESC_F_INDIRECT (1ull <<  VRING_DESC_F_INDIRECT_SHIFT)
> > 
> > #define VRING_DESC_F_AVAIL_SHIFT 7
> > #define VRING_DESC_F_AVAIL (1ull << VRING_DESC_F_AVAIL_SHIFT)
> > 
> > I think it would be better for consistency.
> 
> I don't think that will help. the problem is that
> most of the existing virtio code consistently uses _F_ as shifts.
> So we just need to do something about these 5 being inconsistent:
> 
> include/uapi/linux/virtio_ring.h:#define VRING_DESC_F_NEXT  1
> include/uapi/linux/virtio_ring.h:#define VRING_DESC_F_WRITE 2
> include/uapi/linux/virtio_ring.h:#define VRING_DESC_F_INDIRECT  4
> include/uapi/linux/virtio_ring.h:#define VRING_USED_F_NO_NOTIFY 1
> include/uapi/linux/virtio_ring.h:#define VRING_AVAIL_F_NO_INTERRUPT 1
> 
> I do not want all of virtio to become verbose with _SHIFT, ergo
> we need to change the above 5 to have names which are with _F_ and
> have the bit number.

How about something like this:

#define VRING_COMM_DESC_F_NEXT  0
#define VRING_COMM_DESC_F_WRITE 1
#define VRING_COMM_DESC_F_INDIRECT  2

#define VRING_SPLIT_USED_F_NO_NOTIFY0
#define VRING_SPLIT_AVAIL_F_NO_INTERRUPT0

or

#define VRING_SPLIT_DESC_F_NEXT 0
#define VRING_SPLIT_DESC_F_WRITE1
#define VRING_SPLIT_DESC_F_INDIRECT 2

#define VRING_SPLIT_USED_F_NO_NOTIFY0
#define VRING_SPLIT_AVAIL_F_NO_INTERRUPT0

#define VRING_PACKED_DESC_F_NEXT0
#define VRING_PACKED

Re: [PATCH net-next v3 01/13] virtio: add packed ring types and macros

2018-11-30 Thread Tiwei Bie
On Fri, Nov 30, 2018 at 04:10:55PM +0800, Jason Wang wrote:
> 
> On 2018/11/21 下午6:03, Tiwei Bie wrote:
> > Add types and macros for packed ring.
> > 
> > Signed-off-by: Tiwei Bie 
> > ---
> >   include/uapi/linux/virtio_config.h |  3 +++
> >   include/uapi/linux/virtio_ring.h   | 52 
> > ++
> >   2 files changed, 55 insertions(+)
> > 
> > diff --git a/include/uapi/linux/virtio_config.h 
> > b/include/uapi/linux/virtio_config.h
> > index 449132c76b1c..1196e1c1d4f6 100644
> > --- a/include/uapi/linux/virtio_config.h
> > +++ b/include/uapi/linux/virtio_config.h
> > @@ -75,6 +75,9 @@
> >*/
> >   #define VIRTIO_F_IOMMU_PLATFORM   33
> > +/* This feature indicates support for the packed virtqueue layout. */
> > +#define VIRTIO_F_RING_PACKED   34
> > +
> >   /*
> >* Does the device support Single Root I/O Virtualization?
> >*/
> > diff --git a/include/uapi/linux/virtio_ring.h 
> > b/include/uapi/linux/virtio_ring.h
> > index 6d5d5faa989b..2414f8af26b3 100644
> > --- a/include/uapi/linux/virtio_ring.h
> > +++ b/include/uapi/linux/virtio_ring.h
> > @@ -44,6 +44,13 @@
> >   /* This means the buffer contains a list of buffer descriptors. */
> >   #define VRING_DESC_F_INDIRECT 4
> > +/*
> > + * Mark a descriptor as available or used in packed ring.
> > + * Notice: they are defined as shifts instead of shifted values.
> 
> 
> This looks inconsistent to previous flags, any reason for using shifts?

Yeah, it was suggested to use shifts, as _F_ should be a bit
number, not a shifted value:

https://patchwork.ozlabs.org/patch/942296/#1989390

> 
> 
> > + */
> > +#define VRING_PACKED_DESC_F_AVAIL  7
> > +#define VRING_PACKED_DESC_F_USED   15
> > +
> >   /* The Host uses this in used->flags to advise the Guest: don't kick me 
> > when
> >* you add a buffer.  It's unreliable, so it's simply an optimization.  
> > Guest
> >* will still kick if it's out of buffers. */
> > @@ -53,6 +60,23 @@
> >* optimization.  */
> >   #define VRING_AVAIL_F_NO_INTERRUPT1
> > +/* Enable events in packed ring. */
> > +#define VRING_PACKED_EVENT_FLAG_ENABLE 0x0
> > +/* Disable events in packed ring. */
> > +#define VRING_PACKED_EVENT_FLAG_DISABLE0x1
> > +/*
> > + * Enable events for a specific descriptor in packed ring.
> > + * (as specified by Descriptor Ring Change Event Offset/Wrap Counter).
> > + * Only valid if VIRTIO_RING_F_EVENT_IDX has been negotiated.
> > + */
> > +#define VRING_PACKED_EVENT_FLAG_DESC   0x2
> 
> 
> Any reason for using _FLAG_ instead of _F_?

Yeah, it was suggested to not use _F_, as these are values,
should not have _F_:

https://patchwork.ozlabs.org/patch/942296/#1989390

Regards,
Tiwei

> 
> Thanks
> 
> 
> > +
> > +/*
> > + * Wrap counter bit shift in event suppression structure
> > + * of packed ring.
> > + */
> > +#define VRING_PACKED_EVENT_F_WRAP_CTR  15
> > +
> >   /* We support indirect buffer descriptors */
> >   #define VIRTIO_RING_F_INDIRECT_DESC   28
> > @@ -171,4 +195,32 @@ static inline int vring_need_event(__u16 event_idx, 
> > __u16 new_idx, __u16 old)
> > return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
> >   }
> > +struct vring_packed_desc_event {
> > +   /* Descriptor Ring Change Event Offset/Wrap Counter. */
> > +   __le16 off_wrap;
> > +   /* Descriptor Ring Change Event Flags. */
> > +   __le16 flags;
> > +};
> > +
> > +struct vring_packed_desc {
> > +   /* Buffer Address. */
> > +   __le64 addr;
> > +   /* Buffer Length. */
> > +   __le32 len;
> > +   /* Buffer ID. */
> > +   __le16 id;
> > +   /* The flags depending on descriptor type. */
> > +   __le16 flags;
> > +};
> > +
> > +struct vring_packed {
> > +   unsigned int num;
> > +
> > +   struct vring_packed_desc *desc;
> > +
> > +   struct vring_packed_desc_event *driver;
> > +
> > +   struct vring_packed_desc_event *device;
> > +};
> > +
> >   #endif /* _UAPI_LINUX_VIRTIO_RING_H */


Re: [PATCH net-next v3 01/13] virtio: add packed ring types and macros

2018-11-30 Thread Tiwei Bie
On Fri, Nov 30, 2018 at 04:10:55PM +0800, Jason Wang wrote:
> 
> On 2018/11/21 下午6:03, Tiwei Bie wrote:
> > Add types and macros for packed ring.
> > 
> > Signed-off-by: Tiwei Bie 
> > ---
> >   include/uapi/linux/virtio_config.h |  3 +++
> >   include/uapi/linux/virtio_ring.h   | 52 
> > ++
> >   2 files changed, 55 insertions(+)
> > 
> > diff --git a/include/uapi/linux/virtio_config.h 
> > b/include/uapi/linux/virtio_config.h
> > index 449132c76b1c..1196e1c1d4f6 100644
> > --- a/include/uapi/linux/virtio_config.h
> > +++ b/include/uapi/linux/virtio_config.h
> > @@ -75,6 +75,9 @@
> >*/
> >   #define VIRTIO_F_IOMMU_PLATFORM   33
> > +/* This feature indicates support for the packed virtqueue layout. */
> > +#define VIRTIO_F_RING_PACKED   34
> > +
> >   /*
> >* Does the device support Single Root I/O Virtualization?
> >*/
> > diff --git a/include/uapi/linux/virtio_ring.h 
> > b/include/uapi/linux/virtio_ring.h
> > index 6d5d5faa989b..2414f8af26b3 100644
> > --- a/include/uapi/linux/virtio_ring.h
> > +++ b/include/uapi/linux/virtio_ring.h
> > @@ -44,6 +44,13 @@
> >   /* This means the buffer contains a list of buffer descriptors. */
> >   #define VRING_DESC_F_INDIRECT 4
> > +/*
> > + * Mark a descriptor as available or used in packed ring.
> > + * Notice: they are defined as shifts instead of shifted values.
> 
> 
> This looks inconsistent to previous flags, any reason for using shifts?

Yeah, it was suggested to use shifts, as _F_ should be a bit
number, not a shifted value:

https://patchwork.ozlabs.org/patch/942296/#1989390

> 
> 
> > + */
> > +#define VRING_PACKED_DESC_F_AVAIL  7
> > +#define VRING_PACKED_DESC_F_USED   15
> > +
> >   /* The Host uses this in used->flags to advise the Guest: don't kick me 
> > when
> >* you add a buffer.  It's unreliable, so it's simply an optimization.  
> > Guest
> >* will still kick if it's out of buffers. */
> > @@ -53,6 +60,23 @@
> >* optimization.  */
> >   #define VRING_AVAIL_F_NO_INTERRUPT1
> > +/* Enable events in packed ring. */
> > +#define VRING_PACKED_EVENT_FLAG_ENABLE 0x0
> > +/* Disable events in packed ring. */
> > +#define VRING_PACKED_EVENT_FLAG_DISABLE0x1
> > +/*
> > + * Enable events for a specific descriptor in packed ring.
> > + * (as specified by Descriptor Ring Change Event Offset/Wrap Counter).
> > + * Only valid if VIRTIO_RING_F_EVENT_IDX has been negotiated.
> > + */
> > +#define VRING_PACKED_EVENT_FLAG_DESC   0x2
> 
> 
> Any reason for using _FLAG_ instead of _F_?

Yeah, it was suggested to not use _F_, as these are values,
should not have _F_:

https://patchwork.ozlabs.org/patch/942296/#1989390

Regards,
Tiwei

> 
> Thanks
> 
> 
> > +
> > +/*
> > + * Wrap counter bit shift in event suppression structure
> > + * of packed ring.
> > + */
> > +#define VRING_PACKED_EVENT_F_WRAP_CTR  15
> > +
> >   /* We support indirect buffer descriptors */
> >   #define VIRTIO_RING_F_INDIRECT_DESC   28
> > @@ -171,4 +195,32 @@ static inline int vring_need_event(__u16 event_idx, 
> > __u16 new_idx, __u16 old)
> > return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
> >   }
> > +struct vring_packed_desc_event {
> > +   /* Descriptor Ring Change Event Offset/Wrap Counter. */
> > +   __le16 off_wrap;
> > +   /* Descriptor Ring Change Event Flags. */
> > +   __le16 flags;
> > +};
> > +
> > +struct vring_packed_desc {
> > +   /* Buffer Address. */
> > +   __le64 addr;
> > +   /* Buffer Length. */
> > +   __le32 len;
> > +   /* Buffer ID. */
> > +   __le16 id;
> > +   /* The flags depending on descriptor type. */
> > +   __le16 flags;
> > +};
> > +
> > +struct vring_packed {
> > +   unsigned int num;
> > +
> > +   struct vring_packed_desc *desc;
> > +
> > +   struct vring_packed_desc_event *driver;
> > +
> > +   struct vring_packed_desc_event *device;
> > +};
> > +
> >   #endif /* _UAPI_LINUX_VIRTIO_RING_H */


Re: [virtio-dev] Re: [PATCH net-next v2 0/5] virtio: support packed ring

2018-10-11 Thread Tiwei Bie
On Thu, Oct 11, 2018 at 10:17:15AM -0400, Michael S. Tsirkin wrote:
> On Thu, Oct 11, 2018 at 10:13:31PM +0800, Tiwei Bie wrote:
> > On Thu, Oct 11, 2018 at 09:48:48AM -0400, Michael S. Tsirkin wrote:
> > > On Thu, Oct 11, 2018 at 08:12:21PM +0800, Tiwei Bie wrote:
> > > > > > But if it's not too late, I second for a OUT_OF_ORDER feature.
> > > > > > Starting from in order can have much simpler code in driver.
> > > > > > 
> > > > > > Thanks
> > > > > 
> > > > > It's tricky to change the flag polarity because of compatibility
> > > > > with legacy interfaces. Why is this such a big deal?
> > > > > 
> > > > > Let's teach drivers about IN_ORDER, then if devices
> > > > > are in order it will get enabled by default.
> > > > 
> > > > Yeah, make sense.
> > > > 
> > > > Besides, I have done some further profiling and debugging
> > > > both in kernel driver and DPDK vhost. Previously I was mislead
> > > > by a bug in vhost code. I will send a patch to fix that bug.
> > > > With that bug fixed, the performance of packed ring in the
> > > > test between kernel driver and DPDK vhost is better now.
> > > 
> > > OK, if we get a performance gain on the virtio side, we can finally
> > > upstream it. If you see that please re-post ASAP so we can
> > > put it in the next kernel release.
> > 
> > Got it, I will re-post ASAP.
> > 
> > Thanks!
> 
> 
> Pls remember to include data on performance gain in the cover letter.

Sure. I'll try to include some performance analyses.



Re: [virtio-dev] Re: [PATCH net-next v2 0/5] virtio: support packed ring

2018-10-11 Thread Tiwei Bie
On Thu, Oct 11, 2018 at 10:17:15AM -0400, Michael S. Tsirkin wrote:
> On Thu, Oct 11, 2018 at 10:13:31PM +0800, Tiwei Bie wrote:
> > On Thu, Oct 11, 2018 at 09:48:48AM -0400, Michael S. Tsirkin wrote:
> > > On Thu, Oct 11, 2018 at 08:12:21PM +0800, Tiwei Bie wrote:
> > > > > > But if it's not too late, I second for a OUT_OF_ORDER feature.
> > > > > > Starting from in order can have much simpler code in driver.
> > > > > > 
> > > > > > Thanks
> > > > > 
> > > > > It's tricky to change the flag polarity because of compatibility
> > > > > with legacy interfaces. Why is this such a big deal?
> > > > > 
> > > > > Let's teach drivers about IN_ORDER, then if devices
> > > > > are in order it will get enabled by default.
> > > > 
> > > > Yeah, make sense.
> > > > 
> > > > Besides, I have done some further profiling and debugging
> > > > both in kernel driver and DPDK vhost. Previously I was mislead
> > > > by a bug in vhost code. I will send a patch to fix that bug.
> > > > With that bug fixed, the performance of packed ring in the
> > > > test between kernel driver and DPDK vhost is better now.
> > > 
> > > OK, if we get a performance gain on the virtio side, we can finally
> > > upstream it. If you see that please re-post ASAP so we can
> > > put it in the next kernel release.
> > 
> > Got it, I will re-post ASAP.
> > 
> > Thanks!
> 
> 
> Pls remember to include data on performance gain in the cover letter.

Sure. I'll try to include some performance analyses.



Re: [virtio-dev] Re: [PATCH net-next v2 0/5] virtio: support packed ring

2018-10-11 Thread Tiwei Bie
On Thu, Oct 11, 2018 at 09:48:48AM -0400, Michael S. Tsirkin wrote:
> On Thu, Oct 11, 2018 at 08:12:21PM +0800, Tiwei Bie wrote:
> > > > But if it's not too late, I second for a OUT_OF_ORDER feature.
> > > > Starting from in order can have much simpler code in driver.
> > > > 
> > > > Thanks
> > > 
> > > It's tricky to change the flag polarity because of compatibility
> > > with legacy interfaces. Why is this such a big deal?
> > > 
> > > Let's teach drivers about IN_ORDER, then if devices
> > > are in order it will get enabled by default.
> > 
> > Yeah, make sense.
> > 
> > Besides, I have done some further profiling and debugging
> > both in kernel driver and DPDK vhost. Previously I was mislead
> > by a bug in vhost code. I will send a patch to fix that bug.
> > With that bug fixed, the performance of packed ring in the
> > test between kernel driver and DPDK vhost is better now.
> 
> OK, if we get a performance gain on the virtio side, we can finally
> upstream it. If you see that please re-post ASAP so we can
> put it in the next kernel release.

Got it, I will re-post ASAP.

Thanks!


> 
> -- 
> MST


Re: [virtio-dev] Re: [PATCH net-next v2 0/5] virtio: support packed ring

2018-10-11 Thread Tiwei Bie
On Thu, Oct 11, 2018 at 09:48:48AM -0400, Michael S. Tsirkin wrote:
> On Thu, Oct 11, 2018 at 08:12:21PM +0800, Tiwei Bie wrote:
> > > > But if it's not too late, I second for a OUT_OF_ORDER feature.
> > > > Starting from in order can have much simpler code in driver.
> > > > 
> > > > Thanks
> > > 
> > > It's tricky to change the flag polarity because of compatibility
> > > with legacy interfaces. Why is this such a big deal?
> > > 
> > > Let's teach drivers about IN_ORDER, then if devices
> > > are in order it will get enabled by default.
> > 
> > Yeah, make sense.
> > 
> > Besides, I have done some further profiling and debugging
> > both in kernel driver and DPDK vhost. Previously I was mislead
> > by a bug in vhost code. I will send a patch to fix that bug.
> > With that bug fixed, the performance of packed ring in the
> > test between kernel driver and DPDK vhost is better now.
> 
> OK, if we get a performance gain on the virtio side, we can finally
> upstream it. If you see that please re-post ASAP so we can
> put it in the next kernel release.

Got it, I will re-post ASAP.

Thanks!


> 
> -- 
> MST


Re: [virtio-dev] Re: [PATCH net-next v2 0/5] virtio: support packed ring

2018-10-11 Thread Tiwei Bie
On Wed, Oct 10, 2018 at 10:36:26AM -0400, Michael S. Tsirkin wrote:
> On Thu, Sep 13, 2018 at 05:47:29PM +0800, Jason Wang wrote:
> > On 2018年09月13日 16:59, Tiwei Bie wrote:
> > > > If what you say is true then we should take a careful look
> > > > and not supporting these generic things with packed layout.
> > > > Once we do support them it will be too late and we won't
> > > > be able to get performance back.
> > > I think it's a good point that we don't need to support
> > > everything in packed ring (especially these which would
> > > hurt the performance), as the packed ring aims at high
> > > performance. I'm also wondering about the features. Is
> > > there any possibility that we won't support the out of
> > > order processing (at least not by default) in packed ring?
> > > If I didn't miss anything, the need to support out of order
> > > processing in packed ring will make the data structure
> > > inside the driver not cache friendly which is similar to
> > > the case of the descriptor table in the split ring (the
> > > difference is that, it only happens in driver now).
> > 
> > Out of order is not the only user, DMA is another one. We don't have used
> > ring(len), so we need to maintain buffer length somewhere even for in order
> > device.
> 
> For a bunch of systems dma unmap is a nop so we do not really
> need to maintain it. It's a question of an API to detect that
> and optimize for it. I posted a proposed patch for that -
> want to try using that?

Yeah, definitely!

> 
> > But if it's not too late, I second for a OUT_OF_ORDER feature.
> > Starting from in order can have much simpler code in driver.
> > 
> > Thanks
> 
> It's tricky to change the flag polarity because of compatibility
> with legacy interfaces. Why is this such a big deal?
> 
> Let's teach drivers about IN_ORDER, then if devices
> are in order it will get enabled by default.

Yeah, make sense.

Besides, I have done some further profiling and debugging
both in kernel driver and DPDK vhost. Previously I was mislead
by a bug in vhost code. I will send a patch to fix that bug.
With that bug fixed, the performance of packed ring in the
test between kernel driver and DPDK vhost is better now.
I will send a new series soon. Thanks!

> 
> -- 
> MST


Re: [virtio-dev] Re: [PATCH net-next v2 0/5] virtio: support packed ring

2018-10-11 Thread Tiwei Bie
On Wed, Oct 10, 2018 at 10:36:26AM -0400, Michael S. Tsirkin wrote:
> On Thu, Sep 13, 2018 at 05:47:29PM +0800, Jason Wang wrote:
> > On 2018年09月13日 16:59, Tiwei Bie wrote:
> > > > If what you say is true then we should take a careful look
> > > > and not supporting these generic things with packed layout.
> > > > Once we do support them it will be too late and we won't
> > > > be able to get performance back.
> > > I think it's a good point that we don't need to support
> > > everything in packed ring (especially these which would
> > > hurt the performance), as the packed ring aims at high
> > > performance. I'm also wondering about the features. Is
> > > there any possibility that we won't support the out of
> > > order processing (at least not by default) in packed ring?
> > > If I didn't miss anything, the need to support out of order
> > > processing in packed ring will make the data structure
> > > inside the driver not cache friendly which is similar to
> > > the case of the descriptor table in the split ring (the
> > > difference is that, it only happens in driver now).
> > 
> > Out of order is not the only user, DMA is another one. We don't have used
> > ring(len), so we need to maintain buffer length somewhere even for in order
> > device.
> 
> For a bunch of systems dma unmap is a nop so we do not really
> need to maintain it. It's a question of an API to detect that
> and optimize for it. I posted a proposed patch for that -
> want to try using that?

Yeah, definitely!

> 
> > But if it's not too late, I second for a OUT_OF_ORDER feature.
> > Starting from in order can have much simpler code in driver.
> > 
> > Thanks
> 
> It's tricky to change the flag polarity because of compatibility
> with legacy interfaces. Why is this such a big deal?
> 
> Let's teach drivers about IN_ORDER, then if devices
> are in order it will get enabled by default.

Yeah, make sense.

Besides, I have done some further profiling and debugging
both in kernel driver and DPDK vhost. Previously I was mislead
by a bug in vhost code. I will send a patch to fix that bug.
With that bug fixed, the performance of packed ring in the
test between kernel driver and DPDK vhost is better now.
I will send a new series soon. Thanks!

> 
> -- 
> MST


Re: [RFC v6 4/5] virtio_ring: add event idx support in packed ring

2018-06-08 Thread Tiwei Bie
On Thu, Jun 07, 2018 at 05:50:32PM +0800, Jason Wang wrote:
> On 2018年06月05日 15:40, Tiwei Bie wrote:
> >   static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> >   {
> > struct vring_virtqueue *vq = to_vvq(_vq);
> > +   u16 bufs, used_idx, wrap_counter;
> > START_USE(vq);
> > /* We optimistically turn back on interrupts, then check if there was
> >  * more to do. */
> > +   /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
> > +* either clear the flags bit or point the event index at the next
> > +* entry. Always update the event index to keep code simple. */
> > +
> 
> Maybe for packed ring, it's time to treat event index separately to avoid a
> virtio_wmb() for event idx is off.

Okay. I'll do it.

> 
> > +   /* TODO: tune this threshold */
> > +   if (vq->next_avail_idx < vq->last_used_idx)
> > +   bufs = (vq->vring_packed.num + vq->next_avail_idx -
> > +   vq->last_used_idx) * 3 / 4;
> > +   else
> > +   bufs = (vq->next_avail_idx - vq->last_used_idx) * 3 / 4;
> 
> vq->next_avail-idx could be equal to vq->last_usd_idx when the ring is full.
> Though virito-net is the only user now and it can guarantee this won't
> happen. But consider this is a core API, we should make sure it can work for
> any cases.
> 
> It looks to me that bufs is just vq->vring_packed.num - vq->num_free?

I'll try it! Thanks for the suggestion! :)

> 
> > +
> > +   wrap_counter = vq->used_wrap_counter;
> > +
> > +   used_idx = vq->last_used_idx + bufs;
> > +   if (used_idx >= vq->vring_packed.num) {
> > +   used_idx -= vq->vring_packed.num;
> > +   wrap_counter ^= 1;
> > +   }
> > +
> > +   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > +   used_idx | (wrap_counter << 15));
> > if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > -   vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +   /* We need to update event offset and event wrap
> > +* counter first before updating event flags. */
> > +   virtio_wmb(vq->weak_barriers);
> > +   vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > +VRING_EVENT_F_ENABLE;
> > vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > vq->event_flags_shadow);
> > -   /* We need to enable interrupts first before re-checking
> > -* for more used buffers. */
> > -   virtio_mb(vq->weak_barriers);
> > }
> > +   /* We need to update event suppression structure first
> > +* before re-checking for more used buffers. */
> > +   virtio_mb(vq->weak_barriers);
> > +
> > if (more_used_packed(vq)) {
> > END_USE(vq);
> > return false;
> 
> I think what we need to to make sure the descriptor used_idx is used?
> Otherwise we may stop and restart qdisc too frequently?

Good catch! Yeah. It's not the best choice to check the
descriptor at last_used_idx. I'll try to fix this!

Best regards,
Tiwei Bie

> 
> Thanks
> 
> > -- 
> 


Re: [RFC v6 4/5] virtio_ring: add event idx support in packed ring

2018-06-08 Thread Tiwei Bie
On Thu, Jun 07, 2018 at 05:50:32PM +0800, Jason Wang wrote:
> On 2018年06月05日 15:40, Tiwei Bie wrote:
> >   static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> >   {
> > struct vring_virtqueue *vq = to_vvq(_vq);
> > +   u16 bufs, used_idx, wrap_counter;
> > START_USE(vq);
> > /* We optimistically turn back on interrupts, then check if there was
> >  * more to do. */
> > +   /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
> > +* either clear the flags bit or point the event index at the next
> > +* entry. Always update the event index to keep code simple. */
> > +
> 
> Maybe for packed ring, it's time to treat event index separately to avoid a
> virtio_wmb() for event idx is off.

Okay. I'll do it.

> 
> > +   /* TODO: tune this threshold */
> > +   if (vq->next_avail_idx < vq->last_used_idx)
> > +   bufs = (vq->vring_packed.num + vq->next_avail_idx -
> > +   vq->last_used_idx) * 3 / 4;
> > +   else
> > +   bufs = (vq->next_avail_idx - vq->last_used_idx) * 3 / 4;
> 
> vq->next_avail-idx could be equal to vq->last_usd_idx when the ring is full.
> Though virito-net is the only user now and it can guarantee this won't
> happen. But consider this is a core API, we should make sure it can work for
> any cases.
> 
> It looks to me that bufs is just vq->vring_packed.num - vq->num_free?

I'll try it! Thanks for the suggestion! :)

> 
> > +
> > +   wrap_counter = vq->used_wrap_counter;
> > +
> > +   used_idx = vq->last_used_idx + bufs;
> > +   if (used_idx >= vq->vring_packed.num) {
> > +   used_idx -= vq->vring_packed.num;
> > +   wrap_counter ^= 1;
> > +   }
> > +
> > +   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > +   used_idx | (wrap_counter << 15));
> > if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > -   vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +   /* We need to update event offset and event wrap
> > +* counter first before updating event flags. */
> > +   virtio_wmb(vq->weak_barriers);
> > +   vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > +VRING_EVENT_F_ENABLE;
> > vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > vq->event_flags_shadow);
> > -   /* We need to enable interrupts first before re-checking
> > -* for more used buffers. */
> > -   virtio_mb(vq->weak_barriers);
> > }
> > +   /* We need to update event suppression structure first
> > +* before re-checking for more used buffers. */
> > +   virtio_mb(vq->weak_barriers);
> > +
> > if (more_used_packed(vq)) {
> > END_USE(vq);
> > return false;
> 
> I think what we need to to make sure the descriptor used_idx is used?
> Otherwise we may stop and restart qdisc too frequently?

Good catch! Yeah. It's not the best choice to check the
descriptor at last_used_idx. I'll try to fix this!

Best regards,
Tiwei Bie

> 
> Thanks
> 
> > -- 
> 


[PATCH] virtio: update the comments for transport features

2018-06-01 Thread Tiwei Bie
Suggested-by: Michael S. Tsirkin 
Signed-off-by: Tiwei Bie 
---
This patch is generated on top of below patch:
https://lists.oasis-open.org/archives/virtio-dev/201805/msg00212.html

 include/uapi/linux/virtio_config.h | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/virtio_config.h 
b/include/uapi/linux/virtio_config.h
index b7c1f4e7d59e..479affd903e9 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -45,9 +45,12 @@
 /* We've given up on this device. */
 #define VIRTIO_CONFIG_S_FAILED 0x80
 
-/* Some virtio feature bits (currently bits 28 through 32) are reserved for the
- * transport being used (eg. virtio_ring), the rest are per-device feature
- * bits. */
+/*
+ * Some virtio feature bits (currently bits VIRTIO_TRANSPORT_F_START
+ * through VIRTIO_TRANSPORT_F_END) are reserved for the transport being
+ * used (e.g. virtio_ring, virtio_pci etc.), the rest are per-device
+ * feature bits.
+ */
 #define VIRTIO_TRANSPORT_F_START   28
 #define VIRTIO_TRANSPORT_F_END 38
 
-- 
2.17.0



[PATCH] virtio: update the comments for transport features

2018-06-01 Thread Tiwei Bie
Suggested-by: Michael S. Tsirkin 
Signed-off-by: Tiwei Bie 
---
This patch is generated on top of below patch:
https://lists.oasis-open.org/archives/virtio-dev/201805/msg00212.html

 include/uapi/linux/virtio_config.h | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/virtio_config.h 
b/include/uapi/linux/virtio_config.h
index b7c1f4e7d59e..479affd903e9 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -45,9 +45,12 @@
 /* We've given up on this device. */
 #define VIRTIO_CONFIG_S_FAILED 0x80
 
-/* Some virtio feature bits (currently bits 28 through 32) are reserved for the
- * transport being used (eg. virtio_ring), the rest are per-device feature
- * bits. */
+/*
+ * Some virtio feature bits (currently bits VIRTIO_TRANSPORT_F_START
+ * through VIRTIO_TRANSPORT_F_END) are reserved for the transport being
+ * used (e.g. virtio_ring, virtio_pci etc.), the rest are per-device
+ * feature bits.
+ */
 #define VIRTIO_TRANSPORT_F_START   28
 #define VIRTIO_TRANSPORT_F_END 38
 
-- 
2.17.0



Re: [RFC v5 2/5] virtio_ring: support creating packed ring

2018-05-28 Thread Tiwei Bie
On Tue, May 29, 2018 at 10:49:11AM +0800, Jason Wang wrote:
> On 2018年05月22日 16:16, Tiwei Bie wrote:
> > This commit introduces the support for creating packed ring.
> > All split ring specific functions are added _split suffix.
> > Some necessary stubs for packed ring are also added.
> > 
> > Signed-off-by: Tiwei Bie 
> > ---
> >   drivers/virtio/virtio_ring.c | 801 +++
> >   include/linux/virtio_ring.h  |   8 +-
> >   2 files changed, 546 insertions(+), 263 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 71458f493cf8..f5ef5f42a7cf 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -61,11 +61,15 @@ struct vring_desc_state {
> > struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
> >   };
> > +struct vring_desc_state_packed {
> > +   int next;   /* The next desc state. */
> > +};
> > +
> >   struct vring_virtqueue {
> > struct virtqueue vq;
> > -   /* Actual memory layout for this queue */
> > -   struct vring vring;
> > +   /* Is this a packed ring? */
> > +   bool packed;
> > /* Can we use weak barriers? */
> > bool weak_barriers;
> > @@ -87,11 +91,39 @@ struct vring_virtqueue {
> > /* Last used index we've seen. */
> > u16 last_used_idx;
> > -   /* Last written value to avail->flags */
> > -   u16 avail_flags_shadow;
> > +   union {
> > +   /* Available for split ring */
> > +   struct {
> > +   /* Actual memory layout for this queue. */
> > +   struct vring vring;
> > -   /* Last written value to avail->idx in guest byte order */
> > -   u16 avail_idx_shadow;
> > +   /* Last written value to avail->flags */
> > +   u16 avail_flags_shadow;
> > +
> > +   /* Last written value to avail->idx in
> > +* guest byte order. */
> > +   u16 avail_idx_shadow;
> > +   };
> > +
> > +   /* Available for packed ring */
> > +   struct {
> > +   /* Actual memory layout for this queue. */
> > +   struct vring_packed vring_packed;
> > +
> > +   /* Driver ring wrap counter. */
> > +   u8 avail_wrap_counter;
> > +
> > +   /* Device ring wrap counter. */
> > +   u8 used_wrap_counter;
> 
> How about just use boolean?

I can change it to bool if you want.

> 
[...]
> > -static int vring_mapping_error(const struct vring_virtqueue *vq,
> > -  dma_addr_t addr)
> > -{
> > -   if (!vring_use_dma_api(vq->vq.vdev))
> > -   return 0;
> > -
> > -   return dma_mapping_error(vring_dma_dev(vq), addr);
> > -}
> 
> It looks to me if you keep vring_mapping_error behind
> vring_unmap_one_split(), lots of changes were unncessary.
> 
[...]
> > +   }
> > +   /* That should have freed everything. */
> > +   BUG_ON(vq->vq.num_free != vq->vring.num);
> > +
> > +   END_USE(vq);
> > +   return NULL;
> > +}
> 
> I think the those copy-and-paste hunks could be avoided and the diff should
> only contains renaming of the function. If yes, it would be very welcomed
> since it requires to compare the changes verbatim otherwise.

Michael suggested to lay out the code as:

XXX_split

XXX_packed

XXX wrappers

https://lkml.org/lkml/2018/4/13/410

That's why I moved some code.

> 
> > +
> > +/*
> > + * The layout for the packed ring is a continuous chunk of memory
> > + * which looks like this.
> > + *
> > + * struct vring_packed {
> > + * // The actual descriptors (16 bytes each)
> > + * struct vring_packed_desc desc[num];
> > + *
> > + * // Padding to the next align boundary.
> > + * char pad[];
> > + *
> > + * // Driver Event Suppression
> > + * struct vring_packed_desc_event driver;
> > + *
> > + * // Device Event Suppression
> > + * struct vring_packed_desc_event device;
> > + * };
> > + */
> > +static inline void vring_init_packed(struct vring_packed *vr, unsigned int 
> > num,
> > +void *p, unsigned long align)
> > +{
> > +   vr->num = num;
> > +   vr->desc = p;
> > +   vr->driver = (void *)(((uintptr_t)p + sizeof(struct vring_packed_desc)
> > +   * num + align - 1) &

Re: [RFC v5 2/5] virtio_ring: support creating packed ring

2018-05-28 Thread Tiwei Bie
On Tue, May 29, 2018 at 10:49:11AM +0800, Jason Wang wrote:
> On 2018年05月22日 16:16, Tiwei Bie wrote:
> > This commit introduces the support for creating packed ring.
> > All split ring specific functions are added _split suffix.
> > Some necessary stubs for packed ring are also added.
> > 
> > Signed-off-by: Tiwei Bie 
> > ---
> >   drivers/virtio/virtio_ring.c | 801 +++
> >   include/linux/virtio_ring.h  |   8 +-
> >   2 files changed, 546 insertions(+), 263 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 71458f493cf8..f5ef5f42a7cf 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -61,11 +61,15 @@ struct vring_desc_state {
> > struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
> >   };
> > +struct vring_desc_state_packed {
> > +   int next;   /* The next desc state. */
> > +};
> > +
> >   struct vring_virtqueue {
> > struct virtqueue vq;
> > -   /* Actual memory layout for this queue */
> > -   struct vring vring;
> > +   /* Is this a packed ring? */
> > +   bool packed;
> > /* Can we use weak barriers? */
> > bool weak_barriers;
> > @@ -87,11 +91,39 @@ struct vring_virtqueue {
> > /* Last used index we've seen. */
> > u16 last_used_idx;
> > -   /* Last written value to avail->flags */
> > -   u16 avail_flags_shadow;
> > +   union {
> > +   /* Available for split ring */
> > +   struct {
> > +   /* Actual memory layout for this queue. */
> > +   struct vring vring;
> > -   /* Last written value to avail->idx in guest byte order */
> > -   u16 avail_idx_shadow;
> > +   /* Last written value to avail->flags */
> > +   u16 avail_flags_shadow;
> > +
> > +   /* Last written value to avail->idx in
> > +* guest byte order. */
> > +   u16 avail_idx_shadow;
> > +   };
> > +
> > +   /* Available for packed ring */
> > +   struct {
> > +   /* Actual memory layout for this queue. */
> > +   struct vring_packed vring_packed;
> > +
> > +   /* Driver ring wrap counter. */
> > +   u8 avail_wrap_counter;
> > +
> > +   /* Device ring wrap counter. */
> > +   u8 used_wrap_counter;
> 
> How about just use boolean?

I can change it to bool if you want.

> 
[...]
> > -static int vring_mapping_error(const struct vring_virtqueue *vq,
> > -  dma_addr_t addr)
> > -{
> > -   if (!vring_use_dma_api(vq->vq.vdev))
> > -   return 0;
> > -
> > -   return dma_mapping_error(vring_dma_dev(vq), addr);
> > -}
> 
> It looks to me if you keep vring_mapping_error behind
> vring_unmap_one_split(), lots of changes were unncessary.
> 
[...]
> > +   }
> > +   /* That should have freed everything. */
> > +   BUG_ON(vq->vq.num_free != vq->vring.num);
> > +
> > +   END_USE(vq);
> > +   return NULL;
> > +}
> 
> I think the those copy-and-paste hunks could be avoided and the diff should
> only contains renaming of the function. If yes, it would be very welcomed
> since it requires to compare the changes verbatim otherwise.

Michael suggested to lay out the code as:

XXX_split

XXX_packed

XXX wrappers

https://lkml.org/lkml/2018/4/13/410

That's why I moved some code.

> 
> > +
> > +/*
> > + * The layout for the packed ring is a continuous chunk of memory
> > + * which looks like this.
> > + *
> > + * struct vring_packed {
> > + * // The actual descriptors (16 bytes each)
> > + * struct vring_packed_desc desc[num];
> > + *
> > + * // Padding to the next align boundary.
> > + * char pad[];
> > + *
> > + * // Driver Event Suppression
> > + * struct vring_packed_desc_event driver;
> > + *
> > + * // Device Event Suppression
> > + * struct vring_packed_desc_event device;
> > + * };
> > + */
> > +static inline void vring_init_packed(struct vring_packed *vr, unsigned int 
> > num,
> > +void *p, unsigned long align)
> > +{
> > +   vr->num = num;
> > +   vr->desc = p;
> > +   vr->driver = (void *)(((uintptr_t)p + sizeof(struct vring_packed_desc)
> > +   * num + align - 1) &

Re: [RFC v5 3/5] virtio_ring: add packed ring support

2018-05-28 Thread Tiwei Bie
On Tue, May 29, 2018 at 11:18:57AM +0800, Jason Wang wrote:
> On 2018年05月22日 16:16, Tiwei Bie wrote:
[...]
> > +static void detach_buf_packed(struct vring_virtqueue *vq,
> > + unsigned int id, void **ctx)
> > +{
> > +   struct vring_desc_state_packed *state;
> > +   struct vring_packed_desc *desc;
> > +   unsigned int i;
> > +   int *next;
> > +
> > +   /* Clear data ptr. */
> > +   vq->desc_state_packed[id].data = NULL;
> > +
> > +   next = 
> > +   for (i = 0; i < vq->desc_state_packed[id].num; i++) {
> > +   state = >desc_state_packed[*next];
> > +   vring_unmap_state_packed(vq, state);
> > +   next = >desc_state_packed[*next].next;
> 
> Have a short discussion with Michael. It looks like the only thing we care
> is DMA address and length here.

The length tracked by desc_state_packed[] is also necessary
when INDIRECT is used and the buffers are writable.

> 
> So a thought is to avoid using desc_state_packed() is vring_use_dma_api() is
> false? Probably need another id allocator or just use desc_state_packed for
> free id list.

I think it's a good suggestion. Thanks!

I won't track the addr/len/flags for non-indirect descs
when vring_use_dma_api() returns false.

> 
> > +   }
> > +
> > +   vq->vq.num_free += vq->desc_state_packed[id].num;
> > +   *next = vq->free_head;
> 
> Using pointer seems not elegant and unnecessary here. You can just track
> next in integer I think?

It's possible to use integer, but will need one more var
to track `prev` (desc_state_packed[prev].next == next in
this case).

> 
> > +   vq->free_head = id;
> > +
> > +   if (vq->indirect) {
> > +   u32 len;
> > +
> > +   /* Free the indirect table, if any, now that it's unmapped. */
> > +   desc = vq->desc_state_packed[id].indir_desc;
> > +   if (!desc)
> > +   return;
> > +
> > +   len = vq->desc_state_packed[id].len;
> > +   for (i = 0; i < len / sizeof(struct vring_packed_desc); i++)
> > +   vring_unmap_desc_packed(vq, [i]);
> > +
> > +   kfree(desc);
> > +   vq->desc_state_packed[id].indir_desc = NULL;
> > +   } else if (ctx) {
> > +   *ctx = vq->desc_state_packed[id].indir_desc;
> > +   }
> >   }
> >   static inline bool more_used_packed(const struct vring_virtqueue *vq)
> >   {
> > -   return false;
> > +   u16 last_used, flags;
> > +   u8 avail, used;
> > +
> > +   last_used = vq->last_used_idx;
> > +   flags = virtio16_to_cpu(vq->vq.vdev,
> > +   vq->vring_packed.desc[last_used].flags);
> > +   avail = !!(flags & VRING_DESC_F_AVAIL(1));
> > +   used = !!(flags & VRING_DESC_F_USED(1));
> > +
> > +   return avail == used && used == vq->used_wrap_counter;
> 
> Spec does not check avail == used? So there's probably a bug in either of
> the two places.
> 
> In what condition that avail != used but used == vq_used_wrap_counter? A
> buggy device or device set the two bits in two writes?

Currently, spec doesn't forbid devices to set the status
bits as: avail != used but used == vq_used_wrap_counter.
So I think driver cannot assume this won't happen.

> 
> >   }
[...]
> >   static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> >   {
> > -   return 0;
> > +   struct vring_virtqueue *vq = to_vvq(_vq);
> > +
> > +   START_USE(vq);
> > +
> > +   /* We optimistically turn back on interrupts, then check if there was
> > +* more to do. */
> > +
> > +   if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > +   virtio_wmb(vq->weak_barriers);
> 
> Missing comments for the barrier.

Will add some comments.

> 
> > +   vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +   vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > +   vq->event_flags_shadow);
> > +   }
> > +
> > +   END_USE(vq);
> > +   return vq->last_used_idx;
> >   }
> >   static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned 
> > last_used_idx)
> >   {
> > -   return false;
> > +   struct vring_virtqueue *vq = to_vvq(_vq);
> > +   u8 avail, used;
> > +   u16 flags;
> > +
> > +   virtio_mb(vq->weak_barriers);
> > +   flags = virtio16_to_cpu(vq->vq.vdev,
> > +   vq->vring_packed.desc[last_used_idx].flags);
> > +   avail = !!(flags & VRING_DESC_F_AVAIL(1));
> > +   used = !!(flags & VRING_DESC_F_USED(1));
> > +
> > +   return avail == used && used == vq->used_wrap_counter;
> >   }
> >   static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> >   {
> > -   return false;
> > +   struct vring_virtqueue *vq = to_vvq(_vq);
> > +
> > +   START_USE(vq);
> > +
> > +   /* We optimistically turn back on interrupts, then check if there was
> > +* more to do. */
> > +
> > +   if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > +   virtio_wmb(vq->weak_barriers);
> 
> Need comments for the barrier.

Will add some comments.

Best regards,
Tiwei Bie


Re: [RFC v5 3/5] virtio_ring: add packed ring support

2018-05-28 Thread Tiwei Bie
On Tue, May 29, 2018 at 11:18:57AM +0800, Jason Wang wrote:
> On 2018年05月22日 16:16, Tiwei Bie wrote:
[...]
> > +static void detach_buf_packed(struct vring_virtqueue *vq,
> > + unsigned int id, void **ctx)
> > +{
> > +   struct vring_desc_state_packed *state;
> > +   struct vring_packed_desc *desc;
> > +   unsigned int i;
> > +   int *next;
> > +
> > +   /* Clear data ptr. */
> > +   vq->desc_state_packed[id].data = NULL;
> > +
> > +   next = 
> > +   for (i = 0; i < vq->desc_state_packed[id].num; i++) {
> > +   state = >desc_state_packed[*next];
> > +   vring_unmap_state_packed(vq, state);
> > +   next = >desc_state_packed[*next].next;
> 
> Have a short discussion with Michael. It looks like the only thing we care
> is DMA address and length here.

The length tracked by desc_state_packed[] is also necessary
when INDIRECT is used and the buffers are writable.

> 
> So a thought is to avoid using desc_state_packed() is vring_use_dma_api() is
> false? Probably need another id allocator or just use desc_state_packed for
> free id list.

I think it's a good suggestion. Thanks!

I won't track the addr/len/flags for non-indirect descs
when vring_use_dma_api() returns false.

> 
> > +   }
> > +
> > +   vq->vq.num_free += vq->desc_state_packed[id].num;
> > +   *next = vq->free_head;
> 
> Using pointer seems not elegant and unnecessary here. You can just track
> next in integer I think?

It's possible to use integer, but will need one more var
to track `prev` (desc_state_packed[prev].next == next in
this case).

> 
> > +   vq->free_head = id;
> > +
> > +   if (vq->indirect) {
> > +   u32 len;
> > +
> > +   /* Free the indirect table, if any, now that it's unmapped. */
> > +   desc = vq->desc_state_packed[id].indir_desc;
> > +   if (!desc)
> > +   return;
> > +
> > +   len = vq->desc_state_packed[id].len;
> > +   for (i = 0; i < len / sizeof(struct vring_packed_desc); i++)
> > +   vring_unmap_desc_packed(vq, [i]);
> > +
> > +   kfree(desc);
> > +   vq->desc_state_packed[id].indir_desc = NULL;
> > +   } else if (ctx) {
> > +   *ctx = vq->desc_state_packed[id].indir_desc;
> > +   }
> >   }
> >   static inline bool more_used_packed(const struct vring_virtqueue *vq)
> >   {
> > -   return false;
> > +   u16 last_used, flags;
> > +   u8 avail, used;
> > +
> > +   last_used = vq->last_used_idx;
> > +   flags = virtio16_to_cpu(vq->vq.vdev,
> > +   vq->vring_packed.desc[last_used].flags);
> > +   avail = !!(flags & VRING_DESC_F_AVAIL(1));
> > +   used = !!(flags & VRING_DESC_F_USED(1));
> > +
> > +   return avail == used && used == vq->used_wrap_counter;
> 
> Spec does not check avail == used? So there's probably a bug in either of
> the two places.
> 
> In what condition that avail != used but used == vq_used_wrap_counter? A
> buggy device or device set the two bits in two writes?

Currently, spec doesn't forbid devices to set the status
bits as: avail != used but used == vq_used_wrap_counter.
So I think driver cannot assume this won't happen.

> 
> >   }
[...]
> >   static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> >   {
> > -   return 0;
> > +   struct vring_virtqueue *vq = to_vvq(_vq);
> > +
> > +   START_USE(vq);
> > +
> > +   /* We optimistically turn back on interrupts, then check if there was
> > +* more to do. */
> > +
> > +   if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > +   virtio_wmb(vq->weak_barriers);
> 
> Missing comments for the barrier.

Will add some comments.

> 
> > +   vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +   vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > +   vq->event_flags_shadow);
> > +   }
> > +
> > +   END_USE(vq);
> > +   return vq->last_used_idx;
> >   }
> >   static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned 
> > last_used_idx)
> >   {
> > -   return false;
> > +   struct vring_virtqueue *vq = to_vvq(_vq);
> > +   u8 avail, used;
> > +   u16 flags;
> > +
> > +   virtio_mb(vq->weak_barriers);
> > +   flags = virtio16_to_cpu(vq->vq.vdev,
> > +   vq->vring_packed.desc[last_used_idx].flags);
> > +   avail = !!(flags & VRING_DESC_F_AVAIL(1));
> > +   used = !!(flags & VRING_DESC_F_USED(1));
> > +
> > +   return avail == used && used == vq->used_wrap_counter;
> >   }
> >   static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> >   {
> > -   return false;
> > +   struct vring_virtqueue *vq = to_vvq(_vq);
> > +
> > +   START_USE(vq);
> > +
> > +   /* We optimistically turn back on interrupts, then check if there was
> > +* more to do. */
> > +
> > +   if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > +   virtio_wmb(vq->weak_barriers);
> 
> Need comments for the barrier.

Will add some comments.

Best regards,
Tiwei Bie


Re: [RFC v5 0/5] virtio: support packed ring

2018-05-24 Thread Tiwei Bie
On Fri, May 25, 2018 at 10:31:26AM +0800, Jason Wang wrote:
> On 2018年05月22日 16:16, Tiwei Bie wrote:
> > Hello everyone,
> > 
> > This RFC implements packed ring support in virtio driver.
> > 
> > Some simple functional tests have been done with Jason's
> > packed ring implementation in vhost (RFC v4):
> > 
> > https://lkml.org/lkml/2018/5/16/501
> > 
> > Both of ping and netperf worked as expected w/ EVENT_IDX
> > disabled. Ping worked as expected w/ EVENT_IDX enabled,
> > but netperf didn't (A hack has been added in the driver
> > to make netperf test pass in this case. The hack can be
> > found by `grep -rw XXX` in the code).
> 
> Looks like this is because a bug in vhost which wrongly track signalled_used
> and may miss an interrupt. After fixing that, both side works like a charm.
> 
> I'm testing vIOMMU and zerocopy, and will post a new version shortly. (Hope
> it would be the last RFC version).

Great to know that. Thanks a lot! :)

Best regards,
Tiwei Bie


Re: [RFC v5 0/5] virtio: support packed ring

2018-05-24 Thread Tiwei Bie
On Fri, May 25, 2018 at 10:31:26AM +0800, Jason Wang wrote:
> On 2018年05月22日 16:16, Tiwei Bie wrote:
> > Hello everyone,
> > 
> > This RFC implements packed ring support in virtio driver.
> > 
> > Some simple functional tests have been done with Jason's
> > packed ring implementation in vhost (RFC v4):
> > 
> > https://lkml.org/lkml/2018/5/16/501
> > 
> > Both of ping and netperf worked as expected w/ EVENT_IDX
> > disabled. Ping worked as expected w/ EVENT_IDX enabled,
> > but netperf didn't (A hack has been added in the driver
> > to make netperf test pass in this case. The hack can be
> > found by `grep -rw XXX` in the code).
> 
> Looks like this is because a bug in vhost which wrongly track signalled_used
> and may miss an interrupt. After fixing that, both side works like a charm.
> 
> I'm testing vIOMMU and zerocopy, and will post a new version shortly. (Hope
> it would be the last RFC version).

Great to know that. Thanks a lot! :)

Best regards,
Tiwei Bie


[RFC v5 1/5] virtio: add packed ring definitions

2018-05-22 Thread Tiwei Bie
Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 include/uapi/linux/virtio_config.h |  5 -
 include/uapi/linux/virtio_ring.h   | 36 ++
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/virtio_config.h 
b/include/uapi/linux/virtio_config.h
index 308e2096291f..932a6ecc8e46 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -49,7 +49,7 @@
  * transport being used (eg. virtio_ring), the rest are per-device feature
  * bits. */
 #define VIRTIO_TRANSPORT_F_START   28
-#define VIRTIO_TRANSPORT_F_END 34
+#define VIRTIO_TRANSPORT_F_END 35
 
 #ifndef VIRTIO_CONFIG_NO_LEGACY
 /* Do we get callbacks when the ring is completely used, even if we've
@@ -71,4 +71,7 @@
  * this is for compatibility with legacy systems.
  */
 #define VIRTIO_F_IOMMU_PLATFORM33
+
+/* This feature indicates support for the packed virtqueue layout. */
+#define VIRTIO_F_RING_PACKED   34
 #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 6d5d5faa989b..3932cb80c347 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -44,6 +44,9 @@
 /* This means the buffer contains a list of buffer descriptors. */
 #define VRING_DESC_F_INDIRECT  4
 
+#define VRING_DESC_F_AVAIL(b)  ((b) << 7)
+#define VRING_DESC_F_USED(b)   ((b) << 15)
+
 /* The Host uses this in used->flags to advise the Guest: don't kick me when
  * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
  * will still kick if it's out of buffers. */
@@ -53,6 +56,10 @@
  * optimization.  */
 #define VRING_AVAIL_F_NO_INTERRUPT 1
 
+#define VRING_EVENT_F_ENABLE   0x0
+#define VRING_EVENT_F_DISABLE  0x1
+#define VRING_EVENT_F_DESC 0x2
+
 /* We support indirect buffer descriptors */
 #define VIRTIO_RING_F_INDIRECT_DESC28
 
@@ -171,4 +178,33 @@ static inline int vring_need_event(__u16 event_idx, __u16 
new_idx, __u16 old)
return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
 }
 
+struct vring_packed_desc_event {
+   /* __virtio16 off  : 15; // Descriptor Event Offset
+* __virtio16 wrap : 1;  // Descriptor Event Wrap Counter */
+   __virtio16 off_wrap;
+   /* __virtio16 flags : 2; // Descriptor Event Flags */
+   __virtio16 flags;
+};
+
+struct vring_packed_desc {
+   /* Buffer Address. */
+   __virtio64 addr;
+   /* Buffer Length. */
+   __virtio32 len;
+   /* Buffer ID. */
+   __virtio16 id;
+   /* The flags depending on descriptor type. */
+   __virtio16 flags;
+};
+
+struct vring_packed {
+   unsigned int num;
+
+   struct vring_packed_desc *desc;
+
+   struct vring_packed_desc_event *driver;
+
+   struct vring_packed_desc_event *device;
+};
+
 #endif /* _UAPI_LINUX_VIRTIO_RING_H */
-- 
2.17.0



[RFC v5 1/5] virtio: add packed ring definitions

2018-05-22 Thread Tiwei Bie
Signed-off-by: Tiwei Bie 
---
 include/uapi/linux/virtio_config.h |  5 -
 include/uapi/linux/virtio_ring.h   | 36 ++
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/virtio_config.h 
b/include/uapi/linux/virtio_config.h
index 308e2096291f..932a6ecc8e46 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -49,7 +49,7 @@
  * transport being used (eg. virtio_ring), the rest are per-device feature
  * bits. */
 #define VIRTIO_TRANSPORT_F_START   28
-#define VIRTIO_TRANSPORT_F_END 34
+#define VIRTIO_TRANSPORT_F_END 35
 
 #ifndef VIRTIO_CONFIG_NO_LEGACY
 /* Do we get callbacks when the ring is completely used, even if we've
@@ -71,4 +71,7 @@
  * this is for compatibility with legacy systems.
  */
 #define VIRTIO_F_IOMMU_PLATFORM33
+
+/* This feature indicates support for the packed virtqueue layout. */
+#define VIRTIO_F_RING_PACKED   34
 #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 6d5d5faa989b..3932cb80c347 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -44,6 +44,9 @@
 /* This means the buffer contains a list of buffer descriptors. */
 #define VRING_DESC_F_INDIRECT  4
 
+#define VRING_DESC_F_AVAIL(b)  ((b) << 7)
+#define VRING_DESC_F_USED(b)   ((b) << 15)
+
 /* The Host uses this in used->flags to advise the Guest: don't kick me when
  * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
  * will still kick if it's out of buffers. */
@@ -53,6 +56,10 @@
  * optimization.  */
 #define VRING_AVAIL_F_NO_INTERRUPT 1
 
+#define VRING_EVENT_F_ENABLE   0x0
+#define VRING_EVENT_F_DISABLE  0x1
+#define VRING_EVENT_F_DESC 0x2
+
 /* We support indirect buffer descriptors */
 #define VIRTIO_RING_F_INDIRECT_DESC28
 
@@ -171,4 +178,33 @@ static inline int vring_need_event(__u16 event_idx, __u16 
new_idx, __u16 old)
return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
 }
 
+struct vring_packed_desc_event {
+   /* __virtio16 off  : 15; // Descriptor Event Offset
+* __virtio16 wrap : 1;  // Descriptor Event Wrap Counter */
+   __virtio16 off_wrap;
+   /* __virtio16 flags : 2; // Descriptor Event Flags */
+   __virtio16 flags;
+};
+
+struct vring_packed_desc {
+   /* Buffer Address. */
+   __virtio64 addr;
+   /* Buffer Length. */
+   __virtio32 len;
+   /* Buffer ID. */
+   __virtio16 id;
+   /* The flags depending on descriptor type. */
+   __virtio16 flags;
+};
+
+struct vring_packed {
+   unsigned int num;
+
+   struct vring_packed_desc *desc;
+
+   struct vring_packed_desc_event *driver;
+
+   struct vring_packed_desc_event *device;
+};
+
 #endif /* _UAPI_LINUX_VIRTIO_RING_H */
-- 
2.17.0



[RFC v5 3/5] virtio_ring: add packed ring support

2018-05-22 Thread Tiwei Bie
This commit introduces the support (without EVENT_IDX) for
packed ring.

Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 drivers/virtio/virtio_ring.c | 474 ++-
 1 file changed, 468 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index f5ef5f42a7cf..eb9fd5207a68 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -62,6 +62,12 @@ struct vring_desc_state {
 };
 
 struct vring_desc_state_packed {
+   void *data; /* Data for callback. */
+   struct vring_packed_desc *indir_desc; /* Indirect descriptor, if any. */
+   int num;/* Descriptor list length. */
+   dma_addr_t addr;/* Buffer DMA addr. */
+   u32 len;/* Buffer length. */
+   u16 flags;  /* Descriptor flags. */
int next;   /* The next desc state. */
 };
 
@@ -758,6 +764,72 @@ static inline unsigned vring_size_packed(unsigned int num, 
unsigned long align)
& ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
 }
 
+static void vring_unmap_state_packed(const struct vring_virtqueue *vq,
+struct vring_desc_state_packed *state)
+{
+   u16 flags;
+
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return;
+
+   flags = state->flags;
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+state->addr, state->len,
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  state->addr, state->len,
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static void vring_unmap_desc_packed(const struct vring_virtqueue *vq,
+  struct vring_packed_desc *desc)
+{
+   u16 flags;
+
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return;
+
+   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+virtio64_to_cpu(vq->vq.vdev, desc->addr),
+virtio32_to_cpu(vq->vq.vdev, desc->len),
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  virtio64_to_cpu(vq->vq.vdev, desc->addr),
+  virtio32_to_cpu(vq->vq.vdev, desc->len),
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq,
+  unsigned int total_sg,
+  gfp_t gfp)
+{
+   struct vring_packed_desc *desc;
+
+   /*
+* We require lowmem mappings for the descriptors because
+* otherwise virt_to_phys will give us bogus addresses in the
+* virtqueue.
+*/
+   gfp &= ~__GFP_HIGHMEM;
+
+   desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);
+
+   return desc;
+}
+
 static inline int virtqueue_add_packed(struct virtqueue *_vq,
   struct scatterlist *sgs[],
   unsigned int total_sg,
@@ -767,47 +839,437 @@ static inline int virtqueue_add_packed(struct virtqueue 
*_vq,
   void *ctx,
   gfp_t gfp)
 {
+   struct vring_virtqueue *vq = to_vvq(_vq);
+   struct vring_packed_desc *desc;
+   struct scatterlist *sg;
+   unsigned int i, n, descs_used, uninitialized_var(prev), err_idx;
+   __virtio16 uninitialized_var(head_flags), flags;
+   u16 head, avail_wrap_counter, id, cur;
+   bool indirect;
+
+   START_USE(vq);
+
+   BUG_ON(data == NULL);
+   BUG_ON(ctx && vq->indirect);
+
+   if (unlikely(vq->broken)) {
+   END_USE(vq);
+   return -EIO;
+   }
+
+#ifdef DEBUG
+   {
+   ktime_t now = ktime_get();
+
+   /* No kick or get, with .1 second between?  Warn. */
+   if (vq->last_add_time_valid)
+   WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time))
+   > 100);
+   vq->last_add_time = now;
+  

[RFC v5 3/5] virtio_ring: add packed ring support

2018-05-22 Thread Tiwei Bie
This commit introduces the support (without EVENT_IDX) for
packed ring.

Signed-off-by: Tiwei Bie 
---
 drivers/virtio/virtio_ring.c | 474 ++-
 1 file changed, 468 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index f5ef5f42a7cf..eb9fd5207a68 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -62,6 +62,12 @@ struct vring_desc_state {
 };
 
 struct vring_desc_state_packed {
+   void *data; /* Data for callback. */
+   struct vring_packed_desc *indir_desc; /* Indirect descriptor, if any. */
+   int num;/* Descriptor list length. */
+   dma_addr_t addr;/* Buffer DMA addr. */
+   u32 len;/* Buffer length. */
+   u16 flags;  /* Descriptor flags. */
int next;   /* The next desc state. */
 };
 
@@ -758,6 +764,72 @@ static inline unsigned vring_size_packed(unsigned int num, 
unsigned long align)
& ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
 }
 
+static void vring_unmap_state_packed(const struct vring_virtqueue *vq,
+struct vring_desc_state_packed *state)
+{
+   u16 flags;
+
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return;
+
+   flags = state->flags;
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+state->addr, state->len,
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  state->addr, state->len,
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static void vring_unmap_desc_packed(const struct vring_virtqueue *vq,
+  struct vring_packed_desc *desc)
+{
+   u16 flags;
+
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return;
+
+   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+virtio64_to_cpu(vq->vq.vdev, desc->addr),
+virtio32_to_cpu(vq->vq.vdev, desc->len),
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  virtio64_to_cpu(vq->vq.vdev, desc->addr),
+  virtio32_to_cpu(vq->vq.vdev, desc->len),
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq,
+  unsigned int total_sg,
+  gfp_t gfp)
+{
+   struct vring_packed_desc *desc;
+
+   /*
+* We require lowmem mappings for the descriptors because
+* otherwise virt_to_phys will give us bogus addresses in the
+* virtqueue.
+*/
+   gfp &= ~__GFP_HIGHMEM;
+
+   desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);
+
+   return desc;
+}
+
 static inline int virtqueue_add_packed(struct virtqueue *_vq,
   struct scatterlist *sgs[],
   unsigned int total_sg,
@@ -767,47 +839,437 @@ static inline int virtqueue_add_packed(struct virtqueue 
*_vq,
   void *ctx,
   gfp_t gfp)
 {
+   struct vring_virtqueue *vq = to_vvq(_vq);
+   struct vring_packed_desc *desc;
+   struct scatterlist *sg;
+   unsigned int i, n, descs_used, uninitialized_var(prev), err_idx;
+   __virtio16 uninitialized_var(head_flags), flags;
+   u16 head, avail_wrap_counter, id, cur;
+   bool indirect;
+
+   START_USE(vq);
+
+   BUG_ON(data == NULL);
+   BUG_ON(ctx && vq->indirect);
+
+   if (unlikely(vq->broken)) {
+   END_USE(vq);
+   return -EIO;
+   }
+
+#ifdef DEBUG
+   {
+   ktime_t now = ktime_get();
+
+   /* No kick or get, with .1 second between?  Warn. */
+   if (vq->last_add_time_valid)
+   WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time))
+   > 100);
+   vq->last_add_time = now;
+   vq->la

[RFC v5 2/5] virtio_ring: support creating packed ring

2018-05-22 Thread Tiwei Bie
This commit introduces the support for creating packed ring.
All split ring specific functions are added _split suffix.
Some necessary stubs for packed ring are also added.

Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 drivers/virtio/virtio_ring.c | 801 +++
 include/linux/virtio_ring.h  |   8 +-
 2 files changed, 546 insertions(+), 263 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..f5ef5f42a7cf 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -61,11 +61,15 @@ struct vring_desc_state {
struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
 };
 
+struct vring_desc_state_packed {
+   int next;   /* The next desc state. */
+};
+
 struct vring_virtqueue {
struct virtqueue vq;
 
-   /* Actual memory layout for this queue */
-   struct vring vring;
+   /* Is this a packed ring? */
+   bool packed;
 
/* Can we use weak barriers? */
bool weak_barriers;
@@ -87,11 +91,39 @@ struct vring_virtqueue {
/* Last used index we've seen. */
u16 last_used_idx;
 
-   /* Last written value to avail->flags */
-   u16 avail_flags_shadow;
+   union {
+   /* Available for split ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring vring;
 
-   /* Last written value to avail->idx in guest byte order */
-   u16 avail_idx_shadow;
+   /* Last written value to avail->flags */
+   u16 avail_flags_shadow;
+
+   /* Last written value to avail->idx in
+* guest byte order. */
+   u16 avail_idx_shadow;
+   };
+
+   /* Available for packed ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring_packed vring_packed;
+
+   /* Driver ring wrap counter. */
+   u8 avail_wrap_counter;
+
+   /* Device ring wrap counter. */
+   u8 used_wrap_counter;
+
+   /* Index of the next avail descriptor. */
+   u16 next_avail_idx;
+
+   /* Last written value to driver->flags in
+* guest byte order. */
+   u16 event_flags_shadow;
+   };
+   };
 
/* How to notify other side. FIXME: commonalize hcalls! */
bool (*notify)(struct virtqueue *vq);
@@ -111,11 +143,24 @@ struct vring_virtqueue {
 #endif
 
/* Per-descriptor state. */
-   struct vring_desc_state desc_state[];
+   union {
+   struct vring_desc_state desc_state[1];
+   struct vring_desc_state_packed desc_state_packed[1];
+   };
 };
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+static inline bool virtqueue_use_indirect(struct virtqueue *_vq,
+ unsigned int total_sg)
+{
+   struct vring_virtqueue *vq = to_vvq(_vq);
+
+   /* If the host supports indirect descriptor tables, and we have multiple
+* buffers, then go indirect. FIXME: tune this threshold */
+   return (vq->indirect && total_sg > 1 && vq->vq.num_free);
+}
+
 /*
  * Modern virtio devices have feature bits to specify whether they need a
  * quirk and bypass the IOMMU. If not there, just use the DMA API.
@@ -201,8 +246,17 @@ static dma_addr_t vring_map_single(const struct 
vring_virtqueue *vq,
  cpu_addr, size, direction);
 }
 
-static void vring_unmap_one(const struct vring_virtqueue *vq,
-   struct vring_desc *desc)
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+  dma_addr_t addr)
+{
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return 0;
+
+   return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
+static void vring_unmap_one_split(const struct vring_virtqueue *vq,
+ struct vring_desc *desc)
 {
u16 flags;
 
@@ -226,17 +280,9 @@ static void vring_unmap_one(const struct vring_virtqueue 
*vq,
}
 }
 
-static int vring_mapping_error(const struct vring_virtqueue *vq,
-  dma_addr_t addr)
-{
-   if (!vring_use_dma_api(vq->vq.vdev))
-   return 0;
-
-   return dma_mapping_error(vring_dma_dev(vq), addr);
-}
-
-static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
-unsigned int total_sg, gfp_t gfp)
+static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
+  unsigned int total_sg,
+  

[RFC v5 4/5] virtio_ring: add event idx support in packed ring

2018-05-22 Thread Tiwei Bie
This commit introduces the EVENT_IDX support in packed ring.

Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 drivers/virtio/virtio_ring.c | 67 ++--
 1 file changed, 64 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index eb9fd5207a68..1ee52a89cb04 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1043,7 +1043,7 @@ static inline int virtqueue_add_packed(struct virtqueue 
*_vq,
 static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
-   u16 flags;
+   u16 new, old, off_wrap, flags, wrap_counter, event_idx;
bool needs_kick;
u32 snapshot;
 
@@ -1052,9 +1052,19 @@ static bool virtqueue_kick_prepare_packed(struct 
virtqueue *_vq)
 * suppressions. */
virtio_mb(vq->weak_barriers);
 
+   old = vq->next_avail_idx - vq->num_added;
+   new = vq->next_avail_idx;
+   vq->num_added = 0;
+
snapshot = *(u32 *)vq->vring_packed.device;
+   off_wrap = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot & 0x));
flags = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot >> 16)) & 0x3;
 
+   wrap_counter = off_wrap >> 15;
+   event_idx = off_wrap & ~(1<<15);
+   if (wrap_counter != vq->avail_wrap_counter)
+   event_idx -= vq->vring_packed.num;
+
 #ifdef DEBUG
if (vq->last_add_time_valid) {
WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
@@ -1063,7 +1073,10 @@ static bool virtqueue_kick_prepare_packed(struct 
virtqueue *_vq)
vq->last_add_time_valid = false;
 #endif
 
-   needs_kick = (flags != VRING_EVENT_F_DISABLE);
+   if (flags == VRING_EVENT_F_DESC)
+   needs_kick = vring_need_event(event_idx, new, old);
+   else
+   needs_kick = (flags != VRING_EVENT_F_DISABLE);
END_USE(vq);
return needs_kick;
 }
@@ -1170,6 +1183,15 @@ static void *virtqueue_get_buf_ctx_packed(struct 
virtqueue *_vq,
ret = vq->desc_state_packed[id].data;
detach_buf_packed(vq, id, ctx);
 
+   /* If we expect an interrupt for the next entry, tell host
+* by writing event index and flush out the write before
+* the read in the next get_buf call. */
+   if (vq->event_flags_shadow == VRING_EVENT_F_DESC)
+   virtio_store_mb(vq->weak_barriers,
+   >vring_packed.driver->off_wrap,
+   cpu_to_virtio16(_vq->vdev, vq->last_used_idx |
+   ((u16)vq->used_wrap_counter << 15)));
+
 #ifdef DEBUG
vq->last_add_time_valid = false;
 #endif
@@ -1197,10 +1219,26 @@ static unsigned 
virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
 
/* We optimistically turn back on interrupts, then check if there was
 * more to do. */
+   /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
+* either clear the flags bit or point the event index at the next
+* entry. Always update the event index to keep code simple. */
+
+   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
+   vq->last_used_idx |
+   ((u16)vq->used_wrap_counter << 15));
 
if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
virtio_wmb(vq->weak_barriers);
+   // XXX XXX XXX
+   // Setting VRING_EVENT_F_DESC in this function
+   // will break netperf test for now.
+   // Will need to look into this.
+#if 0
+   vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
+VRING_EVENT_F_ENABLE;
+#else
vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
+#endif
vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
vq->event_flags_shadow);
}
@@ -1227,15 +1265,38 @@ static bool virtqueue_poll_packed(struct virtqueue 
*_vq, unsigned last_used_idx)
 static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
+   u16 bufs, used_idx, wrap_counter;
 
START_USE(vq);
 
/* We optimistically turn back on interrupts, then check if there was
 * more to do. */
+   /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
+* either clear the flags bit or point the event index at the next
+* entry. Always update the event index to keep code simple. */
+
+   /* TODO: tune this threshold */
+   if (vq->next_avail_idx < vq->last_used_idx)
+   bufs = (vq->vring_packed.num + vq->next_avail_idx -
+

[RFC v5 2/5] virtio_ring: support creating packed ring

2018-05-22 Thread Tiwei Bie
This commit introduces the support for creating packed ring.
All split ring specific functions are added _split suffix.
Some necessary stubs for packed ring are also added.

Signed-off-by: Tiwei Bie 
---
 drivers/virtio/virtio_ring.c | 801 +++
 include/linux/virtio_ring.h  |   8 +-
 2 files changed, 546 insertions(+), 263 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..f5ef5f42a7cf 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -61,11 +61,15 @@ struct vring_desc_state {
struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
 };
 
+struct vring_desc_state_packed {
+   int next;   /* The next desc state. */
+};
+
 struct vring_virtqueue {
struct virtqueue vq;
 
-   /* Actual memory layout for this queue */
-   struct vring vring;
+   /* Is this a packed ring? */
+   bool packed;
 
/* Can we use weak barriers? */
bool weak_barriers;
@@ -87,11 +91,39 @@ struct vring_virtqueue {
/* Last used index we've seen. */
u16 last_used_idx;
 
-   /* Last written value to avail->flags */
-   u16 avail_flags_shadow;
+   union {
+   /* Available for split ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring vring;
 
-   /* Last written value to avail->idx in guest byte order */
-   u16 avail_idx_shadow;
+   /* Last written value to avail->flags */
+   u16 avail_flags_shadow;
+
+   /* Last written value to avail->idx in
+* guest byte order. */
+   u16 avail_idx_shadow;
+   };
+
+   /* Available for packed ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring_packed vring_packed;
+
+   /* Driver ring wrap counter. */
+   u8 avail_wrap_counter;
+
+   /* Device ring wrap counter. */
+   u8 used_wrap_counter;
+
+   /* Index of the next avail descriptor. */
+   u16 next_avail_idx;
+
+   /* Last written value to driver->flags in
+* guest byte order. */
+   u16 event_flags_shadow;
+   };
+   };
 
/* How to notify other side. FIXME: commonalize hcalls! */
bool (*notify)(struct virtqueue *vq);
@@ -111,11 +143,24 @@ struct vring_virtqueue {
 #endif
 
/* Per-descriptor state. */
-   struct vring_desc_state desc_state[];
+   union {
+   struct vring_desc_state desc_state[1];
+   struct vring_desc_state_packed desc_state_packed[1];
+   };
 };
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+static inline bool virtqueue_use_indirect(struct virtqueue *_vq,
+ unsigned int total_sg)
+{
+   struct vring_virtqueue *vq = to_vvq(_vq);
+
+   /* If the host supports indirect descriptor tables, and we have multiple
+* buffers, then go indirect. FIXME: tune this threshold */
+   return (vq->indirect && total_sg > 1 && vq->vq.num_free);
+}
+
 /*
  * Modern virtio devices have feature bits to specify whether they need a
  * quirk and bypass the IOMMU. If not there, just use the DMA API.
@@ -201,8 +246,17 @@ static dma_addr_t vring_map_single(const struct 
vring_virtqueue *vq,
  cpu_addr, size, direction);
 }
 
-static void vring_unmap_one(const struct vring_virtqueue *vq,
-   struct vring_desc *desc)
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+  dma_addr_t addr)
+{
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return 0;
+
+   return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
+static void vring_unmap_one_split(const struct vring_virtqueue *vq,
+ struct vring_desc *desc)
 {
u16 flags;
 
@@ -226,17 +280,9 @@ static void vring_unmap_one(const struct vring_virtqueue 
*vq,
}
 }
 
-static int vring_mapping_error(const struct vring_virtqueue *vq,
-  dma_addr_t addr)
-{
-   if (!vring_use_dma_api(vq->vq.vdev))
-   return 0;
-
-   return dma_mapping_error(vring_dma_dev(vq), addr);
-}
-
-static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
-unsigned int total_sg, gfp_t gfp)
+static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
+  unsigned int total_sg,
+ 

[RFC v5 4/5] virtio_ring: add event idx support in packed ring

2018-05-22 Thread Tiwei Bie
This commit introduces the EVENT_IDX support in packed ring.

Signed-off-by: Tiwei Bie 
---
 drivers/virtio/virtio_ring.c | 67 ++--
 1 file changed, 64 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index eb9fd5207a68..1ee52a89cb04 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1043,7 +1043,7 @@ static inline int virtqueue_add_packed(struct virtqueue 
*_vq,
 static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
-   u16 flags;
+   u16 new, old, off_wrap, flags, wrap_counter, event_idx;
bool needs_kick;
u32 snapshot;
 
@@ -1052,9 +1052,19 @@ static bool virtqueue_kick_prepare_packed(struct 
virtqueue *_vq)
 * suppressions. */
virtio_mb(vq->weak_barriers);
 
+   old = vq->next_avail_idx - vq->num_added;
+   new = vq->next_avail_idx;
+   vq->num_added = 0;
+
snapshot = *(u32 *)vq->vring_packed.device;
+   off_wrap = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot & 0x));
flags = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot >> 16)) & 0x3;
 
+   wrap_counter = off_wrap >> 15;
+   event_idx = off_wrap & ~(1<<15);
+   if (wrap_counter != vq->avail_wrap_counter)
+   event_idx -= vq->vring_packed.num;
+
 #ifdef DEBUG
if (vq->last_add_time_valid) {
WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
@@ -1063,7 +1073,10 @@ static bool virtqueue_kick_prepare_packed(struct 
virtqueue *_vq)
vq->last_add_time_valid = false;
 #endif
 
-   needs_kick = (flags != VRING_EVENT_F_DISABLE);
+   if (flags == VRING_EVENT_F_DESC)
+   needs_kick = vring_need_event(event_idx, new, old);
+   else
+   needs_kick = (flags != VRING_EVENT_F_DISABLE);
END_USE(vq);
return needs_kick;
 }
@@ -1170,6 +1183,15 @@ static void *virtqueue_get_buf_ctx_packed(struct 
virtqueue *_vq,
ret = vq->desc_state_packed[id].data;
detach_buf_packed(vq, id, ctx);
 
+   /* If we expect an interrupt for the next entry, tell host
+* by writing event index and flush out the write before
+* the read in the next get_buf call. */
+   if (vq->event_flags_shadow == VRING_EVENT_F_DESC)
+   virtio_store_mb(vq->weak_barriers,
+   >vring_packed.driver->off_wrap,
+   cpu_to_virtio16(_vq->vdev, vq->last_used_idx |
+   ((u16)vq->used_wrap_counter << 15)));
+
 #ifdef DEBUG
vq->last_add_time_valid = false;
 #endif
@@ -1197,10 +1219,26 @@ static unsigned 
virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
 
/* We optimistically turn back on interrupts, then check if there was
 * more to do. */
+   /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
+* either clear the flags bit or point the event index at the next
+* entry. Always update the event index to keep code simple. */
+
+   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
+   vq->last_used_idx |
+   ((u16)vq->used_wrap_counter << 15));
 
if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
virtio_wmb(vq->weak_barriers);
+   // XXX XXX XXX
+   // Setting VRING_EVENT_F_DESC in this function
+   // will break netperf test for now.
+   // Will need to look into this.
+#if 0
+   vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
+VRING_EVENT_F_ENABLE;
+#else
vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
+#endif
vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
vq->event_flags_shadow);
}
@@ -1227,15 +1265,38 @@ static bool virtqueue_poll_packed(struct virtqueue 
*_vq, unsigned last_used_idx)
 static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
+   u16 bufs, used_idx, wrap_counter;
 
START_USE(vq);
 
/* We optimistically turn back on interrupts, then check if there was
 * more to do. */
+   /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
+* either clear the flags bit or point the event index at the next
+* entry. Always update the event index to keep code simple. */
+
+   /* TODO: tune this threshold */
+   if (vq->next_avail_idx < vq->last_used_idx)
+   bufs = (vq->vring_packed.num + vq->next_avail_idx -
+ 

[RFC v5 5/5] virtio_ring: enable packed ring

2018-05-22 Thread Tiwei Bie
Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 drivers/virtio/virtio_ring.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 1ee52a89cb04..355dfef4f1eb 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1956,6 +1956,8 @@ void vring_transport_features(struct virtio_device *vdev)
break;
case VIRTIO_F_IOMMU_PLATFORM:
break;
+   case VIRTIO_F_RING_PACKED:
+   break;
default:
/* We don't understand this bit. */
__virtio_clear_bit(vdev, i);
-- 
2.17.0



[RFC v5 5/5] virtio_ring: enable packed ring

2018-05-22 Thread Tiwei Bie
Signed-off-by: Tiwei Bie 
---
 drivers/virtio/virtio_ring.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 1ee52a89cb04..355dfef4f1eb 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1956,6 +1956,8 @@ void vring_transport_features(struct virtio_device *vdev)
break;
case VIRTIO_F_IOMMU_PLATFORM:
break;
+   case VIRTIO_F_RING_PACKED:
+   break;
default:
/* We don't understand this bit. */
__virtio_clear_bit(vdev, i);
-- 
2.17.0



[RFC v5 0/5] virtio: support packed ring

2018-05-22 Thread Tiwei Bie
Hello everyone,

This RFC implements packed ring support in virtio driver.

Some simple functional tests have been done with Jason's
packed ring implementation in vhost (RFC v4):

https://lkml.org/lkml/2018/5/16/501

Both of ping and netperf worked as expected w/ EVENT_IDX
disabled. Ping worked as expected w/ EVENT_IDX enabled,
but netperf didn't (A hack has been added in the driver
to make netperf test pass in this case. The hack can be
found by `grep -rw XXX` in the code).

TODO:
- Refinements (for code and commit log);
- More tests;
- Bug fixes;

RFC v4 -> RFC v5:
- Save DMA addr, etc in desc state (Jason);
- Track used wrap counter;

RFC v3 -> RFC v4:
- Make ID allocation support out-of-order (Jason);
- Various fixes for EVENT_IDX support;

RFC v2 -> RFC v3:
- Split into small patches (Jason);
- Add helper virtqueue_use_indirect() (Jason);
- Just set id for the last descriptor of a list (Jason);
- Calculate the prev in virtqueue_add_packed() (Jason);
- Fix/improve desc suppression code (Jason/MST);
- Refine the code layout for XXX_split/packed and wrappers (MST);
- Fix the comments and API in uapi (MST);
- Remove the BUG_ON() for indirect (Jason);
- Some other refinements and bug fixes;

RFC v1 -> RFC v2:
- Add indirect descriptor support - compile test only;
- Add event suppression supprt - compile test only;
- Move vring_packed_init() out of uapi (Jason, MST);
- Merge two loops into one in virtqueue_add_packed() (Jason);
- Split vring_unmap_one() for packed ring and split ring (Jason);
- Avoid using '%' operator (Jason);
- Rename free_head -> next_avail_idx (Jason);
- Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
- Some other refinements and bug fixes;

Thanks!

Tiwei Bie (5):
  virtio: add packed ring definitions
  virtio_ring: support creating packed ring
  virtio_ring: add packed ring support
  virtio_ring: add event idx support in packed ring
  virtio_ring: enable packed ring

 drivers/virtio/virtio_ring.c   | 1354 ++--
 include/linux/virtio_ring.h|8 +-
 include/uapi/linux/virtio_config.h |5 +-
 include/uapi/linux/virtio_ring.h   |   36 +
 4 files changed, 1125 insertions(+), 278 deletions(-)

-- 
2.17.0



[RFC v5 0/5] virtio: support packed ring

2018-05-22 Thread Tiwei Bie
Hello everyone,

This RFC implements packed ring support in virtio driver.

Some simple functional tests have been done with Jason's
packed ring implementation in vhost (RFC v4):

https://lkml.org/lkml/2018/5/16/501

Both of ping and netperf worked as expected w/ EVENT_IDX
disabled. Ping worked as expected w/ EVENT_IDX enabled,
but netperf didn't (A hack has been added in the driver
to make netperf test pass in this case. The hack can be
found by `grep -rw XXX` in the code).

TODO:
- Refinements (for code and commit log);
- More tests;
- Bug fixes;

RFC v4 -> RFC v5:
- Save DMA addr, etc in desc state (Jason);
- Track used wrap counter;

RFC v3 -> RFC v4:
- Make ID allocation support out-of-order (Jason);
- Various fixes for EVENT_IDX support;

RFC v2 -> RFC v3:
- Split into small patches (Jason);
- Add helper virtqueue_use_indirect() (Jason);
- Just set id for the last descriptor of a list (Jason);
- Calculate the prev in virtqueue_add_packed() (Jason);
- Fix/improve desc suppression code (Jason/MST);
- Refine the code layout for XXX_split/packed and wrappers (MST);
- Fix the comments and API in uapi (MST);
- Remove the BUG_ON() for indirect (Jason);
- Some other refinements and bug fixes;

RFC v1 -> RFC v2:
- Add indirect descriptor support - compile test only;
- Add event suppression supprt - compile test only;
- Move vring_packed_init() out of uapi (Jason, MST);
- Merge two loops into one in virtqueue_add_packed() (Jason);
- Split vring_unmap_one() for packed ring and split ring (Jason);
- Avoid using '%' operator (Jason);
- Rename free_head -> next_avail_idx (Jason);
- Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
- Some other refinements and bug fixes;

Thanks!

Tiwei Bie (5):
  virtio: add packed ring definitions
  virtio_ring: support creating packed ring
  virtio_ring: add packed ring support
  virtio_ring: add event idx support in packed ring
  virtio_ring: enable packed ring

 drivers/virtio/virtio_ring.c   | 1354 ++--
 include/linux/virtio_ring.h|8 +-
 include/uapi/linux/virtio_config.h |5 +-
 include/uapi/linux/virtio_ring.h   |   36 +
 4 files changed, 1125 insertions(+), 278 deletions(-)

-- 
2.17.0



Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-20 Thread Tiwei Bie
On Mon, May 21, 2018 at 10:30:51AM +0800, Jason Wang wrote:
> On 2018年05月19日 10:29, Tiwei Bie wrote:
> > > I don't hope so.
> > > 
> > > > I agreed driver should track the DMA addrs or some
> > > > other necessary things from the very beginning. And
> > > > I also repeated the spec to emphasize that it does
> > > > make sense. And I'd like to do that.
> > > > 
> > > > What I was saying is that, to support OOO, we may
> > > > need to manage these context (which saves DMA addrs
> > > > etc) via a list which is similar to the desc list
> > > > maintained via `next` in split ring instead of an
> > > > array whose elements always can be indexed directly.
> > > My point is these context is a must (not only for OOO).
> > Yeah, and I have the exactly same point after you
> > pointed that I shouldn't get the addrs from descs.
> > I do think it makes sense. I'll do it in the next
> > version. I don't have any doubt about it. All my
> > questions are about the OOO, instead of whether we
> > should save context or not. It just seems that you
> > thought I don't want to do it, and were trying to
> > convince me that I should do it.
> 
> Right, but looks like I was wrong :)
> 
> > 
> > > > The desc ring in split ring is an array, but its
> > > > free entries are managed as list via next. I was
> > > > just wondering, do we want to manage such a list
> > > > because of OOO. It's just a very simple question
> > > > that I want to hear your opinion... (It doesn't
> > > > means anything, e.g. It doesn't mean I don't want
> > > > to support OOO. It's just a simple question...)
> > > So the question is yes. But I admit I don't have better idea other than 
> > > what
> > > you propose here (something like split ring which is a little bit sad).
> > > Maybe Michael had.
> > Yeah, that's why I asked this question. It will
> > make the packed ring a bit similar to split ring
> > at least in the driver part. So I want to draw
> > your attention on this to make sure that we're
> > on the same page.
> 
> Yes. I think we are.

Cool. Glad to hear that! Thanks! :)

Best regards,
Tiwei Bie

> 
> Thanks
> 
> > Best regards,
> > Tiwei Bie
> > 
> 


Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-20 Thread Tiwei Bie
On Mon, May 21, 2018 at 10:30:51AM +0800, Jason Wang wrote:
> On 2018年05月19日 10:29, Tiwei Bie wrote:
> > > I don't hope so.
> > > 
> > > > I agreed driver should track the DMA addrs or some
> > > > other necessary things from the very beginning. And
> > > > I also repeated the spec to emphasize that it does
> > > > make sense. And I'd like to do that.
> > > > 
> > > > What I was saying is that, to support OOO, we may
> > > > need to manage these context (which saves DMA addrs
> > > > etc) via a list which is similar to the desc list
> > > > maintained via `next` in split ring instead of an
> > > > array whose elements always can be indexed directly.
> > > My point is these context is a must (not only for OOO).
> > Yeah, and I have the exactly same point after you
> > pointed that I shouldn't get the addrs from descs.
> > I do think it makes sense. I'll do it in the next
> > version. I don't have any doubt about it. All my
> > questions are about the OOO, instead of whether we
> > should save context or not. It just seems that you
> > thought I don't want to do it, and were trying to
> > convince me that I should do it.
> 
> Right, but looks like I was wrong :)
> 
> > 
> > > > The desc ring in split ring is an array, but its
> > > > free entries are managed as list via next. I was
> > > > just wondering, do we want to manage such a list
> > > > because of OOO. It's just a very simple question
> > > > that I want to hear your opinion... (It doesn't
> > > > means anything, e.g. It doesn't mean I don't want
> > > > to support OOO. It's just a simple question...)
> > > So the question is yes. But I admit I don't have better idea other than 
> > > what
> > > you propose here (something like split ring which is a little bit sad).
> > > Maybe Michael had.
> > Yeah, that's why I asked this question. It will
> > make the packed ring a bit similar to split ring
> > at least in the driver part. So I want to draw
> > your attention on this to make sure that we're
> > on the same page.
> 
> Yes. I think we are.

Cool. Glad to hear that! Thanks! :)

Best regards,
Tiwei Bie

> 
> Thanks
> 
> > Best regards,
> > Tiwei Bie
> > 
> 


Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-18 Thread Tiwei Bie
On Sat, May 19, 2018 at 09:12:30AM +0800, Jason Wang wrote:
> On 2018年05月18日 22:33, Tiwei Bie wrote:
> > On Fri, May 18, 2018 at 09:17:05PM +0800, Jason Wang wrote:
> > > On 2018年05月18日 19:29, Tiwei Bie wrote:
> > > > On Thu, May 17, 2018 at 08:01:52PM +0800, Jason Wang wrote:
> > > > > On 2018年05月16日 22:33, Tiwei Bie wrote:
> > > > > > On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
> > > > > > > On 2018年05月16日 21:45, Tiwei Bie wrote:
> > > > > > > > On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> > > > > > > > > On 2018年05月16日 20:39, Tiwei Bie wrote:
> > > > > > > > > > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > > > > > > > > > On 2018年05月16日 16:37, Tiwei Bie wrote:
> > > > > > [...]
> > > > > > > > > > > > +static void detach_buf_packed(struct vring_virtqueue 
> > > > > > > > > > > > *vq, unsigned int head,
> > > > > > > > > > > > + unsigned int id, void 
> > > > > > > > > > > > **ctx)
> > > > > > > > > > > > +{
> > > > > > > > > > > > +   struct vring_packed_desc *desc;
> > > > > > > > > > > > +   unsigned int i, j;
> > > > > > > > > > > > +
> > > > > > > > > > > > +   /* Clear data ptr. */
> > > > > > > > > > > > +   vq->desc_state[id].data = NULL;
> > > > > > > > > > > > +
> > > > > > > > > > > > +   i = head;
> > > > > > > > > > > > +
> > > > > > > > > > > > +   for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > > > > > > > > > +   desc = >vring_packed.desc[i];
> > > > > > > > > > > > +   vring_unmap_one_packed(vq, desc);
> > > > > > > > > > > As mentioned in previous discussion, this probably won't 
> > > > > > > > > > > work for the case
> > > > > > > > > > > of out of order completion since it depends on the 
> > > > > > > > > > > information in the
> > > > > > > > > > > descriptor ring. We probably need to extend ctx to record 
> > > > > > > > > > > such information.
> > > > > > > > > > Above code doesn't depend on the information in the 
> > > > > > > > > > descriptor
> > > > > > > > > > ring. The vq->desc_state[] is the extended ctx.
> > > > > > > > > > 
> > > > > > > > > > Best regards,
> > > > > > > > > > Tiwei Bie
> > > > > > > > > Yes, but desc is a pointer to descriptor ring I think so
> > > > > > > > > vring_unmap_one_packed() still depends on the content of 
> > > > > > > > > descriptor ring?
> > > > > > > > > 
> > > > > > > > I got your point now. I think it makes sense to reserve
> > > > > > > > the bits of the addr field. Driver shouldn't try to get
> > > > > > > > addrs from the descriptors when cleanup the descriptors
> > > > > > > > no matter whether we support out-of-order or not.
> > > > > > > Maybe I was wrong, but I remember spec mentioned something like 
> > > > > > > this.
> > > > > > You're right. Spec mentioned this. I was just repeating
> > > > > > the spec to emphasize that it does make sense. :)
> > > > > > 
> > > > > > > > But combining it with the out-of-order support, it will
> > > > > > > > mean that the driver still needs to maintain a desc/ctx
> > > > > > > > list that is very similar to the desc ring in the split
> > > > > > > > ring. I'm not quite sure whether it's something we want.
> > > > > > > > If it is true, I'll do it. So do you think we also want
> > > > > > > > to maintain such a desc/ctx list for packed ring?
&

Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-18 Thread Tiwei Bie
On Sat, May 19, 2018 at 09:12:30AM +0800, Jason Wang wrote:
> On 2018年05月18日 22:33, Tiwei Bie wrote:
> > On Fri, May 18, 2018 at 09:17:05PM +0800, Jason Wang wrote:
> > > On 2018年05月18日 19:29, Tiwei Bie wrote:
> > > > On Thu, May 17, 2018 at 08:01:52PM +0800, Jason Wang wrote:
> > > > > On 2018年05月16日 22:33, Tiwei Bie wrote:
> > > > > > On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
> > > > > > > On 2018年05月16日 21:45, Tiwei Bie wrote:
> > > > > > > > On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> > > > > > > > > On 2018年05月16日 20:39, Tiwei Bie wrote:
> > > > > > > > > > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > > > > > > > > > On 2018年05月16日 16:37, Tiwei Bie wrote:
> > > > > > [...]
> > > > > > > > > > > > +static void detach_buf_packed(struct vring_virtqueue 
> > > > > > > > > > > > *vq, unsigned int head,
> > > > > > > > > > > > + unsigned int id, void 
> > > > > > > > > > > > **ctx)
> > > > > > > > > > > > +{
> > > > > > > > > > > > +   struct vring_packed_desc *desc;
> > > > > > > > > > > > +   unsigned int i, j;
> > > > > > > > > > > > +
> > > > > > > > > > > > +   /* Clear data ptr. */
> > > > > > > > > > > > +   vq->desc_state[id].data = NULL;
> > > > > > > > > > > > +
> > > > > > > > > > > > +   i = head;
> > > > > > > > > > > > +
> > > > > > > > > > > > +   for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > > > > > > > > > +   desc = >vring_packed.desc[i];
> > > > > > > > > > > > +   vring_unmap_one_packed(vq, desc);
> > > > > > > > > > > As mentioned in previous discussion, this probably won't 
> > > > > > > > > > > work for the case
> > > > > > > > > > > of out of order completion since it depends on the 
> > > > > > > > > > > information in the
> > > > > > > > > > > descriptor ring. We probably need to extend ctx to record 
> > > > > > > > > > > such information.
> > > > > > > > > > Above code doesn't depend on the information in the 
> > > > > > > > > > descriptor
> > > > > > > > > > ring. The vq->desc_state[] is the extended ctx.
> > > > > > > > > > 
> > > > > > > > > > Best regards,
> > > > > > > > > > Tiwei Bie
> > > > > > > > > Yes, but desc is a pointer to descriptor ring I think so
> > > > > > > > > vring_unmap_one_packed() still depends on the content of 
> > > > > > > > > descriptor ring?
> > > > > > > > > 
> > > > > > > > I got your point now. I think it makes sense to reserve
> > > > > > > > the bits of the addr field. Driver shouldn't try to get
> > > > > > > > addrs from the descriptors when cleanup the descriptors
> > > > > > > > no matter whether we support out-of-order or not.
> > > > > > > Maybe I was wrong, but I remember spec mentioned something like 
> > > > > > > this.
> > > > > > You're right. Spec mentioned this. I was just repeating
> > > > > > the spec to emphasize that it does make sense. :)
> > > > > > 
> > > > > > > > But combining it with the out-of-order support, it will
> > > > > > > > mean that the driver still needs to maintain a desc/ctx
> > > > > > > > list that is very similar to the desc ring in the split
> > > > > > > > ring. I'm not quite sure whether it's something we want.
> > > > > > > > If it is true, I'll do it. So do you think we also want
> > > > > > > > to maintain such a desc/ctx list for packed ring?
&

Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-18 Thread Tiwei Bie
On Fri, May 18, 2018 at 09:17:05PM +0800, Jason Wang wrote:
> On 2018年05月18日 19:29, Tiwei Bie wrote:
> > On Thu, May 17, 2018 at 08:01:52PM +0800, Jason Wang wrote:
> > > On 2018年05月16日 22:33, Tiwei Bie wrote:
> > > > On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
> > > > > On 2018年05月16日 21:45, Tiwei Bie wrote:
> > > > > > On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> > > > > > > On 2018年05月16日 20:39, Tiwei Bie wrote:
> > > > > > > > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > > > > > > > On 2018年05月16日 16:37, Tiwei Bie wrote:
> > > > [...]
> > > > > > > > > > +static void detach_buf_packed(struct vring_virtqueue *vq, 
> > > > > > > > > > unsigned int head,
> > > > > > > > > > + unsigned int id, void **ctx)
> > > > > > > > > > +{
> > > > > > > > > > +   struct vring_packed_desc *desc;
> > > > > > > > > > +   unsigned int i, j;
> > > > > > > > > > +
> > > > > > > > > > +   /* Clear data ptr. */
> > > > > > > > > > +   vq->desc_state[id].data = NULL;
> > > > > > > > > > +
> > > > > > > > > > +   i = head;
> > > > > > > > > > +
> > > > > > > > > > +   for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > > > > > > > +   desc = >vring_packed.desc[i];
> > > > > > > > > > +   vring_unmap_one_packed(vq, desc);
> > > > > > > > > As mentioned in previous discussion, this probably won't work 
> > > > > > > > > for the case
> > > > > > > > > of out of order completion since it depends on the 
> > > > > > > > > information in the
> > > > > > > > > descriptor ring. We probably need to extend ctx to record 
> > > > > > > > > such information.
> > > > > > > > Above code doesn't depend on the information in the descriptor
> > > > > > > > ring. The vq->desc_state[] is the extended ctx.
> > > > > > > > 
> > > > > > > > Best regards,
> > > > > > > > Tiwei Bie
> > > > > > > Yes, but desc is a pointer to descriptor ring I think so
> > > > > > > vring_unmap_one_packed() still depends on the content of 
> > > > > > > descriptor ring?
> > > > > > > 
> > > > > > I got your point now. I think it makes sense to reserve
> > > > > > the bits of the addr field. Driver shouldn't try to get
> > > > > > addrs from the descriptors when cleanup the descriptors
> > > > > > no matter whether we support out-of-order or not.
> > > > > Maybe I was wrong, but I remember spec mentioned something like this.
> > > > You're right. Spec mentioned this. I was just repeating
> > > > the spec to emphasize that it does make sense. :)
> > > > 
> > > > > > But combining it with the out-of-order support, it will
> > > > > > mean that the driver still needs to maintain a desc/ctx
> > > > > > list that is very similar to the desc ring in the split
> > > > > > ring. I'm not quite sure whether it's something we want.
> > > > > > If it is true, I'll do it. So do you think we also want
> > > > > > to maintain such a desc/ctx list for packed ring?
> > > > > To make it work for OOO backends I think we need something like this
> > > > > (hardware NIC drivers are usually have something like this).
> > > > Which hardware NIC drivers have this?
> > > It's quite common I think, e.g driver track e.g dma addr and page frag
> > > somewhere. e.g the ring->rx_info in mlx4 driver.
> > It seems that I had a misunderstanding on your
> > previous comments. I know it's quite common for
> > drivers to track e.g. DMA addrs somewhere (and
> > I think one reason behind this is that they want
> > to reuse the bits of addr field).
> 
> Yes, we may want this for virtio-net as well in the future.
> 
> >   But tracking
> > addrs somewhere doesn't means supporting OOO.
> > I thought you

Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-18 Thread Tiwei Bie
On Fri, May 18, 2018 at 09:17:05PM +0800, Jason Wang wrote:
> On 2018年05月18日 19:29, Tiwei Bie wrote:
> > On Thu, May 17, 2018 at 08:01:52PM +0800, Jason Wang wrote:
> > > On 2018年05月16日 22:33, Tiwei Bie wrote:
> > > > On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
> > > > > On 2018年05月16日 21:45, Tiwei Bie wrote:
> > > > > > On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> > > > > > > On 2018年05月16日 20:39, Tiwei Bie wrote:
> > > > > > > > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > > > > > > > On 2018年05月16日 16:37, Tiwei Bie wrote:
> > > > [...]
> > > > > > > > > > +static void detach_buf_packed(struct vring_virtqueue *vq, 
> > > > > > > > > > unsigned int head,
> > > > > > > > > > + unsigned int id, void **ctx)
> > > > > > > > > > +{
> > > > > > > > > > +   struct vring_packed_desc *desc;
> > > > > > > > > > +   unsigned int i, j;
> > > > > > > > > > +
> > > > > > > > > > +   /* Clear data ptr. */
> > > > > > > > > > +   vq->desc_state[id].data = NULL;
> > > > > > > > > > +
> > > > > > > > > > +   i = head;
> > > > > > > > > > +
> > > > > > > > > > +   for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > > > > > > > +   desc = >vring_packed.desc[i];
> > > > > > > > > > +   vring_unmap_one_packed(vq, desc);
> > > > > > > > > As mentioned in previous discussion, this probably won't work 
> > > > > > > > > for the case
> > > > > > > > > of out of order completion since it depends on the 
> > > > > > > > > information in the
> > > > > > > > > descriptor ring. We probably need to extend ctx to record 
> > > > > > > > > such information.
> > > > > > > > Above code doesn't depend on the information in the descriptor
> > > > > > > > ring. The vq->desc_state[] is the extended ctx.
> > > > > > > > 
> > > > > > > > Best regards,
> > > > > > > > Tiwei Bie
> > > > > > > Yes, but desc is a pointer to descriptor ring I think so
> > > > > > > vring_unmap_one_packed() still depends on the content of 
> > > > > > > descriptor ring?
> > > > > > > 
> > > > > > I got your point now. I think it makes sense to reserve
> > > > > > the bits of the addr field. Driver shouldn't try to get
> > > > > > addrs from the descriptors when cleanup the descriptors
> > > > > > no matter whether we support out-of-order or not.
> > > > > Maybe I was wrong, but I remember spec mentioned something like this.
> > > > You're right. Spec mentioned this. I was just repeating
> > > > the spec to emphasize that it does make sense. :)
> > > > 
> > > > > > But combining it with the out-of-order support, it will
> > > > > > mean that the driver still needs to maintain a desc/ctx
> > > > > > list that is very similar to the desc ring in the split
> > > > > > ring. I'm not quite sure whether it's something we want.
> > > > > > If it is true, I'll do it. So do you think we also want
> > > > > > to maintain such a desc/ctx list for packed ring?
> > > > > To make it work for OOO backends I think we need something like this
> > > > > (hardware NIC drivers are usually have something like this).
> > > > Which hardware NIC drivers have this?
> > > It's quite common I think, e.g driver track e.g dma addr and page frag
> > > somewhere. e.g the ring->rx_info in mlx4 driver.
> > It seems that I had a misunderstanding on your
> > previous comments. I know it's quite common for
> > drivers to track e.g. DMA addrs somewhere (and
> > I think one reason behind this is that they want
> > to reuse the bits of addr field).
> 
> Yes, we may want this for virtio-net as well in the future.
> 
> >   But tracking
> > addrs somewhere doesn't means supporting OOO.
> > I thought you

Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-18 Thread Tiwei Bie
On Thu, May 17, 2018 at 08:01:52PM +0800, Jason Wang wrote:
> On 2018年05月16日 22:33, Tiwei Bie wrote:
> > On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
> > > On 2018年05月16日 21:45, Tiwei Bie wrote:
> > > > On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> > > > > On 2018年05月16日 20:39, Tiwei Bie wrote:
> > > > > > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > > > > > On 2018年05月16日 16:37, Tiwei Bie wrote:
> > [...]
> > > > > > > > +static void detach_buf_packed(struct vring_virtqueue *vq, 
> > > > > > > > unsigned int head,
> > > > > > > > + unsigned int id, void **ctx)
> > > > > > > > +{
> > > > > > > > +   struct vring_packed_desc *desc;
> > > > > > > > +   unsigned int i, j;
> > > > > > > > +
> > > > > > > > +   /* Clear data ptr. */
> > > > > > > > +   vq->desc_state[id].data = NULL;
> > > > > > > > +
> > > > > > > > +   i = head;
> > > > > > > > +
> > > > > > > > +   for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > > > > > +   desc = >vring_packed.desc[i];
> > > > > > > > +   vring_unmap_one_packed(vq, desc);
> > > > > > > As mentioned in previous discussion, this probably won't work for 
> > > > > > > the case
> > > > > > > of out of order completion since it depends on the information in 
> > > > > > > the
> > > > > > > descriptor ring. We probably need to extend ctx to record such 
> > > > > > > information.
> > > > > > Above code doesn't depend on the information in the descriptor
> > > > > > ring. The vq->desc_state[] is the extended ctx.
> > > > > > 
> > > > > > Best regards,
> > > > > > Tiwei Bie
> > > > > Yes, but desc is a pointer to descriptor ring I think so
> > > > > vring_unmap_one_packed() still depends on the content of descriptor 
> > > > > ring?
> > > > > 
> > > > I got your point now. I think it makes sense to reserve
> > > > the bits of the addr field. Driver shouldn't try to get
> > > > addrs from the descriptors when cleanup the descriptors
> > > > no matter whether we support out-of-order or not.
> > > Maybe I was wrong, but I remember spec mentioned something like this.
> > You're right. Spec mentioned this. I was just repeating
> > the spec to emphasize that it does make sense. :)
> > 
> > > > But combining it with the out-of-order support, it will
> > > > mean that the driver still needs to maintain a desc/ctx
> > > > list that is very similar to the desc ring in the split
> > > > ring. I'm not quite sure whether it's something we want.
> > > > If it is true, I'll do it. So do you think we also want
> > > > to maintain such a desc/ctx list for packed ring?
> > > To make it work for OOO backends I think we need something like this
> > > (hardware NIC drivers are usually have something like this).
> > Which hardware NIC drivers have this?
> 
> It's quite common I think, e.g driver track e.g dma addr and page frag
> somewhere. e.g the ring->rx_info in mlx4 driver.

It seems that I had a misunderstanding on your
previous comments. I know it's quite common for
drivers to track e.g. DMA addrs somewhere (and
I think one reason behind this is that they want
to reuse the bits of addr field). But tracking
addrs somewhere doesn't means supporting OOO.
I thought you were saying it's quite common for
hardware NIC drivers to support OOO (i.e. NICs
will return the descriptors OOO):

I'm not familiar with mlx4, maybe I'm wrong.
I just had a quick glance. And I found below
comments in mlx4_en_process_rx_cq():

```
/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
 * descriptor offset can be deduced from the CQE index instead of
 * reading 'cqe->index' */
index = cq->mcq.cons_index & ring->size_mask;
cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
```

It seems that although they have a completion
queue, they are still using the ring in order.

I guess maybe storage device may want OOO.

Best regards,
Tiwei Bie

> 
> Thanks
> 
> > 
> > > Not for the patch, but it looks like having a OUT_OF_ORDER feature bit is
> > > much more simpler to be started with.
> > +1
> > 
> > Best regards,
> > Tiwei Bie
> 


Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-18 Thread Tiwei Bie
On Thu, May 17, 2018 at 08:01:52PM +0800, Jason Wang wrote:
> On 2018年05月16日 22:33, Tiwei Bie wrote:
> > On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
> > > On 2018年05月16日 21:45, Tiwei Bie wrote:
> > > > On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> > > > > On 2018年05月16日 20:39, Tiwei Bie wrote:
> > > > > > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > > > > > On 2018年05月16日 16:37, Tiwei Bie wrote:
> > [...]
> > > > > > > > +static void detach_buf_packed(struct vring_virtqueue *vq, 
> > > > > > > > unsigned int head,
> > > > > > > > + unsigned int id, void **ctx)
> > > > > > > > +{
> > > > > > > > +   struct vring_packed_desc *desc;
> > > > > > > > +   unsigned int i, j;
> > > > > > > > +
> > > > > > > > +   /* Clear data ptr. */
> > > > > > > > +   vq->desc_state[id].data = NULL;
> > > > > > > > +
> > > > > > > > +   i = head;
> > > > > > > > +
> > > > > > > > +   for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > > > > > +   desc = >vring_packed.desc[i];
> > > > > > > > +   vring_unmap_one_packed(vq, desc);
> > > > > > > As mentioned in previous discussion, this probably won't work for 
> > > > > > > the case
> > > > > > > of out of order completion since it depends on the information in 
> > > > > > > the
> > > > > > > descriptor ring. We probably need to extend ctx to record such 
> > > > > > > information.
> > > > > > Above code doesn't depend on the information in the descriptor
> > > > > > ring. The vq->desc_state[] is the extended ctx.
> > > > > > 
> > > > > > Best regards,
> > > > > > Tiwei Bie
> > > > > Yes, but desc is a pointer to descriptor ring I think so
> > > > > vring_unmap_one_packed() still depends on the content of descriptor 
> > > > > ring?
> > > > > 
> > > > I got your point now. I think it makes sense to reserve
> > > > the bits of the addr field. Driver shouldn't try to get
> > > > addrs from the descriptors when cleanup the descriptors
> > > > no matter whether we support out-of-order or not.
> > > Maybe I was wrong, but I remember spec mentioned something like this.
> > You're right. Spec mentioned this. I was just repeating
> > the spec to emphasize that it does make sense. :)
> > 
> > > > But combining it with the out-of-order support, it will
> > > > mean that the driver still needs to maintain a desc/ctx
> > > > list that is very similar to the desc ring in the split
> > > > ring. I'm not quite sure whether it's something we want.
> > > > If it is true, I'll do it. So do you think we also want
> > > > to maintain such a desc/ctx list for packed ring?
> > > To make it work for OOO backends I think we need something like this
> > > (hardware NIC drivers are usually have something like this).
> > Which hardware NIC drivers have this?
> 
> It's quite common I think, e.g driver track e.g dma addr and page frag
> somewhere. e.g the ring->rx_info in mlx4 driver.

It seems that I had a misunderstanding on your
previous comments. I know it's quite common for
drivers to track e.g. DMA addrs somewhere (and
I think one reason behind this is that they want
to reuse the bits of addr field). But tracking
addrs somewhere doesn't means supporting OOO.
I thought you were saying it's quite common for
hardware NIC drivers to support OOO (i.e. NICs
will return the descriptors OOO):

I'm not familiar with mlx4, maybe I'm wrong.
I just had a quick glance. And I found below
comments in mlx4_en_process_rx_cq():

```
/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
 * descriptor offset can be deduced from the CQE index instead of
 * reading 'cqe->index' */
index = cq->mcq.cons_index & ring->size_mask;
cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
```

It seems that although they have a completion
queue, they are still using the ring in order.

I guess maybe storage device may want OOO.

Best regards,
Tiwei Bie

> 
> Thanks
> 
> > 
> > > Not for the patch, but it looks like having a OUT_OF_ORDER feature bit is
> > > much more simpler to be started with.
> > +1
> > 
> > Best regards,
> > Tiwei Bie
> 


Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-16 Thread Tiwei Bie
On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
> On 2018年05月16日 21:45, Tiwei Bie wrote:
> > On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> > > On 2018年05月16日 20:39, Tiwei Bie wrote:
> > > > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > > > On 2018年05月16日 16:37, Tiwei Bie wrote:
[...]
> > > > > > +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned 
> > > > > > int head,
> > > > > > + unsigned int id, void **ctx)
> > > > > > +{
> > > > > > +   struct vring_packed_desc *desc;
> > > > > > +   unsigned int i, j;
> > > > > > +
> > > > > > +   /* Clear data ptr. */
> > > > > > +   vq->desc_state[id].data = NULL;
> > > > > > +
> > > > > > +   i = head;
> > > > > > +
> > > > > > +   for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > > > +   desc = >vring_packed.desc[i];
> > > > > > +   vring_unmap_one_packed(vq, desc);
> > > > > As mentioned in previous discussion, this probably won't work for the 
> > > > > case
> > > > > of out of order completion since it depends on the information in the
> > > > > descriptor ring. We probably need to extend ctx to record such 
> > > > > information.
> > > > Above code doesn't depend on the information in the descriptor
> > > > ring. The vq->desc_state[] is the extended ctx.
> > > > 
> > > > Best regards,
> > > > Tiwei Bie
> > > Yes, but desc is a pointer to descriptor ring I think so
> > > vring_unmap_one_packed() still depends on the content of descriptor ring?
> > > 
> > I got your point now. I think it makes sense to reserve
> > the bits of the addr field. Driver shouldn't try to get
> > addrs from the descriptors when cleanup the descriptors
> > no matter whether we support out-of-order or not.
> 
> Maybe I was wrong, but I remember spec mentioned something like this.

You're right. Spec mentioned this. I was just repeating
the spec to emphasize that it does make sense. :)

> 
> > 
> > But combining it with the out-of-order support, it will
> > mean that the driver still needs to maintain a desc/ctx
> > list that is very similar to the desc ring in the split
> > ring. I'm not quite sure whether it's something we want.
> > If it is true, I'll do it. So do you think we also want
> > to maintain such a desc/ctx list for packed ring?
> 
> To make it work for OOO backends I think we need something like this
> (hardware NIC drivers are usually have something like this).

Which hardware NIC drivers have this?

> 
> Not for the patch, but it looks like having a OUT_OF_ORDER feature bit is
> much more simpler to be started with.

+1

Best regards,
Tiwei Bie


Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-16 Thread Tiwei Bie
On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
> On 2018年05月16日 21:45, Tiwei Bie wrote:
> > On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> > > On 2018年05月16日 20:39, Tiwei Bie wrote:
> > > > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > > > On 2018年05月16日 16:37, Tiwei Bie wrote:
[...]
> > > > > > +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned 
> > > > > > int head,
> > > > > > + unsigned int id, void **ctx)
> > > > > > +{
> > > > > > +   struct vring_packed_desc *desc;
> > > > > > +   unsigned int i, j;
> > > > > > +
> > > > > > +   /* Clear data ptr. */
> > > > > > +   vq->desc_state[id].data = NULL;
> > > > > > +
> > > > > > +   i = head;
> > > > > > +
> > > > > > +   for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > > > +   desc = >vring_packed.desc[i];
> > > > > > +   vring_unmap_one_packed(vq, desc);
> > > > > As mentioned in previous discussion, this probably won't work for the 
> > > > > case
> > > > > of out of order completion since it depends on the information in the
> > > > > descriptor ring. We probably need to extend ctx to record such 
> > > > > information.
> > > > Above code doesn't depend on the information in the descriptor
> > > > ring. The vq->desc_state[] is the extended ctx.
> > > > 
> > > > Best regards,
> > > > Tiwei Bie
> > > Yes, but desc is a pointer to descriptor ring I think so
> > > vring_unmap_one_packed() still depends on the content of descriptor ring?
> > > 
> > I got your point now. I think it makes sense to reserve
> > the bits of the addr field. Driver shouldn't try to get
> > addrs from the descriptors when cleanup the descriptors
> > no matter whether we support out-of-order or not.
> 
> Maybe I was wrong, but I remember spec mentioned something like this.

You're right. Spec mentioned this. I was just repeating
the spec to emphasize that it does make sense. :)

> 
> > 
> > But combining it with the out-of-order support, it will
> > mean that the driver still needs to maintain a desc/ctx
> > list that is very similar to the desc ring in the split
> > ring. I'm not quite sure whether it's something we want.
> > If it is true, I'll do it. So do you think we also want
> > to maintain such a desc/ctx list for packed ring?
> 
> To make it work for OOO backends I think we need something like this
> (hardware NIC drivers are usually have something like this).

Which hardware NIC drivers have this?

> 
> Not for the patch, but it looks like having a OUT_OF_ORDER feature bit is
> much more simpler to be started with.

+1

Best regards,
Tiwei Bie


Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-16 Thread Tiwei Bie
On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> On 2018年05月16日 20:39, Tiwei Bie wrote:
> > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > On 2018年05月16日 16:37, Tiwei Bie wrote:
> > [...]
> > > >struct vring_virtqueue {
> > > > @@ -116,6 +117,9 @@ struct vring_virtqueue {
> > > > /* Last written value to driver->flags in
> > > >  * guest byte order. */
> > > > u16 event_flags_shadow;
> > > > +
> > > > +   /* ID allocation. */
> > > > +   struct idr buffer_id;
> > > I'm not sure idr is fit for the performance critical case here. Need to
> > > measure its performance impact, especially if we have few unused slots.
> > I'm also not sure.. But fortunately, it should be quite easy
> > to replace it with something else without changing other code.
> > If it will really hurt the performance, I'll change it.
> 
> We may want to do some benchmarking/profiling to see.

Yeah!

> 
> > 
> > > > };
> > > > };
> > [...]
> > > > +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned int 
> > > > head,
> > > > + unsigned int id, void **ctx)
> > > > +{
> > > > +   struct vring_packed_desc *desc;
> > > > +   unsigned int i, j;
> > > > +
> > > > +   /* Clear data ptr. */
> > > > +   vq->desc_state[id].data = NULL;
> > > > +
> > > > +   i = head;
> > > > +
> > > > +   for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > +   desc = >vring_packed.desc[i];
> > > > +   vring_unmap_one_packed(vq, desc);
> > > As mentioned in previous discussion, this probably won't work for the case
> > > of out of order completion since it depends on the information in the
> > > descriptor ring. We probably need to extend ctx to record such 
> > > information.
> > Above code doesn't depend on the information in the descriptor
> > ring. The vq->desc_state[] is the extended ctx.
> > 
> > Best regards,
> > Tiwei Bie
> 
> Yes, but desc is a pointer to descriptor ring I think so
> vring_unmap_one_packed() still depends on the content of descriptor ring?
> 

I got your point now. I think it makes sense to reserve
the bits of the addr field. Driver shouldn't try to get
addrs from the descriptors when cleanup the descriptors
no matter whether we support out-of-order or not.

But combining it with the out-of-order support, it will
mean that the driver still needs to maintain a desc/ctx
list that is very similar to the desc ring in the split
ring. I'm not quite sure whether it's something we want.
If it is true, I'll do it. So do you think we also want
to maintain such a desc/ctx list for packed ring?

Best regards,
Tiwei Bie


Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-16 Thread Tiwei Bie
On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> On 2018年05月16日 20:39, Tiwei Bie wrote:
> > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > On 2018年05月16日 16:37, Tiwei Bie wrote:
> > [...]
> > > >struct vring_virtqueue {
> > > > @@ -116,6 +117,9 @@ struct vring_virtqueue {
> > > > /* Last written value to driver->flags in
> > > >  * guest byte order. */
> > > > u16 event_flags_shadow;
> > > > +
> > > > +   /* ID allocation. */
> > > > +   struct idr buffer_id;
> > > I'm not sure idr is fit for the performance critical case here. Need to
> > > measure its performance impact, especially if we have few unused slots.
> > I'm also not sure.. But fortunately, it should be quite easy
> > to replace it with something else without changing other code.
> > If it will really hurt the performance, I'll change it.
> 
> We may want to do some benchmarking/profiling to see.

Yeah!

> 
> > 
> > > > };
> > > > };
> > [...]
> > > > +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned int 
> > > > head,
> > > > + unsigned int id, void **ctx)
> > > > +{
> > > > +   struct vring_packed_desc *desc;
> > > > +   unsigned int i, j;
> > > > +
> > > > +   /* Clear data ptr. */
> > > > +   vq->desc_state[id].data = NULL;
> > > > +
> > > > +   i = head;
> > > > +
> > > > +   for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > +   desc = >vring_packed.desc[i];
> > > > +   vring_unmap_one_packed(vq, desc);
> > > As mentioned in previous discussion, this probably won't work for the case
> > > of out of order completion since it depends on the information in the
> > > descriptor ring. We probably need to extend ctx to record such 
> > > information.
> > Above code doesn't depend on the information in the descriptor
> > ring. The vq->desc_state[] is the extended ctx.
> > 
> > Best regards,
> > Tiwei Bie
> 
> Yes, but desc is a pointer to descriptor ring I think so
> vring_unmap_one_packed() still depends on the content of descriptor ring?
> 

I got your point now. I think it makes sense to reserve
the bits of the addr field. Driver shouldn't try to get
addrs from the descriptors when cleanup the descriptors
no matter whether we support out-of-order or not.

But combining it with the out-of-order support, it will
mean that the driver still needs to maintain a desc/ctx
list that is very similar to the desc ring in the split
ring. I'm not quite sure whether it's something we want.
If it is true, I'll do it. So do you think we also want
to maintain such a desc/ctx list for packed ring?

Best regards,
Tiwei Bie


Re: [RFC v4 4/5] virtio_ring: add event idx support in packed ring

2018-05-16 Thread Tiwei Bie
On Wed, May 16, 2018 at 08:17:21PM +0800, Jason Wang wrote:
> On 2018年05月16日 16:37, Tiwei Bie wrote:
[...]
> > @@ -1160,15 +1186,27 @@ static void virtqueue_disable_cb_packed(struct 
> > virtqueue *_vq)
> >   static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> >   {
> > struct vring_virtqueue *vq = to_vvq(_vq);
> > +   u16 wrap_counter;
> > START_USE(vq);
> > /* We optimistically turn back on interrupts, then check if there was
> >  * more to do. */
> > +   /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> > +* either clear the flags bit or point the event index at the next
> > +* entry. Always update the event index to keep code simple. */
> > +
> > +   wrap_counter = vq->wrap_counter;
> > +   if (vq->last_used_idx > vq->next_avail_idx)
> 
> Should this be ">=" consider rx refill may try to completely fill the ring?

It seems that there are two cases that last_used_idx
equals to next_avail_idx. The first one is that the
ring is empty. And the second one is that the ring
is full. Although in the first case, most probably,
the driver won't enable the interrupt.

Maybe I really should track the used_wrap_counter
instead of calculating it each time I need it.. I'll
give it a try..

> 
> > +   wrap_counter ^= 1;
> > +
> > +   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > +   vq->last_used_idx | (wrap_counter << 15));
> > if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > virtio_wmb(vq->weak_barriers);
> > -   vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +   vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > +VRING_EVENT_F_ENABLE;
> > vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > vq->event_flags_shadow);
> > }
> > @@ -1194,15 +1232,40 @@ static bool virtqueue_poll_packed(struct virtqueue 
> > *_vq, unsigned last_used_idx)
> >   static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> >   {
> > struct vring_virtqueue *vq = to_vvq(_vq);
> > +   u16 bufs, used_idx, wrap_counter;
> > START_USE(vq);
> > /* We optimistically turn back on interrupts, then check if there was
> >  * more to do. */
> > +   /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> > +* either clear the flags bit or point the event index at the next
> > +* entry. Always update the event index to keep code simple. */
> > +
> > +   /* TODO: tune this threshold */
> > +   if (vq->next_avail_idx < vq->last_used_idx)
> > +   bufs = (vq->vring_packed.num + vq->next_avail_idx -
> > +   vq->last_used_idx) * 3 / 4;
> > +   else
> > +   bufs = (vq->next_avail_idx - vq->last_used_idx) * 3 / 4;
> > +
> > +   wrap_counter = vq->wrap_counter;
> > +   if (vq->last_used_idx > vq->next_avail_idx)
> > +   wrap_counter ^= 1;
> > +
> > +   used_idx = vq->last_used_idx + bufs;
> > +   if (used_idx >= vq->vring_packed.num) {
> > +   used_idx -= vq->vring_packed.num;
> > +   wrap_counter ^= 1;
> > +   }
> 
> Looks correct but maybe it's better to add some comments for such logic.

Make sense.

> 
> > +
> > +   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > +   used_idx | (wrap_counter << 15));
> > if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > virtio_wmb(vq->weak_barriers);
> > -   vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +   vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > +VRING_EVENT_F_ENABLE;
> >     vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > vq->event_flags_shadow);
> > }
> > @@ -1869,8 +1932,10 @@ void vring_transport_features(struct virtio_device 
> > *vdev)
> > switch (i) {
> > case VIRTIO_RING_F_INDIRECT_DESC:
> > break;
> > +#if 0
> > case VIRTIO_RING_F_EVENT_IDX:
> > break;
> > +#endif
> 
> Maybe it's time to remove this #if 0.

Will do it.

Thanks for the review!

Best regards,
Tiwei Bie


Re: [RFC v4 4/5] virtio_ring: add event idx support in packed ring

2018-05-16 Thread Tiwei Bie
On Wed, May 16, 2018 at 08:17:21PM +0800, Jason Wang wrote:
> On 2018年05月16日 16:37, Tiwei Bie wrote:
[...]
> > @@ -1160,15 +1186,27 @@ static void virtqueue_disable_cb_packed(struct 
> > virtqueue *_vq)
> >   static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> >   {
> > struct vring_virtqueue *vq = to_vvq(_vq);
> > +   u16 wrap_counter;
> > START_USE(vq);
> > /* We optimistically turn back on interrupts, then check if there was
> >  * more to do. */
> > +   /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> > +* either clear the flags bit or point the event index at the next
> > +* entry. Always update the event index to keep code simple. */
> > +
> > +   wrap_counter = vq->wrap_counter;
> > +   if (vq->last_used_idx > vq->next_avail_idx)
> 
> Should this be ">=" consider rx refill may try to completely fill the ring?

It seems that there are two cases that last_used_idx
equals to next_avail_idx. The first one is that the
ring is empty. And the second one is that the ring
is full. Although in the first case, most probably,
the driver won't enable the interrupt.

Maybe I really should track the used_wrap_counter
instead of calculating it each time I need it.. I'll
give it a try..

> 
> > +   wrap_counter ^= 1;
> > +
> > +   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > +   vq->last_used_idx | (wrap_counter << 15));
> > if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > virtio_wmb(vq->weak_barriers);
> > -   vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +   vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > +VRING_EVENT_F_ENABLE;
> > vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > vq->event_flags_shadow);
> > }
> > @@ -1194,15 +1232,40 @@ static bool virtqueue_poll_packed(struct virtqueue 
> > *_vq, unsigned last_used_idx)
> >   static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> >   {
> > struct vring_virtqueue *vq = to_vvq(_vq);
> > +   u16 bufs, used_idx, wrap_counter;
> > START_USE(vq);
> > /* We optimistically turn back on interrupts, then check if there was
> >  * more to do. */
> > +   /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> > +* either clear the flags bit or point the event index at the next
> > +* entry. Always update the event index to keep code simple. */
> > +
> > +   /* TODO: tune this threshold */
> > +   if (vq->next_avail_idx < vq->last_used_idx)
> > +   bufs = (vq->vring_packed.num + vq->next_avail_idx -
> > +   vq->last_used_idx) * 3 / 4;
> > +   else
> > +   bufs = (vq->next_avail_idx - vq->last_used_idx) * 3 / 4;
> > +
> > +   wrap_counter = vq->wrap_counter;
> > +   if (vq->last_used_idx > vq->next_avail_idx)
> > +   wrap_counter ^= 1;
> > +
> > +   used_idx = vq->last_used_idx + bufs;
> > +   if (used_idx >= vq->vring_packed.num) {
> > +   used_idx -= vq->vring_packed.num;
> > +   wrap_counter ^= 1;
> > +   }
> 
> Looks correct but maybe it's better to add some comments for such logic.

Make sense.

> 
> > +
> > +   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > +   used_idx | (wrap_counter << 15));
> > if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > virtio_wmb(vq->weak_barriers);
> > -   vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +   vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > +VRING_EVENT_F_ENABLE;
> >     vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > vq->event_flags_shadow);
> > }
> > @@ -1869,8 +1932,10 @@ void vring_transport_features(struct virtio_device 
> > *vdev)
> > switch (i) {
> > case VIRTIO_RING_F_INDIRECT_DESC:
> > break;
> > +#if 0
> > case VIRTIO_RING_F_EVENT_IDX:
> > break;
> > +#endif
> 
> Maybe it's time to remove this #if 0.

Will do it.

Thanks for the review!

Best regards,
Tiwei Bie


Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-16 Thread Tiwei Bie
On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> On 2018年05月16日 16:37, Tiwei Bie wrote:
[...]
> >   struct vring_virtqueue {
> > @@ -116,6 +117,9 @@ struct vring_virtqueue {
> > /* Last written value to driver->flags in
> >  * guest byte order. */
> > u16 event_flags_shadow;
> > +
> > +   /* ID allocation. */
> > +   struct idr buffer_id;
> 
> I'm not sure idr is fit for the performance critical case here. Need to
> measure its performance impact, especially if we have few unused slots.

I'm also not sure.. But fortunately, it should be quite easy
to replace it with something else without changing other code.
If it will really hurt the performance, I'll change it.

> 
> > };
> > };
[...]
> > +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned int 
> > head,
> > + unsigned int id, void **ctx)
> > +{
> > +   struct vring_packed_desc *desc;
> > +   unsigned int i, j;
> > +
> > +   /* Clear data ptr. */
> > +   vq->desc_state[id].data = NULL;
> > +
> > +   i = head;
> > +
> > +   for (j = 0; j < vq->desc_state[id].num; j++) {
> > +   desc = >vring_packed.desc[i];
> > +   vring_unmap_one_packed(vq, desc);
> 
> As mentioned in previous discussion, this probably won't work for the case
> of out of order completion since it depends on the information in the
> descriptor ring. We probably need to extend ctx to record such information.

Above code doesn't depend on the information in the descriptor
ring. The vq->desc_state[] is the extended ctx.

Best regards,
Tiwei Bie


Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-16 Thread Tiwei Bie
On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> On 2018年05月16日 16:37, Tiwei Bie wrote:
[...]
> >   struct vring_virtqueue {
> > @@ -116,6 +117,9 @@ struct vring_virtqueue {
> > /* Last written value to driver->flags in
> >  * guest byte order. */
> > u16 event_flags_shadow;
> > +
> > +   /* ID allocation. */
> > +   struct idr buffer_id;
> 
> I'm not sure idr is fit for the performance critical case here. Need to
> measure its performance impact, especially if we have few unused slots.

I'm also not sure.. But fortunately, it should be quite easy
to replace it with something else without changing other code.
If it will really hurt the performance, I'll change it.

> 
> > };
> > };
[...]
> > +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned int 
> > head,
> > + unsigned int id, void **ctx)
> > +{
> > +   struct vring_packed_desc *desc;
> > +   unsigned int i, j;
> > +
> > +   /* Clear data ptr. */
> > +   vq->desc_state[id].data = NULL;
> > +
> > +   i = head;
> > +
> > +   for (j = 0; j < vq->desc_state[id].num; j++) {
> > +   desc = >vring_packed.desc[i];
> > +   vring_unmap_one_packed(vq, desc);
> 
> As mentioned in previous discussion, this probably won't work for the case
> of out of order completion since it depends on the information in the
> descriptor ring. We probably need to extend ctx to record such information.

Above code doesn't depend on the information in the descriptor
ring. The vq->desc_state[] is the extended ctx.

Best regards,
Tiwei Bie


Re: [RFC v4 5/5] virtio_ring: enable packed ring

2018-05-16 Thread Tiwei Bie
On Wed, May 16, 2018 at 02:42:53PM +0300, Sergei Shtylyov wrote:
> On 05/16/2018 01:21 PM, Tiwei Bie wrote:
> 
> >>> Signed-off-by: Tiwei Bie <tiwei@intel.com>
> >>> ---
> >>>   drivers/virtio/virtio_ring.c | 2 ++
> >>>   1 file changed, 2 insertions(+)
> >>>
> >>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> >>> index de3839f3621a..b158692263b0 100644
> >>> --- a/drivers/virtio/virtio_ring.c
> >>> +++ b/drivers/virtio/virtio_ring.c
> >>> @@ -1940,6 +1940,8 @@ void vring_transport_features(struct virtio_device 
> >>> *vdev)
> >>>   break;
> >>>   case VIRTIO_F_IOMMU_PLATFORM:
> >>>   break;
> >>> + case VIRTIO_F_RING_PACKED:
> >>> + break;
> >>
> >>Why not just add this *case* under the previous *case*?
> > 
> > Do you mean fallthrough? Something like:
> > 
> > case VIRTIO_F_IOMMU_PLATFORM:
> > case VIRTIO_F_RING_PACKED:
> > break;
> 
>Yes, exactly. :-)

Using fallthrough in this case will make the code more
compact. I like such coding style. But unfortunately,
it's not consistent with the existing code. :(

The whole function will become something like this:

void vring_transport_features(struct virtio_device *vdev)
{
unsigned int i;

for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) {
switch (i) {
case VIRTIO_RING_F_INDIRECT_DESC:
break;
case VIRTIO_RING_F_EVENT_IDX:
break;
case VIRTIO_F_VERSION_1:
break;
case VIRTIO_F_IOMMU_PLATFORM:
case VIRTIO_F_RING_PACKED:
break;
default:
/* We don't understand this bit. */
__virtio_clear_bit(vdev, i);
}
}
}

Best regards,
Tiwei Bie

> 
> > Best regards,
> > Tiwei Bie
> 
> [...]
> 
> MBR, Sergei
> 


Re: [RFC v4 5/5] virtio_ring: enable packed ring

2018-05-16 Thread Tiwei Bie
On Wed, May 16, 2018 at 02:42:53PM +0300, Sergei Shtylyov wrote:
> On 05/16/2018 01:21 PM, Tiwei Bie wrote:
> 
> >>> Signed-off-by: Tiwei Bie 
> >>> ---
> >>>   drivers/virtio/virtio_ring.c | 2 ++
> >>>   1 file changed, 2 insertions(+)
> >>>
> >>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> >>> index de3839f3621a..b158692263b0 100644
> >>> --- a/drivers/virtio/virtio_ring.c
> >>> +++ b/drivers/virtio/virtio_ring.c
> >>> @@ -1940,6 +1940,8 @@ void vring_transport_features(struct virtio_device 
> >>> *vdev)
> >>>   break;
> >>>   case VIRTIO_F_IOMMU_PLATFORM:
> >>>   break;
> >>> + case VIRTIO_F_RING_PACKED:
> >>> + break;
> >>
> >>Why not just add this *case* under the previous *case*?
> > 
> > Do you mean fallthrough? Something like:
> > 
> > case VIRTIO_F_IOMMU_PLATFORM:
> > case VIRTIO_F_RING_PACKED:
> > break;
> 
>Yes, exactly. :-)

Using fallthrough in this case will make the code more
compact. I like such coding style. But unfortunately,
it's not consistent with the existing code. :(

The whole function will become something like this:

void vring_transport_features(struct virtio_device *vdev)
{
unsigned int i;

for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) {
switch (i) {
case VIRTIO_RING_F_INDIRECT_DESC:
break;
case VIRTIO_RING_F_EVENT_IDX:
break;
case VIRTIO_F_VERSION_1:
break;
case VIRTIO_F_IOMMU_PLATFORM:
        case VIRTIO_F_RING_PACKED:
break;
default:
/* We don't understand this bit. */
__virtio_clear_bit(vdev, i);
}
}
}

Best regards,
Tiwei Bie

> 
> > Best regards,
> > Tiwei Bie
> 
> [...]
> 
> MBR, Sergei
> 


Re: [RFC v4 5/5] virtio_ring: enable packed ring

2018-05-16 Thread Tiwei Bie
On Wed, May 16, 2018 at 01:15:48PM +0300, Sergei Shtylyov wrote:
> On 5/16/2018 11:37 AM, Tiwei Bie wrote:
> 
> > Signed-off-by: Tiwei Bie <tiwei@intel.com>
> > ---
> >   drivers/virtio/virtio_ring.c | 2 ++
> >   1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index de3839f3621a..b158692263b0 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -1940,6 +1940,8 @@ void vring_transport_features(struct virtio_device 
> > *vdev)
> > break;
> > case VIRTIO_F_IOMMU_PLATFORM:
> > break;
> > +   case VIRTIO_F_RING_PACKED:
> > +   break;
> 
>Why not just add this *case* under the previous *case*?

Do you mean fallthrough? Something like:

case VIRTIO_F_IOMMU_PLATFORM:
case VIRTIO_F_RING_PACKED:
break;

Best regards,
Tiwei Bie

> 
> > default:
> > /* We don't understand this bit. */
> > __virtio_clear_bit(vdev, i);
> 
> MBR, Sergei


Re: [RFC v4 5/5] virtio_ring: enable packed ring

2018-05-16 Thread Tiwei Bie
On Wed, May 16, 2018 at 01:15:48PM +0300, Sergei Shtylyov wrote:
> On 5/16/2018 11:37 AM, Tiwei Bie wrote:
> 
> > Signed-off-by: Tiwei Bie 
> > ---
> >   drivers/virtio/virtio_ring.c | 2 ++
> >   1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index de3839f3621a..b158692263b0 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -1940,6 +1940,8 @@ void vring_transport_features(struct virtio_device 
> > *vdev)
> > break;
> > case VIRTIO_F_IOMMU_PLATFORM:
> > break;
> > +   case VIRTIO_F_RING_PACKED:
> > +   break;
> 
>Why not just add this *case* under the previous *case*?

Do you mean fallthrough? Something like:

case VIRTIO_F_IOMMU_PLATFORM:
case VIRTIO_F_RING_PACKED:
break;

Best regards,
Tiwei Bie

> 
> > default:
> > /* We don't understand this bit. */
> > __virtio_clear_bit(vdev, i);
> 
> MBR, Sergei


[RFC v4 1/5] virtio: add packed ring definitions

2018-05-16 Thread Tiwei Bie
Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 include/uapi/linux/virtio_config.h | 12 +-
 include/uapi/linux/virtio_ring.h   | 36 ++
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/virtio_config.h 
b/include/uapi/linux/virtio_config.h
index 308e2096291f..a6e392325e3a 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -49,7 +49,7 @@
  * transport being used (eg. virtio_ring), the rest are per-device feature
  * bits. */
 #define VIRTIO_TRANSPORT_F_START   28
-#define VIRTIO_TRANSPORT_F_END 34
+#define VIRTIO_TRANSPORT_F_END 36
 
 #ifndef VIRTIO_CONFIG_NO_LEGACY
 /* Do we get callbacks when the ring is completely used, even if we've
@@ -71,4 +71,14 @@
  * this is for compatibility with legacy systems.
  */
 #define VIRTIO_F_IOMMU_PLATFORM33
+
+/* This feature indicates support for the packed virtqueue layout. */
+#define VIRTIO_F_RING_PACKED   34
+
+/*
+ * This feature indicates that all buffers are used by the device
+ * in the same order in which they have been made available.
+ */
+#define VIRTIO_F_IN_ORDER  35
+
 #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 6d5d5faa989b..3932cb80c347 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -44,6 +44,9 @@
 /* This means the buffer contains a list of buffer descriptors. */
 #define VRING_DESC_F_INDIRECT  4
 
+#define VRING_DESC_F_AVAIL(b)  ((b) << 7)
+#define VRING_DESC_F_USED(b)   ((b) << 15)
+
 /* The Host uses this in used->flags to advise the Guest: don't kick me when
  * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
  * will still kick if it's out of buffers. */
@@ -53,6 +56,10 @@
  * optimization.  */
 #define VRING_AVAIL_F_NO_INTERRUPT 1
 
+#define VRING_EVENT_F_ENABLE   0x0
+#define VRING_EVENT_F_DISABLE  0x1
+#define VRING_EVENT_F_DESC 0x2
+
 /* We support indirect buffer descriptors */
 #define VIRTIO_RING_F_INDIRECT_DESC28
 
@@ -171,4 +178,33 @@ static inline int vring_need_event(__u16 event_idx, __u16 
new_idx, __u16 old)
return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
 }
 
+struct vring_packed_desc_event {
+   /* __virtio16 off  : 15; // Descriptor Event Offset
+* __virtio16 wrap : 1;  // Descriptor Event Wrap Counter */
+   __virtio16 off_wrap;
+   /* __virtio16 flags : 2; // Descriptor Event Flags */
+   __virtio16 flags;
+};
+
+struct vring_packed_desc {
+   /* Buffer Address. */
+   __virtio64 addr;
+   /* Buffer Length. */
+   __virtio32 len;
+   /* Buffer ID. */
+   __virtio16 id;
+   /* The flags depending on descriptor type. */
+   __virtio16 flags;
+};
+
+struct vring_packed {
+   unsigned int num;
+
+   struct vring_packed_desc *desc;
+
+   struct vring_packed_desc_event *driver;
+
+   struct vring_packed_desc_event *device;
+};
+
 #endif /* _UAPI_LINUX_VIRTIO_RING_H */
-- 
2.17.0



  1   2   3   >