Re: [virtio-dev] Re: [PATCH net-next v2 0/5] virtio: support packed ring

2018-10-11 Thread Tiwei Bie
On Thu, Oct 11, 2018 at 10:17:15AM -0400, Michael S. Tsirkin wrote:
> On Thu, Oct 11, 2018 at 10:13:31PM +0800, Tiwei Bie wrote:
> > On Thu, Oct 11, 2018 at 09:48:48AM -0400, Michael S. Tsirkin wrote:
> > > On Thu, Oct 11, 2018 at 08:12:21PM +0800, Tiwei Bie wrote:
> > > > > > But if it's not too late, I second for a OUT_OF_ORDER feature.
> > > > > > Starting from in order can have much simpler code in driver.
> > > > > > 
> > > > > > Thanks
> > > > > 
> > > > > It's tricky to change the flag polarity because of compatibility
> > > > > with legacy interfaces. Why is this such a big deal?
> > > > > 
> > > > > Let's teach drivers about IN_ORDER, then if devices
> > > > > are in order it will get enabled by default.
> > > > 
> > > > Yeah, make sense.
> > > > 
> > > > Besides, I have done some further profiling and debugging
> > > > both in kernel driver and DPDK vhost. Previously I was mislead
> > > > by a bug in vhost code. I will send a patch to fix that bug.
> > > > With that bug fixed, the performance of packed ring in the
> > > > test between kernel driver and DPDK vhost is better now.
> > > 
> > > OK, if we get a performance gain on the virtio side, we can finally
> > > upstream it. If you see that please re-post ASAP so we can
> > > put it in the next kernel release.
> > 
> > Got it, I will re-post ASAP.
> > 
> > Thanks!
> 
> 
> Pls remember to include data on performance gain in the cover letter.

Sure. I'll try to include some performance analyses.



Re: [virtio-dev] Re: [PATCH net-next v2 0/5] virtio: support packed ring

2018-10-11 Thread Tiwei Bie
On Thu, Oct 11, 2018 at 09:48:48AM -0400, Michael S. Tsirkin wrote:
> On Thu, Oct 11, 2018 at 08:12:21PM +0800, Tiwei Bie wrote:
> > > > But if it's not too late, I second for a OUT_OF_ORDER feature.
> > > > Starting from in order can have much simpler code in driver.
> > > > 
> > > > Thanks
> > > 
> > > It's tricky to change the flag polarity because of compatibility
> > > with legacy interfaces. Why is this such a big deal?
> > > 
> > > Let's teach drivers about IN_ORDER, then if devices
> > > are in order it will get enabled by default.
> > 
> > Yeah, make sense.
> > 
> > Besides, I have done some further profiling and debugging
> > both in kernel driver and DPDK vhost. Previously I was mislead
> > by a bug in vhost code. I will send a patch to fix that bug.
> > With that bug fixed, the performance of packed ring in the
> > test between kernel driver and DPDK vhost is better now.
> 
> OK, if we get a performance gain on the virtio side, we can finally
> upstream it. If you see that please re-post ASAP so we can
> put it in the next kernel release.

Got it, I will re-post ASAP.

Thanks!


> 
> -- 
> MST


Re: [virtio-dev] Re: [PATCH net-next v2 0/5] virtio: support packed ring

2018-10-11 Thread Tiwei Bie
On Wed, Oct 10, 2018 at 10:36:26AM -0400, Michael S. Tsirkin wrote:
> On Thu, Sep 13, 2018 at 05:47:29PM +0800, Jason Wang wrote:
> > On 2018年09月13日 16:59, Tiwei Bie wrote:
> > > > If what you say is true then we should take a careful look
> > > > and not supporting these generic things with packed layout.
> > > > Once we do support them it will be too late and we won't
> > > > be able to get performance back.
> > > I think it's a good point that we don't need to support
> > > everything in packed ring (especially these which would
> > > hurt the performance), as the packed ring aims at high
> > > performance. I'm also wondering about the features. Is
> > > there any possibility that we won't support the out of
> > > order processing (at least not by default) in packed ring?
> > > If I didn't miss anything, the need to support out of order
> > > processing in packed ring will make the data structure
> > > inside the driver not cache friendly which is similar to
> > > the case of the descriptor table in the split ring (the
> > > difference is that, it only happens in driver now).
> > 
> > Out of order is not the only user, DMA is another one. We don't have used
> > ring(len), so we need to maintain buffer length somewhere even for in order
> > device.
> 
> For a bunch of systems dma unmap is a nop so we do not really
> need to maintain it. It's a question of an API to detect that
> and optimize for it. I posted a proposed patch for that -
> want to try using that?

Yeah, definitely!

> 
> > But if it's not too late, I second for a OUT_OF_ORDER feature.
> > Starting from in order can have much simpler code in driver.
> > 
> > Thanks
> 
> It's tricky to change the flag polarity because of compatibility
> with legacy interfaces. Why is this such a big deal?
> 
> Let's teach drivers about IN_ORDER, then if devices
> are in order it will get enabled by default.

Yeah, make sense.

Besides, I have done some further profiling and debugging
both in kernel driver and DPDK vhost. Previously I was mislead
by a bug in vhost code. I will send a patch to fix that bug.
With that bug fixed, the performance of packed ring in the
test between kernel driver and DPDK vhost is better now.
I will send a new series soon. Thanks!

> 
> -- 
> MST


Re: [RFC v5 2/5] virtio_ring: support creating packed ring

2018-05-28 Thread Tiwei Bie
On Tue, May 29, 2018 at 10:49:11AM +0800, Jason Wang wrote:
> On 2018年05月22日 16:16, Tiwei Bie wrote:
> > This commit introduces the support for creating packed ring.
> > All split ring specific functions are added _split suffix.
> > Some necessary stubs for packed ring are also added.
> > 
> > Signed-off-by: Tiwei Bie 
> > ---
> >   drivers/virtio/virtio_ring.c | 801 +++
> >   include/linux/virtio_ring.h  |   8 +-
> >   2 files changed, 546 insertions(+), 263 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 71458f493cf8..f5ef5f42a7cf 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -61,11 +61,15 @@ struct vring_desc_state {
> > struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
> >   };
> > +struct vring_desc_state_packed {
> > +   int next;   /* The next desc state. */
> > +};
> > +
> >   struct vring_virtqueue {
> > struct virtqueue vq;
> > -   /* Actual memory layout for this queue */
> > -   struct vring vring;
> > +   /* Is this a packed ring? */
> > +   bool packed;
> > /* Can we use weak barriers? */
> > bool weak_barriers;
> > @@ -87,11 +91,39 @@ struct vring_virtqueue {
> > /* Last used index we've seen. */
> > u16 last_used_idx;
> > -   /* Last written value to avail->flags */
> > -   u16 avail_flags_shadow;
> > +   union {
> > +   /* Available for split ring */
> > +   struct {
> > +   /* Actual memory layout for this queue. */
> > +   struct vring vring;
> > -   /* Last written value to avail->idx in guest byte order */
> > -   u16 avail_idx_shadow;
> > +   /* Last written value to avail->flags */
> > +   u16 avail_flags_shadow;
> > +
> > +   /* Last written value to avail->idx in
> > +* guest byte order. */
> > +   u16 avail_idx_shadow;
> > +   };
> > +
> > +   /* Available for packed ring */
> > +   struct {
> > +   /* Actual memory layout for this queue. */
> > +   struct vring_packed vring_packed;
> > +
> > +   /* Driver ring wrap counter. */
> > +   u8 avail_wrap_counter;
> > +
> > +   /* Device ring wrap counter. */
> > +   u8 used_wrap_counter;
> 
> How about just use boolean?

I can change it to bool if you want.

> 
[...]
> > -static int vring_mapping_error(const struct vring_virtqueue *vq,
> > -  dma_addr_t addr)
> > -{
> > -   if (!vring_use_dma_api(vq->vq.vdev))
> > -   return 0;
> > -
> > -   return dma_mapping_error(vring_dma_dev(vq), addr);
> > -}
> 
> It looks to me if you keep vring_mapping_error behind
> vring_unmap_one_split(), lots of changes were unncessary.
> 
[...]
> > +   }
> > +   /* That should have freed everything. */
> > +   BUG_ON(vq->vq.num_free != vq->vring.num);
> > +
> > +   END_USE(vq);
> > +   return NULL;
> > +}
> 
> I think the those copy-and-paste hunks could be avoided and the diff should
> only contains renaming of the function. If yes, it would be very welcomed
> since it requires to compare the changes verbatim otherwise.

Michael suggested to lay out the code as:

XXX_split

XXX_packed

XXX wrappers

https://lkml.org/lkml/2018/4/13/410

That's why I moved some code.

> 
> > +
> > +/*
> > + * The layout for the packed ring is a continuous chunk of memory
> > + * which looks like this.
> > + *
> > + * struct vring_packed {
> > + * // The actual descriptors (16 bytes each)
> > + * struct vring_packed_desc desc[num];
> > + *
> > + * // Padding to the next align boundary.
> > + * char pad[];
> > + *
> > + * // Driver Event Suppression
> > + * struct vring_packed_desc_event driver;
> > + *
> > + * // Device Event Suppression
> > + * struct vring_packed_desc_event device;
> > + * };
> > + */
> > +static inline void vring_init_packed(struct vring_packed *vr, unsigned int 
> > num,
> > +void *p, unsigned long align)
> > +{
> > +   vr->num = num;
> > +   vr->desc = p;
> > +   vr->driver = (void *)(((uintptr_t)p + sizeof(struct vring_packed_desc)
> > +   * num + align - 1) &

Re: [RFC v5 3/5] virtio_ring: add packed ring support

2018-05-28 Thread Tiwei Bie
On Tue, May 29, 2018 at 11:18:57AM +0800, Jason Wang wrote:
> On 2018年05月22日 16:16, Tiwei Bie wrote:
[...]
> > +static void detach_buf_packed(struct vring_virtqueue *vq,
> > + unsigned int id, void **ctx)
> > +{
> > +   struct vring_desc_state_packed *state;
> > +   struct vring_packed_desc *desc;
> > +   unsigned int i;
> > +   int *next;
> > +
> > +   /* Clear data ptr. */
> > +   vq->desc_state_packed[id].data = NULL;
> > +
> > +   next = 
> > +   for (i = 0; i < vq->desc_state_packed[id].num; i++) {
> > +   state = >desc_state_packed[*next];
> > +   vring_unmap_state_packed(vq, state);
> > +   next = >desc_state_packed[*next].next;
> 
> Have a short discussion with Michael. It looks like the only thing we care
> is DMA address and length here.

The length tracked by desc_state_packed[] is also necessary
when INDIRECT is used and the buffers are writable.

> 
> So a thought is to avoid using desc_state_packed() is vring_use_dma_api() is
> false? Probably need another id allocator or just use desc_state_packed for
> free id list.

I think it's a good suggestion. Thanks!

I won't track the addr/len/flags for non-indirect descs
when vring_use_dma_api() returns false.

> 
> > +   }
> > +
> > +   vq->vq.num_free += vq->desc_state_packed[id].num;
> > +   *next = vq->free_head;
> 
> Using pointer seems not elegant and unnecessary here. You can just track
> next in integer I think?

It's possible to use integer, but will need one more var
to track `prev` (desc_state_packed[prev].next == next in
this case).

> 
> > +   vq->free_head = id;
> > +
> > +   if (vq->indirect) {
> > +   u32 len;
> > +
> > +   /* Free the indirect table, if any, now that it's unmapped. */
> > +   desc = vq->desc_state_packed[id].indir_desc;
> > +   if (!desc)
> > +   return;
> > +
> > +   len = vq->desc_state_packed[id].len;
> > +   for (i = 0; i < len / sizeof(struct vring_packed_desc); i++)
> > +   vring_unmap_desc_packed(vq, [i]);
> > +
> > +   kfree(desc);
> > +   vq->desc_state_packed[id].indir_desc = NULL;
> > +   } else if (ctx) {
> > +   *ctx = vq->desc_state_packed[id].indir_desc;
> > +   }
> >   }
> >   static inline bool more_used_packed(const struct vring_virtqueue *vq)
> >   {
> > -   return false;
> > +   u16 last_used, flags;
> > +   u8 avail, used;
> > +
> > +   last_used = vq->last_used_idx;
> > +   flags = virtio16_to_cpu(vq->vq.vdev,
> > +   vq->vring_packed.desc[last_used].flags);
> > +   avail = !!(flags & VRING_DESC_F_AVAIL(1));
> > +   used = !!(flags & VRING_DESC_F_USED(1));
> > +
> > +   return avail == used && used == vq->used_wrap_counter;
> 
> Spec does not check avail == used? So there's probably a bug in either of
> the two places.
> 
> In what condition that avail != used but used == vq_used_wrap_counter? A
> buggy device or device set the two bits in two writes?

Currently, spec doesn't forbid devices to set the status
bits as: avail != used but used == vq_used_wrap_counter.
So I think driver cannot assume this won't happen.

> 
> >   }
[...]
> >   static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> >   {
> > -   return 0;
> > +   struct vring_virtqueue *vq = to_vvq(_vq);
> > +
> > +   START_USE(vq);
> > +
> > +   /* We optimistically turn back on interrupts, then check if there was
> > +* more to do. */
> > +
> > +   if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > +   virtio_wmb(vq->weak_barriers);
> 
> Missing comments for the barrier.

Will add some comments.

> 
> > +   vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +   vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > +   vq->event_flags_shadow);
> > +   }
> > +
> > +   END_USE(vq);
> > +   return vq->last_used_idx;
> >   }
> >   static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned 
> > last_used_idx)
> >   {
> > -   return false;
> > +   struct vring_virtqueue *vq = to_vvq(_vq);
> > +   u8 avail, used;
> > +   u16 flags;
> > +
> > +   virtio_mb(vq->weak_barriers);
> > +   flags = virtio16_to_cpu(vq->vq.vdev,
> > +   vq->vring_packed.desc[last_used_idx].flags);
> > +   avail = !!(flags & VRING_DESC_F_AVAIL(1));
> > +   used = !!(flags & VRING_DESC_F_USED(1));
> > +
> > +   return avail == used && used == vq->used_wrap_counter;
> >   }
> >   static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> >   {
> > -   return false;
> > +   struct vring_virtqueue *vq = to_vvq(_vq);
> > +
> > +   START_USE(vq);
> > +
> > +   /* We optimistically turn back on interrupts, then check if there was
> > +* more to do. */
> > +
> > +   if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > +   virtio_wmb(vq->weak_barriers);
> 
> Need comments for the barrier.

Will add some comments.

Best regards,
Tiwei Bie


Re: [RFC v5 0/5] virtio: support packed ring

2018-05-24 Thread Tiwei Bie
On Fri, May 25, 2018 at 10:31:26AM +0800, Jason Wang wrote:
> On 2018年05月22日 16:16, Tiwei Bie wrote:
> > Hello everyone,
> > 
> > This RFC implements packed ring support in virtio driver.
> > 
> > Some simple functional tests have been done with Jason's
> > packed ring implementation in vhost (RFC v4):
> > 
> > https://lkml.org/lkml/2018/5/16/501
> > 
> > Both of ping and netperf worked as expected w/ EVENT_IDX
> > disabled. Ping worked as expected w/ EVENT_IDX enabled,
> > but netperf didn't (A hack has been added in the driver
> > to make netperf test pass in this case. The hack can be
> > found by `grep -rw XXX` in the code).
> 
> Looks like this is because a bug in vhost which wrongly track signalled_used
> and may miss an interrupt. After fixing that, both side works like a charm.
> 
> I'm testing vIOMMU and zerocopy, and will post a new version shortly. (Hope
> it would be the last RFC version).

Great to know that. Thanks a lot! :)

Best regards,
Tiwei Bie


[RFC v5 1/5] virtio: add packed ring definitions

2018-05-22 Thread Tiwei Bie
Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 include/uapi/linux/virtio_config.h |  5 -
 include/uapi/linux/virtio_ring.h   | 36 ++
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/virtio_config.h 
b/include/uapi/linux/virtio_config.h
index 308e2096291f..932a6ecc8e46 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -49,7 +49,7 @@
  * transport being used (eg. virtio_ring), the rest are per-device feature
  * bits. */
 #define VIRTIO_TRANSPORT_F_START   28
-#define VIRTIO_TRANSPORT_F_END 34
+#define VIRTIO_TRANSPORT_F_END 35
 
 #ifndef VIRTIO_CONFIG_NO_LEGACY
 /* Do we get callbacks when the ring is completely used, even if we've
@@ -71,4 +71,7 @@
  * this is for compatibility with legacy systems.
  */
 #define VIRTIO_F_IOMMU_PLATFORM33
+
+/* This feature indicates support for the packed virtqueue layout. */
+#define VIRTIO_F_RING_PACKED   34
 #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 6d5d5faa989b..3932cb80c347 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -44,6 +44,9 @@
 /* This means the buffer contains a list of buffer descriptors. */
 #define VRING_DESC_F_INDIRECT  4
 
+#define VRING_DESC_F_AVAIL(b)  ((b) << 7)
+#define VRING_DESC_F_USED(b)   ((b) << 15)
+
 /* The Host uses this in used->flags to advise the Guest: don't kick me when
  * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
  * will still kick if it's out of buffers. */
@@ -53,6 +56,10 @@
  * optimization.  */
 #define VRING_AVAIL_F_NO_INTERRUPT 1
 
+#define VRING_EVENT_F_ENABLE   0x0
+#define VRING_EVENT_F_DISABLE  0x1
+#define VRING_EVENT_F_DESC 0x2
+
 /* We support indirect buffer descriptors */
 #define VIRTIO_RING_F_INDIRECT_DESC28
 
@@ -171,4 +178,33 @@ static inline int vring_need_event(__u16 event_idx, __u16 
new_idx, __u16 old)
return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
 }
 
+struct vring_packed_desc_event {
+   /* __virtio16 off  : 15; // Descriptor Event Offset
+* __virtio16 wrap : 1;  // Descriptor Event Wrap Counter */
+   __virtio16 off_wrap;
+   /* __virtio16 flags : 2; // Descriptor Event Flags */
+   __virtio16 flags;
+};
+
+struct vring_packed_desc {
+   /* Buffer Address. */
+   __virtio64 addr;
+   /* Buffer Length. */
+   __virtio32 len;
+   /* Buffer ID. */
+   __virtio16 id;
+   /* The flags depending on descriptor type. */
+   __virtio16 flags;
+};
+
+struct vring_packed {
+   unsigned int num;
+
+   struct vring_packed_desc *desc;
+
+   struct vring_packed_desc_event *driver;
+
+   struct vring_packed_desc_event *device;
+};
+
 #endif /* _UAPI_LINUX_VIRTIO_RING_H */
-- 
2.17.0



[RFC v5 3/5] virtio_ring: add packed ring support

2018-05-22 Thread Tiwei Bie
This commit introduces the support (without EVENT_IDX) for
packed ring.

Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 drivers/virtio/virtio_ring.c | 474 ++-
 1 file changed, 468 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index f5ef5f42a7cf..eb9fd5207a68 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -62,6 +62,12 @@ struct vring_desc_state {
 };
 
 struct vring_desc_state_packed {
+   void *data; /* Data for callback. */
+   struct vring_packed_desc *indir_desc; /* Indirect descriptor, if any. */
+   int num;/* Descriptor list length. */
+   dma_addr_t addr;/* Buffer DMA addr. */
+   u32 len;/* Buffer length. */
+   u16 flags;  /* Descriptor flags. */
int next;   /* The next desc state. */
 };
 
@@ -758,6 +764,72 @@ static inline unsigned vring_size_packed(unsigned int num, 
unsigned long align)
& ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
 }
 
+static void vring_unmap_state_packed(const struct vring_virtqueue *vq,
+struct vring_desc_state_packed *state)
+{
+   u16 flags;
+
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return;
+
+   flags = state->flags;
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+state->addr, state->len,
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  state->addr, state->len,
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static void vring_unmap_desc_packed(const struct vring_virtqueue *vq,
+  struct vring_packed_desc *desc)
+{
+   u16 flags;
+
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return;
+
+   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+virtio64_to_cpu(vq->vq.vdev, desc->addr),
+virtio32_to_cpu(vq->vq.vdev, desc->len),
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  virtio64_to_cpu(vq->vq.vdev, desc->addr),
+  virtio32_to_cpu(vq->vq.vdev, desc->len),
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq,
+  unsigned int total_sg,
+  gfp_t gfp)
+{
+   struct vring_packed_desc *desc;
+
+   /*
+* We require lowmem mappings for the descriptors because
+* otherwise virt_to_phys will give us bogus addresses in the
+* virtqueue.
+*/
+   gfp &= ~__GFP_HIGHMEM;
+
+   desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);
+
+   return desc;
+}
+
 static inline int virtqueue_add_packed(struct virtqueue *_vq,
   struct scatterlist *sgs[],
   unsigned int total_sg,
@@ -767,47 +839,437 @@ static inline int virtqueue_add_packed(struct virtqueue 
*_vq,
   void *ctx,
   gfp_t gfp)
 {
+   struct vring_virtqueue *vq = to_vvq(_vq);
+   struct vring_packed_desc *desc;
+   struct scatterlist *sg;
+   unsigned int i, n, descs_used, uninitialized_var(prev), err_idx;
+   __virtio16 uninitialized_var(head_flags), flags;
+   u16 head, avail_wrap_counter, id, cur;
+   bool indirect;
+
+   START_USE(vq);
+
+   BUG_ON(data == NULL);
+   BUG_ON(ctx && vq->indirect);
+
+   if (unlikely(vq->broken)) {
+   END_USE(vq);
+   return -EIO;
+   }
+
+#ifdef DEBUG
+   {
+   ktime_t now = ktime_get();
+
+   /* No kick or get, with .1 second between?  Warn. */
+   if (vq->last_add_time_valid)
+   WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time))
+   > 100);
+   vq->last_add_time = now;
+  

[RFC v5 0/5] virtio: support packed ring

2018-05-22 Thread Tiwei Bie
Hello everyone,

This RFC implements packed ring support in virtio driver.

Some simple functional tests have been done with Jason's
packed ring implementation in vhost (RFC v4):

https://lkml.org/lkml/2018/5/16/501

Both of ping and netperf worked as expected w/ EVENT_IDX
disabled. Ping worked as expected w/ EVENT_IDX enabled,
but netperf didn't (A hack has been added in the driver
to make netperf test pass in this case. The hack can be
found by `grep -rw XXX` in the code).

TODO:
- Refinements (for code and commit log);
- More tests;
- Bug fixes;

RFC v4 -> RFC v5:
- Save DMA addr, etc in desc state (Jason);
- Track used wrap counter;

RFC v3 -> RFC v4:
- Make ID allocation support out-of-order (Jason);
- Various fixes for EVENT_IDX support;

RFC v2 -> RFC v3:
- Split into small patches (Jason);
- Add helper virtqueue_use_indirect() (Jason);
- Just set id for the last descriptor of a list (Jason);
- Calculate the prev in virtqueue_add_packed() (Jason);
- Fix/improve desc suppression code (Jason/MST);
- Refine the code layout for XXX_split/packed and wrappers (MST);
- Fix the comments and API in uapi (MST);
- Remove the BUG_ON() for indirect (Jason);
- Some other refinements and bug fixes;

RFC v1 -> RFC v2:
- Add indirect descriptor support - compile test only;
- Add event suppression supprt - compile test only;
- Move vring_packed_init() out of uapi (Jason, MST);
- Merge two loops into one in virtqueue_add_packed() (Jason);
- Split vring_unmap_one() for packed ring and split ring (Jason);
- Avoid using '%' operator (Jason);
- Rename free_head -> next_avail_idx (Jason);
- Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
- Some other refinements and bug fixes;

Thanks!

Tiwei Bie (5):
  virtio: add packed ring definitions
  virtio_ring: support creating packed ring
  virtio_ring: add packed ring support
  virtio_ring: add event idx support in packed ring
  virtio_ring: enable packed ring

 drivers/virtio/virtio_ring.c   | 1354 ++--
 include/linux/virtio_ring.h|8 +-
 include/uapi/linux/virtio_config.h |5 +-
 include/uapi/linux/virtio_ring.h   |   36 +
 4 files changed, 1125 insertions(+), 278 deletions(-)

-- 
2.17.0



[RFC v5 2/5] virtio_ring: support creating packed ring

2018-05-22 Thread Tiwei Bie
This commit introduces the support for creating packed ring.
All split ring specific functions are added _split suffix.
Some necessary stubs for packed ring are also added.

Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 drivers/virtio/virtio_ring.c | 801 +++
 include/linux/virtio_ring.h  |   8 +-
 2 files changed, 546 insertions(+), 263 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..f5ef5f42a7cf 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -61,11 +61,15 @@ struct vring_desc_state {
struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
 };
 
+struct vring_desc_state_packed {
+   int next;   /* The next desc state. */
+};
+
 struct vring_virtqueue {
struct virtqueue vq;
 
-   /* Actual memory layout for this queue */
-   struct vring vring;
+   /* Is this a packed ring? */
+   bool packed;
 
/* Can we use weak barriers? */
bool weak_barriers;
@@ -87,11 +91,39 @@ struct vring_virtqueue {
/* Last used index we've seen. */
u16 last_used_idx;
 
-   /* Last written value to avail->flags */
-   u16 avail_flags_shadow;
+   union {
+   /* Available for split ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring vring;
 
-   /* Last written value to avail->idx in guest byte order */
-   u16 avail_idx_shadow;
+   /* Last written value to avail->flags */
+   u16 avail_flags_shadow;
+
+   /* Last written value to avail->idx in
+* guest byte order. */
+   u16 avail_idx_shadow;
+   };
+
+   /* Available for packed ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring_packed vring_packed;
+
+   /* Driver ring wrap counter. */
+   u8 avail_wrap_counter;
+
+   /* Device ring wrap counter. */
+   u8 used_wrap_counter;
+
+   /* Index of the next avail descriptor. */
+   u16 next_avail_idx;
+
+   /* Last written value to driver->flags in
+* guest byte order. */
+   u16 event_flags_shadow;
+   };
+   };
 
/* How to notify other side. FIXME: commonalize hcalls! */
bool (*notify)(struct virtqueue *vq);
@@ -111,11 +143,24 @@ struct vring_virtqueue {
 #endif
 
/* Per-descriptor state. */
-   struct vring_desc_state desc_state[];
+   union {
+   struct vring_desc_state desc_state[1];
+   struct vring_desc_state_packed desc_state_packed[1];
+   };
 };
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+static inline bool virtqueue_use_indirect(struct virtqueue *_vq,
+ unsigned int total_sg)
+{
+   struct vring_virtqueue *vq = to_vvq(_vq);
+
+   /* If the host supports indirect descriptor tables, and we have multiple
+* buffers, then go indirect. FIXME: tune this threshold */
+   return (vq->indirect && total_sg > 1 && vq->vq.num_free);
+}
+
 /*
  * Modern virtio devices have feature bits to specify whether they need a
  * quirk and bypass the IOMMU. If not there, just use the DMA API.
@@ -201,8 +246,17 @@ static dma_addr_t vring_map_single(const struct 
vring_virtqueue *vq,
  cpu_addr, size, direction);
 }
 
-static void vring_unmap_one(const struct vring_virtqueue *vq,
-   struct vring_desc *desc)
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+  dma_addr_t addr)
+{
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return 0;
+
+   return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
+static void vring_unmap_one_split(const struct vring_virtqueue *vq,
+ struct vring_desc *desc)
 {
u16 flags;
 
@@ -226,17 +280,9 @@ static void vring_unmap_one(const struct vring_virtqueue 
*vq,
}
 }
 
-static int vring_mapping_error(const struct vring_virtqueue *vq,
-  dma_addr_t addr)
-{
-   if (!vring_use_dma_api(vq->vq.vdev))
-   return 0;
-
-   return dma_mapping_error(vring_dma_dev(vq), addr);
-}
-
-static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
-unsigned int total_sg, gfp_t gfp)
+static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
+  unsigned int total_sg,
+  

[RFC v5 5/5] virtio_ring: enable packed ring

2018-05-22 Thread Tiwei Bie
Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 drivers/virtio/virtio_ring.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 1ee52a89cb04..355dfef4f1eb 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1956,6 +1956,8 @@ void vring_transport_features(struct virtio_device *vdev)
break;
case VIRTIO_F_IOMMU_PLATFORM:
break;
+   case VIRTIO_F_RING_PACKED:
+   break;
default:
/* We don't understand this bit. */
__virtio_clear_bit(vdev, i);
-- 
2.17.0



[RFC v5 4/5] virtio_ring: add event idx support in packed ring

2018-05-22 Thread Tiwei Bie
This commit introduces the EVENT_IDX support in packed ring.

Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 drivers/virtio/virtio_ring.c | 67 ++--
 1 file changed, 64 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index eb9fd5207a68..1ee52a89cb04 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1043,7 +1043,7 @@ static inline int virtqueue_add_packed(struct virtqueue 
*_vq,
 static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
-   u16 flags;
+   u16 new, old, off_wrap, flags, wrap_counter, event_idx;
bool needs_kick;
u32 snapshot;
 
@@ -1052,9 +1052,19 @@ static bool virtqueue_kick_prepare_packed(struct 
virtqueue *_vq)
 * suppressions. */
virtio_mb(vq->weak_barriers);
 
+   old = vq->next_avail_idx - vq->num_added;
+   new = vq->next_avail_idx;
+   vq->num_added = 0;
+
snapshot = *(u32 *)vq->vring_packed.device;
+   off_wrap = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot & 0x));
flags = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot >> 16)) & 0x3;
 
+   wrap_counter = off_wrap >> 15;
+   event_idx = off_wrap & ~(1<<15);
+   if (wrap_counter != vq->avail_wrap_counter)
+   event_idx -= vq->vring_packed.num;
+
 #ifdef DEBUG
if (vq->last_add_time_valid) {
WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
@@ -1063,7 +1073,10 @@ static bool virtqueue_kick_prepare_packed(struct 
virtqueue *_vq)
vq->last_add_time_valid = false;
 #endif
 
-   needs_kick = (flags != VRING_EVENT_F_DISABLE);
+   if (flags == VRING_EVENT_F_DESC)
+   needs_kick = vring_need_event(event_idx, new, old);
+   else
+   needs_kick = (flags != VRING_EVENT_F_DISABLE);
END_USE(vq);
return needs_kick;
 }
@@ -1170,6 +1183,15 @@ static void *virtqueue_get_buf_ctx_packed(struct 
virtqueue *_vq,
ret = vq->desc_state_packed[id].data;
detach_buf_packed(vq, id, ctx);
 
+   /* If we expect an interrupt for the next entry, tell host
+* by writing event index and flush out the write before
+* the read in the next get_buf call. */
+   if (vq->event_flags_shadow == VRING_EVENT_F_DESC)
+   virtio_store_mb(vq->weak_barriers,
+   >vring_packed.driver->off_wrap,
+   cpu_to_virtio16(_vq->vdev, vq->last_used_idx |
+   ((u16)vq->used_wrap_counter << 15)));
+
 #ifdef DEBUG
vq->last_add_time_valid = false;
 #endif
@@ -1197,10 +1219,26 @@ static unsigned 
virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
 
/* We optimistically turn back on interrupts, then check if there was
 * more to do. */
+   /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
+* either clear the flags bit or point the event index at the next
+* entry. Always update the event index to keep code simple. */
+
+   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
+   vq->last_used_idx |
+   ((u16)vq->used_wrap_counter << 15));
 
if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
virtio_wmb(vq->weak_barriers);
+   // XXX XXX XXX
+   // Setting VRING_EVENT_F_DESC in this function
+   // will break netperf test for now.
+   // Will need to look into this.
+#if 0
+   vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
+VRING_EVENT_F_ENABLE;
+#else
vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
+#endif
vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
vq->event_flags_shadow);
}
@@ -1227,15 +1265,38 @@ static bool virtqueue_poll_packed(struct virtqueue 
*_vq, unsigned last_used_idx)
 static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
+   u16 bufs, used_idx, wrap_counter;
 
START_USE(vq);
 
/* We optimistically turn back on interrupts, then check if there was
 * more to do. */
+   /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
+* either clear the flags bit or point the event index at the next
+* entry. Always update the event index to keep code simple. */
+
+   /* TODO: tune this threshold */
+   if (vq->next_avail_idx < vq->last_used_idx)
+   bufs = (vq->vring_packed.num + vq->next_avail_idx -
+

Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-20 Thread Tiwei Bie
On Mon, May 21, 2018 at 10:30:51AM +0800, Jason Wang wrote:
> On 2018年05月19日 10:29, Tiwei Bie wrote:
> > > I don't hope so.
> > > 
> > > > I agreed driver should track the DMA addrs or some
> > > > other necessary things from the very beginning. And
> > > > I also repeated the spec to emphasize that it does
> > > > make sense. And I'd like to do that.
> > > > 
> > > > What I was saying is that, to support OOO, we may
> > > > need to manage these context (which saves DMA addrs
> > > > etc) via a list which is similar to the desc list
> > > > maintained via `next` in split ring instead of an
> > > > array whose elements always can be indexed directly.
> > > My point is these context is a must (not only for OOO).
> > Yeah, and I have the exactly same point after you
> > pointed that I shouldn't get the addrs from descs.
> > I do think it makes sense. I'll do it in the next
> > version. I don't have any doubt about it. All my
> > questions are about the OOO, instead of whether we
> > should save context or not. It just seems that you
> > thought I don't want to do it, and were trying to
> > convince me that I should do it.
> 
> Right, but looks like I was wrong :)
> 
> > 
> > > > The desc ring in split ring is an array, but its
> > > > free entries are managed as list via next. I was
> > > > just wondering, do we want to manage such a list
> > > > because of OOO. It's just a very simple question
> > > > that I want to hear your opinion... (It doesn't
> > > > means anything, e.g. It doesn't mean I don't want
> > > > to support OOO. It's just a simple question...)
> > > So the question is yes. But I admit I don't have better idea other than 
> > > what
> > > you propose here (something like split ring which is a little bit sad).
> > > Maybe Michael had.
> > Yeah, that's why I asked this question. It will
> > make the packed ring a bit similar to split ring
> > at least in the driver part. So I want to draw
> > your attention on this to make sure that we're
> > on the same page.
> 
> Yes. I think we are.

Cool. Glad to hear that! Thanks! :)

Best regards,
Tiwei Bie

> 
> Thanks
> 
> > Best regards,
> > Tiwei Bie
> > 
> 


Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-18 Thread Tiwei Bie
On Sat, May 19, 2018 at 09:12:30AM +0800, Jason Wang wrote:
> On 2018年05月18日 22:33, Tiwei Bie wrote:
> > On Fri, May 18, 2018 at 09:17:05PM +0800, Jason Wang wrote:
> > > On 2018年05月18日 19:29, Tiwei Bie wrote:
> > > > On Thu, May 17, 2018 at 08:01:52PM +0800, Jason Wang wrote:
> > > > > On 2018年05月16日 22:33, Tiwei Bie wrote:
> > > > > > On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
> > > > > > > On 2018年05月16日 21:45, Tiwei Bie wrote:
> > > > > > > > On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> > > > > > > > > On 2018年05月16日 20:39, Tiwei Bie wrote:
> > > > > > > > > > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > > > > > > > > > On 2018年05月16日 16:37, Tiwei Bie wrote:
> > > > > > [...]
> > > > > > > > > > > > +static void detach_buf_packed(struct vring_virtqueue 
> > > > > > > > > > > > *vq, unsigned int head,
> > > > > > > > > > > > + unsigned int id, void 
> > > > > > > > > > > > **ctx)
> > > > > > > > > > > > +{
> > > > > > > > > > > > +   struct vring_packed_desc *desc;
> > > > > > > > > > > > +   unsigned int i, j;
> > > > > > > > > > > > +
> > > > > > > > > > > > +   /* Clear data ptr. */
> > > > > > > > > > > > +   vq->desc_state[id].data = NULL;
> > > > > > > > > > > > +
> > > > > > > > > > > > +   i = head;
> > > > > > > > > > > > +
> > > > > > > > > > > > +   for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > > > > > > > > > +   desc = >vring_packed.desc[i];
> > > > > > > > > > > > +   vring_unmap_one_packed(vq, desc);
> > > > > > > > > > > As mentioned in previous discussion, this probably won't 
> > > > > > > > > > > work for the case
> > > > > > > > > > > of out of order completion since it depends on the 
> > > > > > > > > > > information in the
> > > > > > > > > > > descriptor ring. We probably need to extend ctx to record 
> > > > > > > > > > > such information.
> > > > > > > > > > Above code doesn't depend on the information in the 
> > > > > > > > > > descriptor
> > > > > > > > > > ring. The vq->desc_state[] is the extended ctx.
> > > > > > > > > > 
> > > > > > > > > > Best regards,
> > > > > > > > > > Tiwei Bie
> > > > > > > > > Yes, but desc is a pointer to descriptor ring I think so
> > > > > > > > > vring_unmap_one_packed() still depends on the content of 
> > > > > > > > > descriptor ring?
> > > > > > > > > 
> > > > > > > > I got your point now. I think it makes sense to reserve
> > > > > > > > the bits of the addr field. Driver shouldn't try to get
> > > > > > > > addrs from the descriptors when cleanup the descriptors
> > > > > > > > no matter whether we support out-of-order or not.
> > > > > > > Maybe I was wrong, but I remember spec mentioned something like 
> > > > > > > this.
> > > > > > You're right. Spec mentioned this. I was just repeating
> > > > > > the spec to emphasize that it does make sense. :)
> > > > > > 
> > > > > > > > But combining it with the out-of-order support, it will
> > > > > > > > mean that the driver still needs to maintain a desc/ctx
> > > > > > > > list that is very similar to the desc ring in the split
> > > > > > > > ring. I'm not quite sure whether it's something we want.
> > > > > > > > If it is true, I'll do it. So do you think we also want
> > > > > > > > to maintain such a desc/ctx list for packed ring?
&

Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-18 Thread Tiwei Bie
On Fri, May 18, 2018 at 09:17:05PM +0800, Jason Wang wrote:
> On 2018年05月18日 19:29, Tiwei Bie wrote:
> > On Thu, May 17, 2018 at 08:01:52PM +0800, Jason Wang wrote:
> > > On 2018年05月16日 22:33, Tiwei Bie wrote:
> > > > On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
> > > > > On 2018年05月16日 21:45, Tiwei Bie wrote:
> > > > > > On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> > > > > > > On 2018年05月16日 20:39, Tiwei Bie wrote:
> > > > > > > > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > > > > > > > On 2018年05月16日 16:37, Tiwei Bie wrote:
> > > > [...]
> > > > > > > > > > +static void detach_buf_packed(struct vring_virtqueue *vq, 
> > > > > > > > > > unsigned int head,
> > > > > > > > > > + unsigned int id, void **ctx)
> > > > > > > > > > +{
> > > > > > > > > > +   struct vring_packed_desc *desc;
> > > > > > > > > > +   unsigned int i, j;
> > > > > > > > > > +
> > > > > > > > > > +   /* Clear data ptr. */
> > > > > > > > > > +   vq->desc_state[id].data = NULL;
> > > > > > > > > > +
> > > > > > > > > > +   i = head;
> > > > > > > > > > +
> > > > > > > > > > +   for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > > > > > > > +   desc = >vring_packed.desc[i];
> > > > > > > > > > +   vring_unmap_one_packed(vq, desc);
> > > > > > > > > As mentioned in previous discussion, this probably won't work 
> > > > > > > > > for the case
> > > > > > > > > of out of order completion since it depends on the 
> > > > > > > > > information in the
> > > > > > > > > descriptor ring. We probably need to extend ctx to record 
> > > > > > > > > such information.
> > > > > > > > Above code doesn't depend on the information in the descriptor
> > > > > > > > ring. The vq->desc_state[] is the extended ctx.
> > > > > > > > 
> > > > > > > > Best regards,
> > > > > > > > Tiwei Bie
> > > > > > > Yes, but desc is a pointer to descriptor ring I think so
> > > > > > > vring_unmap_one_packed() still depends on the content of 
> > > > > > > descriptor ring?
> > > > > > > 
> > > > > > I got your point now. I think it makes sense to reserve
> > > > > > the bits of the addr field. Driver shouldn't try to get
> > > > > > addrs from the descriptors when cleanup the descriptors
> > > > > > no matter whether we support out-of-order or not.
> > > > > Maybe I was wrong, but I remember spec mentioned something like this.
> > > > You're right. Spec mentioned this. I was just repeating
> > > > the spec to emphasize that it does make sense. :)
> > > > 
> > > > > > But combining it with the out-of-order support, it will
> > > > > > mean that the driver still needs to maintain a desc/ctx
> > > > > > list that is very similar to the desc ring in the split
> > > > > > ring. I'm not quite sure whether it's something we want.
> > > > > > If it is true, I'll do it. So do you think we also want
> > > > > > to maintain such a desc/ctx list for packed ring?
> > > > > To make it work for OOO backends I think we need something like this
> > > > > (hardware NIC drivers are usually have something like this).
> > > > Which hardware NIC drivers have this?
> > > It's quite common I think, e.g driver track e.g dma addr and page frag
> > > somewhere. e.g the ring->rx_info in mlx4 driver.
> > It seems that I had a misunderstanding on your
> > previous comments. I know it's quite common for
> > drivers to track e.g. DMA addrs somewhere (and
> > I think one reason behind this is that they want
> > to reuse the bits of addr field).
> 
> Yes, we may want this for virtio-net as well in the future.
> 
> >   But tracking
> > addrs somewhere doesn't means supporting OOO.
> > I thought you

Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-18 Thread Tiwei Bie
On Thu, May 17, 2018 at 08:01:52PM +0800, Jason Wang wrote:
> On 2018年05月16日 22:33, Tiwei Bie wrote:
> > On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
> > > On 2018年05月16日 21:45, Tiwei Bie wrote:
> > > > On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> > > > > On 2018年05月16日 20:39, Tiwei Bie wrote:
> > > > > > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > > > > > On 2018年05月16日 16:37, Tiwei Bie wrote:
> > [...]
> > > > > > > > +static void detach_buf_packed(struct vring_virtqueue *vq, 
> > > > > > > > unsigned int head,
> > > > > > > > + unsigned int id, void **ctx)
> > > > > > > > +{
> > > > > > > > +   struct vring_packed_desc *desc;
> > > > > > > > +   unsigned int i, j;
> > > > > > > > +
> > > > > > > > +   /* Clear data ptr. */
> > > > > > > > +   vq->desc_state[id].data = NULL;
> > > > > > > > +
> > > > > > > > +   i = head;
> > > > > > > > +
> > > > > > > > +   for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > > > > > +   desc = >vring_packed.desc[i];
> > > > > > > > +   vring_unmap_one_packed(vq, desc);
> > > > > > > As mentioned in previous discussion, this probably won't work for 
> > > > > > > the case
> > > > > > > of out of order completion since it depends on the information in 
> > > > > > > the
> > > > > > > descriptor ring. We probably need to extend ctx to record such 
> > > > > > > information.
> > > > > > Above code doesn't depend on the information in the descriptor
> > > > > > ring. The vq->desc_state[] is the extended ctx.
> > > > > > 
> > > > > > Best regards,
> > > > > > Tiwei Bie
> > > > > Yes, but desc is a pointer to descriptor ring I think so
> > > > > vring_unmap_one_packed() still depends on the content of descriptor 
> > > > > ring?
> > > > > 
> > > > I got your point now. I think it makes sense to reserve
> > > > the bits of the addr field. Driver shouldn't try to get
> > > > addrs from the descriptors when cleanup the descriptors
> > > > no matter whether we support out-of-order or not.
> > > Maybe I was wrong, but I remember spec mentioned something like this.
> > You're right. Spec mentioned this. I was just repeating
> > the spec to emphasize that it does make sense. :)
> > 
> > > > But combining it with the out-of-order support, it will
> > > > mean that the driver still needs to maintain a desc/ctx
> > > > list that is very similar to the desc ring in the split
> > > > ring. I'm not quite sure whether it's something we want.
> > > > If it is true, I'll do it. So do you think we also want
> > > > to maintain such a desc/ctx list for packed ring?
> > > To make it work for OOO backends I think we need something like this
> > > (hardware NIC drivers are usually have something like this).
> > Which hardware NIC drivers have this?
> 
> It's quite common I think, e.g driver track e.g dma addr and page frag
> somewhere. e.g the ring->rx_info in mlx4 driver.

It seems that I had a misunderstanding on your
previous comments. I know it's quite common for
drivers to track e.g. DMA addrs somewhere (and
I think one reason behind this is that they want
to reuse the bits of addr field). But tracking
addrs somewhere doesn't means supporting OOO.
I thought you were saying it's quite common for
hardware NIC drivers to support OOO (i.e. NICs
will return the descriptors OOO):

I'm not familiar with mlx4, maybe I'm wrong.
I just had a quick glance. And I found below
comments in mlx4_en_process_rx_cq():

```
/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
 * descriptor offset can be deduced from the CQE index instead of
 * reading 'cqe->index' */
index = cq->mcq.cons_index & ring->size_mask;
cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
```

It seems that although they have a completion
queue, they are still using the ring in order.

I guess maybe storage device may want OOO.

Best regards,
Tiwei Bie

> 
> Thanks
> 
> > 
> > > Not for the patch, but it looks like having a OUT_OF_ORDER feature bit is
> > > much more simpler to be started with.
> > +1
> > 
> > Best regards,
> > Tiwei Bie
> 


Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-16 Thread Tiwei Bie
On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
> On 2018年05月16日 21:45, Tiwei Bie wrote:
> > On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> > > On 2018年05月16日 20:39, Tiwei Bie wrote:
> > > > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > > > On 2018年05月16日 16:37, Tiwei Bie wrote:
[...]
> > > > > > +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned 
> > > > > > int head,
> > > > > > + unsigned int id, void **ctx)
> > > > > > +{
> > > > > > +   struct vring_packed_desc *desc;
> > > > > > +   unsigned int i, j;
> > > > > > +
> > > > > > +   /* Clear data ptr. */
> > > > > > +   vq->desc_state[id].data = NULL;
> > > > > > +
> > > > > > +   i = head;
> > > > > > +
> > > > > > +   for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > > > +   desc = >vring_packed.desc[i];
> > > > > > +   vring_unmap_one_packed(vq, desc);
> > > > > As mentioned in previous discussion, this probably won't work for the 
> > > > > case
> > > > > of out of order completion since it depends on the information in the
> > > > > descriptor ring. We probably need to extend ctx to record such 
> > > > > information.
> > > > Above code doesn't depend on the information in the descriptor
> > > > ring. The vq->desc_state[] is the extended ctx.
> > > > 
> > > > Best regards,
> > > > Tiwei Bie
> > > Yes, but desc is a pointer to descriptor ring I think so
> > > vring_unmap_one_packed() still depends on the content of descriptor ring?
> > > 
> > I got your point now. I think it makes sense to reserve
> > the bits of the addr field. Driver shouldn't try to get
> > addrs from the descriptors when cleanup the descriptors
> > no matter whether we support out-of-order or not.
> 
> Maybe I was wrong, but I remember spec mentioned something like this.

You're right. Spec mentioned this. I was just repeating
the spec to emphasize that it does make sense. :)

> 
> > 
> > But combining it with the out-of-order support, it will
> > mean that the driver still needs to maintain a desc/ctx
> > list that is very similar to the desc ring in the split
> > ring. I'm not quite sure whether it's something we want.
> > If it is true, I'll do it. So do you think we also want
> > to maintain such a desc/ctx list for packed ring?
> 
> To make it work for OOO backends I think we need something like this
> (hardware NIC drivers are usually have something like this).

Which hardware NIC drivers have this?

> 
> Not for the patch, but it looks like having a OUT_OF_ORDER feature bit is
> much more simpler to be started with.

+1

Best regards,
Tiwei Bie


Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-16 Thread Tiwei Bie
On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> On 2018年05月16日 20:39, Tiwei Bie wrote:
> > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > On 2018年05月16日 16:37, Tiwei Bie wrote:
> > [...]
> > > >struct vring_virtqueue {
> > > > @@ -116,6 +117,9 @@ struct vring_virtqueue {
> > > > /* Last written value to driver->flags in
> > > >  * guest byte order. */
> > > > u16 event_flags_shadow;
> > > > +
> > > > +   /* ID allocation. */
> > > > +   struct idr buffer_id;
> > > I'm not sure idr is fit for the performance critical case here. Need to
> > > measure its performance impact, especially if we have few unused slots.
> > I'm also not sure.. But fortunately, it should be quite easy
> > to replace it with something else without changing other code.
> > If it will really hurt the performance, I'll change it.
> 
> We may want to do some benchmarking/profiling to see.

Yeah!

> 
> > 
> > > > };
> > > > };
> > [...]
> > > > +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned int 
> > > > head,
> > > > + unsigned int id, void **ctx)
> > > > +{
> > > > +   struct vring_packed_desc *desc;
> > > > +   unsigned int i, j;
> > > > +
> > > > +   /* Clear data ptr. */
> > > > +   vq->desc_state[id].data = NULL;
> > > > +
> > > > +   i = head;
> > > > +
> > > > +   for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > +   desc = >vring_packed.desc[i];
> > > > +   vring_unmap_one_packed(vq, desc);
> > > As mentioned in previous discussion, this probably won't work for the case
> > > of out of order completion since it depends on the information in the
> > > descriptor ring. We probably need to extend ctx to record such 
> > > information.
> > Above code doesn't depend on the information in the descriptor
> > ring. The vq->desc_state[] is the extended ctx.
> > 
> > Best regards,
> > Tiwei Bie
> 
> Yes, but desc is a pointer to descriptor ring I think so
> vring_unmap_one_packed() still depends on the content of descriptor ring?
> 

I got your point now. I think it makes sense to reserve
the bits of the addr field. Driver shouldn't try to get
addrs from the descriptors when cleanup the descriptors
no matter whether we support out-of-order or not.

But combining it with the out-of-order support, it will
mean that the driver still needs to maintain a desc/ctx
list that is very similar to the desc ring in the split
ring. I'm not quite sure whether it's something we want.
If it is true, I'll do it. So do you think we also want
to maintain such a desc/ctx list for packed ring?

Best regards,
Tiwei Bie


Re: [RFC v4 4/5] virtio_ring: add event idx support in packed ring

2018-05-16 Thread Tiwei Bie
On Wed, May 16, 2018 at 08:17:21PM +0800, Jason Wang wrote:
> On 2018年05月16日 16:37, Tiwei Bie wrote:
[...]
> > @@ -1160,15 +1186,27 @@ static void virtqueue_disable_cb_packed(struct 
> > virtqueue *_vq)
> >   static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> >   {
> > struct vring_virtqueue *vq = to_vvq(_vq);
> > +   u16 wrap_counter;
> > START_USE(vq);
> > /* We optimistically turn back on interrupts, then check if there was
> >  * more to do. */
> > +   /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> > +* either clear the flags bit or point the event index at the next
> > +* entry. Always update the event index to keep code simple. */
> > +
> > +   wrap_counter = vq->wrap_counter;
> > +   if (vq->last_used_idx > vq->next_avail_idx)
> 
> Should this be ">=" consider rx refill may try to completely fill the ring?

It seems that there are two cases that last_used_idx
equals to next_avail_idx. The first one is that the
ring is empty. And the second one is that the ring
is full. Although in the first case, most probably,
the driver won't enable the interrupt.

Maybe I really should track the used_wrap_counter
instead of calculating it each time I need it.. I'll
give it a try..

> 
> > +   wrap_counter ^= 1;
> > +
> > +   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > +   vq->last_used_idx | (wrap_counter << 15));
> > if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > virtio_wmb(vq->weak_barriers);
> > -   vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +   vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > +VRING_EVENT_F_ENABLE;
> > vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > vq->event_flags_shadow);
> > }
> > @@ -1194,15 +1232,40 @@ static bool virtqueue_poll_packed(struct virtqueue 
> > *_vq, unsigned last_used_idx)
> >   static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> >   {
> > struct vring_virtqueue *vq = to_vvq(_vq);
> > +   u16 bufs, used_idx, wrap_counter;
> > START_USE(vq);
> > /* We optimistically turn back on interrupts, then check if there was
> >  * more to do. */
> > +   /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> > +* either clear the flags bit or point the event index at the next
> > +* entry. Always update the event index to keep code simple. */
> > +
> > +   /* TODO: tune this threshold */
> > +   if (vq->next_avail_idx < vq->last_used_idx)
> > +   bufs = (vq->vring_packed.num + vq->next_avail_idx -
> > +   vq->last_used_idx) * 3 / 4;
> > +   else
> > +   bufs = (vq->next_avail_idx - vq->last_used_idx) * 3 / 4;
> > +
> > +   wrap_counter = vq->wrap_counter;
> > +   if (vq->last_used_idx > vq->next_avail_idx)
> > +   wrap_counter ^= 1;
> > +
> > +   used_idx = vq->last_used_idx + bufs;
> > +   if (used_idx >= vq->vring_packed.num) {
> > +   used_idx -= vq->vring_packed.num;
> > +   wrap_counter ^= 1;
> > +   }
> 
> Looks correct but maybe it's better to add some comments for such logic.

Make sense.

> 
> > +
> > +   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > +   used_idx | (wrap_counter << 15));
> > if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > virtio_wmb(vq->weak_barriers);
> > -   vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +   vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > +VRING_EVENT_F_ENABLE;
> >     vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > vq->event_flags_shadow);
> > }
> > @@ -1869,8 +1932,10 @@ void vring_transport_features(struct virtio_device 
> > *vdev)
> > switch (i) {
> > case VIRTIO_RING_F_INDIRECT_DESC:
> > break;
> > +#if 0
> > case VIRTIO_RING_F_EVENT_IDX:
> > break;
> > +#endif
> 
> Maybe it's time to remove this #if 0.

Will do it.

Thanks for the review!

Best regards,
Tiwei Bie


Re: [RFC v4 3/5] virtio_ring: add packed ring support

2018-05-16 Thread Tiwei Bie
On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> On 2018年05月16日 16:37, Tiwei Bie wrote:
[...]
> >   struct vring_virtqueue {
> > @@ -116,6 +117,9 @@ struct vring_virtqueue {
> > /* Last written value to driver->flags in
> >  * guest byte order. */
> > u16 event_flags_shadow;
> > +
> > +   /* ID allocation. */
> > +   struct idr buffer_id;
> 
> I'm not sure idr is fit for the performance critical case here. Need to
> measure its performance impact, especially if we have few unused slots.

I'm also not sure.. But fortunately, it should be quite easy
to replace it with something else without changing other code.
If it will really hurt the performance, I'll change it.

> 
> > };
> > };
[...]
> > +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned int 
> > head,
> > + unsigned int id, void **ctx)
> > +{
> > +   struct vring_packed_desc *desc;
> > +   unsigned int i, j;
> > +
> > +   /* Clear data ptr. */
> > +   vq->desc_state[id].data = NULL;
> > +
> > +   i = head;
> > +
> > +   for (j = 0; j < vq->desc_state[id].num; j++) {
> > +   desc = >vring_packed.desc[i];
> > +   vring_unmap_one_packed(vq, desc);
> 
> As mentioned in previous discussion, this probably won't work for the case
> of out of order completion since it depends on the information in the
> descriptor ring. We probably need to extend ctx to record such information.

Above code doesn't depend on the information in the descriptor
ring. The vq->desc_state[] is the extended ctx.

Best regards,
Tiwei Bie


Re: [RFC v4 5/5] virtio_ring: enable packed ring

2018-05-16 Thread Tiwei Bie
On Wed, May 16, 2018 at 02:42:53PM +0300, Sergei Shtylyov wrote:
> On 05/16/2018 01:21 PM, Tiwei Bie wrote:
> 
> >>> Signed-off-by: Tiwei Bie <tiwei@intel.com>
> >>> ---
> >>>   drivers/virtio/virtio_ring.c | 2 ++
> >>>   1 file changed, 2 insertions(+)
> >>>
> >>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> >>> index de3839f3621a..b158692263b0 100644
> >>> --- a/drivers/virtio/virtio_ring.c
> >>> +++ b/drivers/virtio/virtio_ring.c
> >>> @@ -1940,6 +1940,8 @@ void vring_transport_features(struct virtio_device 
> >>> *vdev)
> >>>   break;
> >>>   case VIRTIO_F_IOMMU_PLATFORM:
> >>>   break;
> >>> + case VIRTIO_F_RING_PACKED:
> >>> + break;
> >>
> >>Why not just add this *case* under the previous *case*?
> > 
> > Do you mean fallthrough? Something like:
> > 
> > case VIRTIO_F_IOMMU_PLATFORM:
> > case VIRTIO_F_RING_PACKED:
> > break;
> 
>Yes, exactly. :-)

Using fallthrough in this case will make the code more
compact. I like such coding style. But unfortunately,
it's not consistent with the existing code. :(

The whole function will become something like this:

void vring_transport_features(struct virtio_device *vdev)
{
unsigned int i;

for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) {
switch (i) {
case VIRTIO_RING_F_INDIRECT_DESC:
break;
case VIRTIO_RING_F_EVENT_IDX:
break;
case VIRTIO_F_VERSION_1:
break;
case VIRTIO_F_IOMMU_PLATFORM:
case VIRTIO_F_RING_PACKED:
break;
default:
/* We don't understand this bit. */
__virtio_clear_bit(vdev, i);
}
}
}

Best regards,
Tiwei Bie

> 
> > Best regards,
> > Tiwei Bie
> 
> [...]
> 
> MBR, Sergei
> 


Re: [RFC v4 5/5] virtio_ring: enable packed ring

2018-05-16 Thread Tiwei Bie
On Wed, May 16, 2018 at 01:15:48PM +0300, Sergei Shtylyov wrote:
> On 5/16/2018 11:37 AM, Tiwei Bie wrote:
> 
> > Signed-off-by: Tiwei Bie <tiwei@intel.com>
> > ---
> >   drivers/virtio/virtio_ring.c | 2 ++
> >   1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index de3839f3621a..b158692263b0 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -1940,6 +1940,8 @@ void vring_transport_features(struct virtio_device 
> > *vdev)
> > break;
> > case VIRTIO_F_IOMMU_PLATFORM:
> > break;
> > +   case VIRTIO_F_RING_PACKED:
> > +   break;
> 
>Why not just add this *case* under the previous *case*?

Do you mean fallthrough? Something like:

case VIRTIO_F_IOMMU_PLATFORM:
case VIRTIO_F_RING_PACKED:
break;

Best regards,
Tiwei Bie

> 
> > default:
> > /* We don't understand this bit. */
> > __virtio_clear_bit(vdev, i);
> 
> MBR, Sergei


[RFC v4 0/5] virtio: support packed ring

2018-05-16 Thread Tiwei Bie
Hello everyone,

This RFC implements packed ring support in virtio driver.

Some simple functional tests have been done with Jason's
packed ring implementation in vhost:

https://lkml.org/lkml/2018/4/23/12

Both of ping and netperf worked as expected (with EVENT_IDX
disabled).

TODO:
- Refinements (for code and commit log);
- More tests;
- Bug fixes;

RFC v3 -> RFC v4:
- Make ID allocation support out-of-order (Jason);
- Various fixes for EVENT_IDX support;

RFC v2 -> RFC v3:
- Split into small patches (Jason);
- Add helper virtqueue_use_indirect() (Jason);
- Just set id for the last descriptor of a list (Jason);
- Calculate the prev in virtqueue_add_packed() (Jason);
- Fix/improve desc suppression code (Jason/MST);
- Refine the code layout for XXX_split/packed and wrappers (MST);
- Fix the comments and API in uapi (MST);
- Remove the BUG_ON() for indirect (Jason);
- Some other refinements and bug fixes;

RFC v1 -> RFC v2:
- Add indirect descriptor support - compile test only;
- Add event suppression supprt - compile test only;
- Move vring_packed_init() out of uapi (Jason, MST);
- Merge two loops into one in virtqueue_add_packed() (Jason);
- Split vring_unmap_one() for packed ring and split ring (Jason);
- Avoid using '%' operator (Jason);
- Rename free_head -> next_avail_idx (Jason);
- Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
- Some other refinements and bug fixes;

Thanks!

Tiwei Bie (5):
  virtio: add packed ring definitions
  virtio_ring: support creating packed ring
  virtio_ring: add packed ring support
  virtio_ring: add event idx support in packed ring
  virtio_ring: enable packed ring

 drivers/virtio/virtio_ring.c   | 1338 ++--
 include/linux/virtio_ring.h|8 +-
 include/uapi/linux/virtio_config.h |   12 +-
 include/uapi/linux/virtio_ring.h   |   36 +
 4 files changed, 1116 insertions(+), 278 deletions(-)

-- 
2.17.0



[RFC v4 1/5] virtio: add packed ring definitions

2018-05-16 Thread Tiwei Bie
Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 include/uapi/linux/virtio_config.h | 12 +-
 include/uapi/linux/virtio_ring.h   | 36 ++
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/virtio_config.h 
b/include/uapi/linux/virtio_config.h
index 308e2096291f..a6e392325e3a 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -49,7 +49,7 @@
  * transport being used (eg. virtio_ring), the rest are per-device feature
  * bits. */
 #define VIRTIO_TRANSPORT_F_START   28
-#define VIRTIO_TRANSPORT_F_END 34
+#define VIRTIO_TRANSPORT_F_END 36
 
 #ifndef VIRTIO_CONFIG_NO_LEGACY
 /* Do we get callbacks when the ring is completely used, even if we've
@@ -71,4 +71,14 @@
  * this is for compatibility with legacy systems.
  */
 #define VIRTIO_F_IOMMU_PLATFORM33
+
+/* This feature indicates support for the packed virtqueue layout. */
+#define VIRTIO_F_RING_PACKED   34
+
+/*
+ * This feature indicates that all buffers are used by the device
+ * in the same order in which they have been made available.
+ */
+#define VIRTIO_F_IN_ORDER  35
+
 #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 6d5d5faa989b..3932cb80c347 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -44,6 +44,9 @@
 /* This means the buffer contains a list of buffer descriptors. */
 #define VRING_DESC_F_INDIRECT  4
 
+#define VRING_DESC_F_AVAIL(b)  ((b) << 7)
+#define VRING_DESC_F_USED(b)   ((b) << 15)
+
 /* The Host uses this in used->flags to advise the Guest: don't kick me when
  * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
  * will still kick if it's out of buffers. */
@@ -53,6 +56,10 @@
  * optimization.  */
 #define VRING_AVAIL_F_NO_INTERRUPT 1
 
+#define VRING_EVENT_F_ENABLE   0x0
+#define VRING_EVENT_F_DISABLE  0x1
+#define VRING_EVENT_F_DESC 0x2
+
 /* We support indirect buffer descriptors */
 #define VIRTIO_RING_F_INDIRECT_DESC28
 
@@ -171,4 +178,33 @@ static inline int vring_need_event(__u16 event_idx, __u16 
new_idx, __u16 old)
return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
 }
 
+struct vring_packed_desc_event {
+   /* __virtio16 off  : 15; // Descriptor Event Offset
+* __virtio16 wrap : 1;  // Descriptor Event Wrap Counter */
+   __virtio16 off_wrap;
+   /* __virtio16 flags : 2; // Descriptor Event Flags */
+   __virtio16 flags;
+};
+
+struct vring_packed_desc {
+   /* Buffer Address. */
+   __virtio64 addr;
+   /* Buffer Length. */
+   __virtio32 len;
+   /* Buffer ID. */
+   __virtio16 id;
+   /* The flags depending on descriptor type. */
+   __virtio16 flags;
+};
+
+struct vring_packed {
+   unsigned int num;
+
+   struct vring_packed_desc *desc;
+
+   struct vring_packed_desc_event *driver;
+
+   struct vring_packed_desc_event *device;
+};
+
 #endif /* _UAPI_LINUX_VIRTIO_RING_H */
-- 
2.17.0



[RFC v4 4/5] virtio_ring: add event idx support in packed ring

2018-05-16 Thread Tiwei Bie
This commit introduces the event idx support in
packed ring.

Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 drivers/virtio/virtio_ring.c | 75 +---
 1 file changed, 70 insertions(+), 5 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index c6c5deb0e3ae..de3839f3621a 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1006,7 +1006,7 @@ static inline int virtqueue_add_packed(struct virtqueue 
*_vq,
 static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
-   u16 flags;
+   u16 new, old, off_wrap, flags, wrap_counter, event_idx;
bool needs_kick;
u32 snapshot;
 
@@ -1015,9 +1015,19 @@ static bool virtqueue_kick_prepare_packed(struct 
virtqueue *_vq)
 * suppressions. */
virtio_mb(vq->weak_barriers);
 
+   old = vq->next_avail_idx - vq->num_added;
+   new = vq->next_avail_idx;
+   vq->num_added = 0;
+
snapshot = *(u32 *)vq->vring_packed.device;
+   off_wrap = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot & 0x));
flags = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot >> 16)) & 0x3;
 
+   wrap_counter = off_wrap >> 15;
+   event_idx = off_wrap & ~(1<<15);
+   if (wrap_counter != vq->wrap_counter)
+   event_idx -= vq->vring_packed.num;
+
 #ifdef DEBUG
if (vq->last_add_time_valid) {
WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
@@ -1026,7 +1036,10 @@ static bool virtqueue_kick_prepare_packed(struct 
virtqueue *_vq)
vq->last_add_time_valid = false;
 #endif
 
-   needs_kick = (flags != VRING_EVENT_F_DISABLE);
+   if (flags == VRING_EVENT_F_DESC)
+   needs_kick = vring_need_event(event_idx, new, old);
+   else
+   needs_kick = (flags != VRING_EVENT_F_DISABLE);
END_USE(vq);
return needs_kick;
 }
@@ -1098,7 +,7 @@ static void *virtqueue_get_buf_ctx_packed(struct 
virtqueue *_vq,
  void **ctx)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
-   u16 last_used, id;
+   u16 wrap_counter, last_used, id;
void *ret;
 
START_USE(vq);
@@ -1138,6 +1151,19 @@ static void *virtqueue_get_buf_ctx_packed(struct 
virtqueue *_vq,
ret = vq->desc_state[id].data;
detach_buf_packed(vq, last_used, id, ctx);
 
+   wrap_counter = vq->wrap_counter;
+   if (vq->last_used_idx > vq->next_avail_idx)
+   wrap_counter ^= 1;
+
+   /* If we expect an interrupt for the next entry, tell host
+* by writing event index and flush out the write before
+* the read in the next get_buf call. */
+   if (vq->event_flags_shadow == VRING_EVENT_F_DESC)
+   virtio_store_mb(vq->weak_barriers,
+   >vring_packed.driver->off_wrap,
+   cpu_to_virtio16(_vq->vdev, vq->last_used_idx |
+   (wrap_counter << 15)));
+
 #ifdef DEBUG
vq->last_add_time_valid = false;
 #endif
@@ -1160,15 +1186,27 @@ static void virtqueue_disable_cb_packed(struct 
virtqueue *_vq)
 static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
+   u16 wrap_counter;
 
START_USE(vq);
 
/* We optimistically turn back on interrupts, then check if there was
 * more to do. */
+   /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+* either clear the flags bit or point the event index at the next
+* entry. Always update the event index to keep code simple. */
+
+   wrap_counter = vq->wrap_counter;
+   if (vq->last_used_idx > vq->next_avail_idx)
+   wrap_counter ^= 1;
+
+   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
+   vq->last_used_idx | (wrap_counter << 15));
 
if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
virtio_wmb(vq->weak_barriers);
-   vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
+   vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
+VRING_EVENT_F_ENABLE;
vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
vq->event_flags_shadow);
}
@@ -1194,15 +1232,40 @@ static bool virtqueue_poll_packed(struct virtqueue 
*_vq, unsigned last_used_idx)
 static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
+   u16 bufs, used_idx, wrap_counter;
 
START_USE(vq);
 
  

[RFC v4 3/5] virtio_ring: add packed ring support

2018-05-16 Thread Tiwei Bie
This commit introduces the basic support (without EVENT_IDX)
for packed ring.

Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 drivers/virtio/virtio_ring.c | 491 ++-
 1 file changed, 481 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 62d7c407841a..c6c5deb0e3ae 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -58,7 +58,8 @@
 
 struct vring_desc_state {
void *data; /* Data for callback. */
-   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
+   void *indir_desc;   /* Indirect descriptor, if any. */
+   int num;/* Descriptor list length. */
 };
 
 struct vring_virtqueue {
@@ -116,6 +117,9 @@ struct vring_virtqueue {
/* Last written value to driver->flags in
 * guest byte order. */
u16 event_flags_shadow;
+
+   /* ID allocation. */
+   struct idr buffer_id;
};
};
 
@@ -142,6 +146,16 @@ struct vring_virtqueue {
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+static inline bool virtqueue_use_indirect(struct virtqueue *_vq,
+ unsigned int total_sg)
+{
+   struct vring_virtqueue *vq = to_vvq(_vq);
+
+   /* If the host supports indirect descriptor tables, and we have multiple
+* buffers, then go indirect. FIXME: tune this threshold */
+   return (vq->indirect && total_sg > 1 && vq->vq.num_free);
+}
+
 /*
  * Modern virtio devices have feature bits to specify whether they need a
  * quirk and bypass the IOMMU. If not there, just use the DMA API.
@@ -327,9 +341,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
 
head = vq->free_head;
 
-   /* If the host supports indirect descriptor tables, and we have multiple
-* buffers, then go indirect. FIXME: tune this threshold */
-   if (vq->indirect && total_sg > 1 && vq->vq.num_free)
+   if (virtqueue_use_indirect(_vq, total_sg))
desc = alloc_indirect_split(_vq, total_sg, gfp);
else {
desc = NULL;
@@ -741,6 +753,63 @@ static inline unsigned vring_size_packed(unsigned int num, 
unsigned long align)
& ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
 }
 
+static void vring_unmap_one_packed(const struct vring_virtqueue *vq,
+  struct vring_packed_desc *desc)
+{
+   u16 flags;
+
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return;
+
+   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+virtio64_to_cpu(vq->vq.vdev, desc->addr),
+virtio32_to_cpu(vq->vq.vdev, desc->len),
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  virtio64_to_cpu(vq->vq.vdev, desc->addr),
+  virtio32_to_cpu(vq->vq.vdev, desc->len),
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq,
+  unsigned int total_sg,
+  gfp_t gfp)
+{
+   struct vring_packed_desc *desc;
+
+   /*
+* We require lowmem mappings for the descriptors because
+* otherwise virt_to_phys will give us bogus addresses in the
+* virtqueue.
+*/
+   gfp &= ~__GFP_HIGHMEM;
+
+   desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);
+
+   return desc;
+}
+
+static u16 alloc_id_packed(struct vring_virtqueue *vq)
+{
+   u16 id;
+
+   id = idr_alloc(>buffer_id, NULL, 0, vq->vring_packed.num,
+  GFP_KERNEL);
+   return id;
+}
+
+static void free_id_packed(struct vring_virtqueue *vq, u16 id)
+{
+   idr_remove(>buffer_id, id);
+}
+
 static inline int virtqueue_add_packed(struct virtqueue *_vq,
   struct scatterlist *sgs[],
   unsigned int total_sg,
@@ -750,47 +819,446 @@ static inline int virtqueue_add_packed(struct virtqueue 
*_vq,
   void *ctx,
   gfp_t gfp)
 {
+   struct vring_virtqueue *vq = to_vvq(_vq);
+   struct vring_packed_desc *desc;

[RFC v4 5/5] virtio_ring: enable packed ring

2018-05-16 Thread Tiwei Bie
Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 drivers/virtio/virtio_ring.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index de3839f3621a..b158692263b0 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1940,6 +1940,8 @@ void vring_transport_features(struct virtio_device *vdev)
break;
case VIRTIO_F_IOMMU_PLATFORM:
break;
+   case VIRTIO_F_RING_PACKED:
+   break;
default:
/* We don't understand this bit. */
__virtio_clear_bit(vdev, i);
-- 
2.17.0



[RFC v4 2/5] virtio_ring: support creating packed ring

2018-05-16 Thread Tiwei Bie
This commit introduces the support for creating packed ring.
All split ring specific functions are added _split suffix.
Some necessary stubs for packed ring are also added.

Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 drivers/virtio/virtio_ring.c | 764 +++
 include/linux/virtio_ring.h  |   8 +-
 2 files changed, 513 insertions(+), 259 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..62d7c407841a 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -64,8 +64,8 @@ struct vring_desc_state {
 struct vring_virtqueue {
struct virtqueue vq;
 
-   /* Actual memory layout for this queue */
-   struct vring vring;
+   /* Is this a packed ring? */
+   bool packed;
 
/* Can we use weak barriers? */
bool weak_barriers;
@@ -79,19 +79,45 @@ struct vring_virtqueue {
/* Host publishes avail event idx */
bool event;
 
-   /* Head of free buffer list. */
-   unsigned int free_head;
/* Number we've added since last sync. */
unsigned int num_added;
 
/* Last used index we've seen. */
u16 last_used_idx;
 
-   /* Last written value to avail->flags */
-   u16 avail_flags_shadow;
+   union {
+   /* Available for split ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring vring;
 
-   /* Last written value to avail->idx in guest byte order */
-   u16 avail_idx_shadow;
+   /* Head of free buffer list. */
+   unsigned int free_head;
+
+   /* Last written value to avail->flags */
+   u16 avail_flags_shadow;
+
+   /* Last written value to avail->idx in
+* guest byte order. */
+   u16 avail_idx_shadow;
+   };
+
+   /* Available for packed ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring_packed vring_packed;
+
+   /* Driver ring wrap counter. */
+   u8 wrap_counter;
+
+   /* Index of the next avail descriptor. */
+   u16 next_avail_idx;
+
+   /* Last written value to driver->flags in
+* guest byte order. */
+   u16 event_flags_shadow;
+   };
+   };
 
/* How to notify other side. FIXME: commonalize hcalls! */
bool (*notify)(struct virtqueue *vq);
@@ -201,8 +227,17 @@ static dma_addr_t vring_map_single(const struct 
vring_virtqueue *vq,
  cpu_addr, size, direction);
 }
 
-static void vring_unmap_one(const struct vring_virtqueue *vq,
-   struct vring_desc *desc)
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+  dma_addr_t addr)
+{
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return 0;
+
+   return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
+static void vring_unmap_one_split(const struct vring_virtqueue *vq,
+ struct vring_desc *desc)
 {
u16 flags;
 
@@ -226,17 +261,9 @@ static void vring_unmap_one(const struct vring_virtqueue 
*vq,
}
 }
 
-static int vring_mapping_error(const struct vring_virtqueue *vq,
-  dma_addr_t addr)
-{
-   if (!vring_use_dma_api(vq->vq.vdev))
-   return 0;
-
-   return dma_mapping_error(vring_dma_dev(vq), addr);
-}
-
-static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
-unsigned int total_sg, gfp_t gfp)
+static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
+  unsigned int total_sg,
+  gfp_t gfp)
 {
struct vring_desc *desc;
unsigned int i;
@@ -257,14 +284,14 @@ static struct vring_desc *alloc_indirect(struct virtqueue 
*_vq,
return desc;
 }
 
-static inline int virtqueue_add(struct virtqueue *_vq,
-   struct scatterlist *sgs[],
-   unsigned int total_sg,
-   unsigned int out_sgs,
-   unsigned int in_sgs,
-   void *data,
-   void *ctx,
-   gfp_t gfp)
+static inline int virtqueue_add_split(struct virtqueue *_vq,
+ struct scatterlist *sgs[],
+ unsigned int total_sg,
+ unsigned int out_sgs,
+ 

Re: [RFC v3 4/5] virtio_ring: add event idx support in packed ring

2018-05-15 Thread Tiwei Bie
On Wed, May 16, 2018 at 01:01:04PM +0800, Jason Wang wrote:
> On 2018年04月25日 13:15, Tiwei Bie wrote:
[...]
> > @@ -1143,10 +1160,17 @@ static unsigned 
> > virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> > /* We optimistically turn back on interrupts, then check if there was
> >  * more to do. */
> > +   /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> > +* either clear the flags bit or point the event index at the next
> > +* entry. Always update the event index to keep code simple. */
> > +
> > +   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > +   vq->last_used_idx | (vq->wrap_counter << 15));
> 
> 
> Using vq->wrap_counter seems not correct, what we need is the warp counter
> for the last_used_idx not next_avail_idx.

Yes, you're right. I have fixed it in my local repo,
but haven't sent out a new version yet.

I'll try to send out a new RFC today.

> 
> And I think there's even no need to bother with event idx here, how about
> just set VRING_EVENT_F_ENABLE?

We had a similar discussion before. Michael prefers
to use VRING_EVENT_F_DESC when possible to avoid
extra interrupts if host is fast:

https://lkml.org/lkml/2018/4/16/1085
"""
I suspect this will lead to extra interrupts if host is fast.
So I think for now we should always use VRING_EVENT_F_DESC
if EVENT_IDX is negotiated.
"""

> 
> > if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > virtio_wmb(vq->weak_barriers);
> > -   vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +   vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > +VRING_EVENT_F_ENABLE;
> > vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > vq->event_flags_shadow);
> > }
> > @@ -1172,15 +1196,34 @@ static bool virtqueue_poll_packed(struct virtqueue 
> > *_vq, unsigned last_used_idx)
> >   static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> >   {
> > struct vring_virtqueue *vq = to_vvq(_vq);
> > +   u16 bufs, used_idx, wrap_counter;
> > START_USE(vq);
> > /* We optimistically turn back on interrupts, then check if there was
> >  * more to do. */
> > +   /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> > +* either clear the flags bit or point the event index at the next
> > +* entry. Always update the event index to keep code simple. */
> > +
> > +   /* TODO: tune this threshold */
> > +   bufs = (u16)(vq->next_avail_idx - vq->last_used_idx) * 3 / 4;
> 
> bufs could be more than vq->num here, is this intended?

Yes, you're right. Like the above one -- I have fixed
it in my local repo, but haven't sent out a new version
yet. Thanks for spotting this!

> 
> > +
> > +   used_idx = vq->last_used_idx + bufs;
> > +   wrap_counter = vq->wrap_counter;
> > +
> > +   if (used_idx >= vq->vring_packed.num) {
> > +   used_idx -= vq->vring_packed.num;
> > +   wrap_counter ^= 1;
> 
> When used_idx is greater or equal vq->num, there's no need to flip
> warp_counter bit since it should match next_avail_idx.
> 
> And we need also care about the case when next_avail wraps but used_idx not.
> so we probaly need:
> 
> else if (vq->next_avail_idx < used_idx) {
>     wrap_counter ^= 1;
> }
> 
> I think maybe it's time to add some sample codes in the spec to avoid
> duplicating the efforts(bugs).

+1

Best regards,
Tiwei Bie


Re: [RFC v3 3/5] virtio_ring: add packed ring support

2018-05-10 Thread Tiwei Bie
On Thu, May 10, 2018 at 05:49:20PM +0800, Jason Wang wrote:
> On 2018年05月10日 16:56, Tiwei Bie wrote:
> > On Thu, May 10, 2018 at 03:34:50PM +0800, Jason Wang wrote:
> > > On 2018年05月10日 15:32, Jason Wang wrote:
> > > > On 2018年04月25日 13:15, Tiwei Bie wrote:
> > > > > +    /* We're using some buffers from the free list. */
> > > > > +    vq->vq.num_free -= descs_used;
> > > > > +
> > > > > +    /* Update free pointer */
> > > > > +    if (indirect) {
> > > > > +    n = head + 1;
> > > > > +    if (n >= vq->vring_packed.num) {
> > > > > +    n = 0;
> > > > > +    vq->wrap_counter ^= 1;
> > > > > +    }
> > > > > +    vq->next_avail_idx = n;
> > > > > +    } else
> > > > > +    vq->next_avail_idx = i;
> > > > During testing zerocopy (out of order completion), I found driver may
> > > > submit two identical buffer id to vhost. So the above code may not work
> > > > well.
> > > > 
> > > > Consider the case that driver adds 3 buffer and virtqueue size is 8.
> > > > 
> > > > a) id = 0,count = 2,next_avail = 2
> > > > 
> > > > b) id = 2,count = 4,next_avail = 2
> > > next_avail should be 6 here.
> > > 
> > > > c) id = 4,count = 2,next_avail = 0
> > > > 
> > > id should be 6 here.
> > > 
> > > Thanks
> > > 
> > > > if packet b is done before packet a, driver may think buffer id 0 is
> > > > available and try to use it if even if the real buffer 0 was not done.
> > > > 
> > > > Thanks
> > Nice catch! Thanks a lot!
> > I'll implement an ID allocator.
> > 
> > Best regards,
> > Tiwei Bie
> 
> Sounds good.
> 
> Another similar issue is detac_buf_packed(). It did:
> 
>     for (j = 0; j < vq->desc_state[head].num; j++) {
>     desc = >vring_packed.desc[i];
>     vring_unmap_one_packed(vq, desc);
>     i++;
>     if (i >= vq->vring_packed.num)
>     i = 0;
>     }
> 
> This probably won't work for out of order too and according to the spec:
> 
> """
> Driver needs to keep track of the size of the list corresponding to each
> buffer ID, to be able to skip to where the next used descriptor is written
> by the device.
> """
> 
> Looks like we should not depend on the descriptor ring.

Yeah, the previous ID allocation is too simple.. 
Let me fix it in the next version.

Thanks!

> 
> Thanks


Re: [RFC v3 3/5] virtio_ring: add packed ring support

2018-05-10 Thread Tiwei Bie
On Thu, May 10, 2018 at 03:34:50PM +0800, Jason Wang wrote:
> On 2018年05月10日 15:32, Jason Wang wrote:
> > On 2018年04月25日 13:15, Tiwei Bie wrote:
> > > +    /* We're using some buffers from the free list. */
> > > +    vq->vq.num_free -= descs_used;
> > > +
> > > +    /* Update free pointer */
> > > +    if (indirect) {
> > > +    n = head + 1;
> > > +    if (n >= vq->vring_packed.num) {
> > > +    n = 0;
> > > +    vq->wrap_counter ^= 1;
> > > +    }
> > > +    vq->next_avail_idx = n;
> > > +    } else
> > > +    vq->next_avail_idx = i;
> > 
> > During testing zerocopy (out of order completion), I found driver may
> > submit two identical buffer id to vhost. So the above code may not work
> > well.
> > 
> > Consider the case that driver adds 3 buffer and virtqueue size is 8.
> > 
> > a) id = 0,count = 2,next_avail = 2
> > 
> > b) id = 2,count = 4,next_avail = 2
> 
> next_avail should be 6 here.
> 
> > 
> > c) id = 4,count = 2,next_avail = 0
> > 
> 
> id should be 6 here.
> 
> Thanks
> 
> > if packet b is done before packet a, driver may think buffer id 0 is
> > available and try to use it if even if the real buffer 0 was not done.
> > 
> > Thanks

Nice catch! Thanks a lot!
I'll implement an ID allocator.

Best regards,
Tiwei Bie


Re: [RFC v3 4/5] virtio_ring: add event idx support in packed ring

2018-05-08 Thread Tiwei Bie
On Tue, May 08, 2018 at 05:34:40PM +0800, Jason Wang wrote:
> On 2018年05月08日 17:16, Tiwei Bie wrote:
> > On Tue, May 08, 2018 at 03:16:53PM +0800, Jason Wang wrote:
> > > On 2018年05月08日 14:44, Tiwei Bie wrote:
> > > > On Tue, May 08, 2018 at 01:40:40PM +0800, Jason Wang wrote:
> > > > > On 2018年05月08日 11:05, Jason Wang wrote:
> > > > > > > Because in virtqueue_enable_cb_delayed(), we may set an
> > > > > > > event_off which is bigger than new and both of them have
> > > > > > > wrapped. And in this case, although new is smaller than
> > > > > > > event_off (i.e. the third param -- old), new shouldn't
> > > > > > > add vq->num, and actually we are expecting a very big
> > > > > > > idx diff.
> > > > > > Yes, so to calculate distance correctly between event and new, we 
> > > > > > just
> > > > > > need to compare the warp counter and return false if it doesn't 
> > > > > > match
> > > > > > without the need to try to add vq.num here.
> > > > > > 
> > > > > > Thanks
> > > > > Sorry, looks like the following should work, we need add vq.num if
> > > > > used_wrap_counter does not match:
> > > > > 
> > > > > static bool vhost_vring_packed_need_event(struct vhost_virtqueue *vq,
> > > > >                         __u16 off_wrap, __u16 new,
> > > > >                         __u16 old)
> > > > > {
> > > > >       bool wrap = off_wrap >> 15;
> > > > >       int off = off_wrap & ~(1 << 15);
> > > > >       __u16 d1, d2;
> > > > > 
> > > > >       if (wrap != vq->used_wrap_counter)
> > > > >           d1 = new + vq->num - off - 1;
> > > > Just to draw your attention (maybe you have already
> > > > noticed this).
> > > I miss this, thanks!
> > > 
> > > > In this case (i.e. wrap != vq->used_wrap_counter),
> > > > it's also possible that (off < new) is true. Because,
> > > > 
> > > > when virtqueue_enable_cb_delayed_packed() is used,
> > > > `off` is calculated in driver in a way like this:
> > > > 
> > > > off = vq->last_used_idx + bufs;
> > > > if (off >= vq->vring_packed.num) {
> > > > off -= vq->vring_packed.num;
> > > > wrap_counter ^= 1;
> > > > }
> > > > 
> > > > And when `new` (in vhost) is close to vq->num. The
> > > > vq->last_used_idx + bufs (in driver) can be bigger
> > > > than vq->vring_packed.num, and:
> > > > 
> > > > 1. `off` will wrap;
> > > > 2. wrap counters won't match;
> > > > 3. off < new;
> > > > 
> > > > And d1 (i.e. new + vq->num - off - 1) will be a value
> > > > bigger than vq->num. I'm okay with this, although it's
> > > > a bit weird.
> > > 
> > > So I'm considering something more compact by reusing vring_need_event() by
> > > pretending a larger queue size and adding vq->num back when necessary:
> > > 
> > > static bool vhost_vring_packed_need_event(struct vhost_virtqueue *vq,
> > >                        __u16 off_wrap, __u16 new,
> > >                        __u16 old)
> > > {
> > >      bool wrap = vq->used_wrap_counter;
> > If the wrap counter is obtained from the vq,
> > I think `new` should also be obtained from
> > the vq. Or the wrap counter should be carried
> > in `new`.
> > 
> > >      int off = off_wrap & ~(1 << 15);
> > >      __u16 d1, d2;
> > > 
> > >      if (new < old) {
> > >          new += vq->num;
> > >          wrap ^= 1;
> > >      }
> > > 
> > >      if (wrap != off_wrap >> 15)
> > >          off += vq->num;
> > When `new` and `old` wraps, and `off` doesn't wrap,
> > wrap != (off_wrap >> 15) will be true. In this case,
> > `off` is bigger than `new`, and what we should do
> > is `off -= vq->num` instead of `off += vq->num`.
> 
> If I understand this correctly, if we track old correctly, it won't happen
> if guest driver behave correctly. That means it should only happen for a
> buggy driver (e.g trying to move off_wrap back).

If vhost is faster than virtio driver, I guess above
case may happen. The `old` and `new` will be updated
each time we want to notify the driver. If the driver
is slower, `old` and `new` in vhost may wrap before
the `off` which is set by driver wraps.

Best regards,
Tiwei Bie

> 
> Thanks
> 
> > 
> > Best regards,
> > Tiwei Bie
> > 
> > >      return vring_need_event(off, new, old);
> > > }
> > > 
> > > 
> > > > Best regards,
> > > > Tiwei Bie
> > > > 
> > > > >       else
> > > > >           d1 = new - off - 1;
> > > > > 
> > > > >       if (new > old)
> > > > >           d2 = new - old;
> > > > >       else
> > > > >           d2 = new + vq->num - old;
> > > > > 
> > > > >       return d1 < d2;
> > > > > }
> > > > > 
> > > > > Thanks
> > > > > 
> 


Re: [RFC v3 4/5] virtio_ring: add event idx support in packed ring

2018-05-08 Thread Tiwei Bie
On Tue, May 08, 2018 at 03:16:53PM +0800, Jason Wang wrote:
> On 2018年05月08日 14:44, Tiwei Bie wrote:
> > On Tue, May 08, 2018 at 01:40:40PM +0800, Jason Wang wrote:
> > > On 2018年05月08日 11:05, Jason Wang wrote:
> > > > > Because in virtqueue_enable_cb_delayed(), we may set an
> > > > > event_off which is bigger than new and both of them have
> > > > > wrapped. And in this case, although new is smaller than
> > > > > event_off (i.e. the third param -- old), new shouldn't
> > > > > add vq->num, and actually we are expecting a very big
> > > > > idx diff.
> > > > Yes, so to calculate distance correctly between event and new, we just
> > > > need to compare the warp counter and return false if it doesn't match
> > > > without the need to try to add vq.num here.
> > > > 
> > > > Thanks
> > > Sorry, looks like the following should work, we need add vq.num if
> > > used_wrap_counter does not match:
> > > 
> > > static bool vhost_vring_packed_need_event(struct vhost_virtqueue *vq,
> > >                        __u16 off_wrap, __u16 new,
> > >                        __u16 old)
> > > {
> > >      bool wrap = off_wrap >> 15;
> > >      int off = off_wrap & ~(1 << 15);
> > >      __u16 d1, d2;
> > > 
> > >      if (wrap != vq->used_wrap_counter)
> > >          d1 = new + vq->num - off - 1;
> > Just to draw your attention (maybe you have already
> > noticed this).
> 
> I miss this, thanks!
> 
> > 
> > In this case (i.e. wrap != vq->used_wrap_counter),
> > it's also possible that (off < new) is true. Because,
> > 
> > when virtqueue_enable_cb_delayed_packed() is used,
> > `off` is calculated in driver in a way like this:
> > 
> > off = vq->last_used_idx + bufs;
> > if (off >= vq->vring_packed.num) {
> > off -= vq->vring_packed.num;
> > wrap_counter ^= 1;
> > }
> > 
> > And when `new` (in vhost) is close to vq->num. The
> > vq->last_used_idx + bufs (in driver) can be bigger
> > than vq->vring_packed.num, and:
> > 
> > 1. `off` will wrap;
> > 2. wrap counters won't match;
> > 3. off < new;
> > 
> > And d1 (i.e. new + vq->num - off - 1) will be a value
> > bigger than vq->num. I'm okay with this, although it's
> > a bit weird.
> 
> 
> So I'm considering something more compact by reusing vring_need_event() by
> pretending a larger queue size and adding vq->num back when necessary:
> 
> static bool vhost_vring_packed_need_event(struct vhost_virtqueue *vq,
>                       __u16 off_wrap, __u16 new,
>                       __u16 old)
> {
>     bool wrap = vq->used_wrap_counter;

If the wrap counter is obtained from the vq,
I think `new` should also be obtained from
the vq. Or the wrap counter should be carried
in `new`.

>     int off = off_wrap & ~(1 << 15);
>     __u16 d1, d2;
> 
>     if (new < old) {
>         new += vq->num;
>         wrap ^= 1;
>     }
> 
>     if (wrap != off_wrap >> 15)
>         off += vq->num;

When `new` and `old` wraps, and `off` doesn't wrap,
wrap != (off_wrap >> 15) will be true. In this case,
`off` is bigger than `new`, and what we should do
is `off -= vq->num` instead of `off += vq->num`.

Best regards,
Tiwei Bie

> 
>     return vring_need_event(off, new, old);
> }
> 
> 
> > 
> > Best regards,
> > Tiwei Bie
> > 
> > >      else
> > >          d1 = new - off - 1;
> > > 
> > >      if (new > old)
> > >          d2 = new - old;
> > >      else
> > >          d2 = new + vq->num - old;
> > > 
> > >      return d1 < d2;
> > > }
> > > 
> > > Thanks
> > > 
> 


Re: [RFC v3 4/5] virtio_ring: add event idx support in packed ring

2018-05-08 Thread Tiwei Bie
On Tue, May 08, 2018 at 01:40:40PM +0800, Jason Wang wrote:
> On 2018年05月08日 11:05, Jason Wang wrote:
> > > 
> > > Because in virtqueue_enable_cb_delayed(), we may set an
> > > event_off which is bigger than new and both of them have
> > > wrapped. And in this case, although new is smaller than
> > > event_off (i.e. the third param -- old), new shouldn't
> > > add vq->num, and actually we are expecting a very big
> > > idx diff.
> > 
> > Yes, so to calculate distance correctly between event and new, we just
> > need to compare the warp counter and return false if it doesn't match
> > without the need to try to add vq.num here.
> > 
> > Thanks
> 
> Sorry, looks like the following should work, we need add vq.num if
> used_wrap_counter does not match:
> 
> static bool vhost_vring_packed_need_event(struct vhost_virtqueue *vq,
>                       __u16 off_wrap, __u16 new,
>                       __u16 old)
> {
>     bool wrap = off_wrap >> 15;
>     int off = off_wrap & ~(1 << 15);
>     __u16 d1, d2;
> 
>     if (wrap != vq->used_wrap_counter)
>         d1 = new + vq->num - off - 1;

Just to draw your attention (maybe you have already
noticed this).

In this case (i.e. wrap != vq->used_wrap_counter),
it's also possible that (off < new) is true. Because,

when virtqueue_enable_cb_delayed_packed() is used,
`off` is calculated in driver in a way like this:

off = vq->last_used_idx + bufs;
if (off >= vq->vring_packed.num) {
off -= vq->vring_packed.num;
wrap_counter ^= 1;
}

And when `new` (in vhost) is close to vq->num. The
vq->last_used_idx + bufs (in driver) can be bigger
than vq->vring_packed.num, and:

1. `off` will wrap;
2. wrap counters won't match;
3. off < new;

And d1 (i.e. new + vq->num - off - 1) will be a value
bigger than vq->num. I'm okay with this, although it's
a bit weird.

Best regards,
Tiwei Bie

>     else
>         d1 = new - off - 1;
> 
>     if (new > old)
>         d2 = new - old;
>     else
>         d2 = new + vq->num - old;
> 
>     return d1 < d2;
> }
> 
> Thanks
> 


Re: [RFC v3 4/5] virtio_ring: add event idx support in packed ring

2018-05-03 Thread Tiwei Bie
On Thu, May 03, 2018 at 03:25:29PM +0800, Jason Wang wrote:
> On 2018年05月03日 10:09, Tiwei Bie wrote:
> > > > > So how about we use the straightforward way then?
> > > > You mean we do new += vq->vring_packed.num instead
> > > > of event_idx -= vq->vring_packed.num before calling
> > > > vring_need_event()?
> > > > 
> > > > The problem is that, the second param (new_idx) of
> > > > vring_need_event() will be used for:
> > > > 
> > > > (__u16)(new_idx - event_idx - 1)
> > > > (__u16)(new_idx - old)
> > > > 
> > > > So if we change new, we will need to change old too.
> > > I think that since we have a branch there anyway,
> > > we are better off just special-casing if (wrap_counter != 
> > > vq->wrap_counter).
> > > Treat is differenty and avoid casts.
> > > 
> > > > And that would be an ugly hack..
> > > > 
> > > > Best regards,
> > > > Tiwei Bie
> > > I consider casts and huge numbers with two's complement
> > > games even uglier.
> > The dependency on two's complement game is introduced
> > since the split ring.
> > 
> > In packed ring, old is calculated via:
> > 
> > old = vq->next_avail_idx - vq->num_added;
> > 
> > In split ring, old is calculated via:
> > 
> > old = vq->avail_idx_shadow - vq->num_added;
> > 
> > In both cases, when vq->num_added is bigger, old will
> > be a big number.
> > 
> > Best regards,
> > Tiwei Bie
> > 
> 
> How about just do something like vhost:
> 
> static u16 vhost_idx_diff(struct vhost_virtqueue *vq, u16 old, u16 new)
> {
>     if (new > old)
>         return new - old;
>     return  (new + vq->num - old);
> }
> 
> static bool vhost_vring_packed_need_event(struct vhost_virtqueue *vq,
>                       __u16 event_off, __u16 new,
>                       __u16 old)
> {
>     return (__u16)(vhost_idx_diff(vq, new, event_off) - 1) <
>        (__u16)vhost_idx_diff(vq, new, old);
> }
> 
> ?

It seems that there is a typo in above code. The second
param of vhost_idx_diff() is `old`, but when calling this
function in vhost_vring_packed_need_event(), `new` is
passed as the second param.

If we assume the second param of vhost_idx_diff() is new
and the third one is old, i.e.:

static u16 vhost_idx_diff(struct vhost_virtqueue *vq, u16 new, u16 old)
{
    if (new > old)
        return new - old;
    return  (new + vq->num - old);
}

I think it's still not right.

Because in virtqueue_enable_cb_delayed(), we may set an
event_off which is bigger than new and both of them have
wrapped. And in this case, although new is smaller than
event_off (i.e. the third param -- old), new shouldn't
add vq->num, and actually we are expecting a very big
idx diff.

Best regards,
Tiwei Bie


Re: [RFC v3 4/5] virtio_ring: add event idx support in packed ring

2018-05-02 Thread Tiwei Bie
On Thu, May 03, 2018 at 04:44:39AM +0300, Michael S. Tsirkin wrote:
> On Thu, May 03, 2018 at 09:11:16AM +0800, Tiwei Bie wrote:
> > On Wed, May 02, 2018 at 06:42:57PM +0300, Michael S. Tsirkin wrote:
> > > On Wed, May 02, 2018 at 11:12:55PM +0800, Tiwei Bie wrote:
> > > > On Wed, May 02, 2018 at 04:51:01PM +0300, Michael S. Tsirkin wrote:
> > > > > On Wed, May 02, 2018 at 03:28:19PM +0800, Tiwei Bie wrote:
> > > > > > On Wed, May 02, 2018 at 10:51:06AM +0800, Jason Wang wrote:
> > > > > > > On 2018年04月25日 13:15, Tiwei Bie wrote:
> > > > > > > > This commit introduces the event idx support in packed
> > > > > > > > ring. This feature is temporarily disabled, because the
> > > > > > > > implementation in this patch may not work as expected,
> > > > > > > > and some further discussions on the implementation are
> > > > > > > > needed, e.g. do we have to check the wrap counter when
> > > > > > > > checking whether a kick is needed?
> > > > > > > > 
> > > > > > > > Signed-off-by: Tiwei Bie <tiwei@intel.com>
> > > > > > > > ---
> > > > > > > >   drivers/virtio/virtio_ring.c | 53 
> > > > > > > > 
> > > > > > > >   1 file changed, 49 insertions(+), 4 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > > > index 0181e93897be..b1039c2985b9 100644
> > > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > > @@ -986,7 +986,7 @@ static inline int 
> > > > > > > > virtqueue_add_packed(struct virtqueue *_vq,
> > > > > > > >   static bool virtqueue_kick_prepare_packed(struct virtqueue 
> > > > > > > > *_vq)
> > > > > > > >   {
> > > > > > > > struct vring_virtqueue *vq = to_vvq(_vq);
> > > > > > > > -   u16 flags;
> > > > > > > > +   u16 new, old, off_wrap, flags;
> > > > > > > > bool needs_kick;
> > > > > > > > u32 snapshot;
> > > > > > > > @@ -995,7 +995,12 @@ static bool 
> > > > > > > > virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > > > > >  * suppressions. */
> > > > > > > > virtio_mb(vq->weak_barriers);
> > > > > > > > +   old = vq->next_avail_idx - vq->num_added;
> > > > > > > > +   new = vq->next_avail_idx;
> > > > > > > > +   vq->num_added = 0;
> > > > > > > > +
> > > > > > > > snapshot = *(u32 *)vq->vring_packed.device;
> > > > > > > > +   off_wrap = virtio16_to_cpu(_vq->vdev, snapshot & 
> > > > > > > > 0x);
> > > > > > > > flags = cpu_to_virtio16(_vq->vdev, snapshot >> 16) & 
> > > > > > > > 0x3;
> > > > > > > >   #ifdef DEBUG
> > > > > > > > @@ -1006,7 +1011,10 @@ static bool 
> > > > > > > > virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > > > > > vq->last_add_time_valid = false;
> > > > > > > >   #endif
> > > > > > > > -   needs_kick = (flags != VRING_EVENT_F_DISABLE);
> > > > > > > > +   if (flags == VRING_EVENT_F_DESC)
> > > > > > > > +   needs_kick = vring_need_event(off_wrap & 
> > > > > > > > ~(1<<15), new, old);
> > > > > > > 
> > > > > > > I wonder whether or not the math is correct. Both new and event 
> > > > > > > are in the
> > > > > > > unit of descriptor ring size, but old looks not.
> > > > > > 
> > > > > > What vring_need_event() cares is the distance between
> > > > > > `new` and `old`, i.e. vq->num_added. So I think there
> > > > > > is nothing wrong with `old`. But th

Re: [RFC v3 4/5] virtio_ring: add event idx support in packed ring

2018-05-02 Thread Tiwei Bie
On Wed, May 02, 2018 at 06:42:57PM +0300, Michael S. Tsirkin wrote:
> On Wed, May 02, 2018 at 11:12:55PM +0800, Tiwei Bie wrote:
> > On Wed, May 02, 2018 at 04:51:01PM +0300, Michael S. Tsirkin wrote:
> > > On Wed, May 02, 2018 at 03:28:19PM +0800, Tiwei Bie wrote:
> > > > On Wed, May 02, 2018 at 10:51:06AM +0800, Jason Wang wrote:
> > > > > On 2018年04月25日 13:15, Tiwei Bie wrote:
> > > > > > This commit introduces the event idx support in packed
> > > > > > ring. This feature is temporarily disabled, because the
> > > > > > implementation in this patch may not work as expected,
> > > > > > and some further discussions on the implementation are
> > > > > > needed, e.g. do we have to check the wrap counter when
> > > > > > checking whether a kick is needed?
> > > > > > 
> > > > > > Signed-off-by: Tiwei Bie <tiwei@intel.com>
> > > > > > ---
> > > > > >   drivers/virtio/virtio_ring.c | 53 
> > > > > > 
> > > > > >   1 file changed, 49 insertions(+), 4 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > index 0181e93897be..b1039c2985b9 100644
> > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > @@ -986,7 +986,7 @@ static inline int virtqueue_add_packed(struct 
> > > > > > virtqueue *_vq,
> > > > > >   static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > > >   {
> > > > > > struct vring_virtqueue *vq = to_vvq(_vq);
> > > > > > -   u16 flags;
> > > > > > +   u16 new, old, off_wrap, flags;
> > > > > > bool needs_kick;
> > > > > > u32 snapshot;
> > > > > > @@ -995,7 +995,12 @@ static bool 
> > > > > > virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > > >  * suppressions. */
> > > > > > virtio_mb(vq->weak_barriers);
> > > > > > +   old = vq->next_avail_idx - vq->num_added;
> > > > > > +   new = vq->next_avail_idx;
> > > > > > +   vq->num_added = 0;
> > > > > > +
> > > > > > snapshot = *(u32 *)vq->vring_packed.device;
> > > > > > +   off_wrap = virtio16_to_cpu(_vq->vdev, snapshot & 0x);
> > > > > > flags = cpu_to_virtio16(_vq->vdev, snapshot >> 16) & 0x3;
> > > > > >   #ifdef DEBUG
> > > > > > @@ -1006,7 +1011,10 @@ static bool 
> > > > > > virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > > > vq->last_add_time_valid = false;
> > > > > >   #endif
> > > > > > -   needs_kick = (flags != VRING_EVENT_F_DISABLE);
> > > > > > +   if (flags == VRING_EVENT_F_DESC)
> > > > > > +   needs_kick = vring_need_event(off_wrap & ~(1<<15), new, 
> > > > > > old);
> > > > > 
> > > > > I wonder whether or not the math is correct. Both new and event are 
> > > > > in the
> > > > > unit of descriptor ring size, but old looks not.
> > > > 
> > > > What vring_need_event() cares is the distance between
> > > > `new` and `old`, i.e. vq->num_added. So I think there
> > > > is nothing wrong with `old`. But the calculation of the
> > > > distance between `new` and `event_idx` isn't right when
> > > > `new` wraps. How do you think about the below code:
> > > > 
> > > > wrap_counter = off_wrap >> 15;
> > > > event_idx = off_wrap & ~(1<<15);
> > > > if (wrap_counter != vq->wrap_counter)
> > > > event_idx -= vq->vring_packed.num;
> > > > 
> > > > needs_kick = vring_need_event(event_idx, new, old);
> > > 
> > > I suspect this hack won't work for non power of 2 ring.
> > 
> > Above code doesn't require the ring size to be a power of 2.
> > 
> > For (__u16)(new_idx - old), what we want to get is vq->num_added.
> > 
> > old = vq->next_avail_idx - vq->num_added;
> > ne

Re: [RFC v3 4/5] virtio_ring: add event idx support in packed ring

2018-05-02 Thread Tiwei Bie
On Wed, May 02, 2018 at 04:51:01PM +0300, Michael S. Tsirkin wrote:
> On Wed, May 02, 2018 at 03:28:19PM +0800, Tiwei Bie wrote:
> > On Wed, May 02, 2018 at 10:51:06AM +0800, Jason Wang wrote:
> > > On 2018年04月25日 13:15, Tiwei Bie wrote:
> > > > This commit introduces the event idx support in packed
> > > > ring. This feature is temporarily disabled, because the
> > > > implementation in this patch may not work as expected,
> > > > and some further discussions on the implementation are
> > > > needed, e.g. do we have to check the wrap counter when
> > > > checking whether a kick is needed?
> > > > 
> > > > Signed-off-by: Tiwei Bie <tiwei@intel.com>
> > > > ---
> > > >   drivers/virtio/virtio_ring.c | 53 
> > > > 
> > > >   1 file changed, 49 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 0181e93897be..b1039c2985b9 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -986,7 +986,7 @@ static inline int virtqueue_add_packed(struct 
> > > > virtqueue *_vq,
> > > >   static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > >   {
> > > > struct vring_virtqueue *vq = to_vvq(_vq);
> > > > -   u16 flags;
> > > > +   u16 new, old, off_wrap, flags;
> > > > bool needs_kick;
> > > > u32 snapshot;
> > > > @@ -995,7 +995,12 @@ static bool virtqueue_kick_prepare_packed(struct 
> > > > virtqueue *_vq)
> > > >  * suppressions. */
> > > > virtio_mb(vq->weak_barriers);
> > > > +   old = vq->next_avail_idx - vq->num_added;
> > > > +   new = vq->next_avail_idx;
> > > > +   vq->num_added = 0;
> > > > +
> > > > snapshot = *(u32 *)vq->vring_packed.device;
> > > > +   off_wrap = virtio16_to_cpu(_vq->vdev, snapshot & 0x);
> > > > flags = cpu_to_virtio16(_vq->vdev, snapshot >> 16) & 0x3;
> > > >   #ifdef DEBUG
> > > > @@ -1006,7 +1011,10 @@ static bool virtqueue_kick_prepare_packed(struct 
> > > > virtqueue *_vq)
> > > > vq->last_add_time_valid = false;
> > > >   #endif
> > > > -   needs_kick = (flags != VRING_EVENT_F_DISABLE);
> > > > +   if (flags == VRING_EVENT_F_DESC)
> > > > +   needs_kick = vring_need_event(off_wrap & ~(1<<15), new, 
> > > > old);
> > > 
> > > I wonder whether or not the math is correct. Both new and event are in the
> > > unit of descriptor ring size, but old looks not.
> > 
> > What vring_need_event() cares is the distance between
> > `new` and `old`, i.e. vq->num_added. So I think there
> > is nothing wrong with `old`. But the calculation of the
> > distance between `new` and `event_idx` isn't right when
> > `new` wraps. How do you think about the below code:
> > 
> > wrap_counter = off_wrap >> 15;
> > event_idx = off_wrap & ~(1<<15);
> > if (wrap_counter != vq->wrap_counter)
> > event_idx -= vq->vring_packed.num;
> > 
> > needs_kick = vring_need_event(event_idx, new, old);
> 
> I suspect this hack won't work for non power of 2 ring.

Above code doesn't require the ring size to be a power of 2.

For (__u16)(new_idx - old), what we want to get is vq->num_added.

old = vq->next_avail_idx - vq->num_added;
new = vq->next_avail_idx;

When vq->next_avail_idx >= vq->num_added, it's obvious that,
(__u16)(new_idx - old) is vq->num_added.

And when vq->next_avail_idx < vq->num_added, new will be smaller
than old (old will be a big unsigned number), but (__u16)(new_idx
- old) is still vq->num_added.

For (__u16)(new_idx - event_idx - 1), when new wraps and event_idx
doesn't wrap, the most straightforward way to calculate it is:
(new + vq->vring_packed.num) - event_idx - 1.

But we can also calculate it in this way:

event_idx -= vq->vring_packed.num;
(event_idx will be a big unsigned number)

Then (__u16)(new_idx - event_idx - 1) will be the value we want.

Best regards,
Tiwei Bie

> 
> 
> > Best regards,
> > Tiwei Bie
> > 
> > 
> > > 
> > > Thanks
> > > 
> > > > +   else
>

Re: [RFC V3 PATCH 1/8] vhost: move get_rx_bufs to vhost.c

2018-05-02 Thread Tiwei Bie
On Mon, Apr 23, 2018 at 01:34:53PM +0800, Jason Wang wrote:
> Move get_rx_bufs() to vhost.c and rename it to
> vhost_get_rx_bufs(). This helps to hide vring internal layout from

A small typo. Based on the code change in this patch, it
seems that this function is renamed to vhost_get_bufs().

Thanks


> specific device implementation. Packed ring implementation will
> benefit from this.
> 
> Signed-off-by: Jason Wang 
> ---
>  drivers/vhost/net.c   | 83 
> ++-
>  drivers/vhost/vhost.c | 78 +++
>  drivers/vhost/vhost.h |  7 +
>  3 files changed, 88 insertions(+), 80 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 986058a..762aa81 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -664,83 +664,6 @@ static int vhost_net_rx_peek_head_len(struct vhost_net 
> *net, struct sock *sk)
>   return len;
>  }
>  
> -/* This is a multi-buffer version of vhost_get_desc, that works if
> - *   vq has read descriptors only.
> - * @vq   - the relevant virtqueue
> - * @datalen  - data length we'll be reading
> - * @iovcount - returned count of io vectors we fill
> - * @log  - vhost log
> - * @log_num  - log offset
> - * @quota   - headcount quota, 1 for big buffer
> - *   returns number of buffer heads allocated, negative on error
> - */
> -static int get_rx_bufs(struct vhost_virtqueue *vq,
> -struct vring_used_elem *heads,
> -int datalen,
> -unsigned *iovcount,
> -struct vhost_log *log,
> -unsigned *log_num,
> -unsigned int quota)
> -{
> - unsigned int out, in;
> - int seg = 0;
> - int headcount = 0;
> - unsigned d;
> - int r, nlogs = 0;
> - /* len is always initialized before use since we are always called with
> -  * datalen > 0.
> -  */
> - u32 uninitialized_var(len);
> -
> - while (datalen > 0 && headcount < quota) {
> - if (unlikely(seg >= UIO_MAXIOV)) {
> - r = -ENOBUFS;
> - goto err;
> - }
> - r = vhost_get_vq_desc(vq, vq->iov + seg,
> -   ARRAY_SIZE(vq->iov) - seg, ,
> -   , log, log_num);
> - if (unlikely(r < 0))
> - goto err;
> -
> - d = r;
> - if (d == vq->num) {
> - r = 0;
> - goto err;
> - }
> - if (unlikely(out || in <= 0)) {
> - vq_err(vq, "unexpected descriptor format for RX: "
> - "out %d, in %d\n", out, in);
> - r = -EINVAL;
> - goto err;
> - }
> - if (unlikely(log)) {
> - nlogs += *log_num;
> - log += *log_num;
> - }
> - heads[headcount].id = cpu_to_vhost32(vq, d);
> - len = iov_length(vq->iov + seg, in);
> - heads[headcount].len = cpu_to_vhost32(vq, len);
> - datalen -= len;
> - ++headcount;
> - seg += in;
> - }
> - heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
> - *iovcount = seg;
> - if (unlikely(log))
> - *log_num = nlogs;
> -
> - /* Detect overrun */
> - if (unlikely(datalen > 0)) {
> - r = UIO_MAXIOV + 1;
> - goto err;
> - }
> - return headcount;
> -err:
> - vhost_discard_vq_desc(vq, headcount);
> - return r;
> -}
> -
>  /* Expects to be always run from workqueue - which acts as
>   * read-size critical section for our kind of RCU. */
>  static void handle_rx(struct vhost_net *net)
> @@ -790,9 +713,9 @@ static void handle_rx(struct vhost_net *net)
>   while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) {
>   sock_len += sock_hlen;
>   vhost_len = sock_len + vhost_hlen;
> - headcount = get_rx_bufs(vq, vq->heads + nheads, vhost_len,
> - , vq_log, ,
> - likely(mergeable) ? UIO_MAXIOV : 1);
> + headcount = vhost_get_bufs(vq, vq->heads + nheads, vhost_len,
> +, vq_log, ,
> +likely(mergeable) ? UIO_MAXIOV : 1);
>   /* On error, stop handling until the next kick. */
>   if (unlikely(headcount < 0))
>   goto out;
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index f3bd8e9..6b455f6 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2097,6 +2097,84 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  }
>  EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
>  
> +/* This is a multi-buffer version of 

Re: [RFC v3 4/5] virtio_ring: add event idx support in packed ring

2018-05-02 Thread Tiwei Bie
On Wed, May 02, 2018 at 10:51:06AM +0800, Jason Wang wrote:
> On 2018年04月25日 13:15, Tiwei Bie wrote:
> > This commit introduces the event idx support in packed
> > ring. This feature is temporarily disabled, because the
> > implementation in this patch may not work as expected,
> > and some further discussions on the implementation are
> > needed, e.g. do we have to check the wrap counter when
> > checking whether a kick is needed?
> > 
> > Signed-off-by: Tiwei Bie <tiwei@intel.com>
> > ---
> >   drivers/virtio/virtio_ring.c | 53 
> > 
> >   1 file changed, 49 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 0181e93897be..b1039c2985b9 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -986,7 +986,7 @@ static inline int virtqueue_add_packed(struct virtqueue 
> > *_vq,
> >   static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> >   {
> > struct vring_virtqueue *vq = to_vvq(_vq);
> > -   u16 flags;
> > +   u16 new, old, off_wrap, flags;
> > bool needs_kick;
> > u32 snapshot;
> > @@ -995,7 +995,12 @@ static bool virtqueue_kick_prepare_packed(struct 
> > virtqueue *_vq)
> >  * suppressions. */
> > virtio_mb(vq->weak_barriers);
> > +   old = vq->next_avail_idx - vq->num_added;
> > +   new = vq->next_avail_idx;
> > +   vq->num_added = 0;
> > +
> > snapshot = *(u32 *)vq->vring_packed.device;
> > +   off_wrap = virtio16_to_cpu(_vq->vdev, snapshot & 0x);
> > flags = cpu_to_virtio16(_vq->vdev, snapshot >> 16) & 0x3;
> >   #ifdef DEBUG
> > @@ -1006,7 +1011,10 @@ static bool virtqueue_kick_prepare_packed(struct 
> > virtqueue *_vq)
> > vq->last_add_time_valid = false;
> >   #endif
> > -   needs_kick = (flags != VRING_EVENT_F_DISABLE);
> > +   if (flags == VRING_EVENT_F_DESC)
> > +   needs_kick = vring_need_event(off_wrap & ~(1<<15), new, old);
> 
> I wonder whether or not the math is correct. Both new and event are in the
> unit of descriptor ring size, but old looks not.

What vring_need_event() cares is the distance between
`new` and `old`, i.e. vq->num_added. So I think there
is nothing wrong with `old`. But the calculation of the
distance between `new` and `event_idx` isn't right when
`new` wraps. How do you think about the below code:

wrap_counter = off_wrap >> 15;
event_idx = off_wrap & ~(1<<15);
if (wrap_counter != vq->wrap_counter)
event_idx -= vq->vring_packed.num;

needs_kick = vring_need_event(event_idx, new, old);

Best regards,
Tiwei Bie


> 
> Thanks
> 
> > +   else
> > +   needs_kick = (flags != VRING_EVENT_F_DISABLE);
> > END_USE(vq);
> > return needs_kick;
> >   }
> > @@ -1116,6 +1124,15 @@ static void *virtqueue_get_buf_ctx_packed(struct 
> > virtqueue *_vq,
> > if (vq->last_used_idx >= vq->vring_packed.num)
> > vq->last_used_idx -= vq->vring_packed.num;
> > +   /* If we expect an interrupt for the next entry, tell host
> > +* by writing event index and flush out the write before
> > +* the read in the next get_buf call. */
> > +   if (vq->event_flags_shadow == VRING_EVENT_F_DESC)
> > +   virtio_store_mb(vq->weak_barriers,
> > +   >vring_packed.driver->off_wrap,
> > +   cpu_to_virtio16(_vq->vdev, vq->last_used_idx |
> > +   (vq->wrap_counter << 15)));
> > +
> >   #ifdef DEBUG
> > vq->last_add_time_valid = false;
> >   #endif
> > @@ -1143,10 +1160,17 @@ static unsigned 
> > virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> > /* We optimistically turn back on interrupts, then check if there was
> >  * more to do. */
> > +   /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> > +* either clear the flags bit or point the event index at the next
> > +* entry. Always update the event index to keep code simple. */
> > +
> > +   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > +   vq->last_used_idx | (vq->wrap_counter << 15));
> > if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > virtio_wmb(vq->weak_barriers);
> > -   vq->event_flags_shado

Re: [RFC v3 0/5] virtio: support packed ring

2018-04-27 Thread Tiwei Bie
On Fri, Apr 27, 2018 at 02:17:51PM +0800, Jason Wang wrote:
> On 2018年04月27日 12:18, Michael S. Tsirkin wrote:
> > On Fri, Apr 27, 2018 at 11:56:05AM +0800, Jason Wang wrote:
> > > On 2018年04月25日 13:15, Tiwei Bie wrote:
> > > > Hello everyone,
> > > > 
> > > > This RFC implements packed ring support in virtio driver.
> > > > 
> > > > Some simple functional tests have been done with Jason's
> > > > packed ring implementation in vhost:
> > > > 
> > > > https://lkml.org/lkml/2018/4/23/12
> > > > 
> > > > Both of ping and netperf worked as expected (with EVENT_IDX
> > > > disabled). But there are below known issues:
> > > > 
> > > > 1. Reloading the guest driver will break the Tx/Rx;
> > > Will have a look at this issue.
> > > 
> > > > 2. Zeroing the flags when detaching a used desc will
> > > >  break the guest -> host path.
> > > I still think zeroing flags is unnecessary or even a bug. At host, I track
> > > last observed avail wrap counter and detect avail like (what is suggested 
> > > in
> > > the example code in the spec):
> > > 
> > > static bool desc_is_avail(struct vhost_virtqueue *vq, __virtio16 flags)
> > > {
> > >     bool avail = flags & cpu_to_vhost16(vq, DESC_AVAIL);
> > > 
> > >     return avail == vq->avail_wrap_counter;
> > > }
> > > 
> > > So zeroing wrap can not work with this obviously.
> > > 
> > > Thanks
> > I agree. I think what one should do is flip the available bit.
> > 
> 
> But is this flipping a must?
> 
> Thanks

Yeah, that's my question too. It seems to be a requirement
for driver that, the only change to the desc status that a
driver can do during running is to mark the desc as avail,
and any other changes to the desc status are not allowed.
Similarly, the device can only mark the desc as used, and
any other changes to the desc status are also not allowed.
So the question is, are there such requirements?

Based on below contents in the spec:

"""
Thus VIRTQ_DESC_F_AVAIL and VIRTQ_DESC_F_USED bits are different
for an available descriptor and equal for a used descriptor.

Note that this observation is mostly useful for sanity-checking
as these are necessary but not sufficient conditions
"""

It seems that, it's necessary for devices to check whether
the AVAIL bit and USED bit are different.

Best regards,
Tiwei Bie


[RFC v3 1/5] virtio: add packed ring definitions

2018-04-24 Thread Tiwei Bie
Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 include/uapi/linux/virtio_config.h | 12 +++-
 include/uapi/linux/virtio_ring.h   | 36 
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/virtio_config.h 
b/include/uapi/linux/virtio_config.h
index 308e2096291f..a6e392325e3a 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -49,7 +49,7 @@
  * transport being used (eg. virtio_ring), the rest are per-device feature
  * bits. */
 #define VIRTIO_TRANSPORT_F_START   28
-#define VIRTIO_TRANSPORT_F_END 34
+#define VIRTIO_TRANSPORT_F_END 36
 
 #ifndef VIRTIO_CONFIG_NO_LEGACY
 /* Do we get callbacks when the ring is completely used, even if we've
@@ -71,4 +71,14 @@
  * this is for compatibility with legacy systems.
  */
 #define VIRTIO_F_IOMMU_PLATFORM33
+
+/* This feature indicates support for the packed virtqueue layout. */
+#define VIRTIO_F_RING_PACKED   34
+
+/*
+ * This feature indicates that all buffers are used by the device
+ * in the same order in which they have been made available.
+ */
+#define VIRTIO_F_IN_ORDER  35
+
 #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 6d5d5faa989b..3932cb80c347 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -44,6 +44,9 @@
 /* This means the buffer contains a list of buffer descriptors. */
 #define VRING_DESC_F_INDIRECT  4
 
+#define VRING_DESC_F_AVAIL(b)  ((b) << 7)
+#define VRING_DESC_F_USED(b)   ((b) << 15)
+
 /* The Host uses this in used->flags to advise the Guest: don't kick me when
  * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
  * will still kick if it's out of buffers. */
@@ -53,6 +56,10 @@
  * optimization.  */
 #define VRING_AVAIL_F_NO_INTERRUPT 1
 
+#define VRING_EVENT_F_ENABLE   0x0
+#define VRING_EVENT_F_DISABLE  0x1
+#define VRING_EVENT_F_DESC 0x2
+
 /* We support indirect buffer descriptors */
 #define VIRTIO_RING_F_INDIRECT_DESC28
 
@@ -171,4 +178,33 @@ static inline int vring_need_event(__u16 event_idx, __u16 
new_idx, __u16 old)
return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
 }
 
+struct vring_packed_desc_event {
+   /* __virtio16 off  : 15; // Descriptor Event Offset
+* __virtio16 wrap : 1;  // Descriptor Event Wrap Counter */
+   __virtio16 off_wrap;
+   /* __virtio16 flags : 2; // Descriptor Event Flags */
+   __virtio16 flags;
+};
+
+struct vring_packed_desc {
+   /* Buffer Address. */
+   __virtio64 addr;
+   /* Buffer Length. */
+   __virtio32 len;
+   /* Buffer ID. */
+   __virtio16 id;
+   /* The flags depending on descriptor type. */
+   __virtio16 flags;
+};
+
+struct vring_packed {
+   unsigned int num;
+
+   struct vring_packed_desc *desc;
+
+   struct vring_packed_desc_event *driver;
+
+   struct vring_packed_desc_event *device;
+};
+
 #endif /* _UAPI_LINUX_VIRTIO_RING_H */
-- 
2.11.0



[RFC v3 0/5] virtio: support packed ring

2018-04-24 Thread Tiwei Bie
Hello everyone,

This RFC implements packed ring support in virtio driver.

Some simple functional tests have been done with Jason's
packed ring implementation in vhost:

https://lkml.org/lkml/2018/4/23/12

Both of ping and netperf worked as expected (with EVENT_IDX
disabled). But there are below known issues:

1. Reloading the guest driver will break the Tx/Rx;
2. Zeroing the flags when detaching a used desc will
   break the guest -> host path.

Some simple functional tests have also been done with
Wei's packed ring implementation in QEMU:

http://lists.nongnu.org/archive/html/qemu-devel/2018-04/msg00342.html

Both of ping and netperf worked as expected (with EVENT_IDX
disabled). Reloading the guest driver also worked as expected.

TODO:
- Refinements (for code and commit log) and bug fixes;
- Discuss/fix/test EVENT_IDX support;
- Test devices other than net;

RFC v2 -> RFC v3:
- Split into small patches (Jason);
- Add helper virtqueue_use_indirect() (Jason);
- Just set id for the last descriptor of a list (Jason);
- Calculate the prev in virtqueue_add_packed() (Jason);
- Fix/improve desc suppression code (Jason/MST);
- Refine the code layout for XXX_split/packed and wrappers (MST);
- Fix the comments and API in uapi (MST);
- Remove the BUG_ON() for indirect (Jason);
- Some other refinements and bug fixes;

RFC v1 -> RFC v2:
- Add indirect descriptor support - compile test only;
- Add event suppression supprt - compile test only;
- Move vring_packed_init() out of uapi (Jason, MST);
- Merge two loops into one in virtqueue_add_packed() (Jason);
- Split vring_unmap_one() for packed ring and split ring (Jason);
- Avoid using '%' operator (Jason);
- Rename free_head -> next_avail_idx (Jason);
- Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
- Some other refinements and bug fixes;

Thanks!

Tiwei Bie (5):
  virtio: add packed ring definitions
  virtio_ring: support creating packed ring
  virtio_ring: add packed ring support
  virtio_ring: add event idx support in packed ring
  virtio_ring: enable packed ring

 drivers/virtio/virtio_ring.c   | 1271 
 include/linux/virtio_ring.h|8 +-
 include/uapi/linux/virtio_config.h |   12 +-
 include/uapi/linux/virtio_ring.h   |   36 +
 4 files changed, 1049 insertions(+), 278 deletions(-)

-- 
2.11.0



[RFC v3 2/5] virtio_ring: support creating packed ring

2018-04-24 Thread Tiwei Bie
This commit introduces the support for creating packed ring.
All split ring specific functions are added _split suffix.
Some necessary stubs for packed ring are also added.

Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 drivers/virtio/virtio_ring.c | 764 ---
 include/linux/virtio_ring.h  |   8 +-
 2 files changed, 513 insertions(+), 259 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..e164822ca66e 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -64,8 +64,8 @@ struct vring_desc_state {
 struct vring_virtqueue {
struct virtqueue vq;
 
-   /* Actual memory layout for this queue */
-   struct vring vring;
+   /* Is this a packed ring? */
+   bool packed;
 
/* Can we use weak barriers? */
bool weak_barriers;
@@ -79,19 +79,45 @@ struct vring_virtqueue {
/* Host publishes avail event idx */
bool event;
 
-   /* Head of free buffer list. */
-   unsigned int free_head;
/* Number we've added since last sync. */
unsigned int num_added;
 
/* Last used index we've seen. */
u16 last_used_idx;
 
-   /* Last written value to avail->flags */
-   u16 avail_flags_shadow;
+   union {
+   /* Available for split ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring vring;
 
-   /* Last written value to avail->idx in guest byte order */
-   u16 avail_idx_shadow;
+   /* Head of free buffer list. */
+   unsigned int free_head;
+
+   /* Last written value to avail->flags */
+   u16 avail_flags_shadow;
+
+   /* Last written value to avail->idx in
+* guest byte order. */
+   u16 avail_idx_shadow;
+   };
+
+   /* Available for packed ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring_packed vring_packed;
+
+   /* Driver ring wrap counter. */
+   u8 wrap_counter;
+
+   /* Index of the next avail descriptor. */
+   unsigned int next_avail_idx;
+
+   /* Last written value to driver->flags in
+* guest byte order. */
+   u16 event_flags_shadow;
+   };
+   };
 
/* How to notify other side. FIXME: commonalize hcalls! */
bool (*notify)(struct virtqueue *vq);
@@ -201,8 +227,17 @@ static dma_addr_t vring_map_single(const struct 
vring_virtqueue *vq,
  cpu_addr, size, direction);
 }
 
-static void vring_unmap_one(const struct vring_virtqueue *vq,
-   struct vring_desc *desc)
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+  dma_addr_t addr)
+{
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return 0;
+
+   return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
+static void vring_unmap_one_split(const struct vring_virtqueue *vq,
+ struct vring_desc *desc)
 {
u16 flags;
 
@@ -226,17 +261,9 @@ static void vring_unmap_one(const struct vring_virtqueue 
*vq,
}
 }
 
-static int vring_mapping_error(const struct vring_virtqueue *vq,
-  dma_addr_t addr)
-{
-   if (!vring_use_dma_api(vq->vq.vdev))
-   return 0;
-
-   return dma_mapping_error(vring_dma_dev(vq), addr);
-}
-
-static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
-unsigned int total_sg, gfp_t gfp)
+static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
+  unsigned int total_sg,
+  gfp_t gfp)
 {
struct vring_desc *desc;
unsigned int i;
@@ -257,14 +284,14 @@ static struct vring_desc *alloc_indirect(struct virtqueue 
*_vq,
return desc;
 }
 
-static inline int virtqueue_add(struct virtqueue *_vq,
-   struct scatterlist *sgs[],
-   unsigned int total_sg,
-   unsigned int out_sgs,
-   unsigned int in_sgs,
-   void *data,
-   void *ctx,
-   gfp_t gfp)
+static inline int virtqueue_add_split(struct virtqueue *_vq,
+ struct scatterlist *sgs[],
+ unsigned int total_sg,
+ unsigned int out_sgs,
+ 

[RFC v3 4/5] virtio_ring: add event idx support in packed ring

2018-04-24 Thread Tiwei Bie
This commit introduces the event idx support in packed
ring. This feature is temporarily disabled, because the
implementation in this patch may not work as expected,
and some further discussions on the implementation are
needed, e.g. do we have to check the wrap counter when
checking whether a kick is needed?

Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 drivers/virtio/virtio_ring.c | 53 
 1 file changed, 49 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 0181e93897be..b1039c2985b9 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -986,7 +986,7 @@ static inline int virtqueue_add_packed(struct virtqueue 
*_vq,
 static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
-   u16 flags;
+   u16 new, old, off_wrap, flags;
bool needs_kick;
u32 snapshot;
 
@@ -995,7 +995,12 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue 
*_vq)
 * suppressions. */
virtio_mb(vq->weak_barriers);
 
+   old = vq->next_avail_idx - vq->num_added;
+   new = vq->next_avail_idx;
+   vq->num_added = 0;
+
snapshot = *(u32 *)vq->vring_packed.device;
+   off_wrap = virtio16_to_cpu(_vq->vdev, snapshot & 0x);
flags = cpu_to_virtio16(_vq->vdev, snapshot >> 16) & 0x3;
 
 #ifdef DEBUG
@@ -1006,7 +1011,10 @@ static bool virtqueue_kick_prepare_packed(struct 
virtqueue *_vq)
vq->last_add_time_valid = false;
 #endif
 
-   needs_kick = (flags != VRING_EVENT_F_DISABLE);
+   if (flags == VRING_EVENT_F_DESC)
+   needs_kick = vring_need_event(off_wrap & ~(1<<15), new, old);
+   else
+   needs_kick = (flags != VRING_EVENT_F_DISABLE);
END_USE(vq);
return needs_kick;
 }
@@ -1116,6 +1124,15 @@ static void *virtqueue_get_buf_ctx_packed(struct 
virtqueue *_vq,
if (vq->last_used_idx >= vq->vring_packed.num)
vq->last_used_idx -= vq->vring_packed.num;
 
+   /* If we expect an interrupt for the next entry, tell host
+* by writing event index and flush out the write before
+* the read in the next get_buf call. */
+   if (vq->event_flags_shadow == VRING_EVENT_F_DESC)
+   virtio_store_mb(vq->weak_barriers,
+   >vring_packed.driver->off_wrap,
+   cpu_to_virtio16(_vq->vdev, vq->last_used_idx |
+   (vq->wrap_counter << 15)));
+
 #ifdef DEBUG
vq->last_add_time_valid = false;
 #endif
@@ -1143,10 +1160,17 @@ static unsigned 
virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
 
/* We optimistically turn back on interrupts, then check if there was
 * more to do. */
+   /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+* either clear the flags bit or point the event index at the next
+* entry. Always update the event index to keep code simple. */
+
+   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
+   vq->last_used_idx | (vq->wrap_counter << 15));
 
if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
virtio_wmb(vq->weak_barriers);
-   vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
+   vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
+VRING_EVENT_F_ENABLE;
vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
vq->event_flags_shadow);
}
@@ -1172,15 +1196,34 @@ static bool virtqueue_poll_packed(struct virtqueue 
*_vq, unsigned last_used_idx)
 static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
+   u16 bufs, used_idx, wrap_counter;
 
START_USE(vq);
 
/* We optimistically turn back on interrupts, then check if there was
 * more to do. */
+   /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+* either clear the flags bit or point the event index at the next
+* entry. Always update the event index to keep code simple. */
+
+   /* TODO: tune this threshold */
+   bufs = (u16)(vq->next_avail_idx - vq->last_used_idx) * 3 / 4;
+
+   used_idx = vq->last_used_idx + bufs;
+   wrap_counter = vq->wrap_counter;
+
+   if (used_idx >= vq->vring_packed.num) {
+   used_idx -= vq->vring_packed.num;
+   wrap_counter ^= 1;
+   }
+
+   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
+   used

[RFC v3 5/5] virtio_ring: enable packed ring

2018-04-24 Thread Tiwei Bie
Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 drivers/virtio/virtio_ring.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index b1039c2985b9..9a3d13e1e2ba 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1873,6 +1873,8 @@ void vring_transport_features(struct virtio_device *vdev)
break;
case VIRTIO_F_IOMMU_PLATFORM:
break;
+   case VIRTIO_F_RING_PACKED:
+   break;
default:
/* We don't understand this bit. */
__virtio_clear_bit(vdev, i);
-- 
2.11.0



[RFC v3 3/5] virtio_ring: add packed ring support

2018-04-24 Thread Tiwei Bie
This commit introduces the basic support (without EVENT_IDX)
for packed ring.

Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 drivers/virtio/virtio_ring.c | 444 ++-
 1 file changed, 434 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index e164822ca66e..0181e93897be 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -58,7 +58,8 @@
 
 struct vring_desc_state {
void *data; /* Data for callback. */
-   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
+   void *indir_desc;   /* Indirect descriptor, if any. */
+   int num;/* Descriptor list length. */
 };
 
 struct vring_virtqueue {
@@ -142,6 +143,16 @@ struct vring_virtqueue {
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+static inline bool virtqueue_use_indirect(struct virtqueue *_vq,
+ unsigned int total_sg)
+{
+   struct vring_virtqueue *vq = to_vvq(_vq);
+
+   /* If the host supports indirect descriptor tables, and we have multiple
+* buffers, then go indirect. FIXME: tune this threshold */
+   return (vq->indirect && total_sg > 1 && vq->vq.num_free);
+}
+
 /*
  * Modern virtio devices have feature bits to specify whether they need a
  * quirk and bypass the IOMMU. If not there, just use the DMA API.
@@ -327,9 +338,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
 
head = vq->free_head;
 
-   /* If the host supports indirect descriptor tables, and we have multiple
-* buffers, then go indirect. FIXME: tune this threshold */
-   if (vq->indirect && total_sg > 1 && vq->vq.num_free)
+   if (virtqueue_use_indirect(_vq, total_sg))
desc = alloc_indirect_split(_vq, total_sg, gfp);
else {
desc = NULL;
@@ -741,6 +750,49 @@ static inline unsigned vring_size_packed(unsigned int num, 
unsigned long align)
& ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
 }
 
+static void vring_unmap_one_packed(const struct vring_virtqueue *vq,
+  struct vring_packed_desc *desc)
+{
+   u16 flags;
+
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return;
+
+   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+virtio64_to_cpu(vq->vq.vdev, desc->addr),
+virtio32_to_cpu(vq->vq.vdev, desc->len),
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  virtio64_to_cpu(vq->vq.vdev, desc->addr),
+  virtio32_to_cpu(vq->vq.vdev, desc->len),
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq,
+  unsigned int total_sg,
+  gfp_t gfp)
+{
+   struct vring_packed_desc *desc;
+
+   /*
+* We require lowmem mappings for the descriptors because
+* otherwise virt_to_phys will give us bogus addresses in the
+* virtqueue.
+*/
+   gfp &= ~__GFP_HIGHMEM;
+
+   desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);
+
+   return desc;
+}
+
 static inline int virtqueue_add_packed(struct virtqueue *_vq,
   struct scatterlist *sgs[],
   unsigned int total_sg,
@@ -750,47 +802,419 @@ static inline int virtqueue_add_packed(struct virtqueue 
*_vq,
   void *ctx,
   gfp_t gfp)
 {
+   struct vring_virtqueue *vq = to_vvq(_vq);
+   struct vring_packed_desc *desc;
+   struct scatterlist *sg;
+   unsigned int i, n, descs_used, uninitialized_var(prev), err_idx;
+   __virtio16 uninitialized_var(head_flags), flags;
+   int head, wrap_counter;
+   bool indirect;
+
+   START_USE(vq);
+
+   BUG_ON(data == NULL);
+   BUG_ON(ctx && vq->indirect);
+
+   if (unlikely(vq->broken)) {
+   END_USE(vq);
+   return -EIO;
+   }
+
+#ifdef DEBUG
+   {
+   ktime_t now = ktime_get();
+
+   /* No kick or get, with .1 second between?  Warn. */
+   if (vq->last_add_time_valid)
+   WARN_ON(kti

Re: [RFC v2] virtio: support packed ring

2018-04-23 Thread Tiwei Bie
On Tue, Apr 24, 2018 at 04:43:22AM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 24, 2018 at 09:37:47AM +0800, Tiwei Bie wrote:
> > On Tue, Apr 24, 2018 at 04:29:51AM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Apr 24, 2018 at 09:16:38AM +0800, Tiwei Bie wrote:
> > > > On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> > > > > > 
> > > > > > 
> > > > > > On 2018年04月23日 17:29, Tiwei Bie wrote:
> > > > > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > > > > > Hello everyone,
> > > > > > > > > 
> > > > > > > > > This RFC implements packed ring support for virtio driver.
> > > > > > > > > 
> > > > > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) 
> > > > > > > > > implemented
> > > > > > > > > by Jens at 
> > > > > > > > > http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > > > > > > Minor changes are needed for the vhost code, e.g. to kick the 
> > > > > > > > > guest.
> > > > > > > > > 
> > > > > > > > > TODO:
> > > > > > > > > - Refinements and bug fixes;
> > > > > > > > > - Split into small patches;
> > > > > > > > > - Test indirect descriptor support;
> > > > > > > > > - Test/fix event suppression support;
> > > > > > > > > - Test devices other than net;
> > > > > > > > > 
> > > > > > > > > RFC v1 -> RFC v2:
> > > > > > > > > - Add indirect descriptor support - compile test only;
> > > > > > > > > - Add event suppression supprt - compile test only;
> > > > > > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > > > > > - Split vring_unmap_one() for packed ring and split ring 
> > > > > > > > > (Jason);
> > > > > > > > > - Avoid using '%' operator (Jason);
> > > > > > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() 
> > > > > > > > > (Jason);
> > > > > > > > > - Some other refinements and bug fixes;
> > > > > > > > > 
> > > > > > > > > Thanks!
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Tiwei Bie <tiwei@intel.com>
> > > > > > > > > ---
> > > > > > > > >drivers/virtio/virtio_ring.c   | 1094 
> > > > > > > > > +---
> > > > > > > > >include/linux/virtio_ring.h|8 +-
> > > > > > > > >include/uapi/linux/virtio_config.h |   12 +-
> > > > > > > > >include/uapi/linux/virtio_ring.h   |   61 ++
> > > > > > > > >4 files changed, 980 insertions(+), 195 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > > > > index 71458f493cf8..0515dca34d77 100644
> > > > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > > > @@ -58,14 +58,15 @@
> > > > > > > > [...]
> > > > > > > > 
> > > > > > > > > +
> > > > > > > > > + if (vq->indirect) {
> > > > > > > > > + u32 len;
> > > > > > > > > +
> > > > > > > > > + desc = vq->desc_state[head].indir_desc;
> > > > > > > > > + /* Free the indirect table, if any, now that 
> > > > >

Re: [RFC v2] virtio: support packed ring

2018-04-23 Thread Tiwei Bie
On Tue, Apr 24, 2018 at 04:29:51AM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 24, 2018 at 09:16:38AM +0800, Tiwei Bie wrote:
> > On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> > > > 
> > > > 
> > > > On 2018年04月23日 17:29, Tiwei Bie wrote:
> > > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > > > Hello everyone,
> > > > > > > 
> > > > > > > This RFC implements packed ring support for virtio driver.
> > > > > > > 
> > > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) 
> > > > > > > implemented
> > > > > > > by Jens at 
> > > > > > > http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > > > > Minor changes are needed for the vhost code, e.g. to kick the 
> > > > > > > guest.
> > > > > > > 
> > > > > > > TODO:
> > > > > > > - Refinements and bug fixes;
> > > > > > > - Split into small patches;
> > > > > > > - Test indirect descriptor support;
> > > > > > > - Test/fix event suppression support;
> > > > > > > - Test devices other than net;
> > > > > > > 
> > > > > > > RFC v1 -> RFC v2:
> > > > > > > - Add indirect descriptor support - compile test only;
> > > > > > > - Add event suppression supprt - compile test only;
> > > > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > > > > - Avoid using '%' operator (Jason);
> > > > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > > > > - Some other refinements and bug fixes;
> > > > > > > 
> > > > > > > Thanks!
> > > > > > > 
> > > > > > > Signed-off-by: Tiwei Bie <tiwei@intel.com>
> > > > > > > ---
> > > > > > >drivers/virtio/virtio_ring.c   | 1094 
> > > > > > > +---
> > > > > > >include/linux/virtio_ring.h|8 +-
> > > > > > >include/uapi/linux/virtio_config.h |   12 +-
> > > > > > >include/uapi/linux/virtio_ring.h   |   61 ++
> > > > > > >4 files changed, 980 insertions(+), 195 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > > index 71458f493cf8..0515dca34d77 100644
> > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > @@ -58,14 +58,15 @@
> > > > > > [...]
> > > > > > 
> > > > > > > +
> > > > > > > + if (vq->indirect) {
> > > > > > > + u32 len;
> > > > > > > +
> > > > > > > + desc = vq->desc_state[head].indir_desc;
> > > > > > > + /* Free the indirect table, if any, now that it's 
> > > > > > > unmapped. */
> > > > > > > + if (!desc)
> > > > > > > + goto out;
> > > > > > > +
> > > > > > > + len = virtio32_to_cpu(vq->vq.vdev,
> > > > > > > +   vq->vring_packed.desc[head].len);
> > > > > > > +
> > > > > > > + BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > > > > +  cpu_to_virtio16(vq->vq.vdev, 
> > > > > > > VRING_DESC_F_INDIRECT)));
> > > > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT 
> > > > > > here. So we
> > > > > > can safely remove this BUG_ON() here.
> > > >

Re: [RFC v2] virtio: support packed ring

2018-04-23 Thread Tiwei Bie
On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> > 
> > 
> > On 2018年04月23日 17:29, Tiwei Bie wrote:
> > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > Hello everyone,
> > > > > 
> > > > > This RFC implements packed ring support for virtio driver.
> > > > > 
> > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > > > > 
> > > > > TODO:
> > > > > - Refinements and bug fixes;
> > > > > - Split into small patches;
> > > > > - Test indirect descriptor support;
> > > > > - Test/fix event suppression support;
> > > > > - Test devices other than net;
> > > > > 
> > > > > RFC v1 -> RFC v2:
> > > > > - Add indirect descriptor support - compile test only;
> > > > > - Add event suppression supprt - compile test only;
> > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > > - Avoid using '%' operator (Jason);
> > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > > - Some other refinements and bug fixes;
> > > > > 
> > > > > Thanks!
> > > > > 
> > > > > Signed-off-by: Tiwei Bie <tiwei@intel.com>
> > > > > ---
> > > > >drivers/virtio/virtio_ring.c   | 1094 
> > > > > +---
> > > > >include/linux/virtio_ring.h|8 +-
> > > > >include/uapi/linux/virtio_config.h |   12 +-
> > > > >include/uapi/linux/virtio_ring.h   |   61 ++
> > > > >4 files changed, 980 insertions(+), 195 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > b/drivers/virtio/virtio_ring.c
> > > > > index 71458f493cf8..0515dca34d77 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -58,14 +58,15 @@
> > > > [...]
> > > > 
> > > > > +
> > > > > + if (vq->indirect) {
> > > > > + u32 len;
> > > > > +
> > > > > + desc = vq->desc_state[head].indir_desc;
> > > > > + /* Free the indirect table, if any, now that it's 
> > > > > unmapped. */
> > > > > + if (!desc)
> > > > > + goto out;
> > > > > +
> > > > > + len = virtio32_to_cpu(vq->vq.vdev,
> > > > > +   vq->vring_packed.desc[head].len);
> > > > > +
> > > > > + BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > > +  cpu_to_virtio16(vq->vq.vdev, 
> > > > > VRING_DESC_F_INDIRECT)));
> > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. 
> > > > So we
> > > > can safely remove this BUG_ON() here.
> > > > 
> > > > > + BUG_ON(len == 0 || len % sizeof(struct 
> > > > > vring_packed_desc));
> > > > Len could be ignored for used descriptor according to the spec, so we 
> > > > need
> > > > remove this BUG_ON() too.
> > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > > And I think something related to this in the spec isn't very
> > > clear currently.
> > > 
> > > In the spec, there are below words:
> > > 
> > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > > """
> > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > > is reserved and is ignored by the device.
> > > """
> > > 
> > > So when device write

Re: [RFC v2] virtio: support packed ring

2018-04-23 Thread Tiwei Bie
On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> On 2018年04月01日 22:12, Tiwei Bie wrote:
> > Hello everyone,
> > 
> > This RFC implements packed ring support for virtio driver.
> > 
> > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > 
> > TODO:
> > - Refinements and bug fixes;
> > - Split into small patches;
> > - Test indirect descriptor support;
> > - Test/fix event suppression support;
> > - Test devices other than net;
> > 
> > RFC v1 -> RFC v2:
> > - Add indirect descriptor support - compile test only;
> > - Add event suppression supprt - compile test only;
> > - Move vring_packed_init() out of uapi (Jason, MST);
> > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > - Avoid using '%' operator (Jason);
> > - Rename free_head -> next_avail_idx (Jason);
> > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > - Some other refinements and bug fixes;
> > 
> > Thanks!
> > 
> > Signed-off-by: Tiwei Bie <tiwei@intel.com>
> > ---
> >   drivers/virtio/virtio_ring.c   | 1094 
> > +---
> >   include/linux/virtio_ring.h|8 +-
> >   include/uapi/linux/virtio_config.h |   12 +-
> >   include/uapi/linux/virtio_ring.h   |   61 ++
> >   4 files changed, 980 insertions(+), 195 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 71458f493cf8..0515dca34d77 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -58,14 +58,15 @@
> 
> [...]
> 
> > +
> > +   if (vq->indirect) {
> > +   u32 len;
> > +
> > +   desc = vq->desc_state[head].indir_desc;
> > +   /* Free the indirect table, if any, now that it's unmapped. */
> > +   if (!desc)
> > +   goto out;
> > +
> > +   len = virtio32_to_cpu(vq->vq.vdev,
> > + vq->vring_packed.desc[head].len);
> > +
> > +   BUG_ON(!(vq->vring_packed.desc[head].flags &
> > +cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> 
> It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
> can safely remove this BUG_ON() here.
> 
> > +   BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
> 
> Len could be ignored for used descriptor according to the spec, so we need
> remove this BUG_ON() too.

Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
And I think something related to this in the spec isn't very
clear currently.

In the spec, there are below words:

https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
"""
In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
is reserved and is ignored by the device.
"""

So when device writes back an used descriptor in this case,
device may not set the VIRTQ_DESC_F_WRITE flag as the flag
is reserved and should be ignored.

https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
"""
Element Length is reserved for used descriptors without the
VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
"""

And this is the way how driver ignores the `len` in an used
descriptor.

https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
"""
To increase ring capacity the driver can store a (read-only
by the device) table of indirect descriptors anywhere in memory,
and insert a descriptor in the main virtqueue (with \field{Flags}
bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
containing this indirect descriptor table;
"""

So the indirect descriptors in the table are read-only by
the device. And the only descriptor which is writeable by
the device is the descriptor in the main virtqueue (with
Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
`len` in this descriptor, we won't be able to get the
length of the data written by the device.

So I think the `len` in this descriptor will carry the
length of the data written by the device (if the buffers
are writable to the device) even if the VIRTQ_DESC_F_WRITE
isn't set by the device. How do you think?


> 
> The reason is we don't touch descriptor ring in the case of split, so
> BUG_ON()s may help there.
> 
> > +
> > +

Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-19 Thread Tiwei Bie
On Thu, Apr 19, 2018 at 09:40:23PM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 10, 2018 at 03:25:45PM +0800, Jason Wang wrote:
> > > > > One problem is that, different virtio ring compatible devices
> > > > > may have different device interfaces. That is to say, we will
> > > > > need different drivers in QEMU. It could be troublesome. And
> > > > > that's what this patch trying to fix. The idea behind this
> > > > > patch is very simple: mdev is a standard way to emulate device
> > > > > in kernel.
> > > > So you just move the abstraction layer from qemu to kernel, and you 
> > > > still
> > > > need different drivers in kernel for different device interfaces of
> > > > accelerators. This looks even more complex than leaving it in qemu. As 
> > > > you
> > > > said, another idea is to implement userspace vhost backend for 
> > > > accelerators
> > > > which seems easier and could co-work with other parts of qemu without
> > > > inventing new type of messages.
> > > I'm not quite sure. Do you think it's acceptable to
> > > add various vendor specific hardware drivers in QEMU?
> > > 
> > 
> > I don't object but we need to figure out the advantages of doing it in qemu
> > too.
> > 
> > Thanks
> 
> To be frank kernel is exactly where device drivers belong.  DPDK did
> move them to userspace but that's merely a requirement for data path.
> *If* you can have them in kernel that is best:
> - update kernel and there's no need to rebuild userspace
> - apps can be written in any language no need to maintain multiple
>   libraries or add wrappers
> - security concerns are much smaller (ok people are trying to
>   raise the bar with IOMMUs and such, but it's already pretty
>   good even without)
> 
> The biggest issue is that you let userspace poke at the
> device which is also allowed by the IOMMU to poke at
> kernel memory (needed for kernel driver to work).

I think the device won't and shouldn't be allowed to
poke at kernel memory. Its kernel driver needs some
kernel memory to work. But the device doesn't have
the access to them. Instead, the device only has the
access to:

(1) the entire memory of the VM (if vIOMMU isn't used)
or
(2) the memory belongs to the guest virtio device (if
vIOMMU is being used).

Below is the reason:

For the first case, we should program the IOMMU for
the hardware device based on the info in the memory
table which is the entire memory of the VM.

For the second case, we should program the IOMMU for
the hardware device based on the info in the shadow
page table of the vIOMMU.

So the memory can be accessed by the device is limited,
it should be safe especially for the second case.

My concern is that, in this RFC, we don't program the
IOMMU for the mdev device in the userspace via the VFIO
API directly. Instead, we pass the memory table to the
kernel driver via the mdev device (BAR0) and ask the
driver to do the IOMMU programming. Someone may don't
like it. The main reason why we don't program IOMMU via
VFIO API in userspace directly is that, currently IOMMU
drivers don't support mdev bus.

> 
> Yes, maybe if device is not buggy it's all fine, but
> it's better if we do not have to trust the device
> otherwise the security picture becomes more murky.
> 
> I suggested attaching a PASID to (some) queues - see my old post "using
> PASIDs to enable a safe variant of direct ring access".

It's pretty cool. We also have some similar ideas.
Cunming will talk more about this.

Best regards,
Tiwei Bie

> 
> Then using IOMMU with VFIO to limit access through queue to corrent
> ranges of memory.
> 
> 
> -- 
> MST


Re: [RFC v2] virtio: support packed ring

2018-04-17 Thread Tiwei Bie
On Tue, Apr 17, 2018 at 06:54:51PM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 17, 2018 at 10:56:26PM +0800, Tiwei Bie wrote:
> > On Tue, Apr 17, 2018 at 05:04:59PM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Apr 17, 2018 at 08:47:16PM +0800, Tiwei Bie wrote:
> > > > On Tue, Apr 17, 2018 at 03:17:41PM +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Apr 17, 2018 at 10:51:33AM +0800, Tiwei Bie wrote:
> > > > > > On Tue, Apr 17, 2018 at 10:11:58AM +0800, Jason Wang wrote:
> > > > > > > On 2018年04月13日 15:15, Tiwei Bie wrote:
> > > > > > > > On Fri, Apr 13, 2018 at 12:30:24PM +0800, Jason Wang wrote:
> > > > > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > > [...]
> > > > > > > > > > +static int detach_buf_packed(struct vring_virtqueue *vq, 
> > > > > > > > > > unsigned int head,
> > > > > > > > > > + void **ctx)
> > > > > > > > > > +{
> > > > > > > > > > +   struct vring_packed_desc *desc;
> > > > > > > > > > +   unsigned int i, j;
> > > > > > > > > > +
> > > > > > > > > > +   /* Clear data ptr. */
> > > > > > > > > > +   vq->desc_state[head].data = NULL;
> > > > > > > > > > +
> > > > > > > > > > +   i = head;
> > > > > > > > > > +
> > > > > > > > > > +   for (j = 0; j < vq->desc_state[head].num; j++) {
> > > > > > > > > > +   desc = >vring_packed.desc[i];
> > > > > > > > > > +   vring_unmap_one_packed(vq, desc);
> > > > > > > > > > +   desc->flags = 0x0;
> > > > > > > > > Looks like this is unnecessary.
> > > > > > > > It's safer to zero it. If we don't zero it, after we
> > > > > > > > call virtqueue_detach_unused_buf_packed() which calls
> > > > > > > > this function, the desc is still available to the
> > > > > > > > device.
> > > > > > > 
> > > > > > > Well detach_unused_buf_packed() should be called after device is 
> > > > > > > stopped,
> > > > > > > otherwise even if you try to clear, there will still be a window 
> > > > > > > that device
> > > > > > > may use it.
> > > > > > 
> > > > > > This is not about whether the device has been stopped or
> > > > > > not. We don't have other places to re-initialize the ring
> > > > > > descriptors and wrap_counter. So they need to be set to
> > > > > > the correct values when doing detach_unused_buf.
> > > > > > 
> > > > > > Best regards,
> > > > > > Tiwei Bie
> > > > > 
> > > > > find vqs is the time to do it.
> > > > 
> > > > The .find_vqs() will call .setup_vq() which will eventually
> > > > call vring_create_virtqueue(). It's a different case. Here
> > > > we're talking about re-initializing the descs and updating
> > > > the wrap counter when detaching the unused descs (In this
> > > > case, split ring just needs to decrease vring.avail->idx).
> > > > 
> > > > Best regards,
> > > > Tiwei Bie
> > > 
> > > There's no requirement that  virtqueue_detach_unused_buf re-initializes
> > > the descs. It happens on cleanup path just before drivers delete the
> > > vqs.
> > 
> > Cool, I wasn't aware of it. I saw split ring decrease
> > vring.avail->idx after detaching an unused desc, so I
> > thought detaching unused desc also needs to make sure
> > that the ring state will be updated correspondingly.
> 
> 
> Hmm. You are right. Seems to be out console driver being out of spec.
> Will have to look at how to fix that :(
> 
> It was done here:
> 
> Commit b3258ff1d6086bd2b9eeb556844a868ad7d49bc8
> Author: Amit Shah <amit.s...@redhat.com>
> Date:   Wed Mar 16 19:12:10 2011 +0530
> 
> virtio: Decrement avail idx on buffer detach
> 
> When detaching a buffer from a vq, the avail.idx value should be
> decremented as well.
> 
> This was noticed by hot-unplugging a virtio console port and then
>

Re: [RFC v2] virtio: support packed ring

2018-04-17 Thread Tiwei Bie
On Tue, Apr 17, 2018 at 05:04:59PM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 17, 2018 at 08:47:16PM +0800, Tiwei Bie wrote:
> > On Tue, Apr 17, 2018 at 03:17:41PM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Apr 17, 2018 at 10:51:33AM +0800, Tiwei Bie wrote:
> > > > On Tue, Apr 17, 2018 at 10:11:58AM +0800, Jason Wang wrote:
> > > > > On 2018年04月13日 15:15, Tiwei Bie wrote:
> > > > > > On Fri, Apr 13, 2018 at 12:30:24PM +0800, Jason Wang wrote:
> > > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > [...]
> > > > > > > > +static int detach_buf_packed(struct vring_virtqueue *vq, 
> > > > > > > > unsigned int head,
> > > > > > > > + void **ctx)
> > > > > > > > +{
> > > > > > > > +   struct vring_packed_desc *desc;
> > > > > > > > +   unsigned int i, j;
> > > > > > > > +
> > > > > > > > +   /* Clear data ptr. */
> > > > > > > > +   vq->desc_state[head].data = NULL;
> > > > > > > > +
> > > > > > > > +   i = head;
> > > > > > > > +
> > > > > > > > +   for (j = 0; j < vq->desc_state[head].num; j++) {
> > > > > > > > +   desc = >vring_packed.desc[i];
> > > > > > > > +   vring_unmap_one_packed(vq, desc);
> > > > > > > > +   desc->flags = 0x0;
> > > > > > > Looks like this is unnecessary.
> > > > > > It's safer to zero it. If we don't zero it, after we
> > > > > > call virtqueue_detach_unused_buf_packed() which calls
> > > > > > this function, the desc is still available to the
> > > > > > device.
> > > > > 
> > > > > Well detach_unused_buf_packed() should be called after device is 
> > > > > stopped,
> > > > > otherwise even if you try to clear, there will still be a window that 
> > > > > device
> > > > > may use it.
> > > > 
> > > > This is not about whether the device has been stopped or
> > > > not. We don't have other places to re-initialize the ring
> > > > descriptors and wrap_counter. So they need to be set to
> > > > the correct values when doing detach_unused_buf.
> > > > 
> > > > Best regards,
> > > > Tiwei Bie
> > > 
> > > find vqs is the time to do it.
> > 
> > The .find_vqs() will call .setup_vq() which will eventually
> > call vring_create_virtqueue(). It's a different case. Here
> > we're talking about re-initializing the descs and updating
> > the wrap counter when detaching the unused descs (In this
> > case, split ring just needs to decrease vring.avail->idx).
> > 
> > Best regards,
> > Tiwei Bie
> 
> There's no requirement that  virtqueue_detach_unused_buf re-initializes
> the descs. It happens on cleanup path just before drivers delete the
> vqs.

Cool, I wasn't aware of it. I saw split ring decrease
vring.avail->idx after detaching an unused desc, so I
thought detaching unused desc also needs to make sure
that the ring state will be updated correspondingly.

If there is no such requirement, do you think it's OK
to remove below two lines:

vq->avail_idx_shadow--;
vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);

from virtqueue_detach_unused_buf(), and we could have
one generic function to handle both rings:

void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
{
struct vring_virtqueue *vq = to_vvq(_vq);
unsigned int num, i;
void *buf;

START_USE(vq);

num = vq->packed ? vq->vring_packed.num : vq->vring.num;

for (i = 0; i < num; i++) {
if (!vq->desc_state[i].data)
continue;
/* detach_buf clears data, so grab it now. */
buf = vq->desc_state[i].data;
detach_buf(vq, i, NULL);
END_USE(vq);
return buf;
}
/* That should have freed everything. */
BUG_ON(vq->vq.num_free != num);

END_USE(vq);
return NULL;
}

Best regards,
Tiwei Bie


Re: [RFC v2] virtio: support packed ring

2018-04-17 Thread Tiwei Bie
On Tue, Apr 17, 2018 at 03:17:41PM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 17, 2018 at 10:51:33AM +0800, Tiwei Bie wrote:
> > On Tue, Apr 17, 2018 at 10:11:58AM +0800, Jason Wang wrote:
> > > On 2018年04月13日 15:15, Tiwei Bie wrote:
> > > > On Fri, Apr 13, 2018 at 12:30:24PM +0800, Jason Wang wrote:
> > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > [...]
> > > > > > +static int detach_buf_packed(struct vring_virtqueue *vq, unsigned 
> > > > > > int head,
> > > > > > + void **ctx)
> > > > > > +{
> > > > > > +   struct vring_packed_desc *desc;
> > > > > > +   unsigned int i, j;
> > > > > > +
> > > > > > +   /* Clear data ptr. */
> > > > > > +   vq->desc_state[head].data = NULL;
> > > > > > +
> > > > > > +   i = head;
> > > > > > +
> > > > > > +   for (j = 0; j < vq->desc_state[head].num; j++) {
> > > > > > +   desc = >vring_packed.desc[i];
> > > > > > +   vring_unmap_one_packed(vq, desc);
> > > > > > +   desc->flags = 0x0;
> > > > > Looks like this is unnecessary.
> > > > It's safer to zero it. If we don't zero it, after we
> > > > call virtqueue_detach_unused_buf_packed() which calls
> > > > this function, the desc is still available to the
> > > > device.
> > > 
> > > Well detach_unused_buf_packed() should be called after device is stopped,
> > > otherwise even if you try to clear, there will still be a window that 
> > > device
> > > may use it.
> > 
> > This is not about whether the device has been stopped or
> > not. We don't have other places to re-initialize the ring
> > descriptors and wrap_counter. So they need to be set to
> > the correct values when doing detach_unused_buf.
> > 
> > Best regards,
> > Tiwei Bie
> 
> find vqs is the time to do it.

The .find_vqs() will call .setup_vq() which will eventually
call vring_create_virtqueue(). It's a different case. Here
we're talking about re-initializing the descs and updating
the wrap counter when detaching the unused descs (In this
case, split ring just needs to decrease vring.avail->idx).

Best regards,
Tiwei Bie


Re: [RFC v2] virtio: support packed ring

2018-04-16 Thread Tiwei Bie
On Tue, Apr 17, 2018 at 10:11:58AM +0800, Jason Wang wrote:
> On 2018年04月13日 15:15, Tiwei Bie wrote:
> > On Fri, Apr 13, 2018 at 12:30:24PM +0800, Jason Wang wrote:
> > > On 2018年04月01日 22:12, Tiwei Bie wrote:
[...]
> > > > +static int detach_buf_packed(struct vring_virtqueue *vq, unsigned int 
> > > > head,
> > > > + void **ctx)
> > > > +{
> > > > +   struct vring_packed_desc *desc;
> > > > +   unsigned int i, j;
> > > > +
> > > > +   /* Clear data ptr. */
> > > > +   vq->desc_state[head].data = NULL;
> > > > +
> > > > +   i = head;
> > > > +
> > > > +   for (j = 0; j < vq->desc_state[head].num; j++) {
> > > > +   desc = >vring_packed.desc[i];
> > > > +   vring_unmap_one_packed(vq, desc);
> > > > +   desc->flags = 0x0;
> > > Looks like this is unnecessary.
> > It's safer to zero it. If we don't zero it, after we
> > call virtqueue_detach_unused_buf_packed() which calls
> > this function, the desc is still available to the
> > device.
> 
> Well detach_unused_buf_packed() should be called after device is stopped,
> otherwise even if you try to clear, there will still be a window that device
> may use it.

This is not about whether the device has been stopped or
not. We don't have other places to re-initialize the ring
descriptors and wrap_counter. So they need to be set to
the correct values when doing detach_unused_buf.

Best regards,
Tiwei Bie


Re: [RFC v2] virtio: support packed ring

2018-04-14 Thread Tiwei Bie
On Fri, Apr 13, 2018 at 06:22:45PM +0300, Michael S. Tsirkin wrote:
> On Sun, Apr 01, 2018 at 10:12:16PM +0800, Tiwei Bie wrote:
> > +static inline bool more_used(const struct vring_virtqueue *vq)
> > +{
> > +   return vq->packed ? more_used_packed(vq) : more_used_split(vq);
> > +}
> > +
> > +void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, unsigned int *len,
> > + void **ctx)
> > +{
> > +   struct vring_virtqueue *vq = to_vvq(_vq);
> > +   void *ret;
> > +   unsigned int i;
> > +   u16 last_used;
> > +
> > +   START_USE(vq);
> > +
> > +   if (unlikely(vq->broken)) {
> > +   END_USE(vq);
> > +   return NULL;
> > +   }
> > +
> > +   if (!more_used(vq)) {
> > +   pr_debug("No more buffers in queue\n");
> > +   END_USE(vq);
> > +   return NULL;
> > +   }
> 
> So virtqueue_get_buf_ctx_split should only call more_used_split.

Yeah, you're right! Will fix this in the next version.

> 
> to avoid such issues I think we should lay out the code like this:
> 
> XXX_split
> 
> XXX_packed
> 
> XXX wrappers

I'll do it. Thanks for the suggestion!

> 
> > +/* The standard layout
> 
> I'd drop standard here.

Got it. I'll drop the word "standard".

> 
> > for the packed ring is a continuous chunk of memory
> > + * which looks like this.
> > + *
> > + * struct vring_packed
> > + * {
> 
> Can the opening bracket go on the prev line pls?

Sure.

> 
> > + * // The actual descriptors (16 bytes each)
> > + * struct vring_packed_desc desc[num];
> > + *
> > + * // Padding to the next align boundary.
> > + * char pad[];
> > + *
> > + * // Driver Event Suppression
> > + * struct vring_packed_desc_event driver;
> > + *
> > + * // Device Event Suppression
> > + * struct vring_packed_desc_event device;
> 
> Maybe that's how our driver does it but it's not based on spec
> so I don't think this belongs in the header.

I will move it to the place where vring_packed_init()
is defined.

> 
> > + * };
> > + */
> > +
> > +static inline unsigned vring_packed_size(unsigned int num, unsigned long 
> > align)
> > +{
> > +   return ((sizeof(struct vring_packed_desc) * num + align - 1)
> > +   & ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
> > +}
> > +
> 
> Cant say this API makes sense for me.

Hmm, do you have any suggestion? Also move it out of this header?

Thanks for the review! :)

Best regards,
Tiwei Bie

> 
> 
> >  #endif /* _UAPI_LINUX_VIRTIO_RING_H */
> > -- 
> > 2.11.0


Re: [RFC v2] virtio: support packed ring

2018-04-13 Thread Tiwei Bie
On Fri, Apr 13, 2018 at 12:30:24PM +0800, Jason Wang wrote:
> On 2018年04月01日 22:12, Tiwei Bie wrote:
> > Hello everyone,
> > 
> > This RFC implements packed ring support for virtio driver.
> > 
> > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > 
> > TODO:
> > - Refinements and bug fixes;
> > - Split into small patches;
> > - Test indirect descriptor support;
> > - Test/fix event suppression support;
> > - Test devices other than net;
> > 
> > RFC v1 -> RFC v2:
> > - Add indirect descriptor support - compile test only;
> > - Add event suppression supprt - compile test only;
> > - Move vring_packed_init() out of uapi (Jason, MST);
> > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > - Avoid using '%' operator (Jason);
> > - Rename free_head -> next_avail_idx (Jason);
> > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > - Some other refinements and bug fixes;
> > 
> > Thanks!
> > 
> > Signed-off-by: Tiwei Bie <tiwei@intel.com>
> > ---
> >   drivers/virtio/virtio_ring.c   | 1094 
> > +---
> >   include/linux/virtio_ring.h|8 +-
> >   include/uapi/linux/virtio_config.h |   12 +-
> >   include/uapi/linux/virtio_ring.h   |   61 ++
> >   4 files changed, 980 insertions(+), 195 deletions(-)
[...]
> > +static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue 
> > *_vq,
> > +  unsigned int total_sg,
> > +  gfp_t gfp)
> > +{
> > +   struct vring_packed_desc *desc;
> > +
> > +   /*
> > +* We require lowmem mappings for the descriptors because
> > +* otherwise virt_to_phys will give us bogus addresses in the
> > +* virtqueue.
> > +*/
> > +   gfp &= ~__GFP_HIGHMEM;
> > +
> > +   desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);
> 
> Can we simply check vq->packed here to avoid duplicating helpers?

Then it would be something like this:

static void *alloc_indirect(struct virtqueue *_vq, unsigned int total_sg,
gfp_t gfp)
{
struct vring_virtqueue *vq = to_vvq(_vq);
void *data;

/*
 * We require lowmem mappings for the descriptors because
 * otherwise virt_to_phys will give us bogus addresses in the
 * virtqueue.
 */
gfp &= ~__GFP_HIGHMEM;

if (vq->packed) {
data = kmalloc(total_sg * sizeof(struct vring_packed_desc),
gfp);
if (!data)
return NULL;
} else {
struct vring_desc *desc;
unsigned int i;

desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
if (!desc)
return NULL;

for (i = 0; i < total_sg; i++)
desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);

data = desc;
}

return data;
}

I would prefer to have two simpler helpers (and to the callers,
it's already very clear about which one they should call), i.e.
the current implementation:

static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq,
   unsigned int total_sg,
   gfp_t gfp)
{
struct vring_packed_desc *desc;

/*
 * We require lowmem mappings for the descriptors because
 * otherwise virt_to_phys will give us bogus addresses in the
 * virtqueue.
 */
gfp &= ~__GFP_HIGHMEM;

desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);

return desc;
}

static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
   unsigned int total_sg,
   gfp_t gfp)
{
struct vring_desc *desc;
unsigned int i;

/*
 * We require lowmem mappings for the descriptors because
 * otherwise virt_to_phys will give us bogus addresses in the
 * virtqueue.
 */
gfp &= ~__GFP_HIGHMEM;

desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
if (!desc)
return NULL;

for (i = 0; i < total_sg; i++)
desc[i].next 

Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-09 Thread Tiwei Bie
On Tue, Apr 10, 2018 at 10:52:52AM +0800, Jason Wang wrote:
> On 2018年04月02日 23:23, Tiwei Bie wrote:
> > This patch introduces a mdev (mediated device) based hardware
> > vhost backend. This backend is an abstraction of the various
> > hardware vhost accelerators (potentially any device that uses
> > virtio ring can be used as a vhost accelerator). Some generic
> > mdev parent ops are provided for accelerator drivers to support
> > generating mdev instances.
> > 
> > What's this
> > ===
> > 
> > The idea is that we can setup a virtio ring compatible device
> > with the messages available at the vhost-backend. Originally,
> > these messages are used to implement a software vhost backend,
> > but now we will use these messages to setup a virtio ring
> > compatible hardware device. Then the hardware device will be
> > able to work with the guest virtio driver in the VM just like
> > what the software backend does. That is to say, we can implement
> > a hardware based vhost backend in QEMU, and any virtio ring
> > compatible devices potentially can be used with this backend.
> > (We also call it vDPA -- vhost Data Path Acceleration).
> > 
> > One problem is that, different virtio ring compatible devices
> > may have different device interfaces. That is to say, we will
> > need different drivers in QEMU. It could be troublesome. And
> > that's what this patch trying to fix. The idea behind this
> > patch is very simple: mdev is a standard way to emulate device
> > in kernel.
> 
> So you just move the abstraction layer from qemu to kernel, and you still
> need different drivers in kernel for different device interfaces of
> accelerators. This looks even more complex than leaving it in qemu. As you
> said, another idea is to implement userspace vhost backend for accelerators
> which seems easier and could co-work with other parts of qemu without
> inventing new type of messages.

I'm not quite sure. Do you think it's acceptable to
add various vendor specific hardware drivers in QEMU?

> 
> Need careful thought here to seek a best solution here.

Yeah, definitely! :)
And your opinions would be very helpful!

> 
> >   So we defined a standard device based on mdev, which
> > is able to accept vhost messages. When the mdev emulation code
> > (i.e. the generic mdev parent ops provided by this patch) gets
> > vhost messages, it will parse and deliver them to accelerator
> > drivers. Drivers can use these messages to setup accelerators.
> > 
> > That is to say, the generic mdev parent ops (e.g. read()/write()/
> > ioctl()/...) will be provided for accelerator drivers to register
> > accelerators as mdev parent devices. And each accelerator device
> > will support generating standard mdev instance(s).
> > 
> > With this standard device interface, we will be able to just
> > develop one userspace driver to implement the hardware based
> > vhost backend in QEMU.
> > 
> > Difference between vDPA and PCI passthru
> > 
> > 
> > The key difference between vDPA and PCI passthru is that, in
> > vDPA only the data path of the device (e.g. DMA ring, notify
> > region and queue interrupt) is pass-throughed to the VM, the
> > device control path (e.g. PCI configuration space and MMIO
> > regions) is still defined and emulated by QEMU.
> > 
> > The benefits of keeping virtio device emulation in QEMU compared
> > with virtio device PCI passthru include (but not limit to):
> > 
> > - consistent device interface for guest OS in the VM;
> > - max flexibility on the hardware design, especially the
> >accelerator for each vhost backend doesn't have to be a
> >full PCI device;
> > - leveraging the existing virtio live-migration framework;
> > 
> > The interface of this mdev based device
> > ===
> > 
> > 1. BAR0
> > 
> > The MMIO region described by BAR0 is the main control
> > interface. Messages will be written to or read from
> > this region.
> > 
> > The message type is determined by the `request` field
> > in message header. The message size is encoded in the
> > message header too. The message format looks like this:
> > 
> > struct vhost_vfio_op {
> > __u64 request;
> > __u32 flags;
> > /* Flag values: */
> > #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */
> > __u32 size;
> > union {
> > __u64 u64;
> > struct vhost_vring_state state;
> > struct vhost_vr

Re: [RFC v2] virtio: support packed ring

2018-04-09 Thread Tiwei Bie
On Tue, Apr 10, 2018 at 10:55:25AM +0800, Jason Wang wrote:
> On 2018年04月01日 22:12, Tiwei Bie wrote:
> > Hello everyone,
> > 
> > This RFC implements packed ring support for virtio driver.
> > 
> > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > 
> > TODO:
> > - Refinements and bug fixes;
> > - Split into small patches;
> > - Test indirect descriptor support;
> > - Test/fix event suppression support;
> > - Test devices other than net;
> > 
> > RFC v1 -> RFC v2:
> > - Add indirect descriptor support - compile test only;
> > - Add event suppression supprt - compile test only;
> > - Move vring_packed_init() out of uapi (Jason, MST);
> > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > - Avoid using '%' operator (Jason);
> > - Rename free_head -> next_avail_idx (Jason);
> > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > - Some other refinements and bug fixes;
> > 
> > Thanks!
> 
> Will try to review this later.
> 
> But it would be better if you can split it (more than 1000 lines is too big
> to be reviewed easily). E.g you can at least split it into three patches,
> new structures, datapath, and event suppression.
> 

No problem! It's on my TODO list. I'll get it done in the next version.

Thanks!


[RFC] vhost: introduce mdev based hardware vhost backend

2018-04-02 Thread Tiwei Bie
   struct vhost_vfio_op op;
int ret;

op.request = VHOST_GET_FEATURES;
op.flags = VHOST_VFIO_NEED_REPLY;
op.size = 0;

/* Just need to write the header */
ret = vhost_vfio_write(dev, );
if (ret != 0)
goto out;

/* `op` wasn't changed during write */
op.flags = 0;
op.size = sizeof(*features);

ret = vhost_vfio_read(dev, );
if (ret != 0)
goto out;

*features = op.payload.u64;
out:
return ret;
}

2. BAR1 (mmap-able)

The MMIO region described by BAR1 will be used to notify the
device.

Each queue will has a page for notification, and it can be
mapped to VM (if hardware also supports), and the virtio
driver in the VM will be able to notify the device directly.

The MMIO region described by BAR1 is also write-able. If the
accelerator's notification register(s) cannot be mapped to the
VM, write() can also be used to notify the device. Something
like this:

void notify_relay(void *opaque)
{
..
offset = 0x1000 * queue_idx; /* XXX assume page size is 4K here. */

ret = pwrite64(vfio->device_fd, _idx, sizeof(queue_idx),
vfio->bar1_offset + offset);
..
}

Other BARs are reserved.

3. VFIO interrupt ioctl API

VFIO interrupt ioctl API is used to setup device interrupts.
IRQ-bypass will also be supported.

Currently, only VFIO_PCI_MSIX_IRQ_INDEX is supported.

The API for drivers to provide mdev instances
=

The read()/write()/ioctl()/mmap()/open()/release() mdev
parent ops have been provided for accelerators' drivers
to provide mdev instances.

ssize_t vdpa_read(struct mdev_device *mdev, char __user *buf,
  size_t count, loff_t *ppos);
ssize_t vdpa_write(struct mdev_device *mdev, const char __user *buf,
   size_t count, loff_t *ppos);
long vdpa_ioctl(struct mdev_device *mdev, unsigned int cmd, unsigned long arg);
int vdpa_mmap(struct mdev_device *mdev, struct vm_area_struct *vma);
int vdpa_open(struct mdev_device *mdev);
void vdpa_close(struct mdev_device *mdev);

Each accelerator driver just needs to implement its own
create()/remove() ops, and provide a vdpa device ops
which will be called by the generic mdev emulation code.

Currently, the vdpa device ops are defined as:

typedef int (*vdpa_start_device_t)(struct vdpa_dev *vdpa);
typedef int (*vdpa_stop_device_t)(struct vdpa_dev *vdpa);
typedef int (*vdpa_dma_map_t)(struct vdpa_dev *vdpa);
typedef int (*vdpa_dma_unmap_t)(struct vdpa_dev *vdpa);
typedef int (*vdpa_set_eventfd_t)(struct vdpa_dev *vdpa, int vector, int fd);
typedef u64 (*vdpa_supported_features_t)(struct vdpa_dev *vdpa);
typedef void (*vdpa_notify_device_t)(struct vdpa_dev *vdpa, int qid);
typedef u64 (*vdpa_get_notify_addr_t)(struct vdpa_dev *vdpa, int qid);

struct vdpa_device_ops {
vdpa_start_device_t start;
vdpa_stop_device_t  stop;
vdpa_dma_map_t  dma_map;
vdpa_dma_unmap_tdma_unmap;
vdpa_set_eventfd_t  set_eventfd;
vdpa_supported_features_t   supported_features;
vdpa_notify_device_tnotify;
vdpa_get_notify_addr_t  get_notify_addr;
};

struct vdpa_dev {
struct mdev_device *mdev;
struct mutex ops_lock;
u8 vconfig[VDPA_CONFIG_SIZE];
int nr_vring;
u64 features;
u64 state;
struct vhost_memory *mem_table;
bool pending_reply;
struct vhost_vfio_op pending;
const struct vdpa_device_ops *ops;
void *private;
int max_vrings;
struct vdpa_vring_info vring_info[0];
};

struct vdpa_dev *vdpa_alloc(struct mdev_device *mdev, void *private,
int max_vrings);
void vdpa_free(struct vdpa_dev *vdpa);

A simple example


# Query the number of available mdev instances
$ cat 
/sys/class/mdev_bus/:06:00.2/mdev_supported_types/ifcvf_vdpa-vdpa_virtio/available_instances

# Create a mdev instance
$ echo $UUID > 
/sys/class/mdev_bus/:06:00.2/mdev_supported_types/ifcvf_vdpa-vdpa_virtio/create

# Launch QEMU with a virtio-net device
$ qemu \
.. \
-netdev type=vhost-vfio,sysfsdev=/sys/bus/mdev/devices/$UUID,id=$ID \
-device virtio-net-pci,netdev=$ID

 END 

Most of above words will be refined and moved to a doc in
the formal patch. In this RFC, all introductions and code
are gathered in this patch, the idea is to make it easier
to find all the relevant information. Anyone who wants to
comment could use inline comment and just keep the relevant
parts. Sorry for the big RFC patch..

This patch is just a RFC for now, and something is still
missing or needs to be refined. But it's never too early
to hear the thoughts from the community. So any comments
would be appreciated! Thanks! :-)

Signed-off-by: Tiwei

[RFC v2] virtio: support packed ring

2018-04-01 Thread Tiwei Bie
Hello everyone,

This RFC implements packed ring support for virtio driver.

The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
Minor changes are needed for the vhost code, e.g. to kick the guest.

TODO:
- Refinements and bug fixes;
- Split into small patches;
- Test indirect descriptor support;
- Test/fix event suppression support;
- Test devices other than net;

RFC v1 -> RFC v2:
- Add indirect descriptor support - compile test only;
- Add event suppression supprt - compile test only;
- Move vring_packed_init() out of uapi (Jason, MST);
- Merge two loops into one in virtqueue_add_packed() (Jason);
- Split vring_unmap_one() for packed ring and split ring (Jason);
- Avoid using '%' operator (Jason);
- Rename free_head -> next_avail_idx (Jason);
- Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
- Some other refinements and bug fixes;

Thanks!

Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 drivers/virtio/virtio_ring.c   | 1094 +---
 include/linux/virtio_ring.h|8 +-
 include/uapi/linux/virtio_config.h |   12 +-
 include/uapi/linux/virtio_ring.h   |   61 ++
 4 files changed, 980 insertions(+), 195 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..0515dca34d77 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -58,14 +58,15 @@
 
 struct vring_desc_state {
void *data; /* Data for callback. */
-   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
+   void *indir_desc;   /* Indirect descriptor, if any. */
+   int num;/* Descriptor list length. */
 };
 
 struct vring_virtqueue {
struct virtqueue vq;
 
-   /* Actual memory layout for this queue */
-   struct vring vring;
+   /* Is this a packed ring? */
+   bool packed;
 
/* Can we use weak barriers? */
bool weak_barriers;
@@ -79,19 +80,45 @@ struct vring_virtqueue {
/* Host publishes avail event idx */
bool event;
 
-   /* Head of free buffer list. */
-   unsigned int free_head;
/* Number we've added since last sync. */
unsigned int num_added;
 
/* Last used index we've seen. */
u16 last_used_idx;
 
-   /* Last written value to avail->flags */
-   u16 avail_flags_shadow;
+   union {
+   /* Available for split ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring vring;
 
-   /* Last written value to avail->idx in guest byte order */
-   u16 avail_idx_shadow;
+   /* Head of free buffer list. */
+   unsigned int free_head;
+
+   /* Last written value to avail->flags */
+   u16 avail_flags_shadow;
+
+   /* Last written value to avail->idx in
+* guest byte order. */
+   u16 avail_idx_shadow;
+   };
+
+   /* Available for packed ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring_packed vring_packed;
+
+   /* Driver ring wrap counter. */
+   u8 wrap_counter;
+
+   /* Index of the next avail descriptor. */
+   unsigned int next_avail_idx;
+
+   /* Last written value to driver->flags in
+* guest byte order. */
+   u16 event_flags_shadow;
+   };
+   };
 
/* How to notify other side. FIXME: commonalize hcalls! */
bool (*notify)(struct virtqueue *vq);
@@ -201,8 +228,33 @@ static dma_addr_t vring_map_single(const struct 
vring_virtqueue *vq,
  cpu_addr, size, direction);
 }
 
-static void vring_unmap_one(const struct vring_virtqueue *vq,
-   struct vring_desc *desc)
+static void vring_unmap_one_split(const struct vring_virtqueue *vq,
+ struct vring_desc *desc)
+{
+   u16 flags;
+
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return;
+
+   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+virtio64_to_cpu(vq->vq.vdev, desc->addr),
+virtio32_to_cpu(vq->vq.vdev, desc->len),
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+ 

Re: [RFC PATCH V2 8/8] vhost: event suppression for packed ring

2018-03-29 Thread Tiwei Bie
On Mon, Mar 26, 2018 at 11:38:53AM +0800, Jason Wang wrote:
> This patch introduces basic support for event suppression aka driver
> and device area. Compile tested only.
> 
> Signed-off-by: Jason Wang 
> ---
[...]
> +
> +static bool vhost_notify_packed(struct vhost_dev *dev,
> + struct vhost_virtqueue *vq)
> +{
> + __virtio16 event_off_wrap, event_flags;
> + __u16 old, new;
> + bool v, wrap;
> + int off;
> +
> + /* Flush out used descriptors updates. This is paired
> +  * with the barrier that the Guest executes when enabling
> +  * interrupts.
> +  */
> + smp_mb();
> +
> + if (vhost_get_avail(vq, event_flags,
> +>driver_event->desc_event_flags) < 0) {
> + vq_err(vq, "Failed to get driver desc_event_flags");
> + return true;
> + }
> +
> + if (!(event_flags & cpu_to_vhost16(vq, RING_EVENT_FLAGS_DESC)))
> + return event_flags ==
> +cpu_to_vhost16(vq, RING_EVENT_FLAGS_ENABLE);

Maybe it would be better to not use '&' here. Because these flags
are not defined as bits which can be ORed or ANDed. Instead, they
are defined as values:

0x0  enable
0x1  disable
0x2  desc
0x3  reserved

> +
> + /* Read desc event flags before event_off and event_wrap */
> + smp_rmb();
> +
> + if (vhost_get_avail(vq, event_off_wrap,
> + >driver_event->desc_event_off_warp) < 0) {
> + vq_err(vq, "Failed to get driver desc_event_off/wrap");
> + return true;
> + }
> +
> + off = vhost16_to_cpu(vq, event_off_wrap);
> +
> + wrap = off & 0x1;
> + off >>= 1;

Based on the below definitions in spec, wrap counter is
the most significant bit.

struct pvirtq_event_suppress {
le16 {
desc_event_off : 15; /* Descriptor Ring Change Event Offset */
desc_event_wrap : 1; /* Descriptor Ring Change Event Wrap 
Counter */
} desc; /* If desc_event_flags set to RING_EVENT_FLAGS_DESC */
le16 {
desc_event_flags : 2, /* Descriptor Ring Change Event Flags */
reserved : 14; /* Reserved, set to 0 */
} flags;
};

> +
> +
> + old = vq->signalled_used;
> + v = vq->signalled_used_valid;
> + new = vq->signalled_used = vq->last_used_idx;
> + vq->signalled_used_valid = true;
> +
> + if (unlikely(!v))
> + return true;
> +
> + return vhost_vring_packed_need_event(vq, new, old, off) &&
> +wrap == vq->used_wrap_counter;
> +}
> +
> +static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> +{
> + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
> + return vhost_notify_packed(dev, vq);
> + else
> + return vhost_notify_split(dev, vq);
> +}
> +
>  /* This actually signals the guest, using eventfd. */
>  void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  {
> @@ -2789,7 +2911,17 @@ static bool vhost_enable_notify_packed(struct 
> vhost_dev *dev,
>   __virtio16 flags;
>   int ret;
>  
> - /* FIXME: disable notification through device area */
> + if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
> + return false;
> + vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
> +
> + flags = cpu_to_vhost16(vq, RING_EVENT_FLAGS_ENABLE);
> + ret = vhost_update_device_flags(vq, flags);
> + if (ret) {
> + vq_err(vq, "Failed to enable notification at %p: %d\n",
> +>device_event->desc_event_flags, ret);
> + return false;
> + }
>  
>   /* They could have slipped one in as we were doing that: make
>* sure it's written, then check again. */
> @@ -2855,7 +2987,18 @@ EXPORT_SYMBOL_GPL(vhost_enable_notify);
>  static void vhost_disable_notify_packed(struct vhost_dev *dev,
>   struct vhost_virtqueue *vq)
>  {
> - /* FIXME: disable notification through device area */
> + __virtio16 flags;
> + int r;
> +
> + if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
> + return;
> + vq->used_flags |= VRING_USED_F_NO_NOTIFY;
> +
> + flags = cpu_to_vhost16(vq, RING_EVENT_FLAGS_DISABLE);
> + r = vhost_update_device_flags(vq, flags);
> + if (r)
> + vq_err(vq, "Failed to enable notification at %p: %d\n",
> +>device_event->desc_event_flags, r);
>  }
>  
>  static void vhost_disable_notify_split(struct vhost_dev *dev,
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 8a9df4f..02d7a36 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -96,8 +96,14 @@ struct vhost_virtqueue {
>   struct vring_desc __user *desc;
>   struct vring_desc_packed __user *desc_packed;

Do you think it'd be better to name the desc type as
struct vring_packed_desc? And it will be consistent
with other names, like:

struct 

Re: [PATCH RFC 2/2] virtio_ring: support packed ring

2018-03-21 Thread Tiwei Bie
On Fri, Mar 16, 2018 at 04:30:02PM +0200, Michael S. Tsirkin wrote:
> On Fri, Mar 16, 2018 at 07:36:47PM +0800, Jason Wang wrote:
> > > > @@ -1096,17 +1599,21 @@ struct virtqueue *vring_create_virtqueue(
> > > > > > > > >   if (!queue) {
> > > > > > > > >   /* Try to get a single page. You are my only 
> > > > > > > > > hope! */
> > > > > > > > > - queue = vring_alloc_queue(vdev, vring_size(num, 
> > > > > > > > > vring_align),
> > > > > > > > > + queue = vring_alloc_queue(vdev, 
> > > > > > > > > __vring_size(num, vring_align,
> > > > > > > > > +  
> > > > > > > > > packed),
> > > > > > > > > _addr, 
> > > > > > > > > GFP_KERNEL|__GFP_ZERO);
> > > > > > > > >   }
> > > > > > > > >   if (!queue)
> > > > > > > > >   return NULL;
> > > > > > > > > - queue_size_in_bytes = vring_size(num, vring_align);
> > > > > > > > > - vring_init(, num, queue, vring_align);
> > > > > > > > > + queue_size_in_bytes = __vring_size(num, vring_align, 
> > > > > > > > > packed);
> > > > > > > > > + if (packed)
> > > > > > > > > + vring_packed_init(_packed, num, 
> > > > > > > > > queue, vring_align);
> > > > > > > > > + else
> > > > > > > > > + vring_init(_split, num, queue, 
> > > > > > > > > vring_align);
> > > > > > > > Let's rename vring_init to vring_init_split() like other 
> > > > > > > > helpers?
> > > > > > > The vring_init() is a public API in 
> > > > > > > include/uapi/linux/virtio_ring.h.
> > > > > > > I don't think we can rename it.
> > > > > > I see, then this need more thoughts to unify the API.
> > > > > My thought is to keep the old API as is, and introduce
> > > > > new types and helpers for packed ring.
> > > > I admit it's not a fault of this patch. But we'd better think of this 
> > > > in the
> > > > future, consider we may have new kinds of ring.
> > > > 
> > > > > More details can be found in this patch:
> > > > > https://lkml.org/lkml/2018/2/23/243
> > > > > (PS. The type which has bit fields is just for reference,
> > > > >and will be changed in next version.)
> > > > > 
> > > > > Do you have any other suggestions?
> > > > No.
> > > Hmm.. Sorry, I didn't describe my question well.
> > > I mean do you have any suggestions about the API
> > > design for packed ring in uapi header? Currently
> > > I introduced below two new helpers:
> > > 
> > > static inline void vring_packed_init(struct vring_packed *vr, unsigned 
> > > int num,
> > >void *p, unsigned long align);
> > > static inline unsigned vring_packed_size(unsigned int num, unsigned long 
> > > align);
> > > 
> > > When new rings are introduced in the future, above
> > > helpers can't be reused. Maybe we should make the
> > > helpers be able to determine the ring type?
> > 
> > Let's wait for Michael's comment here. Generally, I fail to understand why
> > vring_init() become a part of uapi. Git grep shows the only use cases are
> > virtio_test/vringh_test.
> > 
> > Thanks
> 
> For init - I think it's a mistake that stems from lguest which sometimes
> made it less than obvious which code is where.  I don't see a reason to
> add to it.

Got it! I'll move vring_packed_init() out of uapi. Many thanks! :)

Best regards,
Tiwei Bie

> 
> -- 
> MST


Re: [PATCH RFC 2/2] virtio_ring: support packed ring

2018-03-21 Thread Tiwei Bie
On Fri, Mar 16, 2018 at 07:36:47PM +0800, Jason Wang wrote:
> On 2018年03月16日 18:04, Tiwei Bie wrote:
> > On Fri, Mar 16, 2018 at 04:34:28PM +0800, Jason Wang wrote:
> > > On 2018年03月16日 15:40, Tiwei Bie wrote:
> > > > On Fri, Mar 16, 2018 at 02:44:12PM +0800, Jason Wang wrote:
> > > > > On 2018年03月16日 14:10, Tiwei Bie wrote:
> > > > > > On Fri, Mar 16, 2018 at 12:03:25PM +0800, Jason Wang wrote:
> > > > > > > On 2018年02月23日 19:18, Tiwei Bie wrote:
> > > > > > > > Signed-off-by: Tiwei Bie <tiwei@intel.com>
> > > > > > > > ---
> > > > > > > >  drivers/virtio/virtio_ring.c | 699 
> > > > > > > > +--
> > > > > > > >  include/linux/virtio_ring.h  |   8 +-
> > > > > > > >  2 files changed, 618 insertions(+), 89 deletions(-)
[...]
> > > @@ -1096,17 +1599,21 @@ struct virtqueue *vring_create_virtqueue(
> > > > > > > > if (!queue) {
> > > > > > > > /* Try to get a single page. You are my only 
> > > > > > > > hope! */
> > > > > > > > -   queue = vring_alloc_queue(vdev, vring_size(num, 
> > > > > > > > vring_align),
> > > > > > > > +   queue = vring_alloc_queue(vdev, 
> > > > > > > > __vring_size(num, vring_align,
> > > > > > > > +
> > > > > > > > packed),
> > > > > > > >   _addr, 
> > > > > > > > GFP_KERNEL|__GFP_ZERO);
> > > > > > > > }
> > > > > > > > if (!queue)
> > > > > > > > return NULL;
> > > > > > > > -   queue_size_in_bytes = vring_size(num, vring_align);
> > > > > > > > -   vring_init(, num, queue, vring_align);
> > > > > > > > +   queue_size_in_bytes = __vring_size(num, vring_align, 
> > > > > > > > packed);
> > > > > > > > +   if (packed)
> > > > > > > > +   vring_packed_init(_packed, num, 
> > > > > > > > queue, vring_align);
> > > > > > > > +   else
> > > > > > > > +   vring_init(_split, num, queue, 
> > > > > > > > vring_align);
> > > > > > > Let's rename vring_init to vring_init_split() like other helpers?
> > > > > > The vring_init() is a public API in 
> > > > > > include/uapi/linux/virtio_ring.h.
> > > > > > I don't think we can rename it.
> > > > > I see, then this need more thoughts to unify the API.
> > > > My thought is to keep the old API as is, and introduce
> > > > new types and helpers for packed ring.
> > > I admit it's not a fault of this patch. But we'd better think of this in 
> > > the
> > > future, consider we may have new kinds of ring.
> > > 
> > > > More details can be found in this patch:
> > > > https://lkml.org/lkml/2018/2/23/243
> > > > (PS. The type which has bit fields is just for reference,
> > > >and will be changed in next version.)
> > > > 
> > > > Do you have any other suggestions?
> > > No.
> > Hmm.. Sorry, I didn't describe my question well.
> > I mean do you have any suggestions about the API
> > design for packed ring in uapi header? Currently
> > I introduced below two new helpers:
> > 
> > static inline void vring_packed_init(struct vring_packed *vr, unsigned int 
> > num,
> >  void *p, unsigned long align);
> > static inline unsigned vring_packed_size(unsigned int num, unsigned long 
> > align);
> > 
> > When new rings are introduced in the future, above
> > helpers can't be reused. Maybe we should make the
> > helpers be able to determine the ring type?
> 
> Let's wait for Michael's comment here. Generally, I fail to understand why
> vring_init() become a part of uapi. Git grep shows the only use cases are
> virtio_test/vringh_test.

Thank you very much for the review on this patch!
I'll send out a new version ASAP to address these
comments. :)

Best regards,
Tiwei Bie


Re: [PATCH RFC 2/2] virtio_ring: support packed ring

2018-03-16 Thread Tiwei Bie
On Fri, Mar 16, 2018 at 04:34:28PM +0800, Jason Wang wrote:
> On 2018年03月16日 15:40, Tiwei Bie wrote:
> > On Fri, Mar 16, 2018 at 02:44:12PM +0800, Jason Wang wrote:
> > > On 2018年03月16日 14:10, Tiwei Bie wrote:
> > > > On Fri, Mar 16, 2018 at 12:03:25PM +0800, Jason Wang wrote:
> > > > > On 2018年02月23日 19:18, Tiwei Bie wrote:
> > > > > > Signed-off-by: Tiwei Bie <tiwei@intel.com>
> > > > > > ---
> > > > > > drivers/virtio/virtio_ring.c | 699 
> > > > > > +--
> > > > > > include/linux/virtio_ring.h  |   8 +-
> > > > > > 2 files changed, 618 insertions(+), 89 deletions(-)
> > [...]
> > > > > >   cpu_addr, size, direction);
> > > > > > }
> > > > > > -static void vring_unmap_one(const struct vring_virtqueue *vq,
> > > > > > -   struct vring_desc *desc)
> > > > > > +static void vring_unmap_one(const struct vring_virtqueue *vq, void 
> > > > > > *_desc)
> > > > > > {
> > > > > Let's split the helpers to packed/split version like other helpers?
> > > > > (Consider the caller has already known the type of vq).
> > > > Okay.
> > > > 
> > > [...]
> > > 
> > > > > > +   desc[i].flags = flags;
> > > > > > +
> > > > > > +   desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
> > > > > > +   desc[i].len = cpu_to_virtio32(_vq->vdev, 
> > > > > > sg->length);
> > > > > > +   desc[i].id = cpu_to_virtio32(_vq->vdev, head);
> > > > > If it's a part of chain, we only need to do this for last buffer I 
> > > > > think.
> > > > I'm not sure I've got your point about the "last buffer".
> > > > But, yes, id just needs to be set for the last desc.
> > > Right, I think I meant "last descriptor" :)
> > > 
> > > > > > +   prev = i;
> > > > > > +   i++;
> > > > > It looks to me prev is always i - 1?
> > > > No. prev will be (vq->vring_packed.num - 1) when i becomes 0.
> > > Right, so prev = i ? i - 1 : vq->vring_packed.num - 1.
> > Yes, i wraps together with vq->wrap_counter in following code:
> > 
> > > > > > +   if (!indirect && i >= vq->vring_packed.num) {
> > > > > > +   i = 0;
> > > > > > +   vq->wrap_counter ^= 1;
> > > > > > +   }
> > 
> > > > > > +   }
> > > > > > +   }
> > > > > > +   for (; n < (out_sgs + in_sgs); n++) {
> > > > > > +   for (sg = sgs[n]; sg; sg = sg_next(sg)) {
> > > > > > +   dma_addr_t addr = vring_map_one_sg(vq, sg, 
> > > > > > DMA_FROM_DEVICE);
> > > > > > +   if (vring_mapping_error(vq, addr))
> > > > > > +   goto unmap_release;
> > > > > > +
> > > > > > +   flags = cpu_to_virtio16(_vq->vdev, 
> > > > > > VRING_DESC_F_NEXT |
> > > > > > +   VRING_DESC_F_WRITE |
> > > > > > +   
> > > > > > VRING_DESC_F_AVAIL(vq->wrap_counter) |
> > > > > > +   
> > > > > > VRING_DESC_F_USED(!vq->wrap_counter));
> > > > > > +   if (!indirect && i == head)
> > > > > > +   head_flags = flags;
> > > > > > +   else
> > > > > > +   desc[i].flags = flags;
> > > > > > +
> > > > > > +   desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
> > > > > > +   desc[i].len = cpu_to_virtio32(_vq->vdev, 
> > > > > > sg->length);
> > > > > > +   desc[i].id = cpu_to_virtio32(_vq->vdev, head);
> > > > > > +   prev = i;
> > > > > > +   i++;
> > > &g

Re: [PATCH RFC 2/2] virtio_ring: support packed ring

2018-03-16 Thread Tiwei Bie
On Fri, Mar 16, 2018 at 02:44:12PM +0800, Jason Wang wrote:
> On 2018年03月16日 14:10, Tiwei Bie wrote:
> > On Fri, Mar 16, 2018 at 12:03:25PM +0800, Jason Wang wrote:
> > > On 2018年02月23日 19:18, Tiwei Bie wrote:
> > > > Signed-off-by: Tiwei Bie <tiwei@intel.com>
> > > > ---
> > > >drivers/virtio/virtio_ring.c | 699 
> > > > +--
> > > >include/linux/virtio_ring.h  |   8 +-
> > > >2 files changed, 618 insertions(+), 89 deletions(-)
[...]
> > > >   cpu_addr, size, direction);
> > > >}
> > > > -static void vring_unmap_one(const struct vring_virtqueue *vq,
> > > > -   struct vring_desc *desc)
> > > > +static void vring_unmap_one(const struct vring_virtqueue *vq, void 
> > > > *_desc)
> > > >{
> > > Let's split the helpers to packed/split version like other helpers?
> > > (Consider the caller has already known the type of vq).
> > Okay.
> > 
> 
> [...]
> 
> > > > +   desc[i].flags = flags;
> > > > +
> > > > +   desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
> > > > +   desc[i].len = cpu_to_virtio32(_vq->vdev, 
> > > > sg->length);
> > > > +   desc[i].id = cpu_to_virtio32(_vq->vdev, head);
> > > If it's a part of chain, we only need to do this for last buffer I think.
> > I'm not sure I've got your point about the "last buffer".
> > But, yes, id just needs to be set for the last desc.
> 
> Right, I think I meant "last descriptor" :)
> 
> > 
> > > > +   prev = i;
> > > > +   i++;
> > > It looks to me prev is always i - 1?
> > No. prev will be (vq->vring_packed.num - 1) when i becomes 0.
> 
> Right, so prev = i ? i - 1 : vq->vring_packed.num - 1.

Yes, i wraps together with vq->wrap_counter in following code:

> > > > +   if (!indirect && i >= vq->vring_packed.num) {
> > > > +   i = 0;
> > > > +   vq->wrap_counter ^= 1;
> > > > +   }


> > > > +   }
> > > > +   }
> > > > +   for (; n < (out_sgs + in_sgs); n++) {
> > > > +   for (sg = sgs[n]; sg; sg = sg_next(sg)) {
> > > > +   dma_addr_t addr = vring_map_one_sg(vq, sg, 
> > > > DMA_FROM_DEVICE);
> > > > +   if (vring_mapping_error(vq, addr))
> > > > +   goto unmap_release;
> > > > +
> > > > +   flags = cpu_to_virtio16(_vq->vdev, 
> > > > VRING_DESC_F_NEXT |
> > > > +   VRING_DESC_F_WRITE |
> > > > +   
> > > > VRING_DESC_F_AVAIL(vq->wrap_counter) |
> > > > +   
> > > > VRING_DESC_F_USED(!vq->wrap_counter));
> > > > +   if (!indirect && i == head)
> > > > +   head_flags = flags;
> > > > +   else
> > > > +   desc[i].flags = flags;
> > > > +
> > > > +   desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
> > > > +   desc[i].len = cpu_to_virtio32(_vq->vdev, 
> > > > sg->length);
> > > > +   desc[i].id = cpu_to_virtio32(_vq->vdev, head);
> > > > +   prev = i;
> > > > +   i++;
> > > > +   if (!indirect && i >= vq->vring_packed.num) {
> > > > +   i = 0;
> > > > +   vq->wrap_counter ^= 1;
> > > > +   }
> > > > +   }
> > > > +   }
> > > > +   /* Last one doesn't continue. */
> > > > +   if (!indirect && (head + 1) % vq->vring_packed.num == i)
> > > > +   head_flags &= cpu_to_virtio16(_vq->vdev, 
> > > > ~VRING_DESC_F_NEXT);
> > > I can't get the why we need this here.
> > If only one desc is used,

Re: [PATCH RFC 2/2] virtio_ring: support packed ring

2018-03-16 Thread Tiwei Bie
On Fri, Mar 16, 2018 at 12:03:25PM +0800, Jason Wang wrote:
> On 2018年02月23日 19:18, Tiwei Bie wrote:
> > Signed-off-by: Tiwei Bie <tiwei@intel.com>
> > ---
> >   drivers/virtio/virtio_ring.c | 699 
> > +--
> >   include/linux/virtio_ring.h  |   8 +-
> >   2 files changed, 618 insertions(+), 89 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index eb30f3e09a47..393778a2f809 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -58,14 +58,14 @@
> >   struct vring_desc_state {
> > void *data; /* Data for callback. */
> > -   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
> > +   void *indir_desc;   /* Indirect descriptor, if any. */
> > +   int num;/* Descriptor list length. */
> >   };
> >   struct vring_virtqueue {
> > struct virtqueue vq;
> > -   /* Actual memory layout for this queue */
> > -   struct vring vring;
> > +   bool packed;
> > /* Can we use weak barriers? */
> > bool weak_barriers;
> > @@ -87,11 +87,28 @@ struct vring_virtqueue {
> > /* Last used index we've seen. */
> > u16 last_used_idx;
> > -   /* Last written value to avail->flags */
> > -   u16 avail_flags_shadow;
> > -
> > -   /* Last written value to avail->idx in guest byte order */
> > -   u16 avail_idx_shadow;
> > +   union {
> > +   /* Available for split ring */
> > +   struct {
> > +   /* Actual memory layout for this queue */
> > +   struct vring vring;
> > +
> > +   /* Last written value to avail->flags */
> > +   u16 avail_flags_shadow;
> > +
> > +   /* Last written value to avail->idx in
> > +* guest byte order */
> > +   u16 avail_idx_shadow;
> > +   };
> > +
> > +   /* Available for packed ring */
> > +   struct {
> > +   /* Actual memory layout for this queue */
> > +   struct vring_packed vring_packed;
> > +   u8 wrap_counter : 1;
> > +   bool chaining;
> > +   };
> > +   };
> > /* How to notify other side. FIXME: commonalize hcalls! */
> > bool (*notify)(struct virtqueue *vq);
> > @@ -201,26 +218,37 @@ static dma_addr_t vring_map_single(const struct 
> > vring_virtqueue *vq,
> >   cpu_addr, size, direction);
> >   }
> > -static void vring_unmap_one(const struct vring_virtqueue *vq,
> > -   struct vring_desc *desc)
> > +static void vring_unmap_one(const struct vring_virtqueue *vq, void *_desc)
> >   {
> 
> Let's split the helpers to packed/split version like other helpers?
> (Consider the caller has already known the type of vq).

Okay.

> 
> > +   u64 addr;
> > +   u32 len;
> > u16 flags;
> > if (!vring_use_dma_api(vq->vq.vdev))
> > return;
> > -   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
> > +   if (vq->packed) {
> > +   struct vring_packed_desc *desc = _desc;
> > +
> > +   addr = virtio64_to_cpu(vq->vq.vdev, desc->addr);
> > +   len = virtio32_to_cpu(vq->vq.vdev, desc->len);
> > +   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
> > +   } else {
> > +   struct vring_desc *desc = _desc;
> > +
> > +   addr = virtio64_to_cpu(vq->vq.vdev, desc->addr);
> > +   len = virtio32_to_cpu(vq->vq.vdev, desc->len);
> > +   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
> > +   }
> > if (flags & VRING_DESC_F_INDIRECT) {
> > dma_unmap_single(vring_dma_dev(vq),
> > -virtio64_to_cpu(vq->vq.vdev, desc->addr),
> > -virtio32_to_cpu(vq->vq.vdev, desc->len),
> > +addr, len,
> >  (flags & VRING_DESC_F_WRITE) ?
> >  DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > } else {
> > dma_unmap_page(vring_dma_dev(vq),
> > -  virtio64_to_cpu(vq->vq.vdev, desc->addr),
> > -  virtio32_to_cpu(vq->vq.vdev, desc->len),
> > +  addr, len,
> >

Re: [PATCH RFC 1/2] virtio: introduce packed ring defines

2018-02-27 Thread Tiwei Bie
On Tue, Feb 27, 2018 at 09:54:58AM +0100, Jens Freimann wrote:
> On Fri, Feb 23, 2018 at 07:18:00PM +0800, Tiwei Bie wrote:
[...]
> > 
> > +struct vring_packed_desc_event {
> > +   /* Descriptor Event Offset */
> > +   __virtio16 desc_event_off   : 15,
> > +   /* Descriptor Event Wrap Counter */
> > +  desc_event_wrap  : 1;
> > +   /* Descriptor Event Flags */
> > +   __virtio16 desc_event_flags : 2;
> > +};
> 
> Where would the virtqueue number go in driver notifications?

This structure is for event suppression instead of notification.

You could refer to the "Event Suppression Structure Format"
section of the spec for more details:

https://github.com/oasis-tcs/virtio-docs/blob/master/virtio-v1.1-packed-wd08.pdf

Best regards,
Tiwei Bie


Re: [PATCH RFC 1/2] virtio: introduce packed ring defines

2018-02-27 Thread Tiwei Bie
On Tue, Feb 27, 2018 at 09:26:27AM +, David Laight wrote:
> From: Tiwei Bie
> > Sent: 23 February 2018 11:18
> ...
> > +struct vring_packed_desc_event {
> > +   /* Descriptor Event Offset */
> > +   __virtio16 desc_event_off   : 15,
> > +   /* Descriptor Event Wrap Counter */
> > +  desc_event_wrap  : 1;
> > +   /* Descriptor Event Flags */
> > +   __virtio16 desc_event_flags : 2;
> > +};
> 
> This looks like you are assuming that a bit-field has a defined
> layout and can be used to map a 'hardware' structure.
> The don't, don't use them like that.
> 
>   David
> 

Thanks for the comments! Above definition isn't used in
this RFC, and the corresponding parts (event suppression)
haven't been implemented yet. It's more like some pseudo
code (I should add some comments about this in the code).

I planned to change it to something like this in the next
version:

struct vring_packed_desc_event {
__virtio16 off_wrap;
__virtio16 flags;  // XXX maybe not a good name for future
}; // extension. Only 2bits are used now.

But it seems that I had a misunderstanding about the spec
on this previously:

https://lists.oasis-open.org/archives/virtio-dev/201802/msg00173.html

Anyway, it will be addressed. Thank you very much! ;-)

Best regards,
Tiwei Bie


Re: [PATCH RFC 2/2] vhost: packed ring support

2018-02-27 Thread Tiwei Bie
On Wed, Feb 14, 2018 at 10:37:09AM +0800, Jason Wang wrote:
[...]
> +static void set_desc_used(struct vhost_virtqueue *vq,
> +   struct vring_desc_packed *desc, bool wrap_counter)
> +{
> + __virtio16 flags = desc->flags;
> +
> + if (wrap_counter) {
> + desc->flags |= cpu_to_vhost16(vq, DESC_AVAIL);
> + desc->flags |= cpu_to_vhost16(vq, DESC_USED);
> + } else {
> + desc->flags &= ~cpu_to_vhost16(vq, DESC_AVAIL);
> + desc->flags &= ~cpu_to_vhost16(vq, DESC_USED);
> + }
> +
> + desc->flags = flags;

The `desc->flags` is restored after the change.

> +}
> +
> +static int vhost_get_vq_desc_packed(struct vhost_virtqueue *vq,
> + struct iovec iov[], unsigned int iov_size,
> + unsigned int *out_num, unsigned int *in_num,
> + struct vhost_log *log,
> + unsigned int *log_num)
> +{
> + struct vring_desc_packed desc;
> + int ret, access, i;
> + u16 avail_idx = vq->last_avail_idx;
> +
> + /* When we start there are none of either input nor output. */
> + *out_num = *in_num = 0;
> + if (unlikely(log))
> + *log_num = 0;
> +
> + do {
> + unsigned int iov_count = *in_num + *out_num;
> +
> + i = vq->last_avail_idx & (vq->num - 1);

The queue size may not be a power of 2 in packed ring.

Best regards,
Tiwei Bie


[PATCH RFC 0/2] Packed ring for virtio

2018-02-23 Thread Tiwei Bie
Hello everyone,

This RFC implements a subset of packed ring which is described at
https://github.com/oasis-tcs/virtio-docs/blob/master/virtio-v1.1-packed-wd08.pdf

The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
Minor changes are needed for the vhost code, e.g. to kick the guest.

It's not a complete implementation, here is what's missing:

- Device area and driver area
- VIRTIO_RING_F_INDIRECT_DESC
- VIRTIO_F_NOTIFICATION_DATA
- Virtio devices except net are not tested
- See FIXME in the code for more details

Thanks!

Best regards,
Tiwei Bie

Tiwei Bie (2):
  virtio: introduce packed ring defines
  virtio_ring: support packed ring

 drivers/virtio/virtio_ring.c   | 699 -
 include/linux/virtio_ring.h|   8 +-
 include/uapi/linux/virtio_config.h |  18 +-
 include/uapi/linux/virtio_ring.h   |  68 
 4 files changed, 703 insertions(+), 90 deletions(-)

-- 
2.14.1



[PATCH RFC 2/2] virtio_ring: support packed ring

2018-02-23 Thread Tiwei Bie
Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 drivers/virtio/virtio_ring.c | 699 +--
 include/linux/virtio_ring.h  |   8 +-
 2 files changed, 618 insertions(+), 89 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index eb30f3e09a47..393778a2f809 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -58,14 +58,14 @@
 
 struct vring_desc_state {
void *data; /* Data for callback. */
-   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
+   void *indir_desc;   /* Indirect descriptor, if any. */
+   int num;/* Descriptor list length. */
 };
 
 struct vring_virtqueue {
struct virtqueue vq;
 
-   /* Actual memory layout for this queue */
-   struct vring vring;
+   bool packed;
 
/* Can we use weak barriers? */
bool weak_barriers;
@@ -87,11 +87,28 @@ struct vring_virtqueue {
/* Last used index we've seen. */
u16 last_used_idx;
 
-   /* Last written value to avail->flags */
-   u16 avail_flags_shadow;
-
-   /* Last written value to avail->idx in guest byte order */
-   u16 avail_idx_shadow;
+   union {
+   /* Available for split ring */
+   struct {
+   /* Actual memory layout for this queue */
+   struct vring vring;
+
+   /* Last written value to avail->flags */
+   u16 avail_flags_shadow;
+
+   /* Last written value to avail->idx in
+* guest byte order */
+   u16 avail_idx_shadow;
+   };
+
+   /* Available for packed ring */
+   struct {
+   /* Actual memory layout for this queue */
+   struct vring_packed vring_packed;
+   u8 wrap_counter : 1;
+   bool chaining;
+   };
+   };
 
/* How to notify other side. FIXME: commonalize hcalls! */
bool (*notify)(struct virtqueue *vq);
@@ -201,26 +218,37 @@ static dma_addr_t vring_map_single(const struct 
vring_virtqueue *vq,
  cpu_addr, size, direction);
 }
 
-static void vring_unmap_one(const struct vring_virtqueue *vq,
-   struct vring_desc *desc)
+static void vring_unmap_one(const struct vring_virtqueue *vq, void *_desc)
 {
+   u64 addr;
+   u32 len;
u16 flags;
 
if (!vring_use_dma_api(vq->vq.vdev))
return;
 
-   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+   if (vq->packed) {
+   struct vring_packed_desc *desc = _desc;
+
+   addr = virtio64_to_cpu(vq->vq.vdev, desc->addr);
+   len = virtio32_to_cpu(vq->vq.vdev, desc->len);
+   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+   } else {
+   struct vring_desc *desc = _desc;
+
+   addr = virtio64_to_cpu(vq->vq.vdev, desc->addr);
+   len = virtio32_to_cpu(vq->vq.vdev, desc->len);
+   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+   }
 
if (flags & VRING_DESC_F_INDIRECT) {
dma_unmap_single(vring_dma_dev(vq),
-virtio64_to_cpu(vq->vq.vdev, desc->addr),
-virtio32_to_cpu(vq->vq.vdev, desc->len),
+addr, len,
 (flags & VRING_DESC_F_WRITE) ?
 DMA_FROM_DEVICE : DMA_TO_DEVICE);
} else {
dma_unmap_page(vring_dma_dev(vq),
-  virtio64_to_cpu(vq->vq.vdev, desc->addr),
-  virtio32_to_cpu(vq->vq.vdev, desc->len),
+  addr, len,
   (flags & VRING_DESC_F_WRITE) ?
   DMA_FROM_DEVICE : DMA_TO_DEVICE);
}
@@ -235,8 +263,9 @@ static int vring_mapping_error(const struct vring_virtqueue 
*vq,
return dma_mapping_error(vring_dma_dev(vq), addr);
 }
 
-static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
-unsigned int total_sg, gfp_t gfp)
+static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
+  unsigned int total_sg,
+  gfp_t gfp)
 {
struct vring_desc *desc;
unsigned int i;
@@ -257,14 +286,32 @@ static struct vring_desc *alloc_indirect(struct virtqueue 
*_vq,
return desc;
 }
 
-static inline int virtqueue_add(struct virtqueue *_vq,
-   struct scatterlist *sgs[],
-

[PATCH RFC 1/2] virtio: introduce packed ring defines

2018-02-23 Thread Tiwei Bie
Signed-off-by: Tiwei Bie <tiwei@intel.com>
---
 include/uapi/linux/virtio_config.h | 18 +-
 include/uapi/linux/virtio_ring.h   | 68 ++
 2 files changed, 85 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/virtio_config.h 
b/include/uapi/linux/virtio_config.h
index 308e2096291f..e3d077ef5207 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -49,7 +49,7 @@
  * transport being used (eg. virtio_ring), the rest are per-device feature
  * bits. */
 #define VIRTIO_TRANSPORT_F_START   28
-#define VIRTIO_TRANSPORT_F_END 34
+#define VIRTIO_TRANSPORT_F_END 37
 
 #ifndef VIRTIO_CONFIG_NO_LEGACY
 /* Do we get callbacks when the ring is completely used, even if we've
@@ -71,4 +71,20 @@
  * this is for compatibility with legacy systems.
  */
 #define VIRTIO_F_IOMMU_PLATFORM33
+
+/* This feature indicates support for the packed virtqueue layout. */
+#define VIRTIO_F_RING_PACKED   34
+
+/*
+ * This feature indicates that all buffers are used by the device
+ * in the same order in which they have been made available.
+ */
+#define VIRTIO_F_IN_ORDER  35
+
+/*
+ * This feature indicates that drivers pass extra data (besides
+ * identifying the Virtqueue) in their device notifications.
+ */
+#define VIRTIO_F_NOTIFICATION_DATA 36
+
 #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 6d5d5faa989b..77b1d4aeef72 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -44,6 +44,9 @@
 /* This means the buffer contains a list of buffer descriptors. */
 #define VRING_DESC_F_INDIRECT  4
 
+#define VRING_DESC_F_AVAIL(b)  ((b) << 7)
+#define VRING_DESC_F_USED(b)   ((b) << 15)
+
 /* The Host uses this in used->flags to advise the Guest: don't kick me when
  * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
  * will still kick if it's out of buffers. */
@@ -104,6 +107,36 @@ struct vring {
struct vring_used *used;
 };
 
+struct vring_packed_desc_event {
+   /* Descriptor Event Offset */
+   __virtio16 desc_event_off   : 15,
+   /* Descriptor Event Wrap Counter */
+  desc_event_wrap  : 1;
+   /* Descriptor Event Flags */
+   __virtio16 desc_event_flags : 2;
+};
+
+struct vring_packed_desc {
+   /* Buffer Address. */
+   __virtio64 addr;
+   /* Buffer Length. */
+   __virtio32 len;
+   /* Buffer ID. */
+   __virtio16 id;
+   /* The flags depending on descriptor type. */
+   __virtio16 flags;
+};
+
+struct vring_packed {
+   unsigned int num;
+
+   struct vring_packed_desc *desc;
+
+   struct vring_packed_desc_event *driver;
+
+   struct vring_packed_desc_event *device;
+};
+
 /* Alignment requirements for vring elements.
  * When using pre-virtio 1.0 layout, these fall out naturally.
  */
@@ -171,4 +204,39 @@ static inline int vring_need_event(__u16 event_idx, __u16 
new_idx, __u16 old)
return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
 }
 
+/* The standard layout for the packed ring is a continuous chunk of memory
+ * which looks like this.
+ *
+ * struct vring_packed
+ * {
+ * // The actual descriptors (16 bytes each)
+ * struct vring_packed_desc desc[num];
+ *
+ * // Padding to the next align boundary.
+ * char pad[];
+ *
+ * // Driver Event Suppression
+ * struct vring_packed_desc_event driver;
+ *
+ * // Device Event Suppression
+ * struct vring_packed_desc_event device;
+ * };
+ */
+
+static inline void vring_packed_init(struct vring_packed *vr, unsigned int num,
+void *p, unsigned long align)
+{
+   vr->num = num;
+   vr->desc = p;
+   vr->driver = (void *)(((uintptr_t)p + sizeof(struct vring_packed_desc)
+   * num + align - 1) & ~(align - 1));
+   vr->device = vr->driver + 1;
+}
+
+static inline unsigned vring_packed_size(unsigned int num, unsigned long align)
+{
+   return ((sizeof(struct vring_packed_desc) * num + align - 1)
+   & ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
+}
+
 #endif /* _UAPI_LINUX_VIRTIO_RING_H */
-- 
2.14.1