Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-11 Thread Yuanhan Liu
On Mon, Oct 10, 2016 at 07:39:59AM +0300, Michael S. Tsirkin wrote:
> > > > > > 1. why is this memset expensive?
> > > > >
> > > > > I don't have the exact answer, but just some rough thoughts:
> > > > >
> > > > > It's an external clib function: there is a call stack and the
> > > > > IP register will bounch back and forth.
> > > >
> > > > for memset 0?  gcc 5.3.1 on fedora happily inlines it.
> > > 
> > > Good to know!
> > > 
> > > > > overkill to use that for resetting 14 bytes structure.
> > > > >
> > > > > Some trick like
> > > > > *(struct virtio_net_hdr *)hdr = {0, };
> > > > >
> > > > > Or even
> > > > > hdr->xxx = 0;
> > > > > hdr->yyy = 0;
> > > > >
> > > > > should behaviour better.
> > > > >
> > > > > There was an example: the vhost enqueue optmization patchset from
> > > > > Zhihong [0] uses memset, and it introduces more than 15% drop (IIRC)
> > > > > on my Ivybridge server: it has no such issue on his server though.
> > > > >
> > > > > [0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html
> > > > >
> > > > >   --yliu
> > > >
> > > > I'd say that's weird. what's your config? any chance you
> > > > are using an old compiler?
> > > 
> > > Not really, it's gcc 5.3.1. Maybe Zhihong could explain more. IIRC,
> > > he said the memset is not well optimized for Ivybridge server.
> > 
> > The dst is remote in that case. It's fine on Haswell but has complication
> > in Ivy Bridge which (wasn't supposed to but) causes serious frontend issue.
> > 
> > I don't think gcc inlined it there. I'm using fc24 gcc 6.1.1.
> 
> 
> So try something like this then:

Yes, I saw memset is inlined when this diff is applied.

So, mind to send a formal patch? You might want to try build at least:
it doesn't build.

 
> Generally pointer chasing in vq->hw->vtnet_hdr_size can't be good
> for performance. Move fields used on data path into vq
> and use from there to avoid indirections?

Good suggestion!

--yliu



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-11 Thread Yuanhan Liu
On Tue, Oct 11, 2016 at 08:39:54AM +0200, Maxime Coquelin wrote:
> 
> 
> On 10/11/2016 08:04 AM, Yuanhan Liu wrote:
> >On Mon, Oct 10, 2016 at 04:54:39PM +0200, Maxime Coquelin wrote:
> >>
> >>
> >>On 10/10/2016 04:42 PM, Yuanhan Liu wrote:
> >>>On Mon, Oct 10, 2016 at 02:40:44PM +0200, Maxime Coquelin wrote:
> >>>At that time, a packet always use 2 descs. Since indirect desc is
> >>>enabled (by default) now, the assumption is not true then. What's
> >>>worse, it might even slow things a bit down. That should also be
> >>>part of the reason why performance is slightly worse than before.
> >>>
> >>>   --yliu
> >>
> >>I'm not sure I get what you are saying
> >>
> >>>commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
> >>>Author: Yuanhan Liu 
> >>>Date:   Mon May 2 17:46:17 2016 -0700
> >>>
> >>>  vhost: optimize dequeue for small packets
> >>>
> >>>  A virtio driver normally uses at least 2 desc buffers for Tx: the
> >>>  first for storing the header, and the others for storing the data.
> >>>
> >>>  Therefore, we could fetch the first data desc buf before the main
> >>>  loop, and do the copy first before the check of "are we done yet?".
> >>>  This could save one check for small packets that just have one data
> >>>  desc buffer and need one mbuf to store it.
> >>>
> >>>  Signed-off-by: Yuanhan Liu 
> >>>  Acked-by: Huawei Xie 
> >>>  Tested-by: Rich Lane 
> >>
> >>This fast-paths the 2-descriptors format but it's not active
> >>for indirect descriptors. Is this what you mean?
> >
> >Yes. It's also not active when ANY_LAYOUT is actually turned on.
> >>Should be a simple matter to apply this optimization for indirect.
> >
> >Might be.
> 
> If I understand the code correctly, indirect descs also benefit from this
> optimization, or am I missing something?
> >>>
> >>>Aha..., you are right!
> >>
> >>The interesting thing is that the patch I send on Thursday that removes
> >>header access when no offload has been negotiated[0] seems to reduce
> >>almost to zero the performance seen with indirect descriptors enabled.
> >
> >Didn't follow that.
> >
> >>I see this with 64 bytes packets using testpmd on both ends.
> >>
> >>When I did the patch, I would have expected the same gain with both
> >>modes, whereas I measured +1% for direct and +4% for indirect.
> >
> >IIRC, I did a test before (remove those offload code piece), and the
> >performance was basically the same before and after that. Well, there
> >might be some small difference, say 1% as you said. But the result has
> >never been steady.
> >
> >Anyway, I think your patch is good to have: I just didn't see v2.
> 
> I waited to gather some comments/feedback before sending the v2.
> I'll send it today or tomorrow.

Interesting, I saw a deadlock then: I haven't looked at the code
carefully once you said there is a v2, thus I'm waiting for it.
However, you are waitting for my review. :)

Anyway, I will take time to look at it shortly.

--yliu



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-11 Thread Maxime Coquelin



On 10/11/2016 08:04 AM, Yuanhan Liu wrote:

On Mon, Oct 10, 2016 at 04:54:39PM +0200, Maxime Coquelin wrote:



On 10/10/2016 04:42 PM, Yuanhan Liu wrote:

On Mon, Oct 10, 2016 at 02:40:44PM +0200, Maxime Coquelin wrote:

At that time, a packet always use 2 descs. Since indirect desc is
enabled (by default) now, the assumption is not true then. What's
worse, it might even slow things a bit down. That should also be
part of the reason why performance is slightly worse than before.

--yliu


I'm not sure I get what you are saying


commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
Author: Yuanhan Liu 
Date:   Mon May 2 17:46:17 2016 -0700

  vhost: optimize dequeue for small packets

  A virtio driver normally uses at least 2 desc buffers for Tx: the
  first for storing the header, and the others for storing the data.

  Therefore, we could fetch the first data desc buf before the main
  loop, and do the copy first before the check of "are we done yet?".
  This could save one check for small packets that just have one data
  desc buffer and need one mbuf to store it.

  Signed-off-by: Yuanhan Liu 
  Acked-by: Huawei Xie 
  Tested-by: Rich Lane 


This fast-paths the 2-descriptors format but it's not active
for indirect descriptors. Is this what you mean?


Yes. It's also not active when ANY_LAYOUT is actually turned on.

Should be a simple matter to apply this optimization for indirect.


Might be.


If I understand the code correctly, indirect descs also benefit from this
optimization, or am I missing something?


Aha..., you are right!


The interesting thing is that the patch I send on Thursday that removes
header access when no offload has been negotiated[0] seems to reduce
almost to zero the performance seen with indirect descriptors enabled.


Didn't follow that.


I see this with 64 bytes packets using testpmd on both ends.

When I did the patch, I would have expected the same gain with both
modes, whereas I measured +1% for direct and +4% for indirect.


IIRC, I did a test before (remove those offload code piece), and the
performance was basically the same before and after that. Well, there
might be some small difference, say 1% as you said. But the result has
never been steady.

Anyway, I think your patch is good to have: I just didn't see v2.


I waited to gather some comments/feedback before sending the v2.
I'll send it today or tomorrow.

Thanks,
Maxime



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-11 Thread Yuanhan Liu
On Mon, Oct 10, 2016 at 04:54:39PM +0200, Maxime Coquelin wrote:
> 
> 
> On 10/10/2016 04:42 PM, Yuanhan Liu wrote:
> >On Mon, Oct 10, 2016 at 02:40:44PM +0200, Maxime Coquelin wrote:
> >At that time, a packet always use 2 descs. Since indirect desc is
> >enabled (by default) now, the assumption is not true then. What's
> >worse, it might even slow things a bit down. That should also be
> >part of the reason why performance is slightly worse than before.
> >
> > --yliu
> 
> I'm not sure I get what you are saying
> 
> >commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
> >Author: Yuanhan Liu 
> >Date:   Mon May 2 17:46:17 2016 -0700
> >
> >   vhost: optimize dequeue for small packets
> >
> >   A virtio driver normally uses at least 2 desc buffers for Tx: the
> >   first for storing the header, and the others for storing the data.
> >
> >   Therefore, we could fetch the first data desc buf before the main
> >   loop, and do the copy first before the check of "are we done yet?".
> >   This could save one check for small packets that just have one data
> >   desc buffer and need one mbuf to store it.
> >
> >   Signed-off-by: Yuanhan Liu 
> >   Acked-by: Huawei Xie 
> >   Tested-by: Rich Lane 
> 
> This fast-paths the 2-descriptors format but it's not active
> for indirect descriptors. Is this what you mean?
> >>>
> >>>Yes. It's also not active when ANY_LAYOUT is actually turned on.
> Should be a simple matter to apply this optimization for indirect.
> >>>
> >>>Might be.
> >>
> >>If I understand the code correctly, indirect descs also benefit from this
> >>optimization, or am I missing something?
> >
> >Aha..., you are right!
> 
> The interesting thing is that the patch I send on Thursday that removes
> header access when no offload has been negotiated[0] seems to reduce
> almost to zero the performance seen with indirect descriptors enabled.

Didn't follow that.

> I see this with 64 bytes packets using testpmd on both ends.
> 
> When I did the patch, I would have expected the same gain with both
> modes, whereas I measured +1% for direct and +4% for indirect.

IIRC, I did a test before (remove those offload code piece), and the
performance was basically the same before and after that. Well, there
might be some small difference, say 1% as you said. But the result has
never been steady.

Anyway, I think your patch is good to have: I just didn't see v2.

--yliu



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Maxime Coquelin



On 10/10/2016 04:42 PM, Yuanhan Liu wrote:

On Mon, Oct 10, 2016 at 02:40:44PM +0200, Maxime Coquelin wrote:

At that time, a packet always use 2 descs. Since indirect desc is
enabled (by default) now, the assumption is not true then. What's
worse, it might even slow things a bit down. That should also be
part of the reason why performance is slightly worse than before.

--yliu


I'm not sure I get what you are saying


commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
Author: Yuanhan Liu 
Date:   Mon May 2 17:46:17 2016 -0700

   vhost: optimize dequeue for small packets

   A virtio driver normally uses at least 2 desc buffers for Tx: the
   first for storing the header, and the others for storing the data.

   Therefore, we could fetch the first data desc buf before the main
   loop, and do the copy first before the check of "are we done yet?".
   This could save one check for small packets that just have one data
   desc buffer and need one mbuf to store it.

   Signed-off-by: Yuanhan Liu 
   Acked-by: Huawei Xie 
   Tested-by: Rich Lane 


This fast-paths the 2-descriptors format but it's not active
for indirect descriptors. Is this what you mean?


Yes. It's also not active when ANY_LAYOUT is actually turned on.

Should be a simple matter to apply this optimization for indirect.


Might be.


If I understand the code correctly, indirect descs also benefit from this
optimization, or am I missing something?


Aha..., you are right!


The interesting thing is that the patch I send on Thursday that removes
header access when no offload has been negotiated[0] seems to reduce
almost to zero the performance seen with indirect descriptors enabled.
I see this with 64 bytes packets using testpmd on both ends.

When I did the patch, I would have expected the same gain with both
modes, whereas I measured +1% for direct and +4% for indirect.

Maxime
[0]: http://dpdk.org/dev/patchwork/patch/16423/



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Yuanhan Liu
On Mon, Oct 10, 2016 at 02:40:44PM +0200, Maxime Coquelin wrote:
> >>>At that time, a packet always use 2 descs. Since indirect desc is
> >>>enabled (by default) now, the assumption is not true then. What's
> >>>worse, it might even slow things a bit down. That should also be
> >>>part of the reason why performance is slightly worse than before.
> >>>
> >>>   --yliu
> >>
> >>I'm not sure I get what you are saying
> >>
> >>>commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
> >>>Author: Yuanhan Liu 
> >>>Date:   Mon May 2 17:46:17 2016 -0700
> >>>
> >>>vhost: optimize dequeue for small packets
> >>>
> >>>A virtio driver normally uses at least 2 desc buffers for Tx: the
> >>>first for storing the header, and the others for storing the data.
> >>>
> >>>Therefore, we could fetch the first data desc buf before the main
> >>>loop, and do the copy first before the check of "are we done yet?".
> >>>This could save one check for small packets that just have one data
> >>>desc buffer and need one mbuf to store it.
> >>>
> >>>Signed-off-by: Yuanhan Liu 
> >>>Acked-by: Huawei Xie 
> >>>Tested-by: Rich Lane 
> >>
> >>This fast-paths the 2-descriptors format but it's not active
> >>for indirect descriptors. Is this what you mean?
> >
> >Yes. It's also not active when ANY_LAYOUT is actually turned on.
> >>Should be a simple matter to apply this optimization for indirect.
> >
> >Might be.
> 
> If I understand the code correctly, indirect descs also benefit from this
> optimization, or am I missing something?

Aha..., you are right!

--yliu



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan@linux.intel.com]
> Sent: Monday, October 10, 2016 11:59 AM
> To: Michael S. Tsirkin <m...@redhat.com>
> Cc: Maxime Coquelin <maxime.coque...@redhat.com>; Stephen Hemminger
> <step...@networkplumber.org>; d...@dpdk.org; qemu-
> de...@nongnu.org; Wang, Zhihong <zhihong.w...@intel.com>
> Subject: Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature
> 
> On Mon, Oct 10, 2016 at 06:46:44AM +0300, Michael S. Tsirkin wrote:
> > On Mon, Oct 10, 2016 at 11:37:44AM +0800, Yuanhan Liu wrote:
> > > On Thu, Sep 29, 2016 at 11:21:48PM +0300, Michael S. Tsirkin wrote:
> > > > On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> > > > >
> > > > >
> > > > > On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:
> > > > Yes but two points.
> > > >
> > > > 1. why is this memset expensive?
> > >
> > > I don't have the exact answer, but just some rough thoughts:
> > >
> > > It's an external clib function: there is a call stack and the
> > > IP register will bounch back and forth.
> >
> > for memset 0?  gcc 5.3.1 on fedora happily inlines it.
> 
> Good to know!
> 
> > > overkill to use that for resetting 14 bytes structure.
> > >
> > > Some trick like
> > > *(struct virtio_net_hdr *)hdr = {0, };
> > >
> > > Or even
> > > hdr->xxx = 0;
> > > hdr->yyy = 0;
> > >
> > > should behaviour better.
> > >
> > > There was an example: the vhost enqueue optmization patchset from
> > > Zhihong [0] uses memset, and it introduces more than 15% drop (IIRC)
> > > on my Ivybridge server: it has no such issue on his server though.
> > >
> > > [0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html
> > >
> > >   --yliu
> >
> > I'd say that's weird. what's your config? any chance you
> > are using an old compiler?
> 
> Not really, it's gcc 5.3.1. Maybe Zhihong could explain more. IIRC,
> he said the memset is not well optimized for Ivybridge server.

The dst is remote in that case. It's fine on Haswell but has complication
in Ivy Bridge which (wasn't supposed to but) causes serious frontend issue.

I don't think gcc inlined it there. I'm using fc24 gcc 6.1.1.

> 
>   --yliu



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Maxime Coquelin



On 10/10/2016 06:22 AM, Yuanhan Liu wrote:

On Mon, Oct 10, 2016 at 07:17:06AM +0300, Michael S. Tsirkin wrote:

On Mon, Oct 10, 2016 at 12:05:31PM +0800, Yuanhan Liu wrote:

On Fri, Sep 30, 2016 at 10:16:43PM +0300, Michael S. Tsirkin wrote:

And the same is done is done in DPDK:

static inline int __attribute__((always_inline))
copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
  uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
  struct rte_mempool *mbuf_pool)
{
...
/*
 * A virtio driver normally uses at least 2 desc buffers
 * for Tx: the first for storing the header, and others
 * for storing the data.
 */
if (likely((desc->len == dev->vhost_hlen) &&
   (desc->flags & VRING_DESC_F_NEXT) != 0)) {
desc = [desc->next];
if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
return -1;

desc_addr = gpa_to_vva(dev, desc->addr);
if (unlikely(!desc_addr))
return -1;

rte_prefetch0((void *)(uintptr_t)desc_addr);

desc_offset = 0;
desc_avail  = desc->len;
nr_desc+= 1;

PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
} else {
desc_avail  = desc->len - dev->vhost_hlen;
desc_offset = dev->vhost_hlen;
}


Actually, the header is parsed in DPDK vhost implementation.
But as Virtio PMD provides a zero'ed header, we could just parse
the header only if VIRTIO_NET_F_NO_TX_HEADER is not negotiated.


host can always skip the header parse if it wants to.
It didn't seem worth it to add branches there but
if I'm wrong, by all means code it up.


It's added by following commit, which yields about 10% performance
boosts for PVP case (with 64B packet size).

At that time, a packet always use 2 descs. Since indirect desc is
enabled (by default) now, the assumption is not true then. What's
worse, it might even slow things a bit down. That should also be
part of the reason why performance is slightly worse than before.

--yliu


I'm not sure I get what you are saying


commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
Author: Yuanhan Liu 
Date:   Mon May 2 17:46:17 2016 -0700

vhost: optimize dequeue for small packets

A virtio driver normally uses at least 2 desc buffers for Tx: the
first for storing the header, and the others for storing the data.

Therefore, we could fetch the first data desc buf before the main
loop, and do the copy first before the check of "are we done yet?".
This could save one check for small packets that just have one data
desc buffer and need one mbuf to store it.

Signed-off-by: Yuanhan Liu 
Acked-by: Huawei Xie 
Tested-by: Rich Lane 


This fast-paths the 2-descriptors format but it's not active
for indirect descriptors. Is this what you mean?


Yes. It's also not active when ANY_LAYOUT is actually turned on.

Should be a simple matter to apply this optimization for indirect.


Might be.


If I understand the code correctly, indirect descs also benefit from 
this optimization, or am I missing something?


Maxime



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-09 Thread Michael S. Tsirkin
On Mon, Oct 10, 2016 at 04:16:19AM +, Wang, Zhihong wrote:
> 
> 
> > -Original Message-
> > From: Yuanhan Liu [mailto:yuanhan@linux.intel.com]
> > Sent: Monday, October 10, 2016 11:59 AM
> > To: Michael S. Tsirkin <m...@redhat.com>
> > Cc: Maxime Coquelin <maxime.coque...@redhat.com>; Stephen Hemminger
> > <step...@networkplumber.org>; d...@dpdk.org; qemu-
> > de...@nongnu.org; Wang, Zhihong <zhihong.w...@intel.com>
> > Subject: Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature
> > 
> > On Mon, Oct 10, 2016 at 06:46:44AM +0300, Michael S. Tsirkin wrote:
> > > On Mon, Oct 10, 2016 at 11:37:44AM +0800, Yuanhan Liu wrote:
> > > > On Thu, Sep 29, 2016 at 11:21:48PM +0300, Michael S. Tsirkin wrote:
> > > > > On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> > > > > >
> > > > > >
> > > > > > On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:
> > > > > Yes but two points.
> > > > >
> > > > > 1. why is this memset expensive?
> > > >
> > > > I don't have the exact answer, but just some rough thoughts:
> > > >
> > > > It's an external clib function: there is a call stack and the
> > > > IP register will bounch back and forth.
> > >
> > > for memset 0?  gcc 5.3.1 on fedora happily inlines it.
> > 
> > Good to know!
> > 
> > > > overkill to use that for resetting 14 bytes structure.
> > > >
> > > > Some trick like
> > > > *(struct virtio_net_hdr *)hdr = {0, };
> > > >
> > > > Or even
> > > > hdr->xxx = 0;
> > > > hdr->yyy = 0;
> > > >
> > > > should behaviour better.
> > > >
> > > > There was an example: the vhost enqueue optmization patchset from
> > > > Zhihong [0] uses memset, and it introduces more than 15% drop (IIRC)
> > > > on my Ivybridge server: it has no such issue on his server though.
> > > >
> > > > [0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html
> > > >
> > > > --yliu
> > >
> > > I'd say that's weird. what's your config? any chance you
> > > are using an old compiler?
> > 
> > Not really, it's gcc 5.3.1. Maybe Zhihong could explain more. IIRC,
> > he said the memset is not well optimized for Ivybridge server.
> 
> The dst is remote in that case. It's fine on Haswell but has complication
> in Ivy Bridge which (wasn't supposed to but) causes serious frontend issue.
> 
> I don't think gcc inlined it there. I'm using fc24 gcc 6.1.1.


So try something like this then:

Signed-off-by: Michael S. Tsirkin <m...@redhat.com>


diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index dd7693f..7a3f88e 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -292,6 +292,16 @@ vtpci_with_feature(struct virtio_hw *hw, uint64_t bit)
return (hw->guest_features & (1ULL << bit)) != 0;
 }
 
+static inline int
+vtnet_hdr_size(struct virtio_hw *hw)
+{
+   if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF) ||
+   vtpci_with_feature(hw, VIRTIO_F_VERSION_1))
+   return sizeof(struct virtio_net_hdr_mrg_rxbuf);
+   else
+   return sizeof(struct virtio_net_hdr);
+}
+
 /*
  * Function declaration from virtio_pci.c
  */
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index a27208e..21a45e1 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -216,7 +216,7 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct 
rte_mbuf *cookie,
struct vring_desc *start_dp;
uint16_t seg_num = cookie->nb_segs;
uint16_t head_idx, idx;
-   uint16_t head_size = vq->hw->vtnet_hdr_size;
+   uint16_t head_size = vtnet_hdr_size(vq->hw);
unsigned long offs;
 
head_idx = vq->vq_desc_head_idx;

Generally pointer chasing in vq->hw->vtnet_hdr_size can't be good
for performance. Move fields used on data path into vq
and use from there to avoid indirections?




Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-09 Thread Michael S. Tsirkin
On Mon, Oct 10, 2016 at 12:22:09PM +0800, Yuanhan Liu wrote:
> On Mon, Oct 10, 2016 at 07:17:06AM +0300, Michael S. Tsirkin wrote:
> > On Mon, Oct 10, 2016 at 12:05:31PM +0800, Yuanhan Liu wrote:
> > > On Fri, Sep 30, 2016 at 10:16:43PM +0300, Michael S. Tsirkin wrote:
> > > > > > And the same is done is done in DPDK:
> > > > > > 
> > > > > > static inline int __attribute__((always_inline))
> > > > > > copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
> > > > > >   uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
> > > > > >   struct rte_mempool *mbuf_pool)
> > > > > > {
> > > > > > ...
> > > > > > /*
> > > > > >  * A virtio driver normally uses at least 2 desc buffers
> > > > > >  * for Tx: the first for storing the header, and others
> > > > > >  * for storing the data.
> > > > > >  */
> > > > > > if (likely((desc->len == dev->vhost_hlen) &&
> > > > > >(desc->flags & VRING_DESC_F_NEXT) != 0)) {
> > > > > > desc = [desc->next];
> > > > > > if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
> > > > > > return -1;
> > > > > > 
> > > > > > desc_addr = gpa_to_vva(dev, desc->addr);
> > > > > > if (unlikely(!desc_addr))
> > > > > > return -1;
> > > > > > 
> > > > > > rte_prefetch0((void *)(uintptr_t)desc_addr);
> > > > > > 
> > > > > > desc_offset = 0;
> > > > > > desc_avail  = desc->len;
> > > > > > nr_desc+= 1;
> > > > > > 
> > > > > > PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
> > > > > > } else {
> > > > > > desc_avail  = desc->len - dev->vhost_hlen;
> > > > > > desc_offset = dev->vhost_hlen;
> > > > > > }
> > > > > 
> > > > > Actually, the header is parsed in DPDK vhost implementation.
> > > > > But as Virtio PMD provides a zero'ed header, we could just parse
> > > > > the header only if VIRTIO_NET_F_NO_TX_HEADER is not negotiated.
> > > > 
> > > > host can always skip the header parse if it wants to.
> > > > It didn't seem worth it to add branches there but
> > > > if I'm wrong, by all means code it up.
> > > 
> > > It's added by following commit, which yields about 10% performance
> > > boosts for PVP case (with 64B packet size).
> > > 
> > > At that time, a packet always use 2 descs. Since indirect desc is
> > > enabled (by default) now, the assumption is not true then. What's
> > > worse, it might even slow things a bit down. That should also be
> > > part of the reason why performance is slightly worse than before.
> > > 
> > >   --yliu
> > 
> > I'm not sure I get what you are saying
> > 
> > > commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
> > > Author: Yuanhan Liu 
> > > Date:   Mon May 2 17:46:17 2016 -0700
> > > 
> > > vhost: optimize dequeue for small packets
> > > 
> > > A virtio driver normally uses at least 2 desc buffers for Tx: the
> > > first for storing the header, and the others for storing the data.
> > > 
> > > Therefore, we could fetch the first data desc buf before the main
> > > loop, and do the copy first before the check of "are we done yet?".
> > > This could save one check for small packets that just have one data
> > > desc buffer and need one mbuf to store it.
> > > 
> > > Signed-off-by: Yuanhan Liu 
> > > Acked-by: Huawei Xie 
> > > Tested-by: Rich Lane 
> > 
> > This fast-paths the 2-descriptors format but it's not active
> > for indirect descriptors. Is this what you mean?
> 
> Yes. It's also not active when ANY_LAYOUT is actually turned on.

It's not needed there though - you only use 1 desc, no point in
fetching the next one.


> > Should be a simple matter to apply this optimization for indirect.
> 
> Might be.
> 
>   --yliu



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-09 Thread Michael S. Tsirkin
On Mon, Oct 10, 2016 at 04:16:19AM +, Wang, Zhihong wrote:
> 
> 
> > -Original Message-
> > From: Yuanhan Liu [mailto:yuanhan@linux.intel.com]
> > Sent: Monday, October 10, 2016 11:59 AM
> > To: Michael S. Tsirkin <m...@redhat.com>
> > Cc: Maxime Coquelin <maxime.coque...@redhat.com>; Stephen Hemminger
> > <step...@networkplumber.org>; d...@dpdk.org; qemu-
> > de...@nongnu.org; Wang, Zhihong <zhihong.w...@intel.com>
> > Subject: Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature
> > 
> > On Mon, Oct 10, 2016 at 06:46:44AM +0300, Michael S. Tsirkin wrote:
> > > On Mon, Oct 10, 2016 at 11:37:44AM +0800, Yuanhan Liu wrote:
> > > > On Thu, Sep 29, 2016 at 11:21:48PM +0300, Michael S. Tsirkin wrote:
> > > > > On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> > > > > >
> > > > > >
> > > > > > On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:
> > > > > Yes but two points.
> > > > >
> > > > > 1. why is this memset expensive?
> > > >
> > > > I don't have the exact answer, but just some rough thoughts:
> > > >
> > > > It's an external clib function: there is a call stack and the
> > > > IP register will bounch back and forth.
> > >
> > > for memset 0?  gcc 5.3.1 on fedora happily inlines it.
> > 
> > Good to know!
> > 
> > > > overkill to use that for resetting 14 bytes structure.
> > > >
> > > > Some trick like
> > > > *(struct virtio_net_hdr *)hdr = {0, };
> > > >
> > > > Or even
> > > > hdr->xxx = 0;
> > > > hdr->yyy = 0;
> > > >
> > > > should behaviour better.
> > > >
> > > > There was an example: the vhost enqueue optmization patchset from
> > > > Zhihong [0] uses memset, and it introduces more than 15% drop (IIRC)
> > > > on my Ivybridge server: it has no such issue on his server though.
> > > >
> > > > [0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html
> > > >
> > > > --yliu
> > >
> > > I'd say that's weird. what's your config? any chance you
> > > are using an old compiler?
> > 
> > Not really, it's gcc 5.3.1. Maybe Zhihong could explain more. IIRC,
> > he said the memset is not well optimized for Ivybridge server.
> 
> The dst is remote in that case. It's fine on Haswell but has complication
> in Ivy Bridge which (wasn't supposed to but) causes serious frontend issue.
> 
> I don't think gcc inlined it there. I'm using fc24 gcc 6.1.1.

I just wrote some test code and compiled with gcc -O2,
it did get inlined.

It's probably this:
uint16_t head_size = vq->hw->vtnet_hdr_size;
...
memset(hdr, 0, head_size);
IOW head_size is not known to compiler.

Try sticking a bool there instead of vtnet_hdr_size, and
 memset(hdr, 0, bigheader ? 10 : 12);


> > 
> > --yliu



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-09 Thread Yuanhan Liu
On Mon, Oct 10, 2016 at 07:17:06AM +0300, Michael S. Tsirkin wrote:
> On Mon, Oct 10, 2016 at 12:05:31PM +0800, Yuanhan Liu wrote:
> > On Fri, Sep 30, 2016 at 10:16:43PM +0300, Michael S. Tsirkin wrote:
> > > > > And the same is done is done in DPDK:
> > > > > 
> > > > > static inline int __attribute__((always_inline))
> > > > > copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
> > > > >   uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
> > > > >   struct rte_mempool *mbuf_pool)
> > > > > {
> > > > > ...
> > > > > /*
> > > > >  * A virtio driver normally uses at least 2 desc buffers
> > > > >  * for Tx: the first for storing the header, and others
> > > > >  * for storing the data.
> > > > >  */
> > > > > if (likely((desc->len == dev->vhost_hlen) &&
> > > > >(desc->flags & VRING_DESC_F_NEXT) != 0)) {
> > > > > desc = [desc->next];
> > > > > if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
> > > > > return -1;
> > > > > 
> > > > > desc_addr = gpa_to_vva(dev, desc->addr);
> > > > > if (unlikely(!desc_addr))
> > > > > return -1;
> > > > > 
> > > > > rte_prefetch0((void *)(uintptr_t)desc_addr);
> > > > > 
> > > > > desc_offset = 0;
> > > > > desc_avail  = desc->len;
> > > > > nr_desc+= 1;
> > > > > 
> > > > > PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
> > > > > } else {
> > > > > desc_avail  = desc->len - dev->vhost_hlen;
> > > > > desc_offset = dev->vhost_hlen;
> > > > > }
> > > > 
> > > > Actually, the header is parsed in DPDK vhost implementation.
> > > > But as Virtio PMD provides a zero'ed header, we could just parse
> > > > the header only if VIRTIO_NET_F_NO_TX_HEADER is not negotiated.
> > > 
> > > host can always skip the header parse if it wants to.
> > > It didn't seem worth it to add branches there but
> > > if I'm wrong, by all means code it up.
> > 
> > It's added by following commit, which yields about 10% performance
> > boosts for PVP case (with 64B packet size).
> > 
> > At that time, a packet always use 2 descs. Since indirect desc is
> > enabled (by default) now, the assumption is not true then. What's
> > worse, it might even slow things a bit down. That should also be
> > part of the reason why performance is slightly worse than before.
> > 
> > --yliu
> 
> I'm not sure I get what you are saying
> 
> > commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
> > Author: Yuanhan Liu 
> > Date:   Mon May 2 17:46:17 2016 -0700
> > 
> > vhost: optimize dequeue for small packets
> > 
> > A virtio driver normally uses at least 2 desc buffers for Tx: the
> > first for storing the header, and the others for storing the data.
> > 
> > Therefore, we could fetch the first data desc buf before the main
> > loop, and do the copy first before the check of "are we done yet?".
> > This could save one check for small packets that just have one data
> > desc buffer and need one mbuf to store it.
> > 
> > Signed-off-by: Yuanhan Liu 
> > Acked-by: Huawei Xie 
> > Tested-by: Rich Lane 
> 
> This fast-paths the 2-descriptors format but it's not active
> for indirect descriptors. Is this what you mean?

Yes. It's also not active when ANY_LAYOUT is actually turned on.

> Should be a simple matter to apply this optimization for indirect.

Might be.

--yliu



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-09 Thread Michael S. Tsirkin
On Mon, Oct 10, 2016 at 12:05:31PM +0800, Yuanhan Liu wrote:
> On Fri, Sep 30, 2016 at 10:16:43PM +0300, Michael S. Tsirkin wrote:
> > > > And the same is done is done in DPDK:
> > > > 
> > > > static inline int __attribute__((always_inline))
> > > > copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
> > > >   uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
> > > >   struct rte_mempool *mbuf_pool)
> > > > {
> > > > ...
> > > > /*
> > > >  * A virtio driver normally uses at least 2 desc buffers
> > > >  * for Tx: the first for storing the header, and others
> > > >  * for storing the data.
> > > >  */
> > > > if (likely((desc->len == dev->vhost_hlen) &&
> > > >(desc->flags & VRING_DESC_F_NEXT) != 0)) {
> > > > desc = [desc->next];
> > > > if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
> > > > return -1;
> > > > 
> > > > desc_addr = gpa_to_vva(dev, desc->addr);
> > > > if (unlikely(!desc_addr))
> > > > return -1;
> > > > 
> > > > rte_prefetch0((void *)(uintptr_t)desc_addr);
> > > > 
> > > > desc_offset = 0;
> > > > desc_avail  = desc->len;
> > > > nr_desc+= 1;
> > > > 
> > > > PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
> > > > } else {
> > > > desc_avail  = desc->len - dev->vhost_hlen;
> > > > desc_offset = dev->vhost_hlen;
> > > > }
> > > 
> > > Actually, the header is parsed in DPDK vhost implementation.
> > > But as Virtio PMD provides a zero'ed header, we could just parse
> > > the header only if VIRTIO_NET_F_NO_TX_HEADER is not negotiated.
> > 
> > host can always skip the header parse if it wants to.
> > It didn't seem worth it to add branches there but
> > if I'm wrong, by all means code it up.
> 
> It's added by following commit, which yields about 10% performance
> boosts for PVP case (with 64B packet size).
> 
> At that time, a packet always use 2 descs. Since indirect desc is
> enabled (by default) now, the assumption is not true then. What's
> worse, it might even slow things a bit down. That should also be
> part of the reason why performance is slightly worse than before.
> 
>   --yliu

I'm not sure I get what you are saying

> commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
> Author: Yuanhan Liu 
> Date:   Mon May 2 17:46:17 2016 -0700
> 
> vhost: optimize dequeue for small packets
> 
> A virtio driver normally uses at least 2 desc buffers for Tx: the
> first for storing the header, and the others for storing the data.
> 
> Therefore, we could fetch the first data desc buf before the main
> loop, and do the copy first before the check of "are we done yet?".
> This could save one check for small packets that just have one data
> desc buffer and need one mbuf to store it.
> 
> Signed-off-by: Yuanhan Liu 
> Acked-by: Huawei Xie 
> Tested-by: Rich Lane 

This fast-paths the 2-descriptors format but it's not active
for indirect descriptors. Is this what you mean?
Should be a simple matter to apply this optimization for indirect.





Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-09 Thread Yuanhan Liu
On Fri, Sep 30, 2016 at 10:16:43PM +0300, Michael S. Tsirkin wrote:
> > > And the same is done is done in DPDK:
> > > 
> > > static inline int __attribute__((always_inline))
> > > copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
> > >   uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
> > >   struct rte_mempool *mbuf_pool)
> > > {
> > > ...
> > > /*
> > >  * A virtio driver normally uses at least 2 desc buffers
> > >  * for Tx: the first for storing the header, and others
> > >  * for storing the data.
> > >  */
> > > if (likely((desc->len == dev->vhost_hlen) &&
> > >(desc->flags & VRING_DESC_F_NEXT) != 0)) {
> > > desc = [desc->next];
> > > if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
> > > return -1;
> > > 
> > > desc_addr = gpa_to_vva(dev, desc->addr);
> > > if (unlikely(!desc_addr))
> > > return -1;
> > > 
> > > rte_prefetch0((void *)(uintptr_t)desc_addr);
> > > 
> > > desc_offset = 0;
> > > desc_avail  = desc->len;
> > > nr_desc+= 1;
> > > 
> > > PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
> > > } else {
> > > desc_avail  = desc->len - dev->vhost_hlen;
> > > desc_offset = dev->vhost_hlen;
> > > }
> > 
> > Actually, the header is parsed in DPDK vhost implementation.
> > But as Virtio PMD provides a zero'ed header, we could just parse
> > the header only if VIRTIO_NET_F_NO_TX_HEADER is not negotiated.
> 
> host can always skip the header parse if it wants to.
> It didn't seem worth it to add branches there but
> if I'm wrong, by all means code it up.

It's added by following commit, which yields about 10% performance
boosts for PVP case (with 64B packet size).

At that time, a packet always use 2 descs. Since indirect desc is
enabled (by default) now, the assumption is not true then. What's
worse, it might even slow things a bit down. That should also be
part of the reason why performance is slightly worse than before.

--yliu

commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
Author: Yuanhan Liu 
Date:   Mon May 2 17:46:17 2016 -0700

vhost: optimize dequeue for small packets

A virtio driver normally uses at least 2 desc buffers for Tx: the
first for storing the header, and the others for storing the data.

Therefore, we could fetch the first data desc buf before the main
loop, and do the copy first before the check of "are we done yet?".
This could save one check for small packets that just have one data
desc buffer and need one mbuf to store it.

Signed-off-by: Yuanhan Liu 
Acked-by: Huawei Xie 
Tested-by: Rich Lane 




Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-09 Thread Yuanhan Liu
On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> >so doing this unconditionally would be a spec violation, but if you see
> >value in this, we can add a feature bit.
> Right it would be a spec violation, so it should be done conditionally.
> If a feature bit is to be added, what about VIRTIO_NET_F_NO_TX_HEADER?
> It would imply VIRTIO_NET_F_CSUM not set, and no GSO features set.
> If negotiated, we wouldn't need to prepend a header.

If we could skip Tx header, I think we could also skip Rx header, in the
case when mrg-rx is aslo turned off?

> From the micro-benchmarks results, we can expect +10% compared to
> indirect descriptors, and + 5% compared to using 2 descs in the
> virtqueue.
> Also, it should have the same benefits as indirect descriptors for 0%
> pkt loss (as we can fill 2x more packets in the virtqueue).
> 
> What do you think?

I would vote for this. It should yield maximum performance for the case
that it's guaranteed that packet size will always fit in a typical MTU
(1500).

--yliu



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-09 Thread Michael S. Tsirkin
On Mon, Oct 10, 2016 at 11:37:44AM +0800, Yuanhan Liu wrote:
> On Thu, Sep 29, 2016 at 11:21:48PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> > > 
> > > 
> > > On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:
> > Yes but two points.
> > 
> > 1. why is this memset expensive?
> 
> I don't have the exact answer, but just some rough thoughts:
> 
> It's an external clib function: there is a call stack and the
> IP register will bounch back and forth.

for memset 0?  gcc 5.3.1 on fedora happily inlines it.

> BTW, It's kind of an
> overkill to use that for resetting 14 bytes structure.
> 
> Some trick like
> *(struct virtio_net_hdr *)hdr = {0, };
> 
> Or even 
> hdr->xxx = 0;
> hdr->yyy = 0;
> 
> should behaviour better.
> 
> There was an example: the vhost enqueue optmization patchset from
> Zhihong [0] uses memset, and it introduces more than 15% drop (IIRC)
> on my Ivybridge server: it has no such issue on his server though.
> 
> [0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html
> 
>   --yliu

I'd say that's weird. what's your config? any chance you
are using an old compiler?


> > Is the test completely skipping looking
> >at the packet otherwise?
> > 
> > 2. As long as we are doing this, see
> > Alignment vs. Networking
> > 
> > in Documentation/unaligned-memory-access.txt
> > 
> > 
> > > From the micro-benchmarks results, we can expect +10% compared to
> > > indirect descriptors, and + 5% compared to using 2 descs in the
> > > virtqueue.
> > > Also, it should have the same benefits as indirect descriptors for 0%
> > > pkt loss (as we can fill 2x more packets in the virtqueue).
> > > 
> > > What do you think?
> > > 
> > > Thanks,
> > > Maxime



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-09 Thread Yuanhan Liu
On Mon, Oct 10, 2016 at 06:46:44AM +0300, Michael S. Tsirkin wrote:
> On Mon, Oct 10, 2016 at 11:37:44AM +0800, Yuanhan Liu wrote:
> > On Thu, Sep 29, 2016 at 11:21:48PM +0300, Michael S. Tsirkin wrote:
> > > On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> > > > 
> > > > 
> > > > On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:
> > > Yes but two points.
> > > 
> > > 1. why is this memset expensive?
> > 
> > I don't have the exact answer, but just some rough thoughts:
> > 
> > It's an external clib function: there is a call stack and the
> > IP register will bounch back and forth.
> 
> for memset 0?  gcc 5.3.1 on fedora happily inlines it.

Good to know!

> > overkill to use that for resetting 14 bytes structure.
> > 
> > Some trick like
> > *(struct virtio_net_hdr *)hdr = {0, };
> > 
> > Or even 
> > hdr->xxx = 0;
> > hdr->yyy = 0;
> > 
> > should behaviour better.
> > 
> > There was an example: the vhost enqueue optmization patchset from
> > Zhihong [0] uses memset, and it introduces more than 15% drop (IIRC)
> > on my Ivybridge server: it has no such issue on his server though.
> > 
> > [0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html
> > 
> > --yliu
> 
> I'd say that's weird. what's your config? any chance you
> are using an old compiler?

Not really, it's gcc 5.3.1. Maybe Zhihong could explain more. IIRC,
he said the memset is not well optimized for Ivybridge server.

--yliu



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-09 Thread Yuanhan Liu
On Mon, Oct 10, 2016 at 06:04:32AM +0300, Michael S. Tsirkin wrote:
> > > So I guess at this point, we can teach vhost-user in qemu
> > > that version 1 implies any_layout but only for machine types
> > > qemu 2.8 and up. It sets a bad precedent but oh well.
> > 
> > It should work.
> > 
> > --yliu
> 
> Cool. Want to post a patch?

I could have a try this week (Well, it's very unlikely though).
If not, it will be postponed for a while: I am traveling next week.

--yliu



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-09 Thread Michael S. Tsirkin
On Mon, Oct 10, 2016 at 11:03:33AM +0800, Yuanhan Liu wrote:
> On Mon, Oct 10, 2016 at 02:20:22AM +0300, Michael S. Tsirkin wrote:
> > On Wed, Sep 28, 2016 at 10:28:48AM +0800, Yuanhan Liu wrote:
> > > On Tue, Sep 27, 2016 at 10:56:40PM +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Sep 27, 2016 at 11:11:58AM +0800, Yuanhan Liu wrote:
> > > > > On Mon, Sep 26, 2016 at 10:24:55PM +0300, Michael S. Tsirkin wrote:
> > > > > > On Mon, Sep 26, 2016 at 11:01:58AM -0700, Stephen Hemminger wrote:
> > > > > > > I assume that if using Version 1 that the bit will be ignored
> > > > > 
> > > > > Yes, but I will just quote what you just said: what if the guest
> > > > > virtio device is a legacy device? I also gave my reasons in another
> > > > > email why I consistently set this flag:
> > > > > 
> > > > >   - we have to return all features we support to the guest.
> > > > >   
> > > > > We don't know the guest is a modern or legacy device. That means
> > > > > we should claim we support both: VERSION_1 and ANY_LAYOUT.
> > > > >   
> > > > > Assume guest is a legacy device and we just set VERSION_1 (the 
> > > > > current
> > > > > case), ANY_LAYOUT will never be negotiated.
> > > > >   
> > > > >   - I'm following the way Linux kernel takes: it also set both 
> > > > > features.
> > > > >   
> > > > >   Maybe, we could unset ANY_LAYOUT when VERSION_1 is _negotiated_?
> > > > > 
> > > > > The unset after negotiation I proposed turned out it won't work: the
> > > > > feature is already negotiated; unsetting it only in vhost side doesn't
> > > > > change anything. Besides, it may break the migration as Michael stated
> > > > > below.
> > > > 
> > > > I think the reverse. Teach vhost user that for future machine types
> > > > only VERSION_1 implies ANY_LAYOUT.
> > 
> > So I guess at this point, we can teach vhost-user in qemu
> > that version 1 implies any_layout but only for machine types
> > qemu 2.8 and up. It sets a bad precedent but oh well.
> 
> It should work.
> 
>   --yliu

Cool. Want to post a patch?

-- 
MST



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-09 Thread Yuanhan Liu
On Mon, Oct 10, 2016 at 02:20:22AM +0300, Michael S. Tsirkin wrote:
> On Wed, Sep 28, 2016 at 10:28:48AM +0800, Yuanhan Liu wrote:
> > On Tue, Sep 27, 2016 at 10:56:40PM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Sep 27, 2016 at 11:11:58AM +0800, Yuanhan Liu wrote:
> > > > On Mon, Sep 26, 2016 at 10:24:55PM +0300, Michael S. Tsirkin wrote:
> > > > > On Mon, Sep 26, 2016 at 11:01:58AM -0700, Stephen Hemminger wrote:
> > > > > > I assume that if using Version 1 that the bit will be ignored
> > > > 
> > > > Yes, but I will just quote what you just said: what if the guest
> > > > virtio device is a legacy device? I also gave my reasons in another
> > > > email why I consistently set this flag:
> > > > 
> > > >   - we have to return all features we support to the guest.
> > > >   
> > > > We don't know the guest is a modern or legacy device. That means
> > > > we should claim we support both: VERSION_1 and ANY_LAYOUT.
> > > >   
> > > > Assume guest is a legacy device and we just set VERSION_1 (the 
> > > > current
> > > > case), ANY_LAYOUT will never be negotiated.
> > > >   
> > > >   - I'm following the way Linux kernel takes: it also set both features.
> > > >   
> > > >   Maybe, we could unset ANY_LAYOUT when VERSION_1 is _negotiated_?
> > > > 
> > > > The unset after negotiation I proposed turned out it won't work: the
> > > > feature is already negotiated; unsetting it only in vhost side doesn't
> > > > change anything. Besides, it may break the migration as Michael stated
> > > > below.
> > > 
> > > I think the reverse. Teach vhost user that for future machine types
> > > only VERSION_1 implies ANY_LAYOUT.
> 
> So I guess at this point, we can teach vhost-user in qemu
> that version 1 implies any_layout but only for machine types
> qemu 2.8 and up. It sets a bad precedent but oh well.

It should work.

--yliu



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-09 Thread Yuanhan Liu
On Thu, Sep 29, 2016 at 11:21:48PM +0300, Michael S. Tsirkin wrote:
> On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> > 
> > 
> > On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:
> Yes but two points.
> 
> 1. why is this memset expensive?

I don't have the exact answer, but just some rough thoughts:

It's an external clib function: there is a call stack and the
IP register will bounch back and forth. BTW, It's kind of an
overkill to use that for resetting 14 bytes structure.

Some trick like
*(struct virtio_net_hdr *)hdr = {0, };

Or even 
hdr->xxx = 0;
hdr->yyy = 0;

should behaviour better.

There was an example: the vhost enqueue optmization patchset from
Zhihong [0] uses memset, and it introduces more than 15% drop (IIRC)
on my Ivybridge server: it has no such issue on his server though.

[0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html

--yliu

> Is the test completely skipping looking
>at the packet otherwise?
> 
> 2. As long as we are doing this, see
>   Alignment vs. Networking
>   
> in Documentation/unaligned-memory-access.txt
> 
> 
> > From the micro-benchmarks results, we can expect +10% compared to
> > indirect descriptors, and + 5% compared to using 2 descs in the
> > virtqueue.
> > Also, it should have the same benefits as indirect descriptors for 0%
> > pkt loss (as we can fill 2x more packets in the virtqueue).
> > 
> > What do you think?
> > 
> > Thanks,
> > Maxime



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-09 Thread Michael S. Tsirkin
On Wed, Sep 28, 2016 at 10:28:48AM +0800, Yuanhan Liu wrote:
> On Tue, Sep 27, 2016 at 10:56:40PM +0300, Michael S. Tsirkin wrote:
> > On Tue, Sep 27, 2016 at 11:11:58AM +0800, Yuanhan Liu wrote:
> > > On Mon, Sep 26, 2016 at 10:24:55PM +0300, Michael S. Tsirkin wrote:
> > > > On Mon, Sep 26, 2016 at 11:01:58AM -0700, Stephen Hemminger wrote:
> > > > > I assume that if using Version 1 that the bit will be ignored
> > > 
> > > Yes, but I will just quote what you just said: what if the guest
> > > virtio device is a legacy device? I also gave my reasons in another
> > > email why I consistently set this flag:
> > > 
> > >   - we have to return all features we support to the guest.
> > >   
> > > We don't know the guest is a modern or legacy device. That means
> > > we should claim we support both: VERSION_1 and ANY_LAYOUT.
> > >   
> > > Assume guest is a legacy device and we just set VERSION_1 (the current
> > > case), ANY_LAYOUT will never be negotiated.
> > >   
> > >   - I'm following the way Linux kernel takes: it also set both features.
> > >   
> > >   Maybe, we could unset ANY_LAYOUT when VERSION_1 is _negotiated_?
> > > 
> > > The unset after negotiation I proposed turned out it won't work: the
> > > feature is already negotiated; unsetting it only in vhost side doesn't
> > > change anything. Besides, it may break the migration as Michael stated
> > > below.
> > 
> > I think the reverse. Teach vhost user that for future machine types
> > only VERSION_1 implies ANY_LAYOUT.

So I guess at this point, we can teach vhost-user in qemu
that version 1 implies any_layout but only for machine types
qemu 2.8 and up. It sets a bad precedent but oh well.

-- 
MST



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-03 Thread Maxime Coquelin



On 09/29/2016 10:21 PM, Michael S. Tsirkin wrote:

On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:

>
>
> On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:

> > On Thu, Sep 29, 2016 at 05:30:53PM +0200, Maxime Coquelin wrote:

> ...

> > >
> > > Before enabling anything by default, we should first optimize the 1 slot
> > > case. Indeed, micro-benchmark using testpmd in txonly[0] shows ~17%
> > > perf regression for 64 bytes case:
> > >  - 2 descs per packet: 11.6Mpps
> > >  - 1 desc per packet: 9.6Mpps
> > >
> > > This is due to the virtio header clearing in virtqueue_enqueue_xmit().
> > > Removing it, we get better results than with 2 descs (1.20Mpps).
> > > Since the Virtio PMD doesn't support offloads, I wonder whether we can
> > > just drop the memset?

> >
> > What will happen? Will the header be uninitialized?

> Yes..
> I didn't look closely at the spec, but just looked at DPDK's and Linux
> vhost implementations. IIUC, the header is just skipped in the two
> implementations.

In linux guest skbs are initialized AFAIK. See virtio_net_hdr_from_skb
first thing it does is
memset(hdr, 0, sizeof(*hdr));




> >
> > The spec says:
> >   The driver can send a completely checksummed packet. In this case, 
flags
> >   will be zero, and gso_type
> >   will be VIRTIO_NET_HDR_GSO_NONE.
> >
> > and
> >   The driver MUST set num_buffers to zero.
> >   If VIRTIO_NET_F_CSUM is not negotiated, the driver MUST set flags to
> >   zero and SHOULD supply a fully
> >   checksummed packet to the device.
> >
> > and
> >   If none of the VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO options have been
> >   negotiated, the driver MUST
> >   set gso_type to VIRTIO_NET_HDR_GSO_NONE.
> >
> > so doing this unconditionally would be a spec violation, but if you see
> > value in this, we can add a feature bit.

> Right it would be a spec violation, so it should be done conditionally.
> If a feature bit is to be added, what about VIRTIO_NET_F_NO_TX_HEADER?
> It would imply VIRTIO_NET_F_CSUM not set, and no GSO features set.
> If negotiated, we wouldn't need to prepend a header.

Yes but two points.

1. why is this memset expensive? Is the test completely skipping looking
   at the packet otherwise?

2. As long as we are doing this, see
Alignment vs. Networking

in Documentation/unaligned-memory-access.txt


This change will not have an impact on the IP header alignment,
as is offset in the mbuf will not change.

Regards,
Maxime



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-09-30 Thread Michael S. Tsirkin
On Fri, Sep 30, 2016 at 02:05:10PM +0200, Maxime Coquelin wrote:
> 
> 
> On 09/29/2016 11:23 PM, Maxime Coquelin wrote:
> > 
> > 
> > On 09/29/2016 10:21 PM, Michael S. Tsirkin wrote:
> > > On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> > > > 
> > > > 
> > > > On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:
> > > > > On Thu, Sep 29, 2016 at 05:30:53PM +0200, Maxime Coquelin wrote:
> > > > ...
> > > > > > 
> > > > > > Before enabling anything by default, we should first optimize the 1
> > > > > > slot
> > > > > > case. Indeed, micro-benchmark using testpmd in txonly[0] shows ~17%
> > > > > > perf regression for 64 bytes case:
> > > > > >  - 2 descs per packet: 11.6Mpps
> > > > > >  - 1 desc per packet: 9.6Mpps
> > > > > > 
> > > > > > This is due to the virtio header clearing in 
> > > > > > virtqueue_enqueue_xmit().
> > > > > > Removing it, we get better results than with 2 descs (1.20Mpps).
> > > > > > Since the Virtio PMD doesn't support offloads, I wonder whether we 
> > > > > > can
> > > > > > just drop the memset?
> > > > > 
> > > > > What will happen? Will the header be uninitialized?
> > > > Yes..
> > > > I didn't look closely at the spec, but just looked at DPDK's and Linux
> > > > vhost implementations. IIUC, the header is just skipped in the two
> > > > implementations.
> > > 
> > > In linux guest skbs are initialized AFAIK. See virtio_net_hdr_from_skb
> > > first thing it does is
> > > memset(hdr, 0, sizeof(*hdr));
> > 
> > I meant in vhost-net linux implementation, the header is just skipped:
> > 
> > static void handle_tx(struct vhost_net *net)
> > {
> > ...
> > /* Skip header. TODO: support TSO. */
> > len = iov_length(vq->iov, out);
> > iov_iter_init(_iter, WRITE, vq->iov, out, len);
> > iov_iter_advance(_iter, hdr_size);
> > 
> > And the same is done is done in DPDK:
> > 
> > static inline int __attribute__((always_inline))
> > copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
> >   uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
> >   struct rte_mempool *mbuf_pool)
> > {
> > ...
> > /*
> >  * A virtio driver normally uses at least 2 desc buffers
> >  * for Tx: the first for storing the header, and others
> >  * for storing the data.
> >  */
> > if (likely((desc->len == dev->vhost_hlen) &&
> >(desc->flags & VRING_DESC_F_NEXT) != 0)) {
> > desc = [desc->next];
> > if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
> > return -1;
> > 
> > desc_addr = gpa_to_vva(dev, desc->addr);
> > if (unlikely(!desc_addr))
> > return -1;
> > 
> > rte_prefetch0((void *)(uintptr_t)desc_addr);
> > 
> > desc_offset = 0;
> > desc_avail  = desc->len;
> > nr_desc+= 1;
> > 
> > PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
> > } else {
> > desc_avail  = desc->len - dev->vhost_hlen;
> > desc_offset = dev->vhost_hlen;
> > }
> 
> Actually, the header is parsed in DPDK vhost implementation.
> But as Virtio PMD provides a zero'ed header, we could just parse
> the header only if VIRTIO_NET_F_NO_TX_HEADER is not negotiated.

host can always skip the header parse if it wants to.
It didn't seem worth it to add branches there but
if I'm wrong, by all means code it up.


> > > 
> > > 
> > > 
> > > > > 
> > > > > The spec says:
> > > > > The driver can send a completely checksummed packet. In this
> > > > > case, flags
> > > > > will be zero, and gso_type
> > > > > will be VIRTIO_NET_HDR_GSO_NONE.
> > > > > 
> > > > > and
> > > > > The driver MUST set num_buffers to zero.
> > > > > If VIRTIO_NET_F_CSUM is not negotiated, the driver MUST set
> > > > > flags to
> > > > > zero and SHOULD supply a fully
> > > > > checksummed packet to the device.
> > > > > 
> > > > > and
> > > > > If none of the VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO options have
> > > > > been
> > > > > negotiated, the driver MUST
> > > > > set gso_type to VIRTIO_NET_HDR_GSO_NONE.
> > > > > 
> > > > > so doing this unconditionally would be a spec violation, but if you 
> > > > > see
> > > > > value in this, we can add a feature bit.
> > > > Right it would be a spec violation, so it should be done conditionally.
> > > > If a feature bit is to be added, what about VIRTIO_NET_F_NO_TX_HEADER?
> > > > It would imply VIRTIO_NET_F_CSUM not set, and no GSO features set.
> > > > If negotiated, we wouldn't need to prepend a header.
> > > 
> > > Yes but two points.
> > > 
> > > 1. why is this memset expensive? Is the test completely skipping looking
> > >at the packet otherwise?
> > Yes.
> > > 
> > > 2. As long as we are doing this, see
> > > Alignment vs. Networking
> > > 
> > > in Documentation/unaligned-memory-access.txt
> > Thanks, I'll have a look tomorrow.
> 
> I did a rough prototype which removes Tx headers unconditionally, 

Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-09-30 Thread Maxime Coquelin



On 09/29/2016 11:23 PM, Maxime Coquelin wrote:



On 09/29/2016 10:21 PM, Michael S. Tsirkin wrote:

On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:



On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:

On Thu, Sep 29, 2016 at 05:30:53PM +0200, Maxime Coquelin wrote:

...


Before enabling anything by default, we should first optimize the 1
slot
case. Indeed, micro-benchmark using testpmd in txonly[0] shows ~17%
perf regression for 64 bytes case:
 - 2 descs per packet: 11.6Mpps
 - 1 desc per packet: 9.6Mpps

This is due to the virtio header clearing in virtqueue_enqueue_xmit().
Removing it, we get better results than with 2 descs (1.20Mpps).
Since the Virtio PMD doesn't support offloads, I wonder whether we can
just drop the memset?


What will happen? Will the header be uninitialized?

Yes..
I didn't look closely at the spec, but just looked at DPDK's and Linux
vhost implementations. IIUC, the header is just skipped in the two
implementations.


In linux guest skbs are initialized AFAIK. See virtio_net_hdr_from_skb
first thing it does is
memset(hdr, 0, sizeof(*hdr));


I meant in vhost-net linux implementation, the header is just skipped:

static void handle_tx(struct vhost_net *net)
{
...
/* Skip header. TODO: support TSO. */
len = iov_length(vq->iov, out);
iov_iter_init(_iter, WRITE, vq->iov, out, len);
iov_iter_advance(_iter, hdr_size);

And the same is done is done in DPDK:

static inline int __attribute__((always_inline))
copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
  uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
  struct rte_mempool *mbuf_pool)
{
...
/*
 * A virtio driver normally uses at least 2 desc buffers
 * for Tx: the first for storing the header, and others
 * for storing the data.
 */
if (likely((desc->len == dev->vhost_hlen) &&
   (desc->flags & VRING_DESC_F_NEXT) != 0)) {
desc = [desc->next];
if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
return -1;

desc_addr = gpa_to_vva(dev, desc->addr);
if (unlikely(!desc_addr))
return -1;

rte_prefetch0((void *)(uintptr_t)desc_addr);

desc_offset = 0;
desc_avail  = desc->len;
nr_desc+= 1;

PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
} else {
desc_avail  = desc->len - dev->vhost_hlen;
desc_offset = dev->vhost_hlen;
}


Actually, the header is parsed in DPDK vhost implementation.
But as Virtio PMD provides a zero'ed header, we could just parse
the header only if VIRTIO_NET_F_NO_TX_HEADER is not negotiated.







The spec says:
The driver can send a completely checksummed packet. In this
case, flags
will be zero, and gso_type
will be VIRTIO_NET_HDR_GSO_NONE.

and
The driver MUST set num_buffers to zero.
If VIRTIO_NET_F_CSUM is not negotiated, the driver MUST set
flags to
zero and SHOULD supply a fully
checksummed packet to the device.

and
If none of the VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO options have
been
negotiated, the driver MUST
set gso_type to VIRTIO_NET_HDR_GSO_NONE.

so doing this unconditionally would be a spec violation, but if you see
value in this, we can add a feature bit.

Right it would be a spec violation, so it should be done conditionally.
If a feature bit is to be added, what about VIRTIO_NET_F_NO_TX_HEADER?
It would imply VIRTIO_NET_F_CSUM not set, and no GSO features set.
If negotiated, we wouldn't need to prepend a header.


Yes but two points.

1. why is this memset expensive? Is the test completely skipping looking
   at the packet otherwise?

Yes.


2. As long as we are doing this, see
Alignment vs. Networking

in Documentation/unaligned-memory-access.txt

Thanks, I'll have a look tomorrow.


I did a rough prototype which removes Tx headers unconditionally, to
see what gain we could expect. I expect the results to be a little lower
with no headers in full implementation, as some more checks will have
to be done.

For PVP use-case with 0.05% acceptable packets loss:
 - Original (with headers): 9.43Mpps
 - Indirect descs: 9.36 Mpps
 - Prototype (no headers): 10.65Mpps

For PVP use-case with 0% acceptable packets loss:
 - Original (with headers): 5.23Mpps
 - Indirect descs: 7.13 Mpps
 - Prototype (no headers): 7.92Mpps

Maxime



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-09-29 Thread Maxime Coquelin



On 09/29/2016 10:21 PM, Michael S. Tsirkin wrote:

On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:



On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:

On Thu, Sep 29, 2016 at 05:30:53PM +0200, Maxime Coquelin wrote:

...


Before enabling anything by default, we should first optimize the 1 slot
case. Indeed, micro-benchmark using testpmd in txonly[0] shows ~17%
perf regression for 64 bytes case:
 - 2 descs per packet: 11.6Mpps
 - 1 desc per packet: 9.6Mpps

This is due to the virtio header clearing in virtqueue_enqueue_xmit().
Removing it, we get better results than with 2 descs (1.20Mpps).
Since the Virtio PMD doesn't support offloads, I wonder whether we can
just drop the memset?


What will happen? Will the header be uninitialized?

Yes..
I didn't look closely at the spec, but just looked at DPDK's and Linux
vhost implementations. IIUC, the header is just skipped in the two
implementations.


In linux guest skbs are initialized AFAIK. See virtio_net_hdr_from_skb
first thing it does is
memset(hdr, 0, sizeof(*hdr));


I meant in vhost-net linux implementation, the header is just skipped:

static void handle_tx(struct vhost_net *net)
{
...
/* Skip header. TODO: support TSO. */
len = iov_length(vq->iov, out);
iov_iter_init(_iter, WRITE, vq->iov, out, len);
iov_iter_advance(_iter, hdr_size);

And the same is done is done in DPDK:

static inline int __attribute__((always_inline))
copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
  uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
  struct rte_mempool *mbuf_pool)
{
...
/*
 * A virtio driver normally uses at least 2 desc buffers
 * for Tx: the first for storing the header, and others
 * for storing the data.
 */
if (likely((desc->len == dev->vhost_hlen) &&
   (desc->flags & VRING_DESC_F_NEXT) != 0)) {
desc = [desc->next];
if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
return -1;

desc_addr = gpa_to_vva(dev, desc->addr);
if (unlikely(!desc_addr))
return -1;

rte_prefetch0((void *)(uintptr_t)desc_addr);

desc_offset = 0;
desc_avail  = desc->len;
nr_desc+= 1;

PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
} else {
desc_avail  = desc->len - dev->vhost_hlen;
desc_offset = dev->vhost_hlen;
}






The spec says:
The driver can send a completely checksummed packet. In this case, flags
will be zero, and gso_type
will be VIRTIO_NET_HDR_GSO_NONE.

and
The driver MUST set num_buffers to zero.
If VIRTIO_NET_F_CSUM is not negotiated, the driver MUST set flags to
zero and SHOULD supply a fully
checksummed packet to the device.

and
If none of the VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO options have been
negotiated, the driver MUST
set gso_type to VIRTIO_NET_HDR_GSO_NONE.

so doing this unconditionally would be a spec violation, but if you see
value in this, we can add a feature bit.

Right it would be a spec violation, so it should be done conditionally.
If a feature bit is to be added, what about VIRTIO_NET_F_NO_TX_HEADER?
It would imply VIRTIO_NET_F_CSUM not set, and no GSO features set.
If negotiated, we wouldn't need to prepend a header.


Yes but two points.

1. why is this memset expensive? Is the test completely skipping looking
   at the packet otherwise?

Yes.


2. As long as we are doing this, see
Alignment vs. Networking

in Documentation/unaligned-memory-access.txt

Thanks, I'll have a look tomorrow.

Maxime




Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-09-29 Thread Michael S. Tsirkin
On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> 
> 
> On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:
> > On Thu, Sep 29, 2016 at 05:30:53PM +0200, Maxime Coquelin wrote:
> ...
> > > 
> > > Before enabling anything by default, we should first optimize the 1 slot
> > > case. Indeed, micro-benchmark using testpmd in txonly[0] shows ~17%
> > > perf regression for 64 bytes case:
> > >  - 2 descs per packet: 11.6Mpps
> > >  - 1 desc per packet: 9.6Mpps
> > > 
> > > This is due to the virtio header clearing in virtqueue_enqueue_xmit().
> > > Removing it, we get better results than with 2 descs (1.20Mpps).
> > > Since the Virtio PMD doesn't support offloads, I wonder whether we can
> > > just drop the memset?
> > 
> > What will happen? Will the header be uninitialized?
> Yes..
> I didn't look closely at the spec, but just looked at DPDK's and Linux
> vhost implementations. IIUC, the header is just skipped in the two
> implementations.

In linux guest skbs are initialized AFAIK. See virtio_net_hdr_from_skb
first thing it does is
memset(hdr, 0, sizeof(*hdr));



> > 
> > The spec says:
> > The driver can send a completely checksummed packet. In this case, flags
> > will be zero, and gso_type
> > will be VIRTIO_NET_HDR_GSO_NONE.
> > 
> > and
> > The driver MUST set num_buffers to zero.
> > If VIRTIO_NET_F_CSUM is not negotiated, the driver MUST set flags to
> > zero and SHOULD supply a fully
> > checksummed packet to the device.
> > 
> > and
> > If none of the VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO options have been
> > negotiated, the driver MUST
> > set gso_type to VIRTIO_NET_HDR_GSO_NONE.
> > 
> > so doing this unconditionally would be a spec violation, but if you see
> > value in this, we can add a feature bit.
> Right it would be a spec violation, so it should be done conditionally.
> If a feature bit is to be added, what about VIRTIO_NET_F_NO_TX_HEADER?
> It would imply VIRTIO_NET_F_CSUM not set, and no GSO features set.
> If negotiated, we wouldn't need to prepend a header.

Yes but two points.

1. why is this memset expensive? Is the test completely skipping looking
   at the packet otherwise?

2. As long as we are doing this, see
Alignment vs. Networking

in Documentation/unaligned-memory-access.txt


> From the micro-benchmarks results, we can expect +10% compared to
> indirect descriptors, and + 5% compared to using 2 descs in the
> virtqueue.
> Also, it should have the same benefits as indirect descriptors for 0%
> pkt loss (as we can fill 2x more packets in the virtqueue).
> 
> What do you think?
> 
> Thanks,
> Maxime



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-09-29 Thread Maxime Coquelin



On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:

On Thu, Sep 29, 2016 at 05:30:53PM +0200, Maxime Coquelin wrote:

...


Before enabling anything by default, we should first optimize the 1 slot
case. Indeed, micro-benchmark using testpmd in txonly[0] shows ~17%
perf regression for 64 bytes case:
 - 2 descs per packet: 11.6Mpps
 - 1 desc per packet: 9.6Mpps

This is due to the virtio header clearing in virtqueue_enqueue_xmit().
Removing it, we get better results than with 2 descs (1.20Mpps).
Since the Virtio PMD doesn't support offloads, I wonder whether we can
just drop the memset?


What will happen? Will the header be uninitialized?

Yes..
I didn't look closely at the spec, but just looked at DPDK's and Linux
vhost implementations. IIUC, the header is just skipped in the two
implementations.


The spec says:
The driver can send a completely checksummed packet. In this case, flags
will be zero, and gso_type
will be VIRTIO_NET_HDR_GSO_NONE.

and
The driver MUST set num_buffers to zero.
If VIRTIO_NET_F_CSUM is not negotiated, the driver MUST set flags to
zero and SHOULD supply a fully
checksummed packet to the device.

and
If none of the VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO options have been
negotiated, the driver MUST
set gso_type to VIRTIO_NET_HDR_GSO_NONE.

so doing this unconditionally would be a spec violation, but if you see
value in this, we can add a feature bit.

Right it would be a spec violation, so it should be done conditionally.
If a feature bit is to be added, what about VIRTIO_NET_F_NO_TX_HEADER?
It would imply VIRTIO_NET_F_CSUM not set, and no GSO features set.
If negotiated, we wouldn't need to prepend a header.

From the micro-benchmarks results, we can expect +10% compared to
indirect descriptors, and + 5% compared to using 2 descs in the
virtqueue.
Also, it should have the same benefits as indirect descriptors for 0%
pkt loss (as we can fill 2x more packets in the virtqueue).

What do you think?

Thanks,
Maxime



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-09-29 Thread Michael S. Tsirkin
On Thu, Sep 29, 2016 at 05:30:53PM +0200, Maxime Coquelin wrote:
> 
> 
> On 09/28/2016 04:28 AM, Yuanhan Liu wrote:
> > On Tue, Sep 27, 2016 at 10:56:40PM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Sep 27, 2016 at 11:11:58AM +0800, Yuanhan Liu wrote:
> > > > On Mon, Sep 26, 2016 at 10:24:55PM +0300, Michael S. Tsirkin wrote:
> > > > > On Mon, Sep 26, 2016 at 11:01:58AM -0700, Stephen Hemminger wrote:
> > > > > > I assume that if using Version 1 that the bit will be ignored
> > > > 
> > > > Yes, but I will just quote what you just said: what if the guest
> > > > virtio device is a legacy device? I also gave my reasons in another
> > > > email why I consistently set this flag:
> > > > 
> > > >   - we have to return all features we support to the guest.
> > > > 
> > > > We don't know the guest is a modern or legacy device. That means
> > > > we should claim we support both: VERSION_1 and ANY_LAYOUT.
> > > > 
> > > > Assume guest is a legacy device and we just set VERSION_1 (the 
> > > > current
> > > > case), ANY_LAYOUT will never be negotiated.
> > > > 
> > > >   - I'm following the way Linux kernel takes: it also set both features.
> > > > 
> > > >   Maybe, we could unset ANY_LAYOUT when VERSION_1 is _negotiated_?
> > > > 
> > > > The unset after negotiation I proposed turned out it won't work: the
> > > > feature is already negotiated; unsetting it only in vhost side doesn't
> > > > change anything. Besides, it may break the migration as Michael stated
> > > > below.
> > > 
> > > I think the reverse. Teach vhost user that for future machine types
> > > only VERSION_1 implies ANY_LAYOUT.
> > > 
> > > 
> > > > > Therein lies a problem. If dpdk tweaks flags, updating it
> > > > > will break guest migration.
> > > > > 
> > > > > One way is to require that users specify all flags fully when
> > > > > creating the virtio net device.
> > > > 
> > > > Like how? By a new command line option? And user has to type
> > > > all those features?
> > > 
> > > Make libvirt do this.  users use management normally. those that don't
> > > likely don't migrate VMs.
> > 
> > Fair enough.
> > 
> > > 
> > > > > QEMU could verify that all required
> > > > > flags are set, and fail init if not.
> > > > > 
> > > > > This has other advantages, e.g. it adds ability to
> > > > > init device without waiting for dpdk to connect.
> > 
> > Will the feature negotiation between DPDK and QEMU still exist
> > in your proposal?
> > 
> > > > > 
> > > > > However, enabling each new feature would now require
> > > > > management work. How about dpdk ships the list
> > > > > of supported features instead?
> > > > > Management tools could read them on source and destination
> > > > > and select features supported on both sides.
> > > > 
> > > > That means the management tool would somehow has a dependency on
> > > > DPDK project, which I have no objection at all. But, is that
> > > > a good idea?
> > > 
> > > It already starts the bridge somehow, does it not?
> > 
> > Indeed. I was firstly thinking about reading the dpdk source file
> > to determine the DPDK supported feature list, with which the bind
> > is too tight. I later realized you may ask DPDK to provide a binary
> > to dump the list, or something like that.
> > 
> > > 
> > > > BTW, I'm not quite sure I followed your idea. I mean, how it supposed
> > > > to fix the ANY_LAYOUT issue here? How this flag will be set for
> > > > legacy device?
> > > > 
> > > > --yliu
> > > 
> > > For ANY_LAYOUT, I think we should just set in in qemu,
> > > but only for new machine types.
> > 
> > What do you mean by "new machine types"? Virtio device with newer
> > virtio-spec version?
> > 
> > > This addresses migration
> > > concerns.
> > 
> > To make sure I followed you, do you mean the migration issue from
> > an older "dpdk + qemu" combo to a newer "dpdk + qemu" combo (that
> > more new features might be shipped)?
> > 
> > Besides that, your proposal looks like a big work to accomplish.
> > Are you okay to make it simple first: set it consistently like
> > what Linux kernel does? This would at least make the ANY_LAYOUT
> > actually be enabled for legacy device (which is also the default
> > one that's widely used so far).
> 
> Before enabling anything by default, we should first optimize the 1 slot
> case. Indeed, micro-benchmark using testpmd in txonly[0] shows ~17%
> perf regression for 64 bytes case:
>  - 2 descs per packet: 11.6Mpps
>  - 1 desc per packet: 9.6Mpps
> 
> This is due to the virtio header clearing in virtqueue_enqueue_xmit().
> Removing it, we get better results than with 2 descs (1.20Mpps).
> Since the Virtio PMD doesn't support offloads, I wonder whether we can
> just drop the memset?

What will happen? Will the header be uninitialized?

The spec says:
The driver can send a completely checksummed packet. In this case, flags
will be zero, and gso_type
will be VIRTIO_NET_HDR_GSO_NONE.

and
The driver MUST set num_buffers to zero.

Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-09-29 Thread Maxime Coquelin



On 09/28/2016 04:28 AM, Yuanhan Liu wrote:

On Tue, Sep 27, 2016 at 10:56:40PM +0300, Michael S. Tsirkin wrote:

On Tue, Sep 27, 2016 at 11:11:58AM +0800, Yuanhan Liu wrote:

On Mon, Sep 26, 2016 at 10:24:55PM +0300, Michael S. Tsirkin wrote:

On Mon, Sep 26, 2016 at 11:01:58AM -0700, Stephen Hemminger wrote:

I assume that if using Version 1 that the bit will be ignored


Yes, but I will just quote what you just said: what if the guest
virtio device is a legacy device? I also gave my reasons in another
email why I consistently set this flag:

  - we have to return all features we support to the guest.

We don't know the guest is a modern or legacy device. That means
we should claim we support both: VERSION_1 and ANY_LAYOUT.

Assume guest is a legacy device and we just set VERSION_1 (the current
case), ANY_LAYOUT will never be negotiated.

  - I'm following the way Linux kernel takes: it also set both features.

  Maybe, we could unset ANY_LAYOUT when VERSION_1 is _negotiated_?

The unset after negotiation I proposed turned out it won't work: the
feature is already negotiated; unsetting it only in vhost side doesn't
change anything. Besides, it may break the migration as Michael stated
below.


I think the reverse. Teach vhost user that for future machine types
only VERSION_1 implies ANY_LAYOUT.



Therein lies a problem. If dpdk tweaks flags, updating it
will break guest migration.

One way is to require that users specify all flags fully when
creating the virtio net device.


Like how? By a new command line option? And user has to type
all those features?


Make libvirt do this.  users use management normally. those that don't
likely don't migrate VMs.


Fair enough.




QEMU could verify that all required
flags are set, and fail init if not.

This has other advantages, e.g. it adds ability to
init device without waiting for dpdk to connect.


Will the feature negotiation between DPDK and QEMU still exist
in your proposal?



However, enabling each new feature would now require
management work. How about dpdk ships the list
of supported features instead?
Management tools could read them on source and destination
and select features supported on both sides.


That means the management tool would somehow has a dependency on
DPDK project, which I have no objection at all. But, is that
a good idea?


It already starts the bridge somehow, does it not?


Indeed. I was firstly thinking about reading the dpdk source file
to determine the DPDK supported feature list, with which the bind
is too tight. I later realized you may ask DPDK to provide a binary
to dump the list, or something like that.




BTW, I'm not quite sure I followed your idea. I mean, how it supposed
to fix the ANY_LAYOUT issue here? How this flag will be set for
legacy device?

--yliu


For ANY_LAYOUT, I think we should just set in in qemu,
but only for new machine types.


What do you mean by "new machine types"? Virtio device with newer
virtio-spec version?


This addresses migration
concerns.


To make sure I followed you, do you mean the migration issue from
an older "dpdk + qemu" combo to a newer "dpdk + qemu" combo (that
more new features might be shipped)?

Besides that, your proposal looks like a big work to accomplish.
Are you okay to make it simple first: set it consistently like
what Linux kernel does? This would at least make the ANY_LAYOUT
actually be enabled for legacy device (which is also the default
one that's widely used so far).


Before enabling anything by default, we should first optimize the 1 slot
case. Indeed, micro-benchmark using testpmd in txonly[0] shows ~17%
perf regression for 64 bytes case:
 - 2 descs per packet: 11.6Mpps
 - 1 desc per packet: 9.6Mpps

This is due to the virtio header clearing in virtqueue_enqueue_xmit().
Removing it, we get better results than with 2 descs (1.20Mpps).
Since the Virtio PMD doesn't support offloads, I wonder whether we can
just drop the memset?

 -- Maxime
[0]: For testing, you'll need these patches, else only first packets
will use a single slot:
 - http://dpdk.org/dev/patchwork/patch/16222/
 - http://dpdk.org/dev/patchwork/patch/16223/



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-09-27 Thread Yuanhan Liu
On Tue, Sep 27, 2016 at 10:56:40PM +0300, Michael S. Tsirkin wrote:
> On Tue, Sep 27, 2016 at 11:11:58AM +0800, Yuanhan Liu wrote:
> > On Mon, Sep 26, 2016 at 10:24:55PM +0300, Michael S. Tsirkin wrote:
> > > On Mon, Sep 26, 2016 at 11:01:58AM -0700, Stephen Hemminger wrote:
> > > > I assume that if using Version 1 that the bit will be ignored
> > 
> > Yes, but I will just quote what you just said: what if the guest
> > virtio device is a legacy device? I also gave my reasons in another
> > email why I consistently set this flag:
> > 
> >   - we have to return all features we support to the guest.
> >   
> > We don't know the guest is a modern or legacy device. That means
> > we should claim we support both: VERSION_1 and ANY_LAYOUT.
> >   
> > Assume guest is a legacy device and we just set VERSION_1 (the current
> > case), ANY_LAYOUT will never be negotiated.
> >   
> >   - I'm following the way Linux kernel takes: it also set both features.
> >   
> >   Maybe, we could unset ANY_LAYOUT when VERSION_1 is _negotiated_?
> > 
> > The unset after negotiation I proposed turned out it won't work: the
> > feature is already negotiated; unsetting it only in vhost side doesn't
> > change anything. Besides, it may break the migration as Michael stated
> > below.
> 
> I think the reverse. Teach vhost user that for future machine types
> only VERSION_1 implies ANY_LAYOUT.
> 
> 
> > > Therein lies a problem. If dpdk tweaks flags, updating it
> > > will break guest migration.
> > > 
> > > One way is to require that users specify all flags fully when
> > > creating the virtio net device. 
> > 
> > Like how? By a new command line option? And user has to type
> > all those features?
> 
> Make libvirt do this.  users use management normally. those that don't
> likely don't migrate VMs.

Fair enough.

> 
> > > QEMU could verify that all required
> > > flags are set, and fail init if not.
> > > 
> > > This has other advantages, e.g. it adds ability to
> > > init device without waiting for dpdk to connect.

Will the feature negotiation between DPDK and QEMU still exist
in your proposal?

> > > 
> > > However, enabling each new feature would now require
> > > management work. How about dpdk ships the list
> > > of supported features instead?
> > > Management tools could read them on source and destination
> > > and select features supported on both sides.
> > 
> > That means the management tool would somehow has a dependency on
> > DPDK project, which I have no objection at all. But, is that
> > a good idea?
> 
> It already starts the bridge somehow, does it not?

Indeed. I was firstly thinking about reading the dpdk source file
to determine the DPDK supported feature list, with which the bind
is too tight. I later realized you may ask DPDK to provide a binary
to dump the list, or something like that.

> 
> > BTW, I'm not quite sure I followed your idea. I mean, how it supposed
> > to fix the ANY_LAYOUT issue here? How this flag will be set for
> > legacy device?
> > 
> > --yliu
> 
> For ANY_LAYOUT, I think we should just set in in qemu,
> but only for new machine types.

What do you mean by "new machine types"? Virtio device with newer
virtio-spec version?

> This addresses migration
> concerns.

To make sure I followed you, do you mean the migration issue from
an older "dpdk + qemu" combo to a newer "dpdk + qemu" combo (that
more new features might be shipped)?

Besides that, your proposal looks like a big work to accomplish.
Are you okay to make it simple first: set it consistently like
what Linux kernel does? This would at least make the ANY_LAYOUT
actually be enabled for legacy device (which is also the default
one that's widely used so far).

--yliu

> 
> But there will be more new features in the future and
> it is necessary to think how we will enable them without
> breaking migration.
> 
> -- 
> MST



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-09-27 Thread Stephen Hemminger
On Tue, 27 Sep 2016 11:11:58 +0800
Yuanhan Liu  wrote:

> On Mon, Sep 26, 2016 at 10:24:55PM +0300, Michael S. Tsirkin wrote:
> > On Mon, Sep 26, 2016 at 11:01:58AM -0700, Stephen Hemminger wrote:  
> > > I assume that if using Version 1 that the bit will be ignored  
> 
> Yes, but I will just quote what you just said: what if the guest
> virtio device is a legacy device? I also gave my reasons in another
> email why I consistently set this flag:
> 
>   - we have to return all features we support to the guest.
>   
> We don't know the guest is a modern or legacy device. That means
> we should claim we support both: VERSION_1 and ANY_LAYOUT.
>   
> Assume guest is a legacy device and we just set VERSION_1 (the current
> case), ANY_LAYOUT will never be negotiated.
>   
>   - I'm following the way Linux kernel takes: it also set both features.

Agreed, just do what the Linux kernel does.



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-09-27 Thread Michael S. Tsirkin
On Tue, Sep 27, 2016 at 11:11:58AM +0800, Yuanhan Liu wrote:
> On Mon, Sep 26, 2016 at 10:24:55PM +0300, Michael S. Tsirkin wrote:
> > On Mon, Sep 26, 2016 at 11:01:58AM -0700, Stephen Hemminger wrote:
> > > I assume that if using Version 1 that the bit will be ignored
> 
> Yes, but I will just quote what you just said: what if the guest
> virtio device is a legacy device? I also gave my reasons in another
> email why I consistently set this flag:
> 
>   - we have to return all features we support to the guest.
>   
> We don't know the guest is a modern or legacy device. That means
> we should claim we support both: VERSION_1 and ANY_LAYOUT.
>   
> Assume guest is a legacy device and we just set VERSION_1 (the current
> case), ANY_LAYOUT will never be negotiated.
>   
>   - I'm following the way Linux kernel takes: it also set both features.
>   
>   Maybe, we could unset ANY_LAYOUT when VERSION_1 is _negotiated_?
> 
> The unset after negotiation I proposed turned out it won't work: the
> feature is already negotiated; unsetting it only in vhost side doesn't
> change anything. Besides, it may break the migration as Michael stated
> below.

I think the reverse. Teach vhost user that for future machine types
only VERSION_1 implies ANY_LAYOUT.


> > Therein lies a problem. If dpdk tweaks flags, updating it
> > will break guest migration.
> > 
> > One way is to require that users specify all flags fully when
> > creating the virtio net device. 
> 
> Like how? By a new command line option? And user has to type
> all those features?

Make libvirt do this.  users use management normally. those that don't
likely don't migrate VMs.

> > QEMU could verify that all required
> > flags are set, and fail init if not.
> > 
> > This has other advantages, e.g. it adds ability to
> > init device without waiting for dpdk to connect.
> > 
> > However, enabling each new feature would now require
> > management work. How about dpdk ships the list
> > of supported features instead?
> > Management tools could read them on source and destination
> > and select features supported on both sides.
> 
> That means the management tool would somehow has a dependency on
> DPDK project, which I have no objection at all. But, is that
> a good idea?

It already starts the bridge somehow, does it not?

> BTW, I'm not quite sure I followed your idea. I mean, how it supposed
> to fix the ANY_LAYOUT issue here? How this flag will be set for
> legacy device?
> 
>   --yliu

For ANY_LAYOUT, I think we should just set in in qemu,
but only for new machine types. This addresses migration
concerns.

But there will be more new features in the future and
it is necessary to think how we will enable them without
breaking migration.

-- 
MST



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-09-26 Thread Yuanhan Liu
On Mon, Sep 26, 2016 at 10:24:55PM +0300, Michael S. Tsirkin wrote:
> On Mon, Sep 26, 2016 at 11:01:58AM -0700, Stephen Hemminger wrote:
> > I assume that if using Version 1 that the bit will be ignored

Yes, but I will just quote what you just said: what if the guest
virtio device is a legacy device? I also gave my reasons in another
email why I consistently set this flag:

  - we have to return all features we support to the guest.
  
We don't know the guest is a modern or legacy device. That means
we should claim we support both: VERSION_1 and ANY_LAYOUT.
  
Assume guest is a legacy device and we just set VERSION_1 (the current
case), ANY_LAYOUT will never be negotiated.
  
  - I'm following the way Linux kernel takes: it also set both features.
  
  Maybe, we could unset ANY_LAYOUT when VERSION_1 is _negotiated_?

The unset after negotiation I proposed turned out it won't work: the
feature is already negotiated; unsetting it only in vhost side doesn't
change anything. Besides, it may break the migration as Michael stated
below.

> Therein lies a problem. If dpdk tweaks flags, updating it
> will break guest migration.
> 
> One way is to require that users specify all flags fully when
> creating the virtio net device. 

Like how? By a new command line option? And user has to type
all those features?

> QEMU could verify that all required
> flags are set, and fail init if not.
> 
> This has other advantages, e.g. it adds ability to
> init device without waiting for dpdk to connect.
> 
> However, enabling each new feature would now require
> management work. How about dpdk ships the list
> of supported features instead?
> Management tools could read them on source and destination
> and select features supported on both sides.

That means the management tool would somehow has a dependency on
DPDK project, which I have no objection at all. But, is that
a good idea?

BTW, I'm not quite sure I followed your idea. I mean, how it supposed
to fix the ANY_LAYOUT issue here? How this flag will be set for
legacy device?

--yliu



Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-09-26 Thread Michael S. Tsirkin
On Mon, Sep 26, 2016 at 11:01:58AM -0700, Stephen Hemminger wrote:
> I assume that if using Version 1 that the bit will be ignored
> 

Therein lies a problem. If dpdk tweaks flags, updating it
will break guest migration.

One way is to require that users specify all flags fully when
creating the virtio net device.  QEMU could verify that all required
flags are set, and fail init if not.

This has other advantages, e.g. it adds ability to
init device without waiting for dpdk to connect.

However, enabling each new feature would now require
management work. How about dpdk ships the list
of supported features instead?
Management tools could read them on source and destination
and select features supported on both sides.

-- 
MST