Re: [PATCH net-next RFC 2/5] vhost: introduce helper to prefetch desc index

2017-09-26 Thread Michael S. Tsirkin
On Fri, Sep 22, 2017 at 04:02:32PM +0800, Jason Wang wrote:
> This patch introduces vhost_prefetch_desc_indices() which could batch
> descriptor indices fetching and used ring updating. This intends to
> reduce the cache misses of indices fetching and updating and reduce
> cache line bounce when virtqueue is almost full. copy_to_user() was
> used in order to benefit from modern cpus that support fast string
> copy. Batched virtqueue processing will be the first user.
> 
> Signed-off-by: Jason Wang 
> ---
>  drivers/vhost/vhost.c | 55 
> +++
>  drivers/vhost/vhost.h |  3 +++
>  2 files changed, 58 insertions(+)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index f87ec75..8424166d 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2437,6 +2437,61 @@ struct vhost_msg_node *vhost_dequeue_msg(struct 
> vhost_dev *dev,
>  }
>  EXPORT_SYMBOL_GPL(vhost_dequeue_msg);
>  
> +int vhost_prefetch_desc_indices(struct vhost_virtqueue *vq,
> + struct vring_used_elem *heads,
> + u16 num, bool used_update)

why do you need to combine used update with prefetch?

> +{
> + int ret, ret2;
> + u16 last_avail_idx, last_used_idx, total, copied;
> + __virtio16 avail_idx;
> + struct vring_used_elem __user *used;
> + int i;
> +
> + if (unlikely(vhost_get_avail(vq, avail_idx, >avail->idx))) {
> + vq_err(vq, "Failed to access avail idx at %p\n",
> +>avail->idx);
> + return -EFAULT;
> + }
> + last_avail_idx = vq->last_avail_idx & (vq->num - 1);
> + vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> + total = vq->avail_idx - vq->last_avail_idx;
> + ret = total = min(total, num);
> +
> + for (i = 0; i < ret; i++) {
> + ret2 = vhost_get_avail(vq, heads[i].id,
> +   >avail->ring[last_avail_idx]);
> + if (unlikely(ret2)) {
> + vq_err(vq, "Failed to get descriptors\n");
> + return -EFAULT;
> + }
> + last_avail_idx = (last_avail_idx + 1) & (vq->num - 1);
> + }
> +
> + if (!used_update)
> + return ret;
> +
> + last_used_idx = vq->last_used_idx & (vq->num - 1);
> + while (total) {
> + copied = min((u16)(vq->num - last_used_idx), total);
> + ret2 = vhost_copy_to_user(vq,
> +   >used->ring[last_used_idx],
> +   [ret - total],
> +   copied * sizeof(*used));
> +
> + if (unlikely(ret2)) {
> + vq_err(vq, "Failed to update used ring!\n");
> + return -EFAULT;
> + }
> +
> + last_used_idx = 0;
> + total -= copied;
> + }
> +
> + /* Only get avail ring entries after they have been exposed by guest. */
> + smp_rmb();

Barrier before return is a very confusing API. I guess it's designed to
be used in a specific way to make it necessary - but what is it?


> + return ret;
> +}
> +EXPORT_SYMBOL(vhost_prefetch_desc_indices);
>  
>  static int __init vhost_init(void)
>  {
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 39ff897..16c2cb6 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -228,6 +228,9 @@ ssize_t vhost_chr_read_iter(struct vhost_dev *dev, struct 
> iov_iter *to,
>  ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
>struct iov_iter *from);
>  int vhost_init_device_iotlb(struct vhost_dev *d, bool enabled);
> +int vhost_prefetch_desc_indices(struct vhost_virtqueue *vq,
> + struct vring_used_elem *heads,
> + u16 num, bool used_update);
>  
>  #define vq_err(vq, fmt, ...) do {  \
>   pr_debug(pr_fmt(fmt), ##__VA_ARGS__);   \
> -- 
> 2.7.4
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next RFC 0/5] batched tx processing in vhost_net

2017-09-26 Thread Michael S. Tsirkin
On Fri, Sep 22, 2017 at 04:02:30PM +0800, Jason Wang wrote:
> Hi:
> 
> This series tries to implement basic tx batched processing. This is
> done by prefetching descriptor indices and update used ring in a
> batch. This intends to speed up used ring updating and improve the
> cache utilization. Test shows about ~22% improvement in tx pss.




> Please review.
> 
> Jason Wang (5):
>   vhost: split out ring head fetching logic
>   vhost: introduce helper to prefetch desc index
>   vhost: introduce vhost_add_used_idx()

Please squash these new APIs into where they are used.
This split-up just makes review harder for me as
I can't figure out how the new APIs are used.


>   vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCH

This is ok as a separate patch.

>   vhost_net: basic tx virtqueue batched processing
> 
>  drivers/vhost/net.c   | 221 
> --
>  drivers/vhost/vhost.c | 165 +++--
>  drivers/vhost/vhost.h |   9 ++
>  3 files changed, 270 insertions(+), 125 deletions(-)
> 
> -- 
> 2.7.4
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next RFC 3/5] vhost: introduce vhost_add_used_idx()

2017-09-26 Thread Michael S. Tsirkin
On Fri, Sep 22, 2017 at 04:02:33PM +0800, Jason Wang wrote:
> This patch introduces a helper which just increase the used idx. This
> will be used in pair with vhost_prefetch_desc_indices() by batching
> code.
> 
> Signed-off-by: Jason Wang 
> ---
>  drivers/vhost/vhost.c | 33 +
>  drivers/vhost/vhost.h |  1 +
>  2 files changed, 34 insertions(+)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 8424166d..6532cda 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2178,6 +2178,39 @@ int vhost_add_used(struct vhost_virtqueue *vq, 
> unsigned int head, int len)
>  }
>  EXPORT_SYMBOL_GPL(vhost_add_used);
>  
> +int vhost_add_used_idx(struct vhost_virtqueue *vq, int n)
> +{
> + u16 old, new;
> +
> + old = vq->last_used_idx;
> + new = (vq->last_used_idx += n);
> + /* If the driver never bothers to signal in a very long while,
> +  * used index might wrap around. If that happens, invalidate
> +  * signalled_used index we stored. TODO: make sure driver
> +  * signals at least once in 2^16 and remove this.
> +  */
> + if (unlikely((u16)(new - vq->signalled_used) < (u16)(new - old)))
> + vq->signalled_used_valid = false;
> +
> + /* Make sure buffer is written before we update index. */
> + smp_wmb();
> + if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
> +>used->idx)) {
> + vq_err(vq, "Failed to increment used idx");
> + return -EFAULT;
> + }
> + if (unlikely(vq->log_used)) {
> + /* Log used index update. */
> + log_write(vq->log_base,
> +   vq->log_addr + offsetof(struct vring_used, idx),
> +   sizeof(vq->used->idx));
> + if (vq->log_ctx)
> + eventfd_signal(vq->log_ctx, 1);
> + }
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(vhost_add_used_idx);
> +
>  static int __vhost_add_used_n(struct vhost_virtqueue *vq,
>   struct vring_used_elem *heads,
>   unsigned count)
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 16c2cb6..5dd6c05 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -199,6 +199,7 @@ int __vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
>  
>  int vhost_vq_init_access(struct vhost_virtqueue *);
> +int vhost_add_used_idx(struct vhost_virtqueue *vq, int n);
>  int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
>  int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
>unsigned count);

Please change the API to hide the fact that there's an index that needs
to be updated.

> -- 
> 2.7.4
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next RFC 5/5] vhost_net: basic tx virtqueue batched processing

2017-09-26 Thread Michael S. Tsirkin
On Fri, Sep 22, 2017 at 04:02:35PM +0800, Jason Wang wrote:
> This patch implements basic batched processing of tx virtqueue by
> prefetching desc indices and updating used ring in a batch. For
> non-zerocopy case, vq->heads were used for storing the prefetched
> indices and updating used ring. It is also a requirement for doing
> more batching on top. For zerocopy case and for simplicity, batched
> processing were simply disabled by only fetching and processing one
> descriptor at a time, this could be optimized in the future.
> 
> XDP_DROP (without touching skb) on tun (with Moongen in guest) with
> zercopy disabled:
> 
> Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz:
> Before: 3.20Mpps
> After:  3.90Mpps (+22%)
> 
> No differences were seen with zerocopy enabled.
> 
> Signed-off-by: Jason Wang 

So where is the speedup coming from? I'd guess the ring is
hot in cache, it's faster to access it in one go, then
pass many packets to net stack. Is that right?

Another possibility is better code cache locality.

So how about this patchset is refactored:

1. use existing APIs just first get packets then
   transmit them all then use them all
2. add new APIs and move the loop into vhost core
   for more speedups



> ---
>  drivers/vhost/net.c   | 215 
> --
>  drivers/vhost/vhost.c |   2 +-
>  2 files changed, 121 insertions(+), 96 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index c89640e..c439892 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -408,27 +408,25 @@ static int vhost_net_enable_vq(struct vhost_net *n,
>   return vhost_poll_start(poll, sock->file);
>  }
>  
> -static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
> - struct vhost_virtqueue *vq,
> - struct iovec iov[], unsigned int iov_size,
> - unsigned int *out_num, unsigned int *in_num)
> +static bool vhost_net_tx_avail(struct vhost_net *net,
> +struct vhost_virtqueue *vq)
>  {
>   unsigned long uninitialized_var(endtime);
> - int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
> -   out_num, in_num, NULL, NULL);
>  
> - if (r == vq->num && vq->busyloop_timeout) {
> - preempt_disable();
> - endtime = busy_clock() + vq->busyloop_timeout;
> - while (vhost_can_busy_poll(vq->dev, endtime) &&
> -vhost_vq_avail_empty(vq->dev, vq))
> - cpu_relax();
> - preempt_enable();
> - r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
> -   out_num, in_num, NULL, NULL);
> - }
> + if (!vq->busyloop_timeout)
> + return false;
>  
> - return r;
> + if (!vhost_vq_avail_empty(vq->dev, vq))
> + return true;
> +
> + preempt_disable();
> + endtime = busy_clock() + vq->busyloop_timeout;
> + while (vhost_can_busy_poll(vq->dev, endtime) &&
> + vhost_vq_avail_empty(vq->dev, vq))
> + cpu_relax();
> + preempt_enable();
> +
> + return !vhost_vq_avail_empty(vq->dev, vq);
>  }
>  
>  static bool vhost_exceeds_maxpend(struct vhost_net *net)
> @@ -446,8 +444,9 @@ static void handle_tx(struct vhost_net *net)
>  {
>   struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
>   struct vhost_virtqueue *vq = >vq;
> + struct vring_used_elem used, *heads = vq->heads;
>   unsigned out, in;
> - int head;
> + int avails, head;
>   struct msghdr msg = {
>   .msg_name = NULL,
>   .msg_namelen = 0,
> @@ -461,6 +460,7 @@ static void handle_tx(struct vhost_net *net)
>   struct socket *sock;
>   struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
>   bool zcopy, zcopy_used;
> + int i, batched = VHOST_NET_BATCH;
>  
>   mutex_lock(>mutex);
>   sock = vq->private_data;
> @@ -475,6 +475,12 @@ static void handle_tx(struct vhost_net *net)
>   hdr_size = nvq->vhost_hlen;
>   zcopy = nvq->ubufs;
>  
> + /* Disable zerocopy batched fetching for simplicity */
> + if (zcopy) {
> + heads = 
> + batched = 1;
> + }
> +
>   for (;;) {
>   /* Release DMAs done buffers first */
>   if (zcopy)
> @@ -486,95 +492,114 @@ static void handle_tx(struct vhost_net *net)
>   if (unlikely(vhost_exceeds_maxpend(net)))
>   break;
>  
> - head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
> - ARRAY_SIZE(vq->iov),
> - , );
> + avails = vhost_prefetch_desc_indices(vq, heads, batched, 
> !zcopy);
>   /* On error, stop handling until the next kick. */
> - if (unlikely(head < 0))
> + if (unlikely(avails 

Re: [PATCH net-next RFC 3/5] vhost: introduce vhost_add_used_idx()

2017-09-26 Thread Jason Wang



On 2017年09月27日 03:13, Michael S. Tsirkin wrote:

On Fri, Sep 22, 2017 at 04:02:33PM +0800, Jason Wang wrote:

This patch introduces a helper which just increase the used idx. This
will be used in pair with vhost_prefetch_desc_indices() by batching
code.

Signed-off-by: Jason Wang 
---
  drivers/vhost/vhost.c | 33 +
  drivers/vhost/vhost.h |  1 +
  2 files changed, 34 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 8424166d..6532cda 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2178,6 +2178,39 @@ int vhost_add_used(struct vhost_virtqueue *vq, unsigned 
int head, int len)
  }
  EXPORT_SYMBOL_GPL(vhost_add_used);
  
+int vhost_add_used_idx(struct vhost_virtqueue *vq, int n)

+{
+   u16 old, new;
+
+   old = vq->last_used_idx;
+   new = (vq->last_used_idx += n);
+   /* If the driver never bothers to signal in a very long while,
+* used index might wrap around. If that happens, invalidate
+* signalled_used index we stored. TODO: make sure driver
+* signals at least once in 2^16 and remove this.
+*/
+   if (unlikely((u16)(new - vq->signalled_used) < (u16)(new - old)))
+   vq->signalled_used_valid = false;
+
+   /* Make sure buffer is written before we update index. */
+   smp_wmb();
+   if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
+  >used->idx)) {
+   vq_err(vq, "Failed to increment used idx");
+   return -EFAULT;
+   }
+   if (unlikely(vq->log_used)) {
+   /* Log used index update. */
+   log_write(vq->log_base,
+ vq->log_addr + offsetof(struct vring_used, idx),
+ sizeof(vq->used->idx));
+   if (vq->log_ctx)
+   eventfd_signal(vq->log_ctx, 1);
+   }
+   return 0;
+}
+EXPORT_SYMBOL_GPL(vhost_add_used_idx);
+
  static int __vhost_add_used_n(struct vhost_virtqueue *vq,
struct vring_used_elem *heads,
unsigned count)
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 16c2cb6..5dd6c05 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -199,6 +199,7 @@ int __vhost_get_vq_desc(struct vhost_virtqueue *vq,
  void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
  
  int vhost_vq_init_access(struct vhost_virtqueue *);

+int vhost_add_used_idx(struct vhost_virtqueue *vq, int n);
  int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
  int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
 unsigned count);

Please change the API to hide the fact that there's an index that needs
to be updated.


In fact, an interesting optimization on top is just call 
vhost_add_used_idx(vq, n) instead of n vhost_add_used_idx(vq, 1). That's 
the reason I leave n in the API.


Thanks




--
2.7.4


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next RFC 0/5] batched tx processing in vhost_net

2017-09-26 Thread Jason Wang



On 2017年09月27日 03:26, Michael S. Tsirkin wrote:

On Fri, Sep 22, 2017 at 04:02:30PM +0800, Jason Wang wrote:

Hi:

This series tries to implement basic tx batched processing. This is
done by prefetching descriptor indices and update used ring in a
batch. This intends to speed up used ring updating and improve the
cache utilization. Test shows about ~22% improvement in tx pss.





Please review.

Jason Wang (5):
   vhost: split out ring head fetching logic
   vhost: introduce helper to prefetch desc index
   vhost: introduce vhost_add_used_idx()

Please squash these new APIs into where they are used.
This split-up just makes review harder for me as
I can't figure out how the new APIs are used.


Ok.

Thanks





   vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCH

This is ok as a separate patch.


   vhost_net: basic tx virtqueue batched processing

  drivers/vhost/net.c   | 221 --
  drivers/vhost/vhost.c | 165 +++--
  drivers/vhost/vhost.h |   9 ++
  3 files changed, 270 insertions(+), 125 deletions(-)

--
2.7.4


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v1 4/4] KVM/vmx: enable lbr for the guest

2017-09-26 Thread Wei Wang

On 09/27/2017 12:41 AM, Andi Kleen wrote:

1) vCPU context switching and guest side task switching are not identical.
That is, when the vCPU is scheduled out, the guest task on the vCPU may not

guest task lifetime has nothing to do with this. It's completely independent
of what you do here on the VCPU level.


run out its time slice yet, so the task will continue to run when the vCPU
is
scheduled in by the host (lbr wasn't save by the guest task when the vCPU is
scheduled out in this case).

It is possible to have the vCPU which runs the guest task (in use of lbr)
scheduled
out, followed by a new host task being scheduled in on the pCPU to run.
It is not guaranteed that the new host task does not use the LBR feature on
the
pCPU.

Sure it may use the LBR, and the normal perf context switch
will switch it and everything works fine.

It's like any other per-task LBR user.


OK, I see the point, thanks.

Why couldn't we save the LBR_SELECT via task switching too?


Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next RFC 5/5] vhost_net: basic tx virtqueue batched processing

2017-09-26 Thread Jason Wang



On 2017年09月27日 03:25, Michael S. Tsirkin wrote:

On Fri, Sep 22, 2017 at 04:02:35PM +0800, Jason Wang wrote:

This patch implements basic batched processing of tx virtqueue by
prefetching desc indices and updating used ring in a batch. For
non-zerocopy case, vq->heads were used for storing the prefetched
indices and updating used ring. It is also a requirement for doing
more batching on top. For zerocopy case and for simplicity, batched
processing were simply disabled by only fetching and processing one
descriptor at a time, this could be optimized in the future.

XDP_DROP (without touching skb) on tun (with Moongen in guest) with
zercopy disabled:

Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz:
Before: 3.20Mpps
After:  3.90Mpps (+22%)

No differences were seen with zerocopy enabled.

Signed-off-by: Jason Wang 

So where is the speedup coming from? I'd guess the ring is
hot in cache, it's faster to access it in one go, then
pass many packets to net stack. Is that right?

Another possibility is better code cache locality.


Yes, I think the speed up comes from:

- less cache misses
- less cache line bounce when virtqueue is about to be full (guest is 
faster than host which is the case of MoonGen)

- less memory barriers
- possible faster copy speed by using copy_to_user() on modern CPUs



So how about this patchset is refactored:

1. use existing APIs just first get packets then
transmit them all then use them all


Looks like current API can not get packets first, it only support get 
packet one by one (if you mean vhost_get_vq_desc()). And used ring 
updating may get more misses in this case.



2. add new APIs and move the loop into vhost core
for more speedups


I don't see any advantages, looks like just need some e.g callbacks in 
this case.


Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next RFC 2/5] vhost: introduce helper to prefetch desc index

2017-09-26 Thread Jason Wang



On 2017年09月27日 03:19, Michael S. Tsirkin wrote:

On Fri, Sep 22, 2017 at 04:02:32PM +0800, Jason Wang wrote:

This patch introduces vhost_prefetch_desc_indices() which could batch
descriptor indices fetching and used ring updating. This intends to
reduce the cache misses of indices fetching and updating and reduce
cache line bounce when virtqueue is almost full. copy_to_user() was
used in order to benefit from modern cpus that support fast string
copy. Batched virtqueue processing will be the first user.

Signed-off-by: Jason Wang 
---
  drivers/vhost/vhost.c | 55 +++
  drivers/vhost/vhost.h |  3 +++
  2 files changed, 58 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f87ec75..8424166d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2437,6 +2437,61 @@ struct vhost_msg_node *vhost_dequeue_msg(struct 
vhost_dev *dev,
  }
  EXPORT_SYMBOL_GPL(vhost_dequeue_msg);
  
+int vhost_prefetch_desc_indices(struct vhost_virtqueue *vq,

+   struct vring_used_elem *heads,
+   u16 num, bool used_update)

why do you need to combine used update with prefetch?


For better performance and I believe we don't care about the overhead 
when we meet errors in tx.





+{
+   int ret, ret2;
+   u16 last_avail_idx, last_used_idx, total, copied;
+   __virtio16 avail_idx;
+   struct vring_used_elem __user *used;
+   int i;
+
+   if (unlikely(vhost_get_avail(vq, avail_idx, >avail->idx))) {
+   vq_err(vq, "Failed to access avail idx at %p\n",
+  >avail->idx);
+   return -EFAULT;
+   }
+   last_avail_idx = vq->last_avail_idx & (vq->num - 1);
+   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
+   total = vq->avail_idx - vq->last_avail_idx;
+   ret = total = min(total, num);
+
+   for (i = 0; i < ret; i++) {
+   ret2 = vhost_get_avail(vq, heads[i].id,
+ >avail->ring[last_avail_idx]);
+   if (unlikely(ret2)) {
+   vq_err(vq, "Failed to get descriptors\n");
+   return -EFAULT;
+   }
+   last_avail_idx = (last_avail_idx + 1) & (vq->num - 1);
+   }
+
+   if (!used_update)
+   return ret;
+
+   last_used_idx = vq->last_used_idx & (vq->num - 1);
+   while (total) {
+   copied = min((u16)(vq->num - last_used_idx), total);
+   ret2 = vhost_copy_to_user(vq,
+ >used->ring[last_used_idx],
+ [ret - total],
+ copied * sizeof(*used));
+
+   if (unlikely(ret2)) {
+   vq_err(vq, "Failed to update used ring!\n");
+   return -EFAULT;
+   }
+
+   last_used_idx = 0;
+   total -= copied;
+   }
+
+   /* Only get avail ring entries after they have been exposed by guest. */
+   smp_rmb();

Barrier before return is a very confusing API. I guess it's designed to
be used in a specific way to make it necessary - but what is it?


Looks like a and we need do this after reading avail_idx.

Thanks





+   return ret;
+}
+EXPORT_SYMBOL(vhost_prefetch_desc_indices);
  
  static int __init vhost_init(void)

  {
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 39ff897..16c2cb6 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -228,6 +228,9 @@ ssize_t vhost_chr_read_iter(struct vhost_dev *dev, struct 
iov_iter *to,
  ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
 struct iov_iter *from);
  int vhost_init_device_iotlb(struct vhost_dev *d, bool enabled);
+int vhost_prefetch_desc_indices(struct vhost_virtqueue *vq,
+   struct vring_used_elem *heads,
+   u16 num, bool used_update);
  
  #define vq_err(vq, fmt, ...) do {  \

pr_debug(pr_fmt(fmt), ##__VA_ARGS__);   \
--
2.7.4


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v1 0/4] Enable LBR for the guest

2017-09-26 Thread Andi Kleen
> On the other side, it seems that the (guest) kernel driver also works
> without
> the above being supported, should we change it to report error and stop
> using the PMU features when the check of the above two fails (at
> intel_pmu_init())?

You could add the extra check for the LBR code yes, although it already checks
if the LBR MSRs work, so in practice it's likely already covered.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 4/4] KVM/vmx: enable lbr for the guest

2017-09-26 Thread Andi Kleen
> 1) vCPU context switching and guest side task switching are not identical.
> That is, when the vCPU is scheduled out, the guest task on the vCPU may not

guest task lifetime has nothing to do with this. It's completely independent
of what you do here on the VCPU level.

> run out its time slice yet, so the task will continue to run when the vCPU
> is
> scheduled in by the host (lbr wasn't save by the guest task when the vCPU is
> scheduled out in this case).
> 
> It is possible to have the vCPU which runs the guest task (in use of lbr)
> scheduled
> out, followed by a new host task being scheduled in on the pCPU to run.
> It is not guaranteed that the new host task does not use the LBR feature on
> the
> pCPU.

Sure it may use the LBR, and the normal perf context switch
will switch it and everything works fine.

It's like any other per-task LBR user.

> 
> 2) Sometimes, people may want this usage: "perf record -b
> ./qemu-system-x86_64 ...",
> which will need lbr to be used in KVM as well.

In this obscure case you can disable LBR support for the guest.
The common case is far more important.

It sounds like you didn't do any performance measurements.
I expect the performance of your current solution to be terrible.

e.g. a normal perf PMI does at least 1 MSR reads and 4+ MSR writes
for a single counter. With multiple counters it gets worse.

For each of those you'll need to exit. Adding something
to the entry/exit list is similar to the cost of doing 
explicit RD/WRMSRs.

On Skylake we have 32*3=96 MSRs for the LBRs.

So with the 5 exits and entries, you're essentually doing
5*2*96=18432 extra MSR accesses for each PMI.

MSR access is 100+ cycles at least, for writes it is far more
expensive.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next RFC 0/5] batched tx processing in vhost_net

2017-09-26 Thread Michael S. Tsirkin
On Fri, Sep 22, 2017 at 04:02:30PM +0800, Jason Wang wrote:
> Hi:
> 
> This series tries to implement basic tx batched processing. This is
> done by prefetching descriptor indices and update used ring in a
> batch. This intends to speed up used ring updating and improve the
> cache utilization.

Interesting, thanks for the patches. So IIUC most of the gain is really
overcoming some of the shortcomings of virtio 1.0 wrt cache utilization?

Which is fair enough (1.0 is already deployed) but I would like to avoid
making 1.1 support harder, and this patchset does this unfortunately,
see comments on individual patches. I'm sure it can be addressed though.

> Test shows about ~22% improvement in tx pss.

Is this with or without tx napi in guest?

> Please review.
> 
> Jason Wang (5):
>   vhost: split out ring head fetching logic
>   vhost: introduce helper to prefetch desc index
>   vhost: introduce vhost_add_used_idx()
>   vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCH
>   vhost_net: basic tx virtqueue batched processing
> 
>  drivers/vhost/net.c   | 221 
> --
>  drivers/vhost/vhost.c | 165 +++--
>  drivers/vhost/vhost.h |   9 ++
>  3 files changed, 270 insertions(+), 125 deletions(-)
> 
> -- 
> 2.7.4
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 0/4] Enable LBR for the guest

2017-09-26 Thread Wei Wang

On 09/25/2017 10:59 PM, Andi Kleen wrote:

On Mon, Sep 25, 2017 at 12:44:52PM +0800, Wei Wang wrote:

This patch series enables the Last Branch Recording feature for the
guest. Instead of trapping each LBR stack MSR access, the MSRs are
passthroughed to the guest. Those MSRs are switched (i.e. load and
saved) on VMExit and VMEntry.

Test:
Try "perf record -b ./test_program" on guest.

I don't see where you expose the PERF capabilities MSR?

That's normally needed for LBR too to report the version
number.



It was missed, thanks for pointing it out. I also found KVM/QEMU doesn't
expose CPUID.PDCM, will add that too.

Since for now we are enabling LBR, I plan to expose only "PERF_CAP & 0x3f"
to the guest, which reports the LBR format only.

On the other side, it seems that the (guest) kernel driver also works 
without

the above being supported, should we change it to report error and stop
using the PMU features when the check of the above two fails (at 
intel_pmu_init())?



Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 4/4] KVM/vmx: enable lbr for the guest

2017-09-26 Thread Wei Wang

On 09/25/2017 10:57 PM, Andi Kleen wrote:

+static void auto_switch_lbr_msrs(struct vcpu_vmx *vmx)
+{
+   int i;
+   struct perf_lbr_stack lbr_stack;
+
+   perf_get_lbr_stack(_stack);
+
+   add_atomic_switch_msr(vmx, MSR_LBR_SELECT, 0, 0);
+   add_atomic_switch_msr(vmx, lbr_stack.lbr_tos, 0, 0);
+
+   for (i = 0; i < lbr_stack.lbr_nr; i++) {
+   add_atomic_switch_msr(vmx, lbr_stack.lbr_from + i, 0, 0);
+   add_atomic_switch_msr(vmx, lbr_stack.lbr_to + i, 0, 0);
+   if (lbr_stack.lbr_info)
+   add_atomic_switch_msr(vmx, lbr_stack.lbr_info + i, 0,
+ 0);
+   }

That will be really expensive and add a lot of overhead to every entry/exit.
perf can already context switch the LBRs on task context switch. With that
you can just switch LBR_SELECT, which is *much* cheaper because there
are far less context switches than exit/entries.

It implies that when KVM is running it needs to prevent perf from enabling
LBRs in the context of KVM, but that should be straight forward.


I kind of have a different thought here:

1) vCPU context switching and guest side task switching are not identical.
That is, when the vCPU is scheduled out, the guest task on the vCPU may not
run out its time slice yet, so the task will continue to run when the 
vCPU is

scheduled in by the host (lbr wasn't save by the guest task when the vCPU is
scheduled out in this case).

It is possible to have the vCPU which runs the guest task (in use of 
lbr) scheduled

out, followed by a new host task being scheduled in on the pCPU to run.
It is not guaranteed that the new host task does not use the LBR feature 
on the

pCPU.

2) Sometimes, people may want this usage: "perf record -b 
./qemu-system-x86_64 ...",

which will need lbr to be used in KVM as well.


I think one possible optimization we could do would be to add the LBR 
MSRs to auto
switching when the guest requests to enable the feature, and remove them 
when

being disabled. This will need to trap guest access to MSR_DEBUGCTL.


Best,
Wei


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 REPOST 1/6] hw_random: place mutex around read functions and buffers.

2017-09-26 Thread Pankaj Gupta

> 
> A bit late to a party, but:
> 
> On Mon, Dec 8, 2014 at 12:50 AM, Amos Kong  wrote:
> > From: Rusty Russell 
> >
> > There's currently a big lock around everything, and it means that we
> > can't query sysfs (eg /sys/devices/virtual/misc/hw_random/rng_current)
> > while the rng is reading.  This is a real problem when the rng is slow,
> > or blocked (eg. virtio_rng with qemu's default /dev/random backend)
> >
> > This doesn't help (it leaves the current lock untouched), just adds a
> > lock to protect the read function and the static buffers, in preparation
> > for transition.
> >
> > Signed-off-by: Rusty Russell 
> > ---
> ...
> >
> > @@ -160,13 +166,14 @@ static ssize_t rng_dev_read(struct file *filp, char
> > __user *buf,
> > goto out_unlock;
> > }
> >
> > +   mutex_lock(_mutex);
> 
> I think this breaks O_NONBLOCK: we have hwrng core thread that is
> constantly pumps underlying rng for data; the thread takes the mutex
> and calls rng_get_data() that blocks until RNG responds. This means
> that even user specified O_NONBLOCK here we'll be waiting until
> [hwrng] thread releases reading_mutex before we can continue.

I think for 'virtio_rng' for 'O_NON_BLOCK' 'rng_get_data' returns
without waiting for data which can let mutex to be  used by other 
threads waiting if any?

rng_dev_read
  rng_get_data
virtio_read
  
static int virtio_read(struct hwrng *rng, void *buf, size_t size, bool wait)
{
int ret;
struct virtrng_info *vi = (struct virtrng_info *)rng->priv;

if (vi->hwrng_removed)
return -ENODEV;

if (!vi->busy) {
vi->busy = true;
init_completion(>have_data);
register_buffer(vi, buf, size);
}

if (!wait)
return 0;

ret = wait_for_completion_killable(>have_data);
if (ret < 0)
return ret;

vi->busy = false;

return vi->data_avail;
}

> 
> > if (!data_avail) {
> > bytes_read = rng_get_data(current_rng, rng_buffer,
> > rng_buffer_size(),
> > !(filp->f_flags & O_NONBLOCK));
> > if (bytes_read < 0) {
> > err = bytes_read;
> > -   goto out_unlock;
> > +   goto out_unlock_reading;
> > }
> > data_avail = bytes_read;
> > }
> 
> Thanks.
> 
> --
> Dmitry
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] KVM: s390: Disable CONFIG_S390_GUEST_OLD_TRANSPORT by default

2017-09-26 Thread Christian Borntraeger
On 09/25/2017 04:45 PM, Thomas Huth wrote:
> There is no recent user space application available anymore which still
> supports this old virtio transport, so let's disable this by default.
> 
> Signed-off-by: Thomas Huth 

thanks applied.
> ---
>  arch/s390/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
> index 48af970..923bf04 100644
> --- a/arch/s390/Kconfig
> +++ b/arch/s390/Kconfig
> @@ -930,7 +930,7 @@ config S390_GUEST
> the KVM hypervisor.
> 
>  config S390_GUEST_OLD_TRANSPORT
> - def_bool y
> + def_bool n
>   prompt "Guest support for old s390 virtio transport (DEPRECATED)"
>   depends on S390_GUEST
>   help
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] KVM: s390: Disable CONFIG_S390_GUEST_OLD_TRANSPORT by default

2017-09-26 Thread Heiko Carstens
On Tue, Sep 26, 2017 at 12:41:41PM +0200, Christian Borntraeger wrote:
> 
> 
> On 09/26/2017 12:40 PM, Heiko Carstens wrote:
> > On Mon, Sep 25, 2017 at 08:37:36PM +0200, Christian Borntraeger wrote:
> >>
> >> On 09/25/2017 07:54 PM, Halil Pasic wrote:
> >>>
> >>>
> >>> On 09/25/2017 04:45 PM, Thomas Huth wrote:
>  There is no recent user space application available anymore which still
>  supports this old virtio transport, so let's disable this by default.
> 
>  Signed-off-by: Thomas Huth 
> >>>
> >>> I don't have any objections, but there may be something I'm not aware of.
> >>> Let's see what Connie says. From my side it's ack.
> >>>
> >>> Via whom is this supposed to go in? Looking at the MAINTAINERS, I would
> >>> say Martin or Heiko but I don't see them among the recipients.
> >>
> >> FWIW as the original author of that transport
> >> Acked-by: Christian Borntraeger 
> >>
> >> I can pick this up for Martins/Heikos tree if you want.
> > 
> > When will this code be removed?
> > 
> > When the config option was initially added Conny said it should survive
> > "probably two kernel releases or so, depending on feedback".
> > It was merged for v4.8. Now we are five kernel releases later...
> 
> Lets wait for one release and then get rid of it?

So it's going to be removed with the next merge window.
Where is the patch? ;)

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] KVM: s390: Disable CONFIG_S390_GUEST_OLD_TRANSPORT by default

2017-09-26 Thread Heiko Carstens
On Tue, Sep 26, 2017 at 12:57:26PM +0200, Thomas Huth wrote:
> On 26.09.2017 12:47, Heiko Carstens wrote:
> > So it's going to be removed with the next merge window.
> > Where is the patch? ;)
> 
> Hmm, so far the code was always enabled by default, so in the unlikely
> case that somebody is still using it, they might not have noticed that
> this is marked as deprecated. So maybe it's better to keep the code for
> two more released, but disable it by default, so that people have time
> to realize that it is going away? ... just my 0.02 € ... but if you
> prefer, I can also write a patch that removes it immediately instead.

Switching the default from yes to no won't prevent anybody from selecting
it again. So I don't see any value in waiting even longer.

Besides the original commit already printed a message to the console that
the transport is deprecated.
See commit 3b2fbb3f06ef ("virtio/s390: deprecate old transport").

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] KVM: s390: Disable CONFIG_S390_GUEST_OLD_TRANSPORT by default

2017-09-26 Thread Heiko Carstens
On Mon, Sep 25, 2017 at 08:37:36PM +0200, Christian Borntraeger wrote:
> 
> On 09/25/2017 07:54 PM, Halil Pasic wrote:
> > 
> > 
> > On 09/25/2017 04:45 PM, Thomas Huth wrote:
> >> There is no recent user space application available anymore which still
> >> supports this old virtio transport, so let's disable this by default.
> >>
> >> Signed-off-by: Thomas Huth 
> > 
> > I don't have any objections, but there may be something I'm not aware of.
> > Let's see what Connie says. From my side it's ack.
> > 
> > Via whom is this supposed to go in? Looking at the MAINTAINERS, I would
> > say Martin or Heiko but I don't see them among the recipients.
> 
> FWIW as the original author of that transport
> Acked-by: Christian Borntraeger 
> 
> I can pick this up for Martins/Heikos tree if you want.

When will this code be removed?

When the config option was initially added Conny said it should survive
"probably two kernel releases or so, depending on feedback".
It was merged for v4.8. Now we are five kernel releases later...

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] KVM: s390: Disable CONFIG_S390_GUEST_OLD_TRANSPORT by default

2017-09-26 Thread Christian Borntraeger


On 09/26/2017 12:40 PM, Heiko Carstens wrote:
> On Mon, Sep 25, 2017 at 08:37:36PM +0200, Christian Borntraeger wrote:
>>
>> On 09/25/2017 07:54 PM, Halil Pasic wrote:
>>>
>>>
>>> On 09/25/2017 04:45 PM, Thomas Huth wrote:
 There is no recent user space application available anymore which still
 supports this old virtio transport, so let's disable this by default.

 Signed-off-by: Thomas Huth 
>>>
>>> I don't have any objections, but there may be something I'm not aware of.
>>> Let's see what Connie says. From my side it's ack.
>>>
>>> Via whom is this supposed to go in? Looking at the MAINTAINERS, I would
>>> say Martin or Heiko but I don't see them among the recipients.
>>
>> FWIW as the original author of that transport
>> Acked-by: Christian Borntraeger 
>>
>> I can pick this up for Martins/Heikos tree if you want.
> 
> When will this code be removed?
> 
> When the config option was initially added Conny said it should survive
> "probably two kernel releases or so, depending on feedback".
> It was merged for v4.8. Now we are five kernel releases later...

Lets wait for one release and then get rid of it?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization