Re: [PATCH RFC] vhost: basic device IOTLB support

2016-01-05 Thread Jason Wang


On 01/05/2016 11:18 AM, Yang Zhang wrote:
> On 2016/1/4 14:22, Jason Wang wrote:
>>
>>
>> On 01/04/2016 09:39 AM, Yang Zhang wrote:
>>> On 2015/12/31 15:13, Jason Wang wrote:
>>>> This patch tries to implement an device IOTLB for vhost. This could be
>>>> used with for co-operation with userspace(qemu) implementation of
>>>> iommu for a secure DMA environment in guest.
>>>>
>>>> The idea is simple. When vhost meets an IOTLB miss, it will request
>>>> the assistance of userspace to do the translation, this is done
>>>> through:
>>>>
>>>> - Fill the translation request in a preset userspace address (This
>>>> address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
>>>> - Notify userspace through eventfd (This eventfd was set through ioctl
>>>> VHOST_SET_IOTLB_FD).
>>>>
>>>> When userspace finishes the translation, it will update the vhost
>>>> IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of
>>>> snooping the IOTLB invalidation of IOMMU IOTLB and use
>>>> VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.
>>>
>>> Is there any performance data shows the difference with IOTLB
>>> supporting?
>>
>> Basic testing show it was slower than without IOTLB.
>>
>>> I doubt we may see performance decrease since the flush code path is
>>> longer than before.
>>>
>>
>> Yes, it also depend on the TLB hit rate.
>>
>> If lots of dynamic mappings and unmappings are used in guest (e.g normal
>> Linux driver). This method should be much more slower since:
>>
>> - lots of invalidation and its path is slow.
>> - the hit rate is low and the high price of userspace assisted address
>> translation.
>> - limitation of userspace IOMMU/IOTLB implementation (qemu's vtd
>> emulation simply empty all entries when it's full).
>>
>> Another method is to implement kernel IOMMU (e.g vtd). But I'm not sure
>> vhost is the best place to do this, since vhost should be architecture
>> independent. Maybe we'd better to do it in kvm or have a pv IOMMU
>> implementation in vhost.
>
> Actually, i have the kernel IOMMU(virtual vtd) patch which can pass
> though the physical device to L2 guest on hand.

A little bit confused, I believe the first step is to exporting an IOMMU
to L1 guest for it to use for a assigned device?

> But it is just a draft patch which was written several years ago. If
> there is real requirement for it, I can rebase it and send out it for
> review.

Interesting but I think the goal is different. This patch tries to make
vhost/virtio works with emulated IOMMU.

>
>>
>> Another side, if fixed mappings were used in guest, (e.g dpdk in guest).
>> We have the possibility to have 100% hit rate with almost no
>> invalidation, the performance penalty should be ignorable, this should
>> be the main use case for this patch.
>>
>> The patch is just a prototype for discussion. Any other ideas are
>> welcomed.
>>
>> Thanks
>>
>
>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] vhost: basic device IOTLB support

2016-01-03 Thread Jason Wang


On 01/04/2016 09:39 AM, Yang Zhang wrote:
> On 2015/12/31 15:13, Jason Wang wrote:
>> This patch tries to implement an device IOTLB for vhost. This could be
>> used with for co-operation with userspace(qemu) implementation of
>> iommu for a secure DMA environment in guest.
>>
>> The idea is simple. When vhost meets an IOTLB miss, it will request
>> the assistance of userspace to do the translation, this is done
>> through:
>>
>> - Fill the translation request in a preset userspace address (This
>>address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
>> - Notify userspace through eventfd (This eventfd was set through ioctl
>>VHOST_SET_IOTLB_FD).
>>
>> When userspace finishes the translation, it will update the vhost
>> IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of
>> snooping the IOTLB invalidation of IOMMU IOTLB and use
>> VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.
>
> Is there any performance data shows the difference with IOTLB supporting?

Basic testing show it was slower than without IOTLB.

> I doubt we may see performance decrease since the flush code path is
> longer than before.
>

Yes, it also depend on the TLB hit rate.

If lots of dynamic mappings and unmappings are used in guest (e.g normal
Linux driver). This method should be much more slower since:

- lots of invalidation and its path is slow.
- the hit rate is low and the high price of userspace assisted address
translation.
- limitation of userspace IOMMU/IOTLB implementation (qemu's vtd
emulation simply empty all entries when it's full).

Another method is to implement kernel IOMMU (e.g vtd). But I'm not sure
vhost is the best place to do this, since vhost should be architecture
independent. Maybe we'd better to do it in kvm or have a pv IOMMU
implementation in vhost.

Another side, if fixed mappings were used in guest, (e.g dpdk in guest).
We have the possibility to have 100% hit rate with almost no
invalidation, the performance penalty should be ignorable, this should
be the main use case for this patch.

The patch is just a prototype for discussion. Any other ideas are welcomed.

Thanks

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] vhost: basic device IOTLB support

2016-01-03 Thread Jason Wang


On 12/31/2015 07:17 PM, Michael S. Tsirkin wrote:
> On Thu, Dec 31, 2015 at 03:13:45PM +0800, Jason Wang wrote:
>> This patch tries to implement an device IOTLB for vhost. This could be
>> used with for co-operation with userspace(qemu) implementation of
>> iommu for a secure DMA environment in guest.
>>
>> The idea is simple. When vhost meets an IOTLB miss, it will request
>> the assistance of userspace to do the translation, this is done
>> through:
>>
>> - Fill the translation request in a preset userspace address (This
>>   address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
>> - Notify userspace through eventfd (This eventfd was set through ioctl
>>   VHOST_SET_IOTLB_FD).
>>
>> When userspace finishes the translation, it will update the vhost
>> IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of
>> snooping the IOTLB invalidation of IOMMU IOTLB and use
>> VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.
>>
>> For simplicity, IOTLB was implemented with a simple hash array. The
>> index were calculated from IOVA page frame number which can only works
>> at PAGE_SIZE level.
>>
>> An qemu implementation (for reference) is available at:
>> g...@github.com:jasowang/qemu.git iommu
>>
>> TODO & Known issues:
>>
>> - read/write permission validation was not implemented.
>> - no feature negotiation.
>> - VHOST_SET_MEM_TABLE is not reused (maybe there's a chance).
>> - working at PAGE_SIZE level, don't support large mappings.
>> - better data structure for IOTLB instead of simple hash array.
>> - better API, e.g using mmap() instead of preset userspace address.
>>
>> Signed-off-by: Jason Wang <jasow...@redhat.com>
> Interesting. I'm working on a slightly different approach
> which is direct vt-d support in vhost.

I've considered this approach. May have advantages but the issues here
is vt-d emulation is even in-complete in qemu and I believe we don't
want to duplicate the code in both vhost-kernel and vhost-user?

> This one has the advantage of being more portable.

Right, the patch tries to be architecture independent.

>
>> ---
>>  drivers/vhost/net.c|   2 +-
>>  drivers/vhost/vhost.c  | 190 
>> -
>>  drivers/vhost/vhost.h  |  13 
>>  include/uapi/linux/vhost.h |  26 +++
>>  4 files changed, 229 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 9eda69e..a172be9 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -1083,7 +1083,7 @@ static long vhost_net_ioctl(struct file *f, unsigned 
>> int ioctl,
>>  r = vhost_dev_ioctl(>dev, ioctl, argp);
>>  if (r == -ENOIOCTLCMD)
>>  r = vhost_vring_ioctl(>dev, ioctl, argp);
>> -else
>> +else if (ioctl != VHOST_UPDATE_IOTLB)
>>  vhost_net_flush(n);
>>  mutex_unlock(>dev.mutex);
>>  return r;
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index eec2f11..729fe05 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -113,6 +113,11 @@ static void vhost_init_is_le(struct vhost_virtqueue *vq)
>>  }
>>  #endif /* CONFIG_VHOST_CROSS_ENDIAN_LEGACY */
>>  
>> +static inline int vhost_iotlb_hash(u64 iova)
>> +{
>> +return (iova >> PAGE_SHIFT) & (VHOST_IOTLB_SIZE - 1);
>> +}
>> +
>>  static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
>>  poll_table *pt)
>>  {
>> @@ -384,8 +389,14 @@ void vhost_dev_init(struct vhost_dev *dev,
>>  dev->memory = NULL;
>>  dev->mm = NULL;
>>  spin_lock_init(>work_lock);
>> +spin_lock_init(>iotlb_lock);
>> +mutex_init(>iotlb_req_mutex);
>>  INIT_LIST_HEAD(>work_list);
>>  dev->worker = NULL;
>> +dev->iotlb_request = NULL;
>> +dev->iotlb_ctx = NULL;
>> +dev->iotlb_file = NULL;
>> +dev->pending_request.flags.type = VHOST_IOTLB_INVALIDATE;
>>  
>>  for (i = 0; i < dev->nvqs; ++i) {
>>  vq = dev->vqs[i];
>> @@ -393,12 +404,17 @@ void vhost_dev_init(struct vhost_dev *dev,
>>  vq->indirect = NULL;
>>  vq->heads = NULL;
>>  vq->dev = dev;
>> +vq->iotlb_request = NULL;
>>  mutex_init(>mutex);
>> 

Re: [RFC v4 0/5] Add virtio transport for AF_VSOCK

2016-01-03 Thread Jason Wang


On 12/22/2015 05:07 PM, Stefan Hajnoczi wrote:
> This series is based on v4.4-rc2 and the "virtio: make find_vqs()
> checkpatch.pl-friendly" patch I recently submitted.
>
> v4:
>  * Addressed code review comments from Alex Bennee
>  * MAINTAINERS file entries for new files
>  * Trace events instead of pr_debug()
>  * RST packet is sent when there is no listen socket
>  * Allow guest->host connections again (began discussing netfilter support 
> with
>Matt Benjamin instead of hard-coding security policy in virtio-vsock code)
>  * Many checkpatch.pl cleanups (will be 100% clean in v5)
>
> v3:
>  * Remove unnecessary 3-way handshake, just do REQUEST/RESPONSE instead
>of REQUEST/RESPONSE/ACK
>  * Remove SOCK_DGRAM support and focus on SOCK_STREAM first
>(also drop v2 Patch 1, it's only needed for SOCK_DGRAM)
>  * Only allow host->guest connections (same security model as latest
>VMware)
>  * Don't put vhost vsock driver into staging
>  * Add missing Kconfig dependencies (Arnd Bergmann )
>  * Remove unneeded variable used to store return value
>(Fengguang Wu  and Julia Lawall
>)
>
> v2:
>  * Rebased onto Linux v4.4-rc2
>  * vhost: Refuse to assign reserved CIDs
>  * vhost: Refuse guest CID if already in use
>  * vhost: Only accept correctly addressed packets (no spoofing!)
>  * vhost: Support flexible rx/tx descriptor layout
>  * vhost: Add missing total_tx_buf decrement
>  * virtio_transport: Fix total_tx_buf accounting
>  * virtio_transport: Add virtio_transport global mutex to prevent races
>  * common: Notify other side of SOCK_STREAM disconnect (fixes shutdown
>semantics)
>  * common: Avoid recursive mutex_lock(tx_lock) for write_space (fixes 
> deadlock)
>  * common: Define VIRTIO_VSOCK_TYPE_STREAM/DGRAM hardware interface constants
>  * common: Define VIRTIO_VSOCK_SHUTDOWN_RCV/SEND hardware interface constants
>  * common: Fix peer_buf_alloc inheritance on child socket
>
> This patch series adds a virtio transport for AF_VSOCK (net/vmw_vsock/).
> AF_VSOCK is designed for communication between virtual machines and
> hypervisors.  It is currently only implemented for VMware's VMCI transport.
>
> This series implements the proposed virtio-vsock device specification from
> here:
> http://permalink.gmane.org/gmane.comp.emulators.virtio.devel/980
>
> Most of the work was done by Asias He and Gerd Hoffmann a while back.  I have
> picked up the series again.
>
> The QEMU userspace changes are here:
> https://github.com/stefanha/qemu/commits/vsock
>
> Why virtio-vsock?
> -
> Guest<->host communication is currently done over the virtio-serial device.
> This makes it hard to port sockets API-based applications and is limited to
> static ports.
>
> virtio-vsock uses the sockets API so that applications can rely on familiar
> SOCK_STREAM semantics.  Applications on the host can easily connect to guest
> agents because the sockets API allows multiple connections to a listen socket
> (unlike virtio-serial).  This simplifies the guest<->host communication and
> eliminates the need for extra processes on the host to arbitrate virtio-serial
> ports.
>
> Overview
> 
> This series adds 3 pieces:
>
> 1. virtio_transport_common.ko - core virtio vsock code that uses vsock.ko
>
> 2. virtio_transport.ko - guest driver
>
> 3. drivers/vhost/vsock.ko - host driver

Have a (dumb maybe) question after a quick glance at the codes:

Is there any chance to reuse existed virtio-net/vhost-net codes? For
example, using virito-net instead of a new device as a transport in
guest and using vhost-net (especially consider it uses a socket as
backend) in host. Maybe just a new virtio-net header type for vsock. I'm
asking since I don't see any blocker for doing this.

Thanks

> Howto
> -
> The following kernel options are needed:
>   CONFIG_VSOCKETS=y
>   CONFIG_VIRTIO_VSOCKETS=y
>   CONFIG_VIRTIO_VSOCKETS_COMMON=y
>   CONFIG_VHOST_VSOCK=m
>
> Launch QEMU as follows:
>   # qemu ... -device vhost-vsock-pci,id=vhost-vsock-pci0,guest-cid=3
>
> Guest and host can communicate via AF_VSOCK sockets.  The host's CID (address)
> is 2 and the guest must be assigned a CID (3 in the example above).
>
> Status
> --
> This patch series implements the latest draft specification.  Please review.
>
> Asias He (4):
>   VSOCK: Introduce virtio_vsock_common.ko
>   VSOCK: Introduce virtio_transport.ko
>   VSOCK: Introduce vhost_vsock.ko
>   VSOCK: Add Makefile and Kconfig
>
> Stefan Hajnoczi (1):
>   VSOCK: transport-specific vsock_transport functions
>
>  MAINTAINERS|  13 +
>  drivers/vhost/Kconfig  |  15 +
>  drivers/vhost/Makefile |   4 +
>  drivers/vhost/vsock.c  | 607 +++
>  drivers/vhost/vsock.h  |   4 +
>  include/linux/virtio_vsock.h   | 167 +
>  

[PATCH RFC] vhost: basic device IOTLB support

2015-12-30 Thread Jason Wang
This patch tries to implement an device IOTLB for vhost. This could be
used with for co-operation with userspace(qemu) implementation of
iommu for a secure DMA environment in guest.

The idea is simple. When vhost meets an IOTLB miss, it will request
the assistance of userspace to do the translation, this is done
through:

- Fill the translation request in a preset userspace address (This
  address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
- Notify userspace through eventfd (This eventfd was set through ioctl
  VHOST_SET_IOTLB_FD).

When userspace finishes the translation, it will update the vhost
IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of
snooping the IOTLB invalidation of IOMMU IOTLB and use
VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.

For simplicity, IOTLB was implemented with a simple hash array. The
index were calculated from IOVA page frame number which can only works
at PAGE_SIZE level.

An qemu implementation (for reference) is available at:
g...@github.com:jasowang/qemu.git iommu

TODO & Known issues:

- read/write permission validation was not implemented.
- no feature negotiation.
- VHOST_SET_MEM_TABLE is not reused (maybe there's a chance).
- working at PAGE_SIZE level, don't support large mappings.
- better data structure for IOTLB instead of simple hash array.
- better API, e.g using mmap() instead of preset userspace address.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/net.c|   2 +-
 drivers/vhost/vhost.c  | 190 -
 drivers/vhost/vhost.h  |  13 
 include/uapi/linux/vhost.h |  26 +++
 4 files changed, 229 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9eda69e..a172be9 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1083,7 +1083,7 @@ static long vhost_net_ioctl(struct file *f, unsigned int 
ioctl,
r = vhost_dev_ioctl(>dev, ioctl, argp);
if (r == -ENOIOCTLCMD)
r = vhost_vring_ioctl(>dev, ioctl, argp);
-   else
+   else if (ioctl != VHOST_UPDATE_IOTLB)
vhost_net_flush(n);
mutex_unlock(>dev.mutex);
return r;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index eec2f11..729fe05 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -113,6 +113,11 @@ static void vhost_init_is_le(struct vhost_virtqueue *vq)
 }
 #endif /* CONFIG_VHOST_CROSS_ENDIAN_LEGACY */
 
+static inline int vhost_iotlb_hash(u64 iova)
+{
+   return (iova >> PAGE_SHIFT) & (VHOST_IOTLB_SIZE - 1);
+}
+
 static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
poll_table *pt)
 {
@@ -384,8 +389,14 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->memory = NULL;
dev->mm = NULL;
spin_lock_init(>work_lock);
+   spin_lock_init(>iotlb_lock);
+   mutex_init(>iotlb_req_mutex);
INIT_LIST_HEAD(>work_list);
dev->worker = NULL;
+   dev->iotlb_request = NULL;
+   dev->iotlb_ctx = NULL;
+   dev->iotlb_file = NULL;
+   dev->pending_request.flags.type = VHOST_IOTLB_INVALIDATE;
 
for (i = 0; i < dev->nvqs; ++i) {
vq = dev->vqs[i];
@@ -393,12 +404,17 @@ void vhost_dev_init(struct vhost_dev *dev,
vq->indirect = NULL;
vq->heads = NULL;
vq->dev = dev;
+   vq->iotlb_request = NULL;
mutex_init(>mutex);
vhost_vq_reset(dev, vq);
if (vq->handle_kick)
vhost_poll_init(>poll, vq->handle_kick,
POLLIN, dev);
}
+
+   init_completion(>iotlb_completion);
+   for (i = 0; i < VHOST_IOTLB_SIZE; i++)
+   dev->iotlb[i].flags.valid = VHOST_IOTLB_INVALID;
 }
 EXPORT_SYMBOL_GPL(vhost_dev_init);
 
@@ -940,9 +956,10 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int 
ioctl, void __user *argp)
 {
struct file *eventfp, *filep = NULL;
struct eventfd_ctx *ctx = NULL;
+   struct vhost_iotlb_entry entry;
u64 p;
long r;
-   int i, fd;
+   int index, i, fd;
 
/* If you are not the owner, you can become one */
if (ioctl == VHOST_SET_OWNER) {
@@ -1008,6 +1025,80 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int 
ioctl, void __user *argp)
if (filep)
fput(filep);
break;
+   case VHOST_SET_IOTLB_FD:
+   r = get_user(fd, (int __user *)argp);
+   if (r < 0)
+   break;
+   eventfp = fd == -1 ? NULL : eventfd_fget(fd);
+   if (IS_ERR(eventfp)) {
+   r = PTR_

Re: [PATCH net-next 3/3] vhost_net: basic polling support

2015-12-03 Thread Jason Wang


On 12/02/2015 08:36 PM, Michael S. Tsirkin wrote:
> On Wed, Dec 02, 2015 at 01:04:03PM +0800, Jason Wang wrote:
>>
>> On 12/01/2015 10:43 PM, Michael S. Tsirkin wrote:
>>> On Tue, Dec 01, 2015 at 01:17:49PM +0800, Jason Wang wrote:
>>>> On 11/30/2015 06:44 PM, Michael S. Tsirkin wrote:
>>>>> On Wed, Nov 25, 2015 at 03:11:29PM +0800, Jason Wang wrote:
>>>>>>> This patch tries to poll for new added tx buffer or socket receive
>>>>>>> queue for a while at the end of tx/rx processing. The maximum time
>>>>>>> spent on polling were specified through a new kind of vring ioctl.
>>>>>>>
>>>>>>> Signed-off-by: Jason Wang <jasow...@redhat.com>
>>>>> One further enhancement would be to actually poll
>>>>> the underlying device. This should be reasonably
>>>>> straight-forward with macvtap (especially in the
>>>>> passthrough mode).
>>>>>
>>>>>
>>>> Yes, it is. I have some patches to do this by replacing
>>>> skb_queue_empty() with sk_busy_loop() but for tap.
>>> We probably don't want to do this unconditionally, though.
>>>
>>>> Tests does not show
>>>> any improvement but some regression.
>>> Did you add code to call sk_mark_napi_id on tap then?
>>> sk_busy_loop won't do anything useful without.
>> Yes I did. Probably something wrong elsewhere.
> Is this for guest-to-guest?

Nope. Like you said below, since it requires NAPI so it was external
host to guest.

>  the patch to do napi
> for tap is still not upstream due to minor performance
> regression.  Want me to repost it?

Sure, I've played this a little bit in the past too.

>
>>>>  Maybe it's better to test macvtap.
>>> Same thing ...
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/3] vhost_net: basic polling support

2015-12-01 Thread Jason Wang


On 12/01/2015 10:43 PM, Michael S. Tsirkin wrote:
> On Tue, Dec 01, 2015 at 01:17:49PM +0800, Jason Wang wrote:
>>
>> On 11/30/2015 06:44 PM, Michael S. Tsirkin wrote:
>>> On Wed, Nov 25, 2015 at 03:11:29PM +0800, Jason Wang wrote:
>>>>> This patch tries to poll for new added tx buffer or socket receive
>>>>> queue for a while at the end of tx/rx processing. The maximum time
>>>>> spent on polling were specified through a new kind of vring ioctl.
>>>>>
>>>>> Signed-off-by: Jason Wang <jasow...@redhat.com>
>>> One further enhancement would be to actually poll
>>> the underlying device. This should be reasonably
>>> straight-forward with macvtap (especially in the
>>> passthrough mode).
>>>
>>>
>> Yes, it is. I have some patches to do this by replacing
>> skb_queue_empty() with sk_busy_loop() but for tap.
> We probably don't want to do this unconditionally, though.
>
>> Tests does not show
>> any improvement but some regression.
> Did you add code to call sk_mark_napi_id on tap then?
> sk_busy_loop won't do anything useful without.

Yes I did. Probably something wrong elsewhere.

>
>>  Maybe it's better to test macvtap.
> Same thing ...
>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/3] vhost_net: basic polling support

2015-11-30 Thread Jason Wang


On 11/30/2015 06:44 PM, Michael S. Tsirkin wrote:
> On Wed, Nov 25, 2015 at 03:11:29PM +0800, Jason Wang wrote:
>> > This patch tries to poll for new added tx buffer or socket receive
>> > queue for a while at the end of tx/rx processing. The maximum time
>> > spent on polling were specified through a new kind of vring ioctl.
>> > 
>> > Signed-off-by: Jason Wang <jasow...@redhat.com>
> One further enhancement would be to actually poll
> the underlying device. This should be reasonably
> straight-forward with macvtap (especially in the
> passthrough mode).
>
>

Yes, it is. I have some patches to do this by replacing
skb_queue_empty() with sk_busy_loop() but for tap. Tests does not show
any improvement but some regression.  Maybe it's better to test macvtap.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 2/3] vhost: introduce vhost_vq_more_avail()

2015-11-30 Thread Jason Wang


On 11/30/2015 04:22 PM, Michael S. Tsirkin wrote:
> On Wed, Nov 25, 2015 at 03:11:28PM +0800, Jason Wang wrote:
>> Signed-off-by: Jason Wang <jasow...@redhat.com>
>> ---
>>  drivers/vhost/vhost.c | 26 +-
>>  drivers/vhost/vhost.h |  1 +
>>  2 files changed, 18 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 163b365..b86c5aa 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -1633,10 +1633,25 @@ void vhost_add_used_and_signal_n(struct vhost_dev 
>> *dev,
>>  }
>>  EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
>>  
>> +bool vhost_vq_more_avail(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>> +{
>> +__virtio16 avail_idx;
>> +int r;
>> +
>> +r = __get_user(avail_idx, >avail->idx);
>> +if (r) {
>> +vq_err(vq, "Failed to check avail idx at %p: %d\n",
>> +   >avail->idx, r);
>> +return false;
> In patch 3 you are calling this under preempt disable,
> so this actually can fail and it isn't a VQ error.
>

Yes.

>> +}
>> +
>> +return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
>> +}
>> +EXPORT_SYMBOL_GPL(vhost_vq_more_avail);
>> +
>>  /* OK, now we need to know about added descriptors. */
>>  bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>>  {
>> -__virtio16 avail_idx;
>>  int r;
>>  
>>  if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
>> @@ -1660,14 +1675,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, 
>> struct vhost_virtqueue *vq)
>>  /* They could have slipped one in as we were doing that: make
>>   * sure it's written, then check again. */
>>  smp_mb();
>> -r = __get_user(avail_idx, >avail->idx);
>> -if (r) {
>> -vq_err(vq, "Failed to check avail idx at %p: %d\n",
>> -   >avail->idx, r);
>> -return false;
>> -}
>> -
>> -return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
>> +return vhost_vq_more_avail(dev, vq);
>>  }
>>  EXPORT_SYMBOL_GPL(vhost_enable_notify);
>>  
> This path does need an error though.
> It's probably easier to just leave this call site alone.

Ok, will leave this function as is and remove the vq_err() in
vhost_vq_more_avail().

Thanks

>
>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>> index 43284ad..2f3c57c 100644
>> --- a/drivers/vhost/vhost.h
>> +++ b/drivers/vhost/vhost.h
>> @@ -159,6 +159,7 @@ void vhost_add_used_and_signal_n(struct vhost_dev *, 
>> struct vhost_virtqueue *,
>> struct vring_used_elem *heads, unsigned count);
>>  void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
>>  void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
>> +bool vhost_vq_more_avail(struct vhost_dev *, struct vhost_virtqueue *);
>>  bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
>>  
>>  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
>> -- 
>> 2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 0/3] basic busy polling support for vhost_net

2015-11-30 Thread Jason Wang
. 

Jason Wang (3):
  vhost: introduce vhost_has_work()
  vhost: introduce vhost_vq_more_avail()
  vhost_net: basic polling support

 drivers/vhost/net.c| 72 ++
 drivers/vhost/vhost.c  | 35 ++
 drivers/vhost/vhost.h  |  3 ++
 include/uapi/linux/vhost.h | 11 +++
 4 files changed, 116 insertions(+), 5 deletions(-)

-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 3/3] vhost_net: basic polling support

2015-11-30 Thread Jason Wang
This patch tries to poll for new added tx buffer or socket receive
queue for a while at the end of tx/rx processing. The maximum time
spent on polling were specified through a new kind of vring ioctl.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/net.c| 72 ++
 drivers/vhost/vhost.c  | 15 ++
 drivers/vhost/vhost.h  |  1 +
 include/uapi/linux/vhost.h | 11 +++
 4 files changed, 94 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9eda69e..ce6da77 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -287,6 +287,41 @@ static void vhost_zerocopy_callback(struct ubuf_info 
*ubuf, bool success)
rcu_read_unlock_bh();
 }
 
+static inline unsigned long busy_clock(void)
+{
+   return local_clock() >> 10;
+}
+
+static bool vhost_can_busy_poll(struct vhost_dev *dev,
+   unsigned long endtime)
+{
+   return likely(!need_resched()) &&
+  likely(!time_after(busy_clock(), endtime)) &&
+  likely(!signal_pending(current)) &&
+  !vhost_has_work(dev) &&
+  single_task_running();
+}
+
+static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
+   struct vhost_virtqueue *vq,
+   struct iovec iov[], unsigned int iov_size,
+   unsigned int *out_num, unsigned int *in_num)
+{
+   unsigned long uninitialized_var(endtime);
+
+   if (vq->busyloop_timeout) {
+   preempt_disable();
+   endtime = busy_clock() + vq->busyloop_timeout;
+   while (vhost_can_busy_poll(vq->dev, endtime) &&
+  !vhost_vq_more_avail(vq->dev, vq))
+   cpu_relax();
+   preempt_enable();
+   }
+
+   return vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+out_num, in_num, NULL, NULL);
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -331,10 +366,9 @@ static void handle_tx(struct vhost_net *net)
  % UIO_MAXIOV == nvq->done_idx))
break;
 
-   head = vhost_get_vq_desc(vq, vq->iov,
-ARRAY_SIZE(vq->iov),
-, ,
-NULL, NULL);
+   head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
+   ARRAY_SIZE(vq->iov),
+   , );
/* On error, stop handling until the next kick. */
if (unlikely(head < 0))
break;
@@ -435,6 +469,34 @@ static int peek_head_len(struct sock *sk)
return len;
 }
 
+static int vhost_net_peek_head_len(struct vhost_net *net, struct sock *sk)
+{
+   struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
+   struct vhost_virtqueue *vq = >vq;
+   unsigned long uninitialized_var(endtime);
+
+   if (vq->busyloop_timeout) {
+   mutex_lock(>mutex);
+   vhost_disable_notify(>dev, vq);
+
+   preempt_disable();
+   endtime = busy_clock() + vq->busyloop_timeout;
+
+   while (vhost_can_busy_poll(>dev, endtime) &&
+  skb_queue_empty(>sk_receive_queue) &&
+  !vhost_vq_more_avail(>dev, vq))
+   cpu_relax();
+
+   preempt_enable();
+
+   if (vhost_enable_notify(>dev, vq))
+   vhost_poll_queue(>poll);
+   mutex_unlock(>mutex);
+   }
+
+   return peek_head_len(sk);
+}
+
 /* This is a multi-buffer version of vhost_get_desc, that works if
  * vq has read descriptors only.
  * @vq - the relevant virtqueue
@@ -553,7 +615,7 @@ static void handle_rx(struct vhost_net *net)
vq->log : NULL;
mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
 
-   while ((sock_len = peek_head_len(sock->sk))) {
+   while ((sock_len = vhost_net_peek_head_len(net, sock->sk))) {
sock_len += sock_hlen;
vhost_len = sock_len + vhost_hlen;
headcount = get_rx_bufs(vq, vq->heads, vhost_len,
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 4f45a03..b8ca873 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -285,6 +285,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->memory = NULL;
vq->is_le = virtio_legacy_is_little_endian();
vhost_vq_reset_user_be(vq);
+   vq->busyloop_tim

[PATCH V2 1/3] vhost: introduce vhost_has_work()

2015-11-30 Thread Jason Wang
This path introduces a helper which can give a hint for whether or not
there's a work queued in the work list.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/vhost.c | 7 +++
 drivers/vhost/vhost.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index eec2f11..163b365 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -245,6 +245,13 @@ void vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
+/* A lockless hint for busy polling code to exit the loop */
+bool vhost_has_work(struct vhost_dev *dev)
+{
+   return !list_empty(>work_list);
+}
+EXPORT_SYMBOL_GPL(vhost_has_work);
+
 void vhost_poll_queue(struct vhost_poll *poll)
 {
vhost_work_queue(poll->dev, >work);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d3f7674..43284ad 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -37,6 +37,7 @@ struct vhost_poll {
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
+bool vhost_has_work(struct vhost_dev *dev);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 unsigned long mask, struct vhost_dev *dev);
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 2/3] vhost: introduce vhost_vq_more_avail()

2015-11-30 Thread Jason Wang
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/vhost.c | 13 +
 drivers/vhost/vhost.h |  1 +
 2 files changed, 14 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 163b365..4f45a03 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1633,6 +1633,19 @@ void vhost_add_used_and_signal_n(struct vhost_dev *dev,
 }
 EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
 
+bool vhost_vq_more_avail(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
+   __virtio16 avail_idx;
+   int r;
+
+   r = __get_user(avail_idx, >avail->idx);
+   if (r)
+   return false;
+
+   return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
+}
+EXPORT_SYMBOL_GPL(vhost_vq_more_avail);
+
 /* OK, now we need to know about added descriptors. */
 bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 43284ad..2f3c57c 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -159,6 +159,7 @@ void vhost_add_used_and_signal_n(struct vhost_dev *, struct 
vhost_virtqueue *,
   struct vring_used_elem *heads, unsigned count);
 void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
 void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
+bool vhost_vq_more_avail(struct vhost_dev *, struct vhost_virtqueue *);
 bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 0/3] basic busy polling support for vhost_net

2015-11-24 Thread Jason Wang
 multiple duplicate conditions in
  critical path when busy loop is not enabled.
- Add the test result of multiple VMs

Changes from RFC V2:
- poll also at the end of rx handling
- factor out the polling logic and optimize the code a little bit
- add two ioctls to get and set the busy poll timeout
- test on ixgbe (which can give more stable and reproducable numbers)
  instead of mlx4.

Changes from RFC V1:
- Add a comment for vhost_has_work() to explain why it could be
  lockless
- Add param description for busyloop_timeout
- Split out the busy polling logic into a new helper
- Check and exit the loop when there's a pending signal
- Disable preemption during busy looping to make sure lock_clock() was
  correctly used.

Jason Wang (3):
  vhost: introduce vhost_has_work()
  vhost: introduce vhost_vq_more_avail()
  vhost_net: basic polling support

 drivers/vhost/net.c| 72 ++
 drivers/vhost/vhost.c  | 48 +--
 drivers/vhost/vhost.h  |  3 ++
 include/uapi/linux/vhost.h | 11 +++
 4 files changed, 120 insertions(+), 14 deletions(-)

-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 1/3] vhost: introduce vhost_has_work()

2015-11-24 Thread Jason Wang
This path introduces a helper which can give a hint for whether or not
there's a work queued in the work list.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/vhost.c | 7 +++
 drivers/vhost/vhost.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index eec2f11..163b365 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -245,6 +245,13 @@ void vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
+/* A lockless hint for busy polling code to exit the loop */
+bool vhost_has_work(struct vhost_dev *dev)
+{
+   return !list_empty(>work_list);
+}
+EXPORT_SYMBOL_GPL(vhost_has_work);
+
 void vhost_poll_queue(struct vhost_poll *poll)
 {
vhost_work_queue(poll->dev, >work);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d3f7674..43284ad 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -37,6 +37,7 @@ struct vhost_poll {
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
+bool vhost_has_work(struct vhost_dev *dev);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 unsigned long mask, struct vhost_dev *dev);
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 3/3] vhost_net: basic polling support

2015-11-24 Thread Jason Wang
This patch tries to poll for new added tx buffer or socket receive
queue for a while at the end of tx/rx processing. The maximum time
spent on polling were specified through a new kind of vring ioctl.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/net.c| 72 ++
 drivers/vhost/vhost.c  | 15 ++
 drivers/vhost/vhost.h  |  1 +
 include/uapi/linux/vhost.h | 11 +++
 4 files changed, 94 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9eda69e..ce6da77 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -287,6 +287,41 @@ static void vhost_zerocopy_callback(struct ubuf_info 
*ubuf, bool success)
rcu_read_unlock_bh();
 }
 
+static inline unsigned long busy_clock(void)
+{
+   return local_clock() >> 10;
+}
+
+static bool vhost_can_busy_poll(struct vhost_dev *dev,
+   unsigned long endtime)
+{
+   return likely(!need_resched()) &&
+  likely(!time_after(busy_clock(), endtime)) &&
+  likely(!signal_pending(current)) &&
+  !vhost_has_work(dev) &&
+  single_task_running();
+}
+
+static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
+   struct vhost_virtqueue *vq,
+   struct iovec iov[], unsigned int iov_size,
+   unsigned int *out_num, unsigned int *in_num)
+{
+   unsigned long uninitialized_var(endtime);
+
+   if (vq->busyloop_timeout) {
+   preempt_disable();
+   endtime = busy_clock() + vq->busyloop_timeout;
+   while (vhost_can_busy_poll(vq->dev, endtime) &&
+  !vhost_vq_more_avail(vq->dev, vq))
+   cpu_relax();
+   preempt_enable();
+   }
+
+   return vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+out_num, in_num, NULL, NULL);
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -331,10 +366,9 @@ static void handle_tx(struct vhost_net *net)
  % UIO_MAXIOV == nvq->done_idx))
break;
 
-   head = vhost_get_vq_desc(vq, vq->iov,
-ARRAY_SIZE(vq->iov),
-, ,
-NULL, NULL);
+   head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
+   ARRAY_SIZE(vq->iov),
+   , );
/* On error, stop handling until the next kick. */
if (unlikely(head < 0))
break;
@@ -435,6 +469,34 @@ static int peek_head_len(struct sock *sk)
return len;
 }
 
+static int vhost_net_peek_head_len(struct vhost_net *net, struct sock *sk)
+{
+   struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
+   struct vhost_virtqueue *vq = >vq;
+   unsigned long uninitialized_var(endtime);
+
+   if (vq->busyloop_timeout) {
+   mutex_lock(>mutex);
+   vhost_disable_notify(>dev, vq);
+
+   preempt_disable();
+   endtime = busy_clock() + vq->busyloop_timeout;
+
+   while (vhost_can_busy_poll(>dev, endtime) &&
+  skb_queue_empty(>sk_receive_queue) &&
+  !vhost_vq_more_avail(>dev, vq))
+   cpu_relax();
+
+   preempt_enable();
+
+   if (vhost_enable_notify(>dev, vq))
+   vhost_poll_queue(>poll);
+   mutex_unlock(>mutex);
+   }
+
+   return peek_head_len(sk);
+}
+
 /* This is a multi-buffer version of vhost_get_desc, that works if
  * vq has read descriptors only.
  * @vq - the relevant virtqueue
@@ -553,7 +615,7 @@ static void handle_rx(struct vhost_net *net)
vq->log : NULL;
mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
 
-   while ((sock_len = peek_head_len(sock->sk))) {
+   while ((sock_len = vhost_net_peek_head_len(net, sock->sk))) {
sock_len += sock_hlen;
vhost_len = sock_len + vhost_hlen;
headcount = get_rx_bufs(vq, vq->heads, vhost_len,
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index b86c5aa..857af6c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -285,6 +285,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->memory = NULL;
vq->is_le = virtio_legacy_is_little_endian();
vhost_vq_reset_user_be(vq);
+   vq->busyloop_tim

[PATCH net-next 2/3] vhost: introduce vhost_vq_more_avail()

2015-11-24 Thread Jason Wang
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/vhost.c | 26 +-
 drivers/vhost/vhost.h |  1 +
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 163b365..b86c5aa 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1633,10 +1633,25 @@ void vhost_add_used_and_signal_n(struct vhost_dev *dev,
 }
 EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
 
+bool vhost_vq_more_avail(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
+   __virtio16 avail_idx;
+   int r;
+
+   r = __get_user(avail_idx, >avail->idx);
+   if (r) {
+   vq_err(vq, "Failed to check avail idx at %p: %d\n",
+  >avail->idx, r);
+   return false;
+   }
+
+   return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
+}
+EXPORT_SYMBOL_GPL(vhost_vq_more_avail);
+
 /* OK, now we need to know about added descriptors. */
 bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
-   __virtio16 avail_idx;
int r;
 
if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
@@ -1660,14 +1675,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
/* They could have slipped one in as we were doing that: make
 * sure it's written, then check again. */
smp_mb();
-   r = __get_user(avail_idx, >avail->idx);
-   if (r) {
-   vq_err(vq, "Failed to check avail idx at %p: %d\n",
-  >avail->idx, r);
-   return false;
-   }
-
-   return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
+   return vhost_vq_more_avail(dev, vq);
 }
 EXPORT_SYMBOL_GPL(vhost_enable_notify);
 
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 43284ad..2f3c57c 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -159,6 +159,7 @@ void vhost_add_used_and_signal_n(struct vhost_dev *, struct 
vhost_virtqueue *,
   struct vring_used_elem *heads, unsigned count);
 void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
 void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
+bool vhost_vq_more_avail(struct vhost_dev *, struct vhost_virtqueue *);
 bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: relax log address alignment

2015-11-16 Thread Jason Wang


On 11/16/2015 11:00 PM, Michael S. Tsirkin wrote:
> commit 5d9a07b0de512b77bf28d2401e5fe3351f00a240 ("vhost: relax used
> address alignment") fixed the alignment for the used virtual address,
> but not for the physical address used for logging.
>
> That's a mistake: alignment should clearly be the same for virtual and
> physical addresses,
>
> Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
> ---
>  drivers/vhost/vhost.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index eec2f11..080422f 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -819,7 +819,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, 
> void __user *argp)
>   BUILD_BUG_ON(__alignof__ *vq->used > VRING_USED_ALIGN_SIZE);
>   if ((a.avail_user_addr & (VRING_AVAIL_ALIGN_SIZE - 1)) ||
>   (a.used_user_addr & (VRING_USED_ALIGN_SIZE - 1)) ||
> - (a.log_guest_addr & (sizeof(u64) - 1))) {
> + (a.log_guest_addr & (VRING_USED_ALIGN_SIZE - 1))) {
>   r = -EINVAL;
>   break;
>   }

Acked-by: Jason Wang <jasow...@redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC V3 0/3] basic busy polling support for vhost_net

2015-11-16 Thread Jason Wang


On 11/13/2015 05:20 PM, Jason Wang wrote:
>
> On 11/12/2015 08:02 PM, Felipe Franciosi wrote:
>> Hi Jason,
>>
>> I understand your busy loop timeout is quite conservative at 50us. Did you 
>> try any other values?
> I've also tried 20us. And results shows 50us was better in:
>
> - very small packet tx (e.g 64bytes at most 46% improvement)
> - TCP_RR (at most 11% improvement)
>
> But I will test bigger values. In fact, for net itself, we can be even
> more aggressive: make vhost poll forever but I haven't tired this.
>
>> Also, did you measure how polling affects many VMs talking to each other 
>> (e.g. 20 VMs on each host, perhaps with several vNICs each, transmitting to 
>> a corresponding VM/vNIC pair on another host)?
> Not yet, in my todo list.
>
>>
>> On a complete separate experiment (busy waiting on storage I/O rings on 
>> Xen), I have observed that bigger timeouts gave bigger benefits. On the 
>> other hand, all cases that contended for CPU were badly hurt with any sort 
>> of polling.
>>
>> The cases that contended for CPU consisted of many VMs generating workload 
>> over very fast I/O devices (in that case, several NVMe devices on a single 
>> host). And the metric that got affected was aggregate throughput from all 
>> VMs.
>>
>> The solution was to determine whether to poll depending on the host's 
>> overall CPU utilisation at that moment. That gave me the best of both worlds 
>> as polling made everything faster without slowing down any other metric.
> You mean a threshold and exit polling when it exceeds this? I use a
> simpler method: just exit the busy loop when there's more than one
> processes is in running state. I test this method in the past for socket
> busy read (http://www.gossamer-threads.com/lists/linux/kernel/1997531)
> which seems can solve the issue. But haven't tested this for vhost
> polling. Will run some simple test (e.g pin two vhost threads in one
> host cpu), and see how well it perform.
>
> Thanks

Run simple test like:

- starting two VMs, each vm has one vcpu
- pin both two vhost threads of VMs to cpu 0
- pin vcpu0 of VM1 to cpu1
- pin vcpu0 of VM2 to cpu2

Try two TCP_RR netperf tests in parallell:

/busy loop timeouts/trate1+trate2/-+%/
/no busy loop/13966.76+13987.31/+0%/
/20us/14097.89+14088.82/+0.08%)
/50us/15103.98+15103.73/+8.06%/


Busy loop can still give improvements even if two vhost threads are
contending one cpu on host.

>
>> Thanks,
>> Felipe
>>
>>
>>
>> On 12/11/2015 10:20, "kvm-ow...@vger.kernel.org on behalf of Jason Wang" 
>> <kvm-ow...@vger.kernel.org on behalf of jasow...@redhat.com> wrote:
>>
>>> On 11/12/2015 06:16 PM, Jason Wang wrote:
>>>> Hi all:
>>>>
>>>> This series tries to add basic busy polling for vhost net. The idea is
>>>> simple: at the end of tx/rx processing, busy polling for new tx added
>>>> descriptor and rx receive socket for a while. The maximum number of
>>>> time (in us) could be spent on busy polling was specified ioctl.
>>>>
>>>> Test were done through:
>>>>
>>>> - 50 us as busy loop timeout
>>>> - Netperf 2.6
>>>> - Two machines with back to back connected ixgbe
>>>> - Guest with 1 vcpu and 1 queue
>>>>
>>>> Results:
>>>> - For stream workload, ioexits were reduced dramatically in medium
>>>>   size (1024-2048) of tx (at most -39%) and almost all rx (at most
>>>>   -79%) as a result of polling. This compensate for the possible
>>>>   wasted cpu cycles more or less. That porbably why we can still see
>>>>   some increasing in the normalized throughput in some cases.
>>>> - Throughput of tx were increased (at most 105%) expect for the huge
>>>>   write (16384). And we can send more packets in the case (+tpkts were
>>>>   increased).
>>>> - Very minor rx regression in some cases.
>>>> - Improvemnt on TCP_RR (at most 16%).
>>> Forget to mention, the following test results by order are:
>>>
>>> 1) Guest TX
>>> 2) Guest RX
>>> 3) TCP_RR
>>>
>>>> size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
>>>>64/ 1/   +9%/  -17%/   +5%/  +10%/   -2%
>>>>64/ 2/   +8%/  -18%/   +6%/  +10%/   -1%
>>>>64/ 4/   +4%/  -21%/   +6%/  +10%/   -1%
>>>>64/ 8/   +9%/  -17%/   +6%/   +9%/   -2%
>>>>   256/ 1/  +20%/   -1%/  +15%/  +11%/   -9%
>>>>   256/ 2/  +15%/   -6%/  +15%/   +8%/   -8%
>>>>

Re: [PATCH net-next RFC V3 0/3] basic busy polling support for vhost_net

2015-11-13 Thread Jason Wang


On 11/12/2015 08:02 PM, Felipe Franciosi wrote:
> Hi Jason,
>
> I understand your busy loop timeout is quite conservative at 50us. Did you 
> try any other values?

I've also tried 20us. And results shows 50us was better in:

- very small packet tx (e.g 64bytes at most 46% improvement)
- TCP_RR (at most 11% improvement)

But I will test bigger values. In fact, for net itself, we can be even
more aggressive: make vhost poll forever but I haven't tired this.

>
> Also, did you measure how polling affects many VMs talking to each other 
> (e.g. 20 VMs on each host, perhaps with several vNICs each, transmitting to a 
> corresponding VM/vNIC pair on another host)?

Not yet, in my todo list.

>
>
> On a complete separate experiment (busy waiting on storage I/O rings on Xen), 
> I have observed that bigger timeouts gave bigger benefits. On the other hand, 
> all cases that contended for CPU were badly hurt with any sort of polling.
>
> The cases that contended for CPU consisted of many VMs generating workload 
> over very fast I/O devices (in that case, several NVMe devices on a single 
> host). And the metric that got affected was aggregate throughput from all VMs.
>
> The solution was to determine whether to poll depending on the host's overall 
> CPU utilisation at that moment. That gave me the best of both worlds as 
> polling made everything faster without slowing down any other metric.

You mean a threshold and exit polling when it exceeds this? I use a
simpler method: just exit the busy loop when there's more than one
processes is in running state. I test this method in the past for socket
busy read (http://www.gossamer-threads.com/lists/linux/kernel/1997531)
which seems can solve the issue. But haven't tested this for vhost
polling. Will run some simple test (e.g pin two vhost threads in one
host cpu), and see how well it perform.

Thanks

>
> Thanks,
> Felipe
>
>
>
> On 12/11/2015 10:20, "kvm-ow...@vger.kernel.org on behalf of Jason Wang" 
> <kvm-ow...@vger.kernel.org on behalf of jasow...@redhat.com> wrote:
>
>>
>> On 11/12/2015 06:16 PM, Jason Wang wrote:
>>> Hi all:
>>>
>>> This series tries to add basic busy polling for vhost net. The idea is
>>> simple: at the end of tx/rx processing, busy polling for new tx added
>>> descriptor and rx receive socket for a while. The maximum number of
>>> time (in us) could be spent on busy polling was specified ioctl.
>>>
>>> Test were done through:
>>>
>>> - 50 us as busy loop timeout
>>> - Netperf 2.6
>>> - Two machines with back to back connected ixgbe
>>> - Guest with 1 vcpu and 1 queue
>>>
>>> Results:
>>> - For stream workload, ioexits were reduced dramatically in medium
>>>   size (1024-2048) of tx (at most -39%) and almost all rx (at most
>>>   -79%) as a result of polling. This compensate for the possible
>>>   wasted cpu cycles more or less. That porbably why we can still see
>>>   some increasing in the normalized throughput in some cases.
>>> - Throughput of tx were increased (at most 105%) expect for the huge
>>>   write (16384). And we can send more packets in the case (+tpkts were
>>>   increased).
>>> - Very minor rx regression in some cases.
>>> - Improvemnt on TCP_RR (at most 16%).
>> Forget to mention, the following test results by order are:
>>
>> 1) Guest TX
>> 2) Guest RX
>> 3) TCP_RR
>>
>>> size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
>>>64/ 1/   +9%/  -17%/   +5%/  +10%/   -2%
>>>64/ 2/   +8%/  -18%/   +6%/  +10%/   -1%
>>>64/ 4/   +4%/  -21%/   +6%/  +10%/   -1%
>>>64/ 8/   +9%/  -17%/   +6%/   +9%/   -2%
>>>   256/ 1/  +20%/   -1%/  +15%/  +11%/   -9%
>>>   256/ 2/  +15%/   -6%/  +15%/   +8%/   -8%
>>>   256/ 4/  +17%/   -4%/  +16%/   +8%/   -8%
>>>   256/ 8/  -61%/  -69%/  +16%/  +10%/  -10%
>>>   512/ 1/  +15%/   -3%/  +19%/  +18%/  -11%
>>>   512/ 2/  +19%/0%/  +19%/  +13%/  -10%
>>>   512/ 4/  +18%/   -2%/  +18%/  +15%/  -10%
>>>   512/ 8/  +17%/   -1%/  +18%/  +15%/  -11%
>>>  1024/ 1/  +25%/   +4%/  +27%/  +16%/  -21%
>>>  1024/ 2/  +28%/   +8%/  +25%/  +15%/  -22%
>>>  1024/ 4/  +25%/   +5%/  +25%/  +14%/  -21%
>>>  1024/ 8/  +27%/   +7%/  +25%/  +16%/  -21%
>>>  2048/ 1/  +32%/  +12%/  +31%/  +22%/  -38%
>>>  2048/ 2/  +33%/  +12%/  +30%/  +23%/  -36%
>>>  2048/ 4/  +31%/  +10%/  +31%/  +24%/  -37%
>>>  2048/ 8/ +105%/  +75%/  +33%/  +23%/  -39%
>>>

[PATCH net-next RFC V3 3/3] vhost_net: basic polling support

2015-11-12 Thread Jason Wang
This patch tries to poll for new added tx buffer or socket receive
queue for a while at the end of tx/rx processing. The maximum time
spent on polling were specified through a new kind of vring ioctl.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/net.c| 77 +++---
 drivers/vhost/vhost.c  | 15 +
 drivers/vhost/vhost.h  |  1 +
 include/uapi/linux/vhost.h | 11 +++
 4 files changed, 99 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9eda69e..a38fa32 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -287,6 +287,45 @@ static void vhost_zerocopy_callback(struct ubuf_info 
*ubuf, bool success)
rcu_read_unlock_bh();
 }
 
+static inline unsigned long busy_clock(void)
+{
+   return local_clock() >> 10;
+}
+
+static bool vhost_can_busy_poll(struct vhost_dev *dev,
+   unsigned long endtime)
+{
+   return likely(!need_resched()) &&
+  likely(!time_after(busy_clock(), endtime)) &&
+  likely(!signal_pending(current)) &&
+  !vhost_has_work(dev) &&
+  single_task_running();
+}
+
+static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
+   struct vhost_virtqueue *vq,
+   struct iovec iov[], unsigned int iov_size,
+   unsigned int *out_num, unsigned int *in_num)
+{
+   unsigned long uninitialized_var(endtime);
+
+   if (vq->busyloop_timeout) {
+   preempt_disable();
+   endtime = busy_clock() + vq->busyloop_timeout;
+   }
+
+   while (vq->busyloop_timeout &&
+  vhost_can_busy_poll(vq->dev, endtime) &&
+  !vhost_vq_more_avail(vq->dev, vq))
+   cpu_relax();
+
+   if (vq->busyloop_timeout)
+   preempt_enable();
+
+   return vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+out_num, in_num, NULL, NULL);
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -331,10 +370,9 @@ static void handle_tx(struct vhost_net *net)
  % UIO_MAXIOV == nvq->done_idx))
break;
 
-   head = vhost_get_vq_desc(vq, vq->iov,
-ARRAY_SIZE(vq->iov),
-, ,
-NULL, NULL);
+   head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
+   ARRAY_SIZE(vq->iov),
+   , );
/* On error, stop handling until the next kick. */
if (unlikely(head < 0))
break;
@@ -435,6 +473,35 @@ static int peek_head_len(struct sock *sk)
return len;
 }
 
+static int vhost_net_peek_head_len(struct vhost_net *net, struct sock *sk)
+{
+   struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
+   struct vhost_virtqueue *vq = >vq;
+   unsigned long uninitialized_var(endtime);
+
+   if (vq->busyloop_timeout) {
+   mutex_lock(>mutex);
+   vhost_disable_notify(>dev, vq);
+   preempt_disable();
+   endtime = busy_clock() + vq->busyloop_timeout;
+   }
+
+   while (vq->busyloop_timeout &&
+  vhost_can_busy_poll(>dev, endtime) &&
+  skb_queue_empty(>sk_receive_queue) &&
+  !vhost_vq_more_avail(>dev, vq))
+   cpu_relax();
+
+   if (vq->busyloop_timeout) {
+   preempt_enable();
+   if (vhost_enable_notify(>dev, vq))
+   vhost_poll_queue(>poll);
+   mutex_unlock(>mutex);
+   }
+
+   return peek_head_len(sk);
+}
+
 /* This is a multi-buffer version of vhost_get_desc, that works if
  * vq has read descriptors only.
  * @vq - the relevant virtqueue
@@ -553,7 +620,7 @@ static void handle_rx(struct vhost_net *net)
vq->log : NULL;
mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
 
-   while ((sock_len = peek_head_len(sock->sk))) {
+   while ((sock_len = vhost_net_peek_head_len(net, sock->sk))) {
sock_len += sock_hlen;
vhost_len = sock_len + vhost_hlen;
headcount = get_rx_bufs(vq, vq->heads, vhost_len,
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index b86c5aa..8f9a64c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -285,6 +285,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->memory = NULL;

[PATCH net-next RFC V3 0/3] basic busy polling support for vhost_net

2015-11-12 Thread Jason Wang
Hi all:

This series tries to add basic busy polling for vhost net. The idea is
simple: at the end of tx/rx processing, busy polling for new tx added
descriptor and rx receive socket for a while. The maximum number of
time (in us) could be spent on busy polling was specified ioctl.

Test were done through:

- 50 us as busy loop timeout
- Netperf 2.6
- Two machines with back to back connected ixgbe
- Guest with 1 vcpu and 1 queue

Results:
- For stream workload, ioexits were reduced dramatically in medium
  size (1024-2048) of tx (at most -39%) and almost all rx (at most
  -79%) as a result of polling. This compensate for the possible
  wasted cpu cycles more or less. That porbably why we can still see
  some increasing in the normalized throughput in some cases.
- Throughput of tx were increased (at most 105%) expect for the huge
  write (16384). And we can send more packets in the case (+tpkts were
  increased).
- Very minor rx regression in some cases.
- Improvemnt on TCP_RR (at most 16%).

size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
   64/ 1/   +9%/  -17%/   +5%/  +10%/   -2%
   64/ 2/   +8%/  -18%/   +6%/  +10%/   -1%
   64/ 4/   +4%/  -21%/   +6%/  +10%/   -1%
   64/ 8/   +9%/  -17%/   +6%/   +9%/   -2%
  256/ 1/  +20%/   -1%/  +15%/  +11%/   -9%
  256/ 2/  +15%/   -6%/  +15%/   +8%/   -8%
  256/ 4/  +17%/   -4%/  +16%/   +8%/   -8%
  256/ 8/  -61%/  -69%/  +16%/  +10%/  -10%
  512/ 1/  +15%/   -3%/  +19%/  +18%/  -11%
  512/ 2/  +19%/0%/  +19%/  +13%/  -10%
  512/ 4/  +18%/   -2%/  +18%/  +15%/  -10%
  512/ 8/  +17%/   -1%/  +18%/  +15%/  -11%
 1024/ 1/  +25%/   +4%/  +27%/  +16%/  -21%
 1024/ 2/  +28%/   +8%/  +25%/  +15%/  -22%
 1024/ 4/  +25%/   +5%/  +25%/  +14%/  -21%
 1024/ 8/  +27%/   +7%/  +25%/  +16%/  -21%
 2048/ 1/  +32%/  +12%/  +31%/  +22%/  -38%
 2048/ 2/  +33%/  +12%/  +30%/  +23%/  -36%
 2048/ 4/  +31%/  +10%/  +31%/  +24%/  -37%
 2048/ 8/ +105%/  +75%/  +33%/  +23%/  -39%
16384/ 1/0%/  -14%/   +2%/0%/  +19%
16384/ 2/0%/  -13%/  +19%/  -13%/  +17%
16384/ 4/0%/  -12%/   +3%/0%/   +2%
16384/ 8/0%/  -11%/   -2%/   +1%/   +1%
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
   64/ 1/   -7%/  -23%/   +4%/   +6%/  -74%
   64/ 2/   -2%/  -12%/   +2%/   +2%/  -55%
   64/ 4/   +2%/   -5%/  +10%/   -2%/  -43%
   64/ 8/   -5%/   -5%/  +11%/  -34%/  -59%
  256/ 1/   -6%/  -16%/   +9%/  +11%/  -60%
  256/ 2/   +3%/   -4%/   +6%/   -3%/  -28%
  256/ 4/0%/   -5%/   -9%/   -9%/  -10%
  256/ 8/   -3%/   -6%/  -12%/   -9%/  -40%
  512/ 1/   -4%/  -17%/  -10%/  +21%/  -34%
  512/ 2/0%/   -9%/  -14%/   -3%/  -30%
  512/ 4/0%/   -4%/  -18%/  -12%/   -4%
  512/ 8/   -1%/   -4%/   -1%/   -5%/   +4%
 1024/ 1/0%/  -16%/  +12%/  +11%/  -10%
 1024/ 2/0%/  -11%/0%/   +5%/  -31%
 1024/ 4/0%/   -4%/   -7%/   +1%/  -22%
 1024/ 8/   -5%/   -6%/  -17%/  -29%/  -79%
 2048/ 1/0%/  -16%/   +1%/   +9%/  -10%
 2048/ 2/0%/  -12%/   +7%/   +9%/  -26%
 2048/ 4/0%/   -7%/   -4%/   +3%/  -64%
 2048/ 8/   -1%/   -5%/   -6%/   +4%/  -20%
16384/ 1/0%/  -12%/  +11%/   +7%/  -20%
16384/ 2/0%/   -7%/   +1%/   +5%/  -26%
16384/ 4/0%/   -5%/  +12%/  +22%/  -23%
16384/ 8/0%/   -1%/   -8%/   +5%/   -3%
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
1/ 1/   +9%/  -29%/   +9%/   +9%/   +9%
1/25/   +6%/  -18%/   +6%/   +6%/   -1%
1/50/   +6%/  -19%/   +5%/   +5%/   -2%
1/   100/   +5%/  -19%/   +4%/   +4%/   -3%
   64/ 1/  +10%/  -28%/  +10%/  +10%/  +10%
   64/25/   +8%/  -18%/   +7%/   +7%/   -2%
   64/50/   +8%/  -17%/   +8%/   +8%/   -1%
   64/   100/   +8%/  -17%/   +8%/   +8%/   -1%
  256/ 1/  +10%/  -28%/  +10%/  +10%/  +10%
  256/25/  +15%/  -13%/  +15%/  +15%/0%
  256/50/  +16%/  -14%/  +18%/  +18%/   +2%
  256/   100/  +15%/  -13%/  +12%/  +12%/   -2%

Changes from V2:
- poll also at the end of rx handling
- factor out the polling logic and optimize the code a little bit
- add two ioctls to get and set the busy poll timeout
- test on ixgbe (which can give more stable and reproducable numbers)
  instead of mlx4.

Changes from V1:
- Add a comment for vhost_has_work() to explain why it could be
  lockless
- Add param description for busyloop_timeout
- Split out the busy polling logic into a new helper
- Check and exit the loop when there's a pending signal
- Disable preemption during busy looping to make sure lock_clock() was
  correctly used.

Jason Wang (3):
  vhost: introduce vhost_has_work()
  vhost: introduce vhost_vq_more_avail()
  vhost_net: basic polling support

 drivers/vhost/net.c| 77 +++---
 drivers/vhost/vhost.c  | 48 +++--
 drivers/vhost/vhost.h  |  3 ++
 include/uapi/linux

[PATCH net-next RFC V3 1/3] vhost: introduce vhost_has_work()

2015-11-12 Thread Jason Wang
This path introduces a helper which can give a hint for whether or not
there's a work queued in the work list.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/vhost.c | 7 +++
 drivers/vhost/vhost.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index eec2f11..163b365 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -245,6 +245,13 @@ void vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
+/* A lockless hint for busy polling code to exit the loop */
+bool vhost_has_work(struct vhost_dev *dev)
+{
+   return !list_empty(>work_list);
+}
+EXPORT_SYMBOL_GPL(vhost_has_work);
+
 void vhost_poll_queue(struct vhost_poll *poll)
 {
vhost_work_queue(poll->dev, >work);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 4772862..ea0327d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -37,6 +37,7 @@ struct vhost_poll {
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
+bool vhost_has_work(struct vhost_dev *dev);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 unsigned long mask, struct vhost_dev *dev);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next RFC V3 2/3] vhost: introduce vhost_vq_more_avail()

2015-11-12 Thread Jason Wang
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/vhost.c | 26 +-
 drivers/vhost/vhost.h |  1 +
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 163b365..b86c5aa 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1633,10 +1633,25 @@ void vhost_add_used_and_signal_n(struct vhost_dev *dev,
 }
 EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
 
+bool vhost_vq_more_avail(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
+   __virtio16 avail_idx;
+   int r;
+
+   r = __get_user(avail_idx, >avail->idx);
+   if (r) {
+   vq_err(vq, "Failed to check avail idx at %p: %d\n",
+  >avail->idx, r);
+   return false;
+   }
+
+   return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
+}
+EXPORT_SYMBOL_GPL(vhost_vq_more_avail);
+
 /* OK, now we need to know about added descriptors. */
 bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
-   __virtio16 avail_idx;
int r;
 
if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
@@ -1660,14 +1675,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
/* They could have slipped one in as we were doing that: make
 * sure it's written, then check again. */
smp_mb();
-   r = __get_user(avail_idx, >avail->idx);
-   if (r) {
-   vq_err(vq, "Failed to check avail idx at %p: %d\n",
-  >avail->idx, r);
-   return false;
-   }
-
-   return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
+   return vhost_vq_more_avail(dev, vq);
 }
 EXPORT_SYMBOL_GPL(vhost_enable_notify);
 
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index ea0327d..5983a13 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -159,6 +159,7 @@ void vhost_add_used_and_signal_n(struct vhost_dev *, struct 
vhost_virtqueue *,
   struct vring_used_elem *heads, unsigned count);
 void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
 void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
+bool vhost_vq_more_avail(struct vhost_dev *, struct vhost_virtqueue *);
 bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC V3 0/3] basic busy polling support for vhost_net

2015-11-12 Thread Jason Wang


On 11/12/2015 06:16 PM, Jason Wang wrote:
> Hi all:
>
> This series tries to add basic busy polling for vhost net. The idea is
> simple: at the end of tx/rx processing, busy polling for new tx added
> descriptor and rx receive socket for a while. The maximum number of
> time (in us) could be spent on busy polling was specified ioctl.
>
> Test were done through:
>
> - 50 us as busy loop timeout
> - Netperf 2.6
> - Two machines with back to back connected ixgbe
> - Guest with 1 vcpu and 1 queue
>
> Results:
> - For stream workload, ioexits were reduced dramatically in medium
>   size (1024-2048) of tx (at most -39%) and almost all rx (at most
>   -79%) as a result of polling. This compensate for the possible
>   wasted cpu cycles more or less. That porbably why we can still see
>   some increasing in the normalized throughput in some cases.
> - Throughput of tx were increased (at most 105%) expect for the huge
>   write (16384). And we can send more packets in the case (+tpkts were
>   increased).
> - Very minor rx regression in some cases.
> - Improvemnt on TCP_RR (at most 16%).

Forget to mention, the following test results by order are:

1) Guest TX
2) Guest RX
3) TCP_RR

> size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
>64/ 1/   +9%/  -17%/   +5%/  +10%/   -2%
>64/ 2/   +8%/  -18%/   +6%/  +10%/   -1%
>64/ 4/   +4%/  -21%/   +6%/  +10%/   -1%
>64/ 8/   +9%/  -17%/   +6%/   +9%/   -2%
>   256/ 1/  +20%/   -1%/  +15%/  +11%/   -9%
>   256/ 2/  +15%/   -6%/  +15%/   +8%/   -8%
>   256/ 4/  +17%/   -4%/  +16%/   +8%/   -8%
>   256/ 8/  -61%/  -69%/  +16%/  +10%/  -10%
>   512/ 1/  +15%/   -3%/  +19%/  +18%/  -11%
>   512/ 2/  +19%/0%/  +19%/  +13%/  -10%
>   512/ 4/  +18%/   -2%/  +18%/  +15%/  -10%
>   512/ 8/  +17%/   -1%/  +18%/  +15%/  -11%
>  1024/ 1/  +25%/   +4%/  +27%/  +16%/  -21%
>  1024/ 2/  +28%/   +8%/  +25%/  +15%/  -22%
>  1024/ 4/  +25%/   +5%/  +25%/  +14%/  -21%
>  1024/ 8/  +27%/   +7%/  +25%/  +16%/  -21%
>  2048/ 1/  +32%/  +12%/  +31%/  +22%/  -38%
>  2048/ 2/  +33%/  +12%/  +30%/  +23%/  -36%
>  2048/ 4/  +31%/  +10%/  +31%/  +24%/  -37%
>  2048/ 8/ +105%/  +75%/  +33%/  +23%/  -39%
> 16384/ 1/0%/  -14%/   +2%/0%/  +19%
> 16384/ 2/0%/  -13%/  +19%/  -13%/  +17%
> 16384/ 4/0%/  -12%/   +3%/0%/   +2%
> 16384/ 8/0%/  -11%/   -2%/   +1%/   +1%
> size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
>64/ 1/   -7%/  -23%/   +4%/   +6%/  -74%
>64/ 2/   -2%/  -12%/   +2%/   +2%/  -55%
>64/ 4/   +2%/   -5%/  +10%/   -2%/  -43%
>64/ 8/   -5%/   -5%/  +11%/  -34%/  -59%
>   256/ 1/   -6%/  -16%/   +9%/  +11%/  -60%
>   256/ 2/   +3%/   -4%/   +6%/   -3%/  -28%
>   256/ 4/0%/   -5%/   -9%/   -9%/  -10%
>   256/ 8/   -3%/   -6%/  -12%/   -9%/  -40%
>   512/ 1/   -4%/  -17%/  -10%/  +21%/  -34%
>   512/ 2/0%/   -9%/  -14%/   -3%/  -30%
>   512/ 4/0%/   -4%/  -18%/  -12%/   -4%
>   512/ 8/   -1%/   -4%/   -1%/   -5%/   +4%
>  1024/ 1/0%/  -16%/  +12%/  +11%/  -10%
>  1024/ 2/0%/  -11%/0%/   +5%/  -31%
>  1024/ 4/0%/   -4%/   -7%/   +1%/  -22%
>  1024/ 8/   -5%/   -6%/  -17%/  -29%/  -79%
>  2048/ 1/0%/  -16%/   +1%/   +9%/  -10%
>  2048/ 2/0%/  -12%/   +7%/   +9%/  -26%
>  2048/ 4/0%/   -7%/   -4%/   +3%/  -64%
>  2048/ 8/   -1%/   -5%/   -6%/   +4%/  -20%
> 16384/ 1/0%/  -12%/  +11%/   +7%/  -20%
> 16384/ 2/0%/   -7%/   +1%/   +5%/  -26%
> 16384/ 4/0%/   -5%/  +12%/  +22%/  -23%
> 16384/ 8/0%/   -1%/   -8%/   +5%/   -3%
> size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
> 1/ 1/   +9%/  -29%/   +9%/   +9%/   +9%
> 1/25/   +6%/  -18%/   +6%/   +6%/   -1%
> 1/50/   +6%/  -19%/   +5%/   +5%/   -2%
> 1/   100/   +5%/  -19%/   +4%/   +4%/   -3%
>64/ 1/  +10%/  -28%/  +10%/  +10%/  +10%
>64/25/   +8%/  -18%/   +7%/   +7%/   -2%
>64/50/   +8%/  -17%/   +8%/   +8%/   -1%
>64/   100/   +8%/  -17%/   +8%/   +8%/   -1%
>   256/ 1/  +10%/  -28%/  +10%/  +10%/  +10%
>   256/25/  +15%/  -13%/  +15%/  +15%/0%
>   256/50/  +16%/  -14%/  +18%/  +18%/   +2%
>   256/   100/  +15%/  -13%/  +12%/  +12%/   -2%
>
> Changes from V2:
> - poll also at the end of rx handling
> - factor out the polling logic and optimize the code a little bit
> - add two ioctls to get and set the busy poll timeout
> - test on ixgbe (which can give more stable and reproducable numbers)
>   instead of mlx4.
>
> Changes from V1:
> - Add a comment for vhost_has

Re: [PATCH V6 0/6] Fast mmio eventfd fixes

2015-11-09 Thread Jason Wang


On 11/10/2015 04:19 AM, Michael S. Tsirkin wrote:
> On Mon, Nov 09, 2015 at 12:35:45PM +0800, Jason Wang wrote:
>> > 
>> > 
>> > On 11/09/2015 01:11 AM, Michael S. Tsirkin wrote:
>>> > > On Tue, Sep 15, 2015 at 02:41:53PM +0800, Jason Wang wrote:
>>>> > >> Hi:
>>>> > >>
>>>> > >> This series fixes two issues of fast mmio eventfd:
>>>> > >>
>>>> > >> 1) A single iodev instance were registerd on two buses: KVM_MMIO_BUS
>>>> > >>and KVM_FAST_MMIO_BUS. This will cause double in
>>>> > >>ioeventfd_destructor()
>>>> > >> 2) A zero length iodev on KVM_MMIO_BUS will never be found but
>>>> > >>kvm_io_bus_cmp(). This will lead e.g the eventfd will be trapped by
>>>> > >>qemu instead of host.
>>>> > >>
>>>> > >> 1 is fixed by allocating two instances of iodev and introduce a new
>>>> > >> capability for userspace. 2 is fixed by ignore the actual length if
>>>> > >> the length of iodev is zero in kvm_io_bus_cmp().
>>>> > >>
>>>> > >> Please review.
>>>> > >> Changes from V5:
>>>> > >> - move patch of explicitly checking for KVM_MMIO_BUS to patch 1 and
>>>> > >>   remove the unnecessary checks
>>>> > >> - even more grammar and typo fixes
>>>> > >> - rabase to kvm.git
>>>> > >> - document KVM_CAP_FAST_MMIO
>>> > > What's up with userspace using this capability?
>> > 
>> > It was renamed to KVM_CAP_IOEVENTFD_ANY_LENGTH.
>> > 
>>> > > Did patches ever get posted?
>> > 
>> > See https://lkml.org/lkml/2015/9/28/208
> Talking about userspace here.
> QEMU freeze is approaching, it really should
> use this to avoid regressions.
>

The patches were posted at
http://lists.gnu.org/archive/html/qemu-devel/2015-11/msg01276.html

(you were in cc list)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V6 0/6] Fast mmio eventfd fixes

2015-11-08 Thread Jason Wang


On 11/09/2015 01:11 AM, Michael S. Tsirkin wrote:
> On Tue, Sep 15, 2015 at 02:41:53PM +0800, Jason Wang wrote:
>> Hi:
>>
>> This series fixes two issues of fast mmio eventfd:
>>
>> 1) A single iodev instance were registerd on two buses: KVM_MMIO_BUS
>>and KVM_FAST_MMIO_BUS. This will cause double in
>>ioeventfd_destructor()
>> 2) A zero length iodev on KVM_MMIO_BUS will never be found but
>>kvm_io_bus_cmp(). This will lead e.g the eventfd will be trapped by
>>qemu instead of host.
>>
>> 1 is fixed by allocating two instances of iodev and introduce a new
>> capability for userspace. 2 is fixed by ignore the actual length if
>> the length of iodev is zero in kvm_io_bus_cmp().
>>
>> Please review.
>> Changes from V5:
>> - move patch of explicitly checking for KVM_MMIO_BUS to patch 1 and
>>   remove the unnecessary checks
>> - even more grammar and typo fixes
>> - rabase to kvm.git
>> - document KVM_CAP_FAST_MMIO
> What's up with userspace using this capability?

It was renamed to KVM_CAP_IOEVENTFD_ANY_LENGTH.

> Did patches ever get posted?

See https://lkml.org/lkml/2015/9/28/208

>
>> Changes from V4:
>> - move the location of kvm_assign_ioeventfd() in patch 1 which reduce
>>   the change set.
>> - commit log typo fixes
>> - switch to use kvm_deassign_ioeventfd_id) when fail to register to
>>   fast mmio bus
>> - change kvm_io_bus_cmp() as Paolo's suggestions
>> - introduce a new capability to avoid new userspace crash old kernel
>> - add a new patch that only try to register mmio eventfd on fast mmio
>>   bus
>>
>> Changes from V3:
>>
>> - Don't do search on two buses when trying to do write on
>>   KVM_MMIO_BUS. This fixes a small regression found by vmexit.flat.
>> - Since we don't do search on two buses, change kvm_io_bus_cmp() to
>>   let it can find zero length iodevs.
>> - Fix the unnecessary lines in tracepoint patch.
>>
>> Changes from V2:
>> - Tweak styles and comment suggested by Cornelia.
>>
>> Changes from v1:
>> - change ioeventfd_bus_from_flags() to return KVM_FAST_MMIO_BUS when
>>   needed to save lots of unnecessary changes.
>>
>> Jason Wang (6):
>>   kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd
>>   kvm: factor out core eventfd assign/deassign logic
>>   kvm: fix double free for fast mmio eventfd
>>   kvm: fix zero length mmio searching
>>   kvm: add tracepoint for fast mmio
>>   kvm: add fast mmio capabilitiy
>>
>>  Documentation/virtual/kvm/api.txt |   7 ++-
>>  arch/x86/kvm/trace.h  |  18 ++
>>  arch/x86/kvm/vmx.c|   1 +
>>  arch/x86/kvm/x86.c|   1 +
>>  include/uapi/linux/kvm.h  |   1 +
>>  virt/kvm/eventfd.c| 124 
>> ++
>>  virt/kvm/kvm_main.c   |  20 +-
>>  7 files changed, 118 insertions(+), 54 deletions(-)
>>
>> -- 
>> 2.1.4
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next rfc V2 0/2] basic busy polling support for vhost_net

2015-11-02 Thread Jason Wang


On 10/30/2015 07:58 PM, Jason Wang wrote:
>
> On 10/29/2015 04:45 PM, Jason Wang wrote:
>> Hi all:
>>
>> This series tries to add basic busy polling for vhost net. The idea is
>> simple: at the end of tx processing, busy polling for new tx added
>> descriptor and rx receive socket for a while. The maximum number of
>> time (in us) could be spent on busy polling was specified through
>> module parameter.
>>
>> Test were done through:
>>
>> - 50 us as busy loop timeout
>> - Netperf 2.6
>> - Two machines with back to back connected mlx4
>> - Guest with 8 vcpus and 1 queue
>>
>> Result shows very huge improvement on both tx (at most 158%) and rr
>> (at most 53%) while rx is as much as in the past. Most cases the cpu
>> utilization is also improved:
>>
> Just notice there's something wrong in the setup. So the numbers are
> incorrect here. Will re-run and post correct number here.
>
> Sorry.

Here's the updated testing result:

1) 1 vcpu 1 queue:

TCP_RR
size/session/+thu%/+normalize%
1/ 1/0%/  -25%
1/50/  +12%/0%
1/   100/  +12%/   +1%
1/   200/   +9%/   -1%
   64/ 1/   +3%/  -21%
   64/50/   +8%/0%
   64/   100/   +7%/0%
   64/   200/   +9%/0%
  256/ 1/   +1%/  -25%
  256/50/   +7%/   -2%
  256/   100/   +6%/   -2%
  256/   200/   +4%/   -2%
  512/ 1/   +2%/  -19%
  512/50/   +5%/   -2%
  512/   100/   +3%/   -3%
  512/   200/   +6%/   -2%
 1024/ 1/   +2%/  -20%
 1024/50/   +3%/   -3%
 1024/   100/   +5%/   -3%
 1024/   200/   +4%/   -2%
Guest RX
size/session/+thu%/+normalize%
   64/ 1/   -4%/   -5%
   64/ 4/   -3%/  -10%
   64/ 8/   -3%/   -5%
  512/ 1/  +15%/   +1%
  512/ 4/   -5%/   -5%
  512/ 8/   -2%/   -4%
 1024/ 1/   -5%/  -16%
 1024/ 4/   -2%/   -5%
 1024/ 8/   -6%/   -6%
 2048/ 1/  +10%/   +5%
 2048/ 4/   -8%/   -4%
 2048/ 8/   -1%/   -4%
 4096/ 1/   -9%/  -11%
 4096/ 4/   +1%/   -1%
 4096/ 8/   +1%/0%
16384/ 1/  +20%/  +11%
16384/ 4/0%/   -3%
16384/ 8/   +1%/0%
65535/ 1/  +36%/  +13%
65535/ 4/  -10%/   -9%
65535/ 8/   -3%/   -2%
Guest TX
size/session/+thu%/+normalize%
   64/ 1/   -7%/  -16%
   64/ 4/  -14%/  -23%
   64/ 8/   -9%/  -20%
  512/ 1/  -62%/  -56%
  512/ 4/  -62%/  -56%
  512/ 8/  -61%/  -53%
 1024/ 1/  -66%/  -61%
 1024/ 4/  -77%/  -73%
 1024/ 8/  -73%/  -67%
 2048/ 1/  -74%/  -75%
 2048/ 4/  -77%/  -74%
 2048/ 8/  -72%/  -68%
 4096/ 1/  -65%/  -68%
 4096/ 4/  -66%/  -63%
 4096/ 8/  -62%/  -57%
16384/ 1/  -25%/  -28%
16384/ 4/  -28%/  -17%
16384/ 8/  -24%/  -10%
65535/ 1/  -17%/  -14%
65535/ 4/  -22%/   -5%
65535/ 8/  -25%/   -9%

- obvious improvement on TCP_RR (at most 12%)
- improvement on guest RX
- huge decreasing on Guest TX (at most -75%), this is probably because
virtio-net driver suffers from buffer bloat by orphaning skb before
transmission. The faster vhost it is, the smaller packet it could
produced. To reduce the impact on this, turning off gso in guest can
result the following result:

size/session/+thu%/+normalize%
   64/ 1/   +3%/  -11%
   64/ 4/   +4%/  -10%
   64/ 8/   +4%/  -10%
  512/ 1/   +2%/   +5%
  512/ 4/0%/   -1%
  512/ 8/0%/0%
 1024/ 1/  +11%/0%
 1024/ 4/0%/   -1%
 1024/ 8/   +3%/   +1%
 2048/ 1/   +4%/   -1%
 2048/ 4/   +8%/   +3%
 2048/ 8/0%/   -1%
 4096/ 1/   +4%/   -1%
 4096/ 4/   +1%/0%
 4096/ 8/   +2%/0%
16384/ 1/   +2%/   -2%
16384/ 4/   +3%/   +1%
16384/ 8/0%/   -1%
65535/ 1/   +9%/   +7%
65535/ 4/0%/   -3%
65535/ 8/   -1%/   -1%

2) 8 vcpus 1 queue:

TCP_RR
size/session/+thu%/+normalize%
1/ 1/   +5%/  -14%
1/50/   +2%/   +1%
1/   100/0%/   -1%
1/   200/0%/0%
   64/ 1/0%/  -25%
   64/50/   +5%/   +5%
   64/   100/0%/0%
   64/   200/0%/   -1%
  256/ 1/0%/  -30%
  256/50/0%/0%
  256/   100/   -2%/   -2%
  256/   200/0%/0%
  512/ 1/   +1%/  -23%
  512/50/   +1%/   +1%
  512/   100/   +1%/0%
  512/   200/   +1%/   +1%
 1024/ 1/   +1%/  -23%
 1024/50/   +5%/   +5%
 1024/   100/0%/   -1%
 1024/   200/0%/0%
Guest RX
size/session/+thu%/+normalize%
   64/ 1/   +1%/   +1%
   64/ 4/   -2%/   +1%
   64/ 8/   +6%/  +19%
  512/ 1/   +5%/   -7%
  512/ 4/   -4%/   -4%
  512/ 8/0%/0%
 1024/ 1/   +1%/   +2%
 1024/ 4/   -2%/   -2%
 1024/ 8/   -1%/   +7%
 2048/ 1/   +8%/   -2%
 2048/ 4/0%/   +5%
 2048/ 8/   -1%/  +13%
 4096/ 1/   -1%/   +2%
 4096/ 4/0%/   +6%
 4096/ 8/   -2%/  +15%
16384/ 1/   -1%/0%
16384/ 4/   -2%/   -1%
16384/ 8/   -2%/   +2%
65535/ 1/ 

Re: [PATCH net-next rfc V2 0/2] basic busy polling support for vhost_net

2015-10-30 Thread Jason Wang


On 10/29/2015 04:45 PM, Jason Wang wrote:
> Hi all:
>
> This series tries to add basic busy polling for vhost net. The idea is
> simple: at the end of tx processing, busy polling for new tx added
> descriptor and rx receive socket for a while. The maximum number of
> time (in us) could be spent on busy polling was specified through
> module parameter.
>
> Test were done through:
>
> - 50 us as busy loop timeout
> - Netperf 2.6
> - Two machines with back to back connected mlx4
> - Guest with 8 vcpus and 1 queue
>
> Result shows very huge improvement on both tx (at most 158%) and rr
> (at most 53%) while rx is as much as in the past. Most cases the cpu
> utilization is also improved:
>

Just notice there's something wrong in the setup. So the numbers are
incorrect here. Will re-run and post correct number here.

Sorry.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next rfc V2 2/2] vhost_net: basic polling support

2015-10-29 Thread Jason Wang
This patch tries to poll for new added tx buffer for a while at the
end of tx processing. The maximum time spent on polling were limited
through a module parameter. To avoid block rx, the loop will end it
there's new other works queued on vhost so in fact socket receive
queue is also be polled.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/net.c | 54 +
 1 file changed, 50 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9eda69e..30e6d3d 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -31,9 +31,13 @@
 #include "vhost.h"
 
 static int experimental_zcopytx = 1;
+static int busyloop_timeout = 50;
 module_param(experimental_zcopytx, int, 0444);
 MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
   " 1 -Enable; 0 - Disable");
+module_param(busyloop_timeout, int, 0444);
+MODULE_PARM_DESC(busyloop_timeout, "Maximum number of time (in us) "
+  "could be spend on busy polling");
 
 /* Max number of bytes transferred before requeueing the job.
  * Using this limit prevents one virtqueue from starving others. */
@@ -287,6 +291,49 @@ static void vhost_zerocopy_callback(struct ubuf_info 
*ubuf, bool success)
rcu_read_unlock_bh();
 }
 
+static inline unsigned long busy_clock(void)
+{
+   return local_clock() >> 10;
+}
+
+static bool tx_can_busy_poll(struct vhost_dev *dev,
+unsigned long endtime)
+{
+   return likely(!need_resched()) &&
+  likely(!time_after(busy_clock(), endtime)) &&
+  likely(!signal_pending(current)) &&
+  !vhost_has_work(dev) &&
+  single_task_running();
+}
+
+static int vhost_net_tx_get_vq_desc(struct vhost_virtqueue *vq,
+   struct iovec iov[], unsigned int iov_size,
+   unsigned int *out_num, unsigned int *in_num)
+{
+   unsigned long uninitialized_var(endtime);
+   int head;
+
+   if (busyloop_timeout) {
+   preempt_disable();
+   endtime = busy_clock() + busyloop_timeout;
+   }
+
+again:
+   head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+out_num, in_num, NULL, NULL);
+
+   if (head == vq->num && busyloop_timeout &&
+   tx_can_busy_poll(vq->dev, endtime)) {
+   cpu_relax();
+   goto again;
+   }
+
+   if (busyloop_timeout)
+   preempt_enable();
+
+   return head;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -331,10 +378,9 @@ static void handle_tx(struct vhost_net *net)
  % UIO_MAXIOV == nvq->done_idx))
break;
 
-   head = vhost_get_vq_desc(vq, vq->iov,
-ARRAY_SIZE(vq->iov),
-, ,
-NULL, NULL);
+   head = vhost_net_tx_get_vq_desc(vq, vq->iov,
+   ARRAY_SIZE(vq->iov),
+   , );
/* On error, stop handling until the next kick. */
if (unlikely(head < 0))
break;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next rfc V2 0/2] basic busy polling support for vhost_net

2015-10-29 Thread Jason Wang
Hi all:

This series tries to add basic busy polling for vhost net. The idea is
simple: at the end of tx processing, busy polling for new tx added
descriptor and rx receive socket for a while. The maximum number of
time (in us) could be spent on busy polling was specified through
module parameter.

Test were done through:

- 50 us as busy loop timeout
- Netperf 2.6
- Two machines with back to back connected mlx4
- Guest with 8 vcpus and 1 queue

Result shows very huge improvement on both tx (at most 158%) and rr
(at most 53%) while rx is as much as in the past. Most cases the cpu
utilization is also improved:

Guest TX:
size/session/+thu%/+normalize%
   64/ 1/  +17%/   +6%
   64/ 4/   +9%/  +17%
   64/ 8/  +34%/  +21%
  512/ 1/  +48%/  +40%
  512/ 4/  +31%/  +20%
  512/ 8/  +39%/  +22%
 1024/ 1/ +158%/  +99%
 1024/ 4/  +20%/  +11%
 1024/ 8/  +40%/  +18%
 2048/ 1/ +108%/  +74%
 2048/ 4/  +21%/   +7%
 2048/ 8/  +32%/  +14%
 4096/ 1/  +94%/  +77%
 4096/ 4/   +7%/   -6%
 4096/ 8/   +9%/   -4%
16384/ 1/  +33%/   +9%
16384/ 4/  +10%/   -6%
16384/ 8/  +19%/   +2%
65535/ 1/  +15%/   -6%
65535/ 4/   +8%/   -9%
65535/ 8/  +14%/0%

Guest RX:
size/session/+thu%/+normalize%
   64/ 1/   -3%/   -3%
   64/ 4/   +4%/  +20%
   64/ 8/   -1%/   -1%
  512/ 1/  +20%/  +12%
  512/ 4/   +1%/   +3%
  512/ 8/0%/   -5%
 1024/ 1/   +9%/   -2%
 1024/ 4/0%/   +5%
 1024/ 8/   +1%/0%
 2048/ 1/0%/   +3%
 2048/ 4/   -2%/   +3%
 2048/ 8/   -1%/   -3%
 4096/ 1/   -8%/   +3%
 4096/ 4/0%/   +2%
 4096/ 8/0%/   +5%
16384/ 1/   +3%/0%
16384/ 4/   +2%/   +2%
16384/ 8/0%/  +13%
65535/ 1/0%/   +3%
65535/ 4/   +2%/   -1%
65535/ 8/   +1%/  +14%

TCP_RR:
size/session/+thu%/+normalize%
1/ 1/   +8%/   -6%
1/50/  +18%/  +15%
1/   100/  +22%/  +19%
1/   200/  +25%/  +23%
   64/ 1/   +2%/  -19%
   64/50/  +46%/  +39%
   64/   100/  +47%/  +39%
   64/   200/  +50%/  +44%
  512/ 1/0%/  -28%
  512/50/  +50%/  +44%
  512/   100/  +53%/  +47%
  512/   200/  +51%/  +58%
 1024/ 1/   +3%/  -14%
 1024/50/  +45%/  +37%
 1024/   100/  +53%/  +49%
 1024/   200/  +48%/  +55%

Changes from V1:
- Add a comment for vhost_has_work() to explain why it could be
  lockless
- Add param description for busyloop_timeout
- Split out the busy polling logic into a new helper
- Check and exit the loop when there's a pending signal
- Disable preemption during busy looping to make sure lock_clock() was
  correctly used.

Todo:
- Make the busyloop timeout could be configure per VM through ioctl.

Please review.

Thanks

Jason Wang (2):
  vhost: introduce vhost_has_work()
  vhost_net: basic polling support

 drivers/vhost/net.c   | 54 +++
 drivers/vhost/vhost.c |  7 +++
 drivers/vhost/vhost.h |  1 +
 3 files changed, 58 insertions(+), 4 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next rfc V2 1/2] vhost: introduce vhost_has_work()

2015-10-29 Thread Jason Wang
This path introduces a helper which can give a hint for whether or not
there's a work queued in the work list.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/vhost.c | 7 +++
 drivers/vhost/vhost.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index eec2f11..163b365 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -245,6 +245,13 @@ void vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
+/* A lockless hint for busy polling code to exit the loop */
+bool vhost_has_work(struct vhost_dev *dev)
+{
+   return !list_empty(>work_list);
+}
+EXPORT_SYMBOL_GPL(vhost_has_work);
+
 void vhost_poll_queue(struct vhost_poll *poll)
 {
vhost_work_queue(poll->dev, >work);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 4772862..ea0327d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -37,6 +37,7 @@ struct vhost_poll {
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
+bool vhost_has_work(struct vhost_dev *dev);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 unsigned long mask, struct vhost_dev *dev);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC 1/2] vhost: introduce vhost_has_work()

2015-10-23 Thread Jason Wang


On 10/22/2015 04:38 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 22, 2015 at 01:27:28AM -0400, Jason Wang wrote:
>> > This path introduces a helper which can give a hint for whether or not
>> > there's a work queued in the work list.
>> > 
>> > Signed-off-by: Jason Wang <jasow...@redhat.com>
>> > ---
>> >  drivers/vhost/vhost.c | 6 ++
>> >  drivers/vhost/vhost.h | 1 +
>> >  2 files changed, 7 insertions(+)
>> > 
>> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> > index eec2f11..d42d11e 100644
>> > --- a/drivers/vhost/vhost.c
>> > +++ b/drivers/vhost/vhost.c
>> > @@ -245,6 +245,12 @@ void vhost_work_queue(struct vhost_dev *dev, struct 
>> > vhost_work *work)
>> >  }
>> >  EXPORT_SYMBOL_GPL(vhost_work_queue);
>> >  
>> > +bool vhost_has_work(struct vhost_dev *dev)
>> > +{
>> > +  return !list_empty(>work_list);
>> > +}
>> > +EXPORT_SYMBOL_GPL(vhost_has_work);
>> > +
>> >  void vhost_poll_queue(struct vhost_poll *poll)
>> >  {
>> >vhost_work_queue(poll->dev, >work);
> This doesn't take a lock so it's unreliable.
> I think it's ok in this case since it's just
> an optimization - but pls document this.
>

Ok, will do.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC 2/2] vhost_net: basic polling support

2015-10-23 Thread Jason Wang


On 10/22/2015 05:33 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 22, 2015 at 01:27:29AM -0400, Jason Wang wrote:
>> This patch tries to poll for new added tx buffer for a while at the
>> end of tx processing. The maximum time spent on polling were limited
>> through a module parameter. To avoid block rx, the loop will end it
>> there's new other works queued on vhost so in fact socket receive
>> queue is also be polled.
>>
>> busyloop_timeout = 50 gives us following improvement on TCP_RR test:
>>
>> size/session/+thu%/+normalize%
>> 1/ 1/   +5%/  -20%
>> 1/50/  +17%/   +3%
> Is there a measureable increase in cpu utilization
> with busyloop_timeout = 0?

Just run TCP_RR, no increasing. Will run a complete test on next version.

>
>> Signed-off-by: Jason Wang <jasow...@redhat.com>
> We might be able to shave off the minor regression
> by careful use of likely/unlikely, or maybe
> deferring 

Yes, but what did "deferring" mean here?
 
>
>> ---
>>  drivers/vhost/net.c | 19 +++
>>  1 file changed, 19 insertions(+)
>>
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 9eda69e..bbb522a 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -31,7 +31,9 @@
>>  #include "vhost.h"
>>  
>>  static int experimental_zcopytx = 1;
>> +static int busyloop_timeout = 50;
>>  module_param(experimental_zcopytx, int, 0444);
>> +module_param(busyloop_timeout, int, 0444);
> Pls add a description, including the units and the special
> value 0.

Ok.

>
>>  MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
>> " 1 -Enable; 0 - Disable");
>>  
>> @@ -287,12 +289,23 @@ static void vhost_zerocopy_callback(struct ubuf_info 
>> *ubuf, bool success)
>>  rcu_read_unlock_bh();
>>  }
>>  
>> +static bool tx_can_busy_poll(struct vhost_dev *dev,
>> + unsigned long endtime)
>> +{
>> +unsigned long now = local_clock() >> 10;
> local_clock might go backwards if we jump between CPUs.
> One way to fix would be to record the CPU id and break
> out of loop if that changes.

Right, or maybe disable preemption in this case?

>
> Also - defer this until we actually know we need it?

Right.

>
>> +
>> +return busyloop_timeout && !need_resched() &&
>> +   !time_after(now, endtime) && !vhost_has_work(dev) &&
>> +   single_task_running();
> signal pending as well?

Yes.

>> +}
>> +
>>  /* Expects to be always run from workqueue - which acts as
>>   * read-size critical section for our kind of RCU. */
>>  static void handle_tx(struct vhost_net *net)
>>  {
>>  struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
>>  struct vhost_virtqueue *vq = >vq;
>> +unsigned long endtime;
>>  unsigned out, in;
>>  int head;
>>  struct msghdr msg = {
>> @@ -331,6 +344,8 @@ static void handle_tx(struct vhost_net *net)
>>% UIO_MAXIOV == nvq->done_idx))
>>  break;
>>  
>> +endtime  = (local_clock() >> 10) + busyloop_timeout;
>> +again:
>>  head = vhost_get_vq_desc(vq, vq->iov,
>>   ARRAY_SIZE(vq->iov),
>>   , ,
>> @@ -340,6 +355,10 @@ static void handle_tx(struct vhost_net *net)
>>  break;
>>  /* Nothing new?  Wait for eventfd to tell us they refilled. */
>>  if (head == vq->num) {
>> +if (tx_can_busy_poll(vq->dev, endtime)) {
>> +cpu_relax();
>> +goto again;
>> +}
>>  if (unlikely(vhost_enable_notify(>dev, vq))) {
>>  vhost_disable_notify(>dev, vq);
>>  continue;
>> -- 
>> 1.8.3.1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC 2/2] vhost_net: basic polling support

2015-10-23 Thread Jason Wang


On 10/23/2015 12:16 AM, Michael S. Tsirkin wrote:
> On Thu, Oct 22, 2015 at 08:46:33AM -0700, Rick Jones wrote:
>> On 10/22/2015 02:33 AM, Michael S. Tsirkin wrote:
>>> On Thu, Oct 22, 2015 at 01:27:29AM -0400, Jason Wang wrote:
>>>> This patch tries to poll for new added tx buffer for a while at the
>>>> end of tx processing. The maximum time spent on polling were limited
>>>> through a module parameter. To avoid block rx, the loop will end it
>>>> there's new other works queued on vhost so in fact socket receive
>>>> queue is also be polled.
>>>>
>>>> busyloop_timeout = 50 gives us following improvement on TCP_RR test:
>>>>
>>>> size/session/+thu%/+normalize%
>>>> 1/ 1/   +5%/  -20%
>>>> 1/50/  +17%/   +3%
>>> Is there a measureable increase in cpu utilization
>>> with busyloop_timeout = 0?
>> And since a netperf TCP_RR test is involved, be careful about what netperf
>> reports for CPU util if that increase isn't in the context of the guest OS.

Right, the cpu utilization is measured on host.

>>
>> For completeness, looking at the effect on TCP_STREAM and TCP_MAERTS,
>> aggregate _RR and even aggregate _RR/packets per second for many VMs on the
>> same system would be in order.
>>
>> happy benchmarking,
>>
>> rick jones
> Absolutely, merging a new kernel API just for a specific
> benchmark doesn't make sense.
> I'm guessing this is just an early RFC, a fuller submission
> will probably include more numbers.
>

Yes, will run more complete tests.

Thanks

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next RFC 2/2] vhost_net: basic polling support

2015-10-21 Thread Jason Wang
This patch tries to poll for new added tx buffer for a while at the
end of tx processing. The maximum time spent on polling were limited
through a module parameter. To avoid block rx, the loop will end it
there's new other works queued on vhost so in fact socket receive
queue is also be polled.

busyloop_timeout = 50 gives us following improvement on TCP_RR test:

size/session/+thu%/+normalize%
1/ 1/   +5%/  -20%
1/50/  +17%/   +3%

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/net.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9eda69e..bbb522a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -31,7 +31,9 @@
 #include "vhost.h"
 
 static int experimental_zcopytx = 1;
+static int busyloop_timeout = 50;
 module_param(experimental_zcopytx, int, 0444);
+module_param(busyloop_timeout, int, 0444);
 MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
   " 1 -Enable; 0 - Disable");
 
@@ -287,12 +289,23 @@ static void vhost_zerocopy_callback(struct ubuf_info 
*ubuf, bool success)
rcu_read_unlock_bh();
 }
 
+static bool tx_can_busy_poll(struct vhost_dev *dev,
+unsigned long endtime)
+{
+   unsigned long now = local_clock() >> 10;
+
+   return busyloop_timeout && !need_resched() &&
+  !time_after(now, endtime) && !vhost_has_work(dev) &&
+  single_task_running();
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
 {
struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = >vq;
+   unsigned long endtime;
unsigned out, in;
int head;
struct msghdr msg = {
@@ -331,6 +344,8 @@ static void handle_tx(struct vhost_net *net)
  % UIO_MAXIOV == nvq->done_idx))
break;
 
+   endtime  = (local_clock() >> 10) + busyloop_timeout;
+again:
head = vhost_get_vq_desc(vq, vq->iov,
 ARRAY_SIZE(vq->iov),
 , ,
@@ -340,6 +355,10 @@ static void handle_tx(struct vhost_net *net)
break;
/* Nothing new?  Wait for eventfd to tell us they refilled. */
if (head == vq->num) {
+   if (tx_can_busy_poll(vq->dev, endtime)) {
+   cpu_relax();
+   goto again;
+   }
if (unlikely(vhost_enable_notify(>dev, vq))) {
vhost_disable_notify(>dev, vq);
continue;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next RFC 1/2] vhost: introduce vhost_has_work()

2015-10-21 Thread Jason Wang
This path introduces a helper which can give a hint for whether or not
there's a work queued in the work list.

Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 drivers/vhost/vhost.c | 6 ++
 drivers/vhost/vhost.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index eec2f11..d42d11e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -245,6 +245,12 @@ void vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
+bool vhost_has_work(struct vhost_dev *dev)
+{
+   return !list_empty(>work_list);
+}
+EXPORT_SYMBOL_GPL(vhost_has_work);
+
 void vhost_poll_queue(struct vhost_poll *poll)
 {
vhost_work_queue(poll->dev, >work);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 4772862..ea0327d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -37,6 +37,7 @@ struct vhost_poll {
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
+bool vhost_has_work(struct vhost_dev *dev);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 unsigned long mask, struct vhost_dev *dev);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 4/6] kvm: fix zero length mmio searching

2015-09-15 Thread Jason Wang
Currently, if we had a zero length mmio eventfd assigned on
KVM_MMIO_BUS. It will never be found by kvm_io_bus_cmp() since it
always compares the kvm_io_range() with the length that guest
wrote. This will cause e.g for vhost, kick will be trapped by qemu
userspace instead of vhost. Fixing this by using zero length if an
iodevice is zero length.

Cc: sta...@vger.kernel.org
Cc: Gleb Natapov <g...@kernel.org>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 virt/kvm/kvm_main.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index eb4c9d2..9af68db 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3157,10 +3157,25 @@ static void kvm_io_bus_destroy(struct kvm_io_bus *bus)
 static inline int kvm_io_bus_cmp(const struct kvm_io_range *r1,
 const struct kvm_io_range *r2)
 {
-   if (r1->addr < r2->addr)
+   gpa_t addr1 = r1->addr;
+   gpa_t addr2 = r2->addr;
+
+   if (addr1 < addr2)
return -1;
-   if (r1->addr + r1->len > r2->addr + r2->len)
+
+   /* If r2->len == 0, match the exact address.  If r2->len != 0,
+* accept any overlapping write.  Any order is acceptable for
+* overlapping ranges, because kvm_io_bus_get_first_dev ensures
+* we process all of them.
+*/
+   if (r2->len) {
+   addr1 += r1->len;
+   addr2 += r2->len;
+   }
+
+   if (addr1 > addr2)
return 1;
+
return 0;
 }
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 2/6] kvm: factor out core eventfd assign/deassign logic

2015-09-15 Thread Jason Wang
This patch factors out core eventfd assign/deassign logic and leaves
the argument checking and bus index selection to callers.

Cc: sta...@vger.kernel.org
Cc: Gleb Natapov <g...@kernel.org>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 virt/kvm/eventfd.c | 85 --
 1 file changed, 50 insertions(+), 35 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index e404806..0829c7f 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -771,40 +771,14 @@ static enum kvm_bus ioeventfd_bus_from_flags(__u32 flags)
return KVM_MMIO_BUS;
 }
 
-static int
-kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
+static int kvm_assign_ioeventfd_idx(struct kvm *kvm,
+   enum kvm_bus bus_idx,
+   struct kvm_ioeventfd *args)
 {
-   enum kvm_bus  bus_idx;
-   struct _ioeventfd*p;
-   struct eventfd_ctx   *eventfd;
-   int   ret;
-
-   bus_idx = ioeventfd_bus_from_flags(args->flags);
-   /* must be natural-word sized, or 0 to ignore length */
-   switch (args->len) {
-   case 0:
-   case 1:
-   case 2:
-   case 4:
-   case 8:
-   break;
-   default:
-   return -EINVAL;
-   }
-
-   /* check for range overflow */
-   if (args->addr + args->len < args->addr)
-   return -EINVAL;
 
-   /* check for extra flags that we don't understand */
-   if (args->flags & ~KVM_IOEVENTFD_VALID_FLAG_MASK)
-   return -EINVAL;
-
-   /* ioeventfd with no length can't be combined with DATAMATCH */
-   if (!args->len &&
-   args->flags & (KVM_IOEVENTFD_FLAG_PIO |
-  KVM_IOEVENTFD_FLAG_DATAMATCH))
-   return -EINVAL;
+   struct eventfd_ctx *eventfd;
+   struct _ioeventfd *p;
+   int ret;
 
eventfd = eventfd_ctx_fdget(args->fd);
if (IS_ERR(eventfd))
@@ -873,14 +847,13 @@ fail:
 }
 
 static int
-kvm_deassign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
+kvm_deassign_ioeventfd_idx(struct kvm *kvm, enum kvm_bus bus_idx,
+  struct kvm_ioeventfd *args)
 {
-   enum kvm_bus  bus_idx;
struct _ioeventfd*p, *tmp;
struct eventfd_ctx   *eventfd;
int   ret = -ENOENT;
 
-   bus_idx = ioeventfd_bus_from_flags(args->flags);
eventfd = eventfd_ctx_fdget(args->fd);
if (IS_ERR(eventfd))
return PTR_ERR(eventfd);
@@ -918,6 +891,48 @@ kvm_deassign_ioeventfd(struct kvm *kvm, struct 
kvm_ioeventfd *args)
return ret;
 }
 
+static int kvm_deassign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
+{
+   enum kvm_bus bus_idx = ioeventfd_bus_from_flags(args->flags);
+
+   return kvm_deassign_ioeventfd_idx(kvm, bus_idx, args);
+}
+
+static int
+kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
+{
+   enum kvm_bus  bus_idx;
+
+   bus_idx = ioeventfd_bus_from_flags(args->flags);
+   /* must be natural-word sized, or 0 to ignore length */
+   switch (args->len) {
+   case 0:
+   case 1:
+   case 2:
+   case 4:
+   case 8:
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   /* check for range overflow */
+   if (args->addr + args->len < args->addr)
+   return -EINVAL;
+
+   /* check for extra flags that we don't understand */
+   if (args->flags & ~KVM_IOEVENTFD_VALID_FLAG_MASK)
+   return -EINVAL;
+
+   /* ioeventfd with no length can't be combined with DATAMATCH */
+   if (!args->len &&
+   args->flags & (KVM_IOEVENTFD_FLAG_PIO |
+  KVM_IOEVENTFD_FLAG_DATAMATCH))
+   return -EINVAL;
+
+   return kvm_assign_ioeventfd_idx(kvm, bus_idx, args);
+}
+
 int
 kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
 {
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 3/6] kvm: fix double free for fast mmio eventfd

2015-09-15 Thread Jason Wang
We register wildcard mmio eventfd on two buses, once for KVM_MMIO_BUS
and once on KVM_FAST_MMIO_BUS but with a single iodev
instance. This will lead to an issue: kvm_io_bus_destroy() knows
nothing about the devices on two buses pointing to a single dev. Which
will lead to double free[1] during exit. Fix this by allocating two
instances of iodevs then registering one on KVM_MMIO_BUS and another
on KVM_FAST_MMIO_BUS.

CPU: 1 PID: 2894 Comm: qemu-system-x86 Not tainted 3.19.0-26-generic #28-Ubuntu
Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET96WW (2.56 ) 09/12/2013
task: 88009ae0c4b0 ti: 88020e7f task.ti: 88020e7f
RIP: 0010:[]  [] 
ioeventfd_release+0x28/0x60 [kvm]
RSP: 0018:88020e7f3bc8  EFLAGS: 00010292
RAX: dead00200200 RBX: 8801ec19c900 RCX: 00018200016d
RDX: 8801ec19cf80 RSI: ea0008bf1d40 RDI: 8801ec19c900
RBP: 88020e7f3bd8 R08: 2fc75a01 R09: 00018200016d
R10: c07df6ae R11: 88022fc75a98 R12: 88021e7cc000
R13: 88021e7cca48 R14: 88021e7cca50 R15: 8801ec19c880
FS:  7fc1ee3e6700() GS:88023e24() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f8f389d8000 CR3: 00023dc13000 CR4: 001427e0
Stack:
88021e7cc000  88020e7f3be8 c07e2622
88020e7f3c38 c07df69a 880232524160 88020e792d80
  880219b78c00 0008 8802321686a8
Call Trace:
[] ioeventfd_destructor+0x12/0x20 [kvm]
[] kvm_put_kvm+0xca/0x210 [kvm]
[] kvm_vcpu_release+0x18/0x20 [kvm]
[] __fput+0xe7/0x250
[] fput+0xe/0x10
[] task_work_run+0xd4/0xf0
[] do_exit+0x368/0xa50
[] ? recalc_sigpending+0x1f/0x60
[] do_group_exit+0x45/0xb0
[] get_signal+0x291/0x750
[] do_signal+0x28/0xab0
[] ? do_futex+0xdb/0x5d0
[] ? __wake_up_locked_key+0x18/0x20
[] ? SyS_futex+0x76/0x170
[] do_notify_resume+0x69/0xb0
[] int_signal+0x12/0x17
Code: 5d c3 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 7f 20 
e8 06 d6 a5 c0 48 8b 43 08 48 8b 13 48 89 df 48 89 42 08 <48> 89 10 48 b8 00 01 
10 00 00
 RIP  [] ioeventfd_release+0x28/0x60 [kvm]
 RSP 

Cc: sta...@vger.kernel.org
Cc: Gleb Natapov <g...@kernel.org>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 virt/kvm/eventfd.c | 43 +--
 1 file changed, 25 insertions(+), 18 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 0829c7f..79db453 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -817,16 +817,6 @@ static int kvm_assign_ioeventfd_idx(struct kvm *kvm,
if (ret < 0)
goto unlock_fail;
 
-   /* When length is ignored, MMIO is also put on a separate bus, for
-* faster lookups.
-*/
-   if (!args->len && bus_idx == KVM_MMIO_BUS) {
-   ret = kvm_io_bus_register_dev(kvm, KVM_FAST_MMIO_BUS,
- p->addr, 0, >dev);
-   if (ret < 0)
-   goto register_fail;
-   }
-
kvm->buses[bus_idx]->ioeventfd_count++;
list_add_tail(>list, >ioeventfds);
 
@@ -834,8 +824,6 @@ static int kvm_assign_ioeventfd_idx(struct kvm *kvm,
 
return 0;
 
-register_fail:
-   kvm_io_bus_unregister_dev(kvm, bus_idx, >dev);
 unlock_fail:
mutex_unlock(>slots_lock);
 
@@ -874,10 +862,6 @@ kvm_deassign_ioeventfd_idx(struct kvm *kvm, enum kvm_bus 
bus_idx,
continue;
 
kvm_io_bus_unregister_dev(kvm, bus_idx, >dev);
-   if (!p->length && p->bus_idx == KVM_MMIO_BUS) {
-   kvm_io_bus_unregister_dev(kvm, KVM_FAST_MMIO_BUS,
- >dev);
-   }
kvm->buses[bus_idx]->ioeventfd_count--;
ioeventfd_release(p);
ret = 0;
@@ -894,14 +878,19 @@ kvm_deassign_ioeventfd_idx(struct kvm *kvm, enum kvm_bus 
bus_idx,
 static int kvm_deassign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
 {
enum kvm_bus bus_idx = ioeventfd_bus_from_flags(args->flags);
+   int ret = kvm_deassign_ioeventfd_idx(kvm, bus_idx, args);
+
+   if (!args->len && bus_idx == KVM_MMIO_BUS)
+   kvm_deassign_ioeventfd_idx(kvm, KVM_FAST_MMIO_BUS, args);
 
-   return kvm_deassign_ioeventfd_idx(kvm, bus_idx, args);
+   return ret;
 }
 
 static int
 kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
 {
enum kvm_bus  bus_idx;
+   int ret;
 
bus_idx = ioeventfd_bus_from_flags(args->flags);
/* must be natural-word sized, or 0 to ignore length */
@@ -930,7 +919,25 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
   KVM_IOEVENTFD_FLAG_DATAMATCH))
return -EINVAL;
 
-   

[PATCH V6 6/6] kvm: add fast mmio capabilitiy

2015-09-15 Thread Jason Wang
Cc: Gleb Natapov <g...@kernel.org>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 Documentation/virtual/kvm/api.txt | 7 ++-
 include/uapi/linux/kvm.h  | 1 +
 virt/kvm/kvm_main.c   | 1 +
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index d9eccee..26661ef 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1598,7 +1598,7 @@ provided event instead of triggering an exit.
 struct kvm_ioeventfd {
__u64 datamatch;
__u64 addr;/* legal pio/mmio address */
-   __u32 len; /* 1, 2, 4, or 8 bytes*/
+   __u32 len; /* 0, 1, 2, 4, or 8 bytes*/
__s32 fd;
__u32 flags;
__u8  pad[36];
@@ -1621,6 +1621,11 @@ to the registered address is equal to datamatch in 
struct kvm_ioeventfd.
 For virtio-ccw devices, addr contains the subchannel id and datamatch the
 virtqueue index.
 
+With KVM_CAP_FAST_MMIO, a zero length mmio eventfd is allowed for
+kernel to ignore the length of guest write and get a possible faster
+response. Note the speedup may only work on some specific
+architectures and setups. Otherwise, it's as fast as wildcard mmio
+eventfd.
 
 4.60 KVM_DIRTY_TLB
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a9256f0..ad72a61 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -824,6 +824,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_MULTI_ADDRESS_SPACE 118
 #define KVM_CAP_GUEST_DEBUG_HW_BPS 119
 #define KVM_CAP_GUEST_DEBUG_HW_WPS 120
+#define KVM_CAP_FAST_MMIO 121
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9af68db..645f55d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2717,6 +2717,7 @@ static long kvm_vm_ioctl_check_extension_generic(struct 
kvm *kvm, long arg)
case KVM_CAP_IRQFD:
case KVM_CAP_IRQFD_RESAMPLE:
 #endif
+   case KVM_CAP_FAST_MMIO:
case KVM_CAP_CHECK_EXTENSION_VM:
return 1;
 #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 5/6] kvm: add tracepoint for fast mmio

2015-09-15 Thread Jason Wang
Cc: Gleb Natapov <g...@kernel.org>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 arch/x86/kvm/trace.h | 18 ++
 arch/x86/kvm/vmx.c   |  1 +
 arch/x86/kvm/x86.c   |  1 +
 3 files changed, 20 insertions(+)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 4eae7c3..ce4abe3 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -129,6 +129,24 @@ TRACE_EVENT(kvm_pio,
 );
 
 /*
+ * Tracepoint for fast mmio.
+ */
+TRACE_EVENT(kvm_fast_mmio,
+   TP_PROTO(u64 gpa),
+   TP_ARGS(gpa),
+
+   TP_STRUCT__entry(
+   __field(u64,gpa)
+   ),
+
+   TP_fast_assign(
+   __entry->gpa= gpa;
+   ),
+
+   TP_printk("fast mmio at gpa 0x%llx", __entry->gpa)
+);
+
+/*
  * Tracepoint for cpuid.
  */
 TRACE_EVENT(kvm_cpuid,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d019868..ff1234a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5767,6 +5767,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
skip_emulated_instruction(vcpu);
+   trace_kvm_fast_mmio(gpa);
return 1;
}
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a60bdbc..1ec3965 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8015,6 +8015,7 @@ bool kvm_arch_has_noncoherent_dma(struct kvm *kvm)
 EXPORT_SYMBOL_GPL(kvm_arch_has_noncoherent_dma);
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_msr);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 1/6] kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd

2015-09-15 Thread Jason Wang
We only want zero length mmio eventfd to be registered on
KVM_FAST_MMIO_BUS. So check this explicitly when arg->len is zero to
make sure this.

Cc: sta...@vger.kernel.org
Cc: Gleb Natapov <g...@kernel.org>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 virt/kvm/eventfd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 9ff4193..e404806 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -846,7 +846,7 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
/* When length is ignored, MMIO is also put on a separate bus, for
 * faster lookups.
 */
-   if (!args->len && !(args->flags & KVM_IOEVENTFD_FLAG_PIO)) {
+   if (!args->len && bus_idx == KVM_MMIO_BUS) {
ret = kvm_io_bus_register_dev(kvm, KVM_FAST_MMIO_BUS,
  p->addr, 0, >dev);
if (ret < 0)
@@ -901,7 +901,7 @@ kvm_deassign_ioeventfd(struct kvm *kvm, struct 
kvm_ioeventfd *args)
continue;
 
kvm_io_bus_unregister_dev(kvm, bus_idx, >dev);
-   if (!p->length) {
+   if (!p->length && p->bus_idx == KVM_MMIO_BUS) {
kvm_io_bus_unregister_dev(kvm, KVM_FAST_MMIO_BUS,
  >dev);
}
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 0/6] Fast mmio eventfd fixes

2015-09-15 Thread Jason Wang
Hi:

This series fixes two issues of fast mmio eventfd:

1) A single iodev instance were registerd on two buses: KVM_MMIO_BUS
   and KVM_FAST_MMIO_BUS. This will cause double in
   ioeventfd_destructor()
2) A zero length iodev on KVM_MMIO_BUS will never be found but
   kvm_io_bus_cmp(). This will lead e.g the eventfd will be trapped by
   qemu instead of host.

1 is fixed by allocating two instances of iodev and introduce a new
capability for userspace. 2 is fixed by ignore the actual length if
the length of iodev is zero in kvm_io_bus_cmp().

Please review.

Changes from V5:
- move patch of explicitly checking for KVM_MMIO_BUS to patch 1 and
  remove the unnecessary checks
- even more grammar and typo fixes
- rabase to kvm.git
- document KVM_CAP_FAST_MMIO

Changes from V4:
- move the location of kvm_assign_ioeventfd() in patch 1 which reduce
  the change set.
- commit log typo fixes
- switch to use kvm_deassign_ioeventfd_id) when fail to register to
  fast mmio bus
- change kvm_io_bus_cmp() as Paolo's suggestions
- introduce a new capability to avoid new userspace crash old kernel
- add a new patch that only try to register mmio eventfd on fast mmio
  bus

Changes from V3:

- Don't do search on two buses when trying to do write on
  KVM_MMIO_BUS. This fixes a small regression found by vmexit.flat.
- Since we don't do search on two buses, change kvm_io_bus_cmp() to
  let it can find zero length iodevs.
- Fix the unnecessary lines in tracepoint patch.

Changes from V2:
- Tweak styles and comment suggested by Cornelia.

Changes from v1:
- change ioeventfd_bus_from_flags() to return KVM_FAST_MMIO_BUS when
  needed to save lots of unnecessary changes.

Jason Wang (6):
  kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd
  kvm: factor out core eventfd assign/deassign logic
  kvm: fix double free for fast mmio eventfd
  kvm: fix zero length mmio searching
  kvm: add tracepoint for fast mmio
  kvm: add fast mmio capabilitiy

 Documentation/virtual/kvm/api.txt |   7 ++-
 arch/x86/kvm/trace.h  |  18 ++
 arch/x86/kvm/vmx.c|   1 +
 arch/x86/kvm/x86.c|   1 +
 include/uapi/linux/kvm.h  |   1 +
 virt/kvm/eventfd.c| 124 ++
 virt/kvm/kvm_main.c   |  20 +-
 7 files changed, 118 insertions(+), 54 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 0/4] Fast MMIO eventfd fixes

2015-09-11 Thread Jason Wang


On 09/11/2015 04:33 PM, Paolo Bonzini wrote:
>
> On 11/09/2015 10:15, Michael S. Tsirkin wrote:
>> I think we should add a capability for fast mmio.
>> This way, userspace can avoid crashing buggy kernels.
> I agree.
>
> Paolo

Right, then qemu will use datamatch eventfd if kenrel dost not have the
capability.

Thanks

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 1/4] kvm: factor out core eventfd assign/deassign logic

2015-09-11 Thread Jason Wang


On 09/11/2015 03:39 PM, Cornelia Huck wrote:
> On Fri, 11 Sep 2015 11:17:34 +0800
> Jason Wang <jasow...@redhat.com> wrote:
>
>> This patch factors out core eventfd assign/deassign logic and leave
>> the argument checking and bus index selection to callers.
>>
>> Cc: Gleb Natapov <g...@kernel.org>
>> Cc: Paolo Bonzini <pbonz...@redhat.com>
>> Signed-off-by: Jason Wang <jasow...@redhat.com>
>> ---
>>  virt/kvm/eventfd.c | 83 
>> --
>>  1 file changed, 49 insertions(+), 34 deletions(-)
>>
>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>> index 9ff4193..163258d 100644
>> --- a/virt/kvm/eventfd.c
>> +++ b/virt/kvm/eventfd.c
>> @@ -771,40 +771,14 @@ static enum kvm_bus ioeventfd_bus_from_flags(__u32 
>> flags)
>>  return KVM_MMIO_BUS;
>>  }
>>
>> -static int
>> -kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
>> +static int kvm_assign_ioeventfd_idx(struct kvm *kvm,
>> +enum kvm_bus bus_idx,
>> +struct kvm_ioeventfd *args)
>>  {
>> -enum kvm_bus  bus_idx;
>> -struct _ioeventfd*p;
>> -struct eventfd_ctx   *eventfd;
>> -int   ret;
>> -
>> -bus_idx = ioeventfd_bus_from_flags(args->flags);
>> -/* must be natural-word sized, or 0 to ignore length */
>> -switch (args->len) {
>> -case 0:
>> -case 1:
>> -case 2:
>> -case 4:
>> -case 8:
>> -break;
>> -default:
>> -return -EINVAL;
>> -}
>>
>> -/* check for range overflow */
>> -if (args->addr + args->len < args->addr)
>> -return -EINVAL;
>> -
>> -/* check for extra flags that we don't understand */
>> -if (args->flags & ~KVM_IOEVENTFD_VALID_FLAG_MASK)
>> -return -EINVAL;
>> -
>> -/* ioeventfd with no length can't be combined with DATAMATCH */
>> -if (!args->len &&
>> -args->flags & (KVM_IOEVENTFD_FLAG_PIO |
>> -   KVM_IOEVENTFD_FLAG_DATAMATCH))
>> -return -EINVAL;
>> +struct eventfd_ctx *eventfd;
>> +struct _ioeventfd *p;
>> +int ret;
>>
>>  eventfd = eventfd_ctx_fdget(args->fd);
>>  if (IS_ERR(eventfd))
>> @@ -873,14 +847,48 @@ fail:
>>  }
>>
>>  static int
>> -kvm_deassign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
>> +kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
> You'll move this function to below the deassign function in patch 2.
> Maybe do it already here?
>

Yes, this can reduce the changes for patch2.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 2/4] kvm: fix double free for fast mmio eventfd

2015-09-11 Thread Jason Wang


On 09/11/2015 03:46 PM, Cornelia Huck wrote:
> On Fri, 11 Sep 2015 11:17:35 +0800
> Jason Wang <jasow...@redhat.com> wrote:
>
>> We register wildcard mmio eventfd on two buses, one for KVM_MMIO_BUS
>> and another is KVM_FAST_MMIO_BUS but with a single iodev
>> instance. This will lead an issue: kvm_io_bus_destroy() knows nothing
>> about the devices on two buses points to a single dev. Which will lead
> s/points/pointing/

Will fix this in V5.

>> double free[1] during exit. Fixing this by using allocate two
> s/using allocate/allocating/

Will fix this in V5.

>
>> instances of iodevs then register one on KVM_MMIO_BUS and another on
>> KVM_FAST_MMIO_BUS.
>>
> (...)
>
>> @@ -929,8 +878,66 @@ kvm_deassign_ioeventfd_idx(struct kvm *kvm, enum 
>> kvm_bus bus_idx,
>>  static int kvm_deassign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
>> *args)
>>  {
>>  enum kvm_bus bus_idx = ioeventfd_bus_from_flags(args->flags);
>> +int ret = kvm_deassign_ioeventfd_idx(kvm, bus_idx, args);
>> +
>> +if (!args->len)
>> +kvm_deassign_ioeventfd_idx(kvm, KVM_FAST_MMIO_BUS, args);
> I think it would be good to explicitly check for bus_idx ==
> KVM_MMIO_BUS here.

Ok.

>
>> +
>> +return ret;
>> +}
>>
>> -return kvm_deassign_ioeventfd_idx(kvm, bus_idx, args);
>> +static int
>> +kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
>> +{
>> +enum kvm_bus  bus_idx;
>> +int ret;
>> +
>> +bus_idx = ioeventfd_bus_from_flags(args->flags);
>> +/* must be natural-word sized, or 0 to ignore length */
>> +switch (args->len) {
>> +case 0:
>> +case 1:
>> +case 2:
>> +case 4:
>> +case 8:
>> +break;
>> +default:
>> +return -EINVAL;
>> +}
>> +
>> +/* check for range overflow */
>> +if (args->addr + args->len < args->addr)
>> +return -EINVAL;
>> +
>> +/* check for extra flags that we don't understand */
>> +if (args->flags & ~KVM_IOEVENTFD_VALID_FLAG_MASK)
>> +return -EINVAL;
>> +
>> +/* ioeventfd with no length can't be combined with DATAMATCH */
>> +if (!args->len &&
>> +args->flags & (KVM_IOEVENTFD_FLAG_PIO |
>> +   KVM_IOEVENTFD_FLAG_DATAMATCH))
>> +return -EINVAL;
>> +
>> +ret = kvm_assign_ioeventfd_idx(kvm, bus_idx, args);
>> +if (ret)
>> +goto fail;
>> +
>> +/* When length is ignored, MMIO is also put on a separate bus, for
>> + * faster lookups.
>> + */
>> +if (!args->len && !(args->flags & KVM_IOEVENTFD_FLAG_PIO)) {
> Dito on a positive check for bus_idx == KVM_MMIO_BUS.

I was thinking maybe this should be done in a separate patch on top.
What's your opinion?

>> +ret = kvm_assign_ioeventfd_idx(kvm, KVM_FAST_MMIO_BUS, args);
>> +if (ret < 0)
>> +goto fast_fail;
>> +}
>> +
>> +return 0;
>> +
>> +fast_fail:
>> +kvm_deassign_ioeventfd(kvm, args);
> Shouldn't you use kvm_deassign_ioeventfd(kvm, bus_idx, args) here?

Actually, it's the same. (the deassign of fast mmio will return -ENOENT
and will be ignored.) But I admit do what you suggested here is better.
Will do this.

Thanks

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 3/4] kvm: fix zero length mmio searching

2015-09-11 Thread Jason Wang


On 09/11/2015 04:31 PM, Cornelia Huck wrote:
> On Fri, 11 Sep 2015 10:26:41 +0200
> Paolo Bonzini <pbonz...@redhat.com> wrote:
>
>> On 11/09/2015 05:17, Jason Wang wrote:
>>> +   int len = r2->len ? r1->len : 0;
>>> +
>>> if (r1->addr < r2->addr)
>>> return -1;
>>> -   if (r1->addr + r1->len > r2->addr + r2->len)
>>> +   if (r1->addr + len > r2->addr + r2->len)
>>> return 1;
>> Perhaps better:
>>
>>  gpa_t addr1 = r1->addr;
>>  gpa_t addr2 = r2->addr;
>>
>>  if (addr1 < addr2)
>>  return -1;
>>
>>  /* If r2->len == 0, match the exact address.  If r2->len != 0,
>>   * accept any overlapping write.  Any order is acceptable for
>>   * overlapping ranges, because kvm_io_bus_get_first_dev ensures
>>   * we process all of them.
>>   */
>>  if (r2->len) {
>>  addr1 += r1->len;
>>  addr2 += r2->len;
>>  }
>>
>>  if (addr1 > addr2)
>>  return 1;
>>
>>  return 0;
>>
> +1 to documenting what the semantics are :)
>

Right, better. Will fix this in V5.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 3/4] kvm: fix zero length mmio searching

2015-09-10 Thread Jason Wang
Currently, if we had a zero length mmio eventfd assigned on
KVM_MMIO_BUS. It will never found by kvm_io_bus_cmp() since it always
compare the kvm_io_range() with the length that guest wrote. This will
lead e.g for vhost, kick will be trapped by qemu userspace instead of
vhost. Fixing this by using zero length if an iodevice is zero length.

Cc: Gleb Natapov <g...@kernel.org>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 virt/kvm/kvm_main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d8db2f8f..d4c3b66 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3071,9 +3071,11 @@ static void kvm_io_bus_destroy(struct kvm_io_bus *bus)
 static inline int kvm_io_bus_cmp(const struct kvm_io_range *r1,
 const struct kvm_io_range *r2)
 {
+   int len = r2->len ? r1->len : 0;
+
if (r1->addr < r2->addr)
return -1;
-   if (r1->addr + r1->len > r2->addr + r2->len)
+   if (r1->addr + len > r2->addr + r2->len)
return 1;
return 0;
 }
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 4/4] kvm: add tracepoint for fast mmio

2015-09-10 Thread Jason Wang
Cc: Gleb Natapov <g...@kernel.org>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 arch/x86/kvm/trace.h | 18 ++
 arch/x86/kvm/vmx.c   |  1 +
 arch/x86/kvm/x86.c   |  1 +
 3 files changed, 20 insertions(+)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 4eae7c3..ce4abe3 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -129,6 +129,24 @@ TRACE_EVENT(kvm_pio,
 );
 
 /*
+ * Tracepoint for fast mmio.
+ */
+TRACE_EVENT(kvm_fast_mmio,
+   TP_PROTO(u64 gpa),
+   TP_ARGS(gpa),
+
+   TP_STRUCT__entry(
+   __field(u64,gpa)
+   ),
+
+   TP_fast_assign(
+   __entry->gpa= gpa;
+   ),
+
+   TP_printk("fast mmio at gpa 0x%llx", __entry->gpa)
+);
+
+/*
  * Tracepoint for cpuid.
  */
 TRACE_EVENT(kvm_cpuid,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4a4eec30..cb505b9 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5767,6 +5767,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
skip_emulated_instruction(vcpu);
+   trace_kvm_fast_mmio(gpa);
return 1;
}
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1e7e76e..f7c4042 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8013,6 +8013,7 @@ bool kvm_arch_has_noncoherent_dma(struct kvm *kvm)
 EXPORT_SYMBOL_GPL(kvm_arch_has_noncoherent_dma);
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_msr);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 2/4] kvm: fix double free for fast mmio eventfd

2015-09-10 Thread Jason Wang
We register wildcard mmio eventfd on two buses, one for KVM_MMIO_BUS
and another is KVM_FAST_MMIO_BUS but with a single iodev
instance. This will lead an issue: kvm_io_bus_destroy() knows nothing
about the devices on two buses points to a single dev. Which will lead
double free[1] during exit. Fixing this by using allocate two
instances of iodevs then register one on KVM_MMIO_BUS and another on
KVM_FAST_MMIO_BUS.

CPU: 1 PID: 2894 Comm: qemu-system-x86 Not tainted 3.19.0-26-generic #28-Ubuntu
Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET96WW (2.56 ) 09/12/2013
task: 88009ae0c4b0 ti: 88020e7f task.ti: 88020e7f
RIP: 0010:[]  [] 
ioeventfd_release+0x28/0x60 [kvm]
RSP: 0018:88020e7f3bc8  EFLAGS: 00010292
RAX: dead00200200 RBX: 8801ec19c900 RCX: 00018200016d
RDX: 8801ec19cf80 RSI: ea0008bf1d40 RDI: 8801ec19c900
RBP: 88020e7f3bd8 R08: 2fc75a01 R09: 00018200016d
R10: c07df6ae R11: 88022fc75a98 R12: 88021e7cc000
R13: 88021e7cca48 R14: 88021e7cca50 R15: 8801ec19c880
FS:  7fc1ee3e6700() GS:88023e24() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f8f389d8000 CR3: 00023dc13000 CR4: 001427e0
Stack:
88021e7cc000  88020e7f3be8 c07e2622
88020e7f3c38 c07df69a 880232524160 88020e792d80
  880219b78c00 0008 8802321686a8
Call Trace:
[] ioeventfd_destructor+0x12/0x20 [kvm]
[] kvm_put_kvm+0xca/0x210 [kvm]
[] kvm_vcpu_release+0x18/0x20 [kvm]
[] __fput+0xe7/0x250
[] fput+0xe/0x10
[] task_work_run+0xd4/0xf0
[] do_exit+0x368/0xa50
[] ? recalc_sigpending+0x1f/0x60
[] do_group_exit+0x45/0xb0
[] get_signal+0x291/0x750
[] do_signal+0x28/0xab0
[] ? do_futex+0xdb/0x5d0
[] ? __wake_up_locked_key+0x18/0x20
[] ? SyS_futex+0x76/0x170
[] do_notify_resume+0x69/0xb0
[] int_signal+0x12/0x17
Code: 5d c3 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 7f 20 
e8 06 d6 a5 c0 48 8b 43 08 48 8b 13 48 89 df 48 89 42 08 <48> 89 10 48 b8 00 01 
10 00 00
 RIP  [] ioeventfd_release+0x28/0x60 [kvm]
 RSP 

Cc: Gleb Natapov <g...@kernel.org>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 virt/kvm/eventfd.c | 111 -
 1 file changed, 59 insertions(+), 52 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 163258d..1a023ac 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -817,16 +817,6 @@ static int kvm_assign_ioeventfd_idx(struct kvm *kvm,
if (ret < 0)
goto unlock_fail;
 
-   /* When length is ignored, MMIO is also put on a separate bus, for
-* faster lookups.
-*/
-   if (!args->len && !(args->flags & KVM_IOEVENTFD_FLAG_PIO)) {
-   ret = kvm_io_bus_register_dev(kvm, KVM_FAST_MMIO_BUS,
- p->addr, 0, >dev);
-   if (ret < 0)
-   goto register_fail;
-   }
-
kvm->buses[bus_idx]->ioeventfd_count++;
list_add_tail(>list, >ioeventfds);
 
@@ -834,8 +824,6 @@ static int kvm_assign_ioeventfd_idx(struct kvm *kvm,
 
return 0;
 
-register_fail:
-   kvm_io_bus_unregister_dev(kvm, bus_idx, >dev);
 unlock_fail:
mutex_unlock(>slots_lock);
 
@@ -847,41 +835,6 @@ fail:
 }
 
 static int
-kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
-{
-   enum kvm_bus  bus_idx;
-
-   bus_idx = ioeventfd_bus_from_flags(args->flags);
-   /* must be natural-word sized, or 0 to ignore length */
-   switch (args->len) {
-   case 0:
-   case 1:
-   case 2:
-   case 4:
-   case 8:
-   break;
-   default:
-   return -EINVAL;
-   }
-
-   /* check for range overflow */
-   if (args->addr + args->len < args->addr)
-   return -EINVAL;
-
-   /* check for extra flags that we don't understand */
-   if (args->flags & ~KVM_IOEVENTFD_VALID_FLAG_MASK)
-   return -EINVAL;
-
-   /* ioeventfd with no length can't be combined with DATAMATCH */
-   if (!args->len &&
-   args->flags & (KVM_IOEVENTFD_FLAG_PIO |
-  KVM_IOEVENTFD_FLAG_DATAMATCH))
-   return -EINVAL;
-
-   return kvm_assign_ioeventfd_idx(kvm, bus_idx, args);
-}
-
-static int
 kvm_deassign_ioeventfd_idx(struct kvm *kvm, enum kvm_bus bus_idx,
   struct kvm_ioeventfd *args)
 {
@@ -909,10 +862,6 @@ kvm_deassign_ioeventfd_idx(struct kvm *kvm, enum kvm_bus 
bus_idx,
continue;
 
kvm_io_bus_unregister_dev(kvm, bus_idx, >dev);
-   if (!p->length) {
-

[PATCH V4 1/4] kvm: factor out core eventfd assign/deassign logic

2015-09-10 Thread Jason Wang
This patch factors out core eventfd assign/deassign logic and leave
the argument checking and bus index selection to callers.

Cc: Gleb Natapov <g...@kernel.org>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Signed-off-by: Jason Wang <jasow...@redhat.com>
---
 virt/kvm/eventfd.c | 83 --
 1 file changed, 49 insertions(+), 34 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 9ff4193..163258d 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -771,40 +771,14 @@ static enum kvm_bus ioeventfd_bus_from_flags(__u32 flags)
return KVM_MMIO_BUS;
 }
 
-static int
-kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
+static int kvm_assign_ioeventfd_idx(struct kvm *kvm,
+   enum kvm_bus bus_idx,
+   struct kvm_ioeventfd *args)
 {
-   enum kvm_bus  bus_idx;
-   struct _ioeventfd*p;
-   struct eventfd_ctx   *eventfd;
-   int   ret;
-
-   bus_idx = ioeventfd_bus_from_flags(args->flags);
-   /* must be natural-word sized, or 0 to ignore length */
-   switch (args->len) {
-   case 0:
-   case 1:
-   case 2:
-   case 4:
-   case 8:
-   break;
-   default:
-   return -EINVAL;
-   }
 
-   /* check for range overflow */
-   if (args->addr + args->len < args->addr)
-   return -EINVAL;
-
-   /* check for extra flags that we don't understand */
-   if (args->flags & ~KVM_IOEVENTFD_VALID_FLAG_MASK)
-   return -EINVAL;
-
-   /* ioeventfd with no length can't be combined with DATAMATCH */
-   if (!args->len &&
-   args->flags & (KVM_IOEVENTFD_FLAG_PIO |
-  KVM_IOEVENTFD_FLAG_DATAMATCH))
-   return -EINVAL;
+   struct eventfd_ctx *eventfd;
+   struct _ioeventfd *p;
+   int ret;
 
eventfd = eventfd_ctx_fdget(args->fd);
if (IS_ERR(eventfd))
@@ -873,14 +847,48 @@ fail:
 }
 
 static int
-kvm_deassign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
+kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
 {
enum kvm_bus  bus_idx;
+
+   bus_idx = ioeventfd_bus_from_flags(args->flags);
+   /* must be natural-word sized, or 0 to ignore length */
+   switch (args->len) {
+   case 0:
+   case 1:
+   case 2:
+   case 4:
+   case 8:
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   /* check for range overflow */
+   if (args->addr + args->len < args->addr)
+   return -EINVAL;
+
+   /* check for extra flags that we don't understand */
+   if (args->flags & ~KVM_IOEVENTFD_VALID_FLAG_MASK)
+   return -EINVAL;
+
+   /* ioeventfd with no length can't be combined with DATAMATCH */
+   if (!args->len &&
+   args->flags & (KVM_IOEVENTFD_FLAG_PIO |
+  KVM_IOEVENTFD_FLAG_DATAMATCH))
+   return -EINVAL;
+
+   return kvm_assign_ioeventfd_idx(kvm, bus_idx, args);
+}
+
+static int
+kvm_deassign_ioeventfd_idx(struct kvm *kvm, enum kvm_bus bus_idx,
+  struct kvm_ioeventfd *args)
+{
struct _ioeventfd*p, *tmp;
struct eventfd_ctx   *eventfd;
int   ret = -ENOENT;
 
-   bus_idx = ioeventfd_bus_from_flags(args->flags);
eventfd = eventfd_ctx_fdget(args->fd);
if (IS_ERR(eventfd))
return PTR_ERR(eventfd);
@@ -918,6 +926,13 @@ kvm_deassign_ioeventfd(struct kvm *kvm, struct 
kvm_ioeventfd *args)
return ret;
 }
 
+static int kvm_deassign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
+{
+   enum kvm_bus bus_idx = ioeventfd_bus_from_flags(args->flags);
+
+   return kvm_deassign_ioeventfd_idx(kvm, bus_idx, args);
+}
+
 int
 kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
 {
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 0/4] Fast MMIO eventfd fixes

2015-09-10 Thread Jason Wang
Hi:

This series fixes two issues of fast mmio eventfd:

1) A single iodev instance were registerd on two buses: KVM_MMIO_BUS
   and KVM_FAST_MMIO_BUS. This will cause double in
   ioeventfd_destructor()
2) A zero length iodev on KVM_MMIO_BUS will never be found but
   kvm_io_bus_cmp(). This will lead e.g the eventfd will be trapped by
   qemu instead of host.

1 is fixed by allocating two instances of iodev. 2 is fixed by ignore
the actual length if the length of iodev is zero in kvm_io_bus_cmp().

Please review.

Changes from V3:

- Don't do search on two buses when trying to do write on
  KVM_MMIO_BUS. This fixes a small regression found by vmexit.flat.
- Since we don't do search on two buses, change kvm_io_bus_cmp() to
  let it can find zero length iodevs.
- Fix the unnecessary lines in tracepoint patch.

Changes from V2:
- Tweak styles and comment suggested by Cornelia.

Changes from v1:
- change ioeventfd_bus_from_flags() to return KVM_FAST_MMIO_BUS when
  needed to save lots of unnecessary changes.

Jason Wang (4):
  kvm: factor out core eventfd assign/deassign logic
  kvm: fix double free for fast mmio eventfd
  kvm: fix zero length mmio searching
  kvm: add tracepoint for fast mmio

 arch/x86/kvm/trace.h |  18 
 arch/x86/kvm/vmx.c   |   1 +
 arch/x86/kvm/x86.c   |   1 +
 virt/kvm/eventfd.c   | 124 ++-
 virt/kvm/kvm_main.c  |   4 +-
 5 files changed, 96 insertions(+), 52 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 2/3] kvm: don't register wildcard MMIO EVENTFD on two buses

2015-09-01 Thread Jason Wang


On 09/01/2015 02:54 PM, Michael S. Tsirkin wrote:
> On Tue, Sep 01, 2015 at 12:47:36PM +0800, Jason Wang wrote:
>>
>> On 09/01/2015 12:31 PM, Michael S. Tsirkin wrote:
>>> On Tue, Sep 01, 2015 at 11:33:43AM +0800, Jason Wang wrote:
>>>> On 08/31/2015 07:33 PM, Michael S. Tsirkin wrote:
>>>>> On Mon, Aug 31, 2015 at 04:03:59PM +0800, Jason Wang wrote:
>>>>>>> On 08/31/2015 03:29 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>>>>>>>> Thinking more about this, invoking the 0-length write after
>>>>>>>>>>>>>>>>>>>>>>>>> the != 0 length one would be better: it would mean we 
>>>>>>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>>>> handle the userspace MMIO like this.
>>>>>>>>>>>>>>>>> Right.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> Using current unittest. This patch is about 2.9% slower than 
>>>>>>>>>>>>> before, and
>>>>>>>>>>>>> invoking 0-length write after is still 1.1% slower 
>>>>>>>>>>>>> (mmio-datamatch-eventfd).
>>>>>>>>>>>>>
>>>>>>>>>>>>> /patch/result/-+%/
>>>>>>>>>>>>> /base/2957/0/
>>>>>>>>>>>>> /V3/3043/+2.9%/
>>>>>>>>>>>>> /V3+invoking != 0 length first/2990/+1.1%/
>>>>>>>>>>>>>
>>>>>>>>>>>>> So looks like the best method is not searching KVM_FAST_MMIO_BUS 
>>>>>>>>>>>>> during
>>>>>>>>>>>>> KVM_MMIO_BUS. Instead, let userspace to register both datamatch 
>>>>>>>>>>>>> and
>>>>>>>>>>>>> wildcard in this case. Does this sound good to you?
>>>>>>>>> No - we can't change userspace.
>>>>>>> Actually, the change was as simple as following. So I don't get the
>>>>>>> reason why.
>>>>> Because it's too late - we committed to a specific userspace ABI
>>>>> when this was merged in kernel, we must maintain it.
>>>> Ok ( Though I don't think it has real users for this now because it was
>>>> actually broken).
>>> It actually worked most of the time - you only trigger a use after free
>>> on deregister.
>>>
>> It doesn't work for amd and intel machine without ept.
> I thought it does :(
>
>>>>> Even if I thought yours is a good API (and I don't BTW - it's exposing
>>>>> internal implementation details) it's too late to change it.
>>>> I believe we should document the special treatment in kernel of zero
>>>> length mmio eventfd in api.txt? If yes, is this an exposing? If not, how
>>>> can userspace know the advantages of this and use it? For better API,
>>>> probably we need another new flag just for fast mmio and obsolete
>>>> current one by failing the assigning for zero length mmio eventfd.
>>> I sent a patch to update api.txt already as part of
>>> kvm: add KVM_CAP_IOEVENTFD_PF capability.
>>> I should probably split it out.
>>>
>>> Sorry, I don't think the api change you propose makes sense - just fix the
>>> crash in the existing one.
>>>
>> Ok, so I believe the fix should go:
>>
>> - having two ioeventfds when we want to assign zero length mmio eventfd
> You mean the in-kernel data structures?

Yes.

>
>> - change the kvm_io_bus_sort_cmp() and can handle zero length correctly
> This one's for amd/non ept, right? I'd rather we implemented the
> fast mmio optimization for these.

Agree, but we'd better fix it and backport it to stable first?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 2/3] kvm: don't register wildcard MMIO EVENTFD on two buses

2015-08-31 Thread Jason Wang


On 08/31/2015 07:33 PM, Michael S. Tsirkin wrote:
> On Mon, Aug 31, 2015 at 04:03:59PM +0800, Jason Wang wrote:
>> > 
>> > 
>> > On 08/31/2015 03:29 PM, Michael S. Tsirkin wrote:
>>>>>>> > >>>>> Thinking more about this, invoking the 0-length write after
>>>>>>>>>>> > >>>>> > >> > the != 0 length one would be better: it would mean 
>>>>>>>>>>> > >>>>> > >> > we only
>>>>>>>>>>> > >>>>> > >> > handle the userspace MMIO like this.
>>>>>>> > >>> > > Right.
>>>>>>> > >>> > >
>>>>> > >> > 
>>>>> > >> > Using current unittest. This patch is about 2.9% slower than 
>>>>> > >> > before, and
>>>>> > >> > invoking 0-length write after is still 1.1% slower 
>>>>> > >> > (mmio-datamatch-eventfd).
>>>>> > >> > 
>>>>> > >> > /patch/result/-+%/
>>>>> > >> > /base/2957/0/
>>>>> > >> > /V3/3043/+2.9%/
>>>>> > >> > /V3+invoking != 0 length first/2990/+1.1%/
>>>>> > >> > 
>>>>> > >> > So looks like the best method is not searching KVM_FAST_MMIO_BUS 
>>>>> > >> > during
>>>>> > >> > KVM_MMIO_BUS. Instead, let userspace to register both datamatch and
>>>>> > >> > wildcard in this case. Does this sound good to you?
>>> > > No - we can't change userspace.
>> > 
>> > Actually, the change was as simple as following. So I don't get the
>> > reason why.
> Because it's too late - we committed to a specific userspace ABI
> when this was merged in kernel, we must maintain it.

Ok ( Though I don't think it has real users for this now because it was
actually broken).

> Even if I thought yours is a good API (and I don't BTW - it's exposing
> internal implementation details) it's too late to change it.

I believe we should document the special treatment in kernel of zero
length mmio eventfd in api.txt? If yes, is this an exposing? If not, how
can userspace know the advantages of this and use it? For better API,
probably we need another new flag just for fast mmio and obsolete
current one by failing the assigning for zero length mmio eventfd.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 2/3] kvm: don't register wildcard MMIO EVENTFD on two buses

2015-08-31 Thread Jason Wang


On 08/31/2015 03:29 PM, Michael S. Tsirkin wrote:
> Thinking more about this, invoking the 0-length write after
> > >> > the != 0 length one would be better: it would mean we only
> > >> > handle the userspace MMIO like this.
>>> > > Right.
>>> > >
>> > 
>> > Using current unittest. This patch is about 2.9% slower than before, and
>> > invoking 0-length write after is still 1.1% slower 
>> > (mmio-datamatch-eventfd).
>> > 
>> > /patch/result/-+%/
>> > /base/2957/0/
>> > /V3/3043/+2.9%/
>> > /V3+invoking != 0 length first/2990/+1.1%/
>> > 
>> > So looks like the best method is not searching KVM_FAST_MMIO_BUS during
>> > KVM_MMIO_BUS. Instead, let userspace to register both datamatch and
>> > wildcard in this case. Does this sound good to you?
> No - we can't change userspace.

Actually, the change was as simple as following. So I don't get the
reason why.

diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index 9935029..42ee986 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -288,6 +288,8 @@ static int
virtio_pci_set_host_notifier_internal(VirtIOPCIProxy *proxy,
 if (modern) {
 memory_region_add_eventfd(modern_mr, modern_addr, 2,
   true, n, notifier);
+memory_region_add_eventfd(modern_mr, modern_addr, 0,
+  false, n, notifier);
 }
 if (legacy) {
 memory_region_add_eventfd(legacy_mr, legacy_addr, 2,
@@ -297,6 +299,8 @@ static int
virtio_pci_set_host_notifier_internal(VirtIOPCIProxy *proxy,
 if (modern) {
 memory_region_del_eventfd(modern_mr, modern_addr, 2,
   true, n, notifier);
+memory_region_del_eventfd(modern_mr, modern_addr, 0,
+  false, n, notifier);
 }
 if (legacy) {
 memory_region_del_eventfd(legacy_mr, legacy_addr, 2,

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/3] vmx: allow ioeventfd for EPT violations

2015-08-31 Thread Jason Wang


On 08/30/2015 05:12 PM, Michael S. Tsirkin wrote:
> Even when we skip data decoding, MMIO is slightly slower
> than port IO because it uses the page-tables, so the CPU
> must do a pagewalk on each access.
>
> This overhead is normally masked by using the TLB cache:
> but not so for KVM MMIO, where PTEs are marked as reserved
> and so are never cached.
>
> As ioeventfd memory is never read, make it possible to use
> RO pages on the host for ioeventfds, instead.
> The result is that TLBs are cached, which finally makes MMIO
> as fast as port IO.
>
> Signed-off-by: Michael S. Tsirkin 
> ---
>  arch/x86/kvm/vmx.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 9d1bfd3..ed44026 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -5745,6 +5745,11 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>   vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, 
> GUEST_INTR_STATE_NMI);
>  
>   gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
> + if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
> + skip_emulated_instruction(vcpu);
> + return 1;
> + }
> +
>   trace_kvm_page_fault(gpa, exit_qualification);
>  
>   /* It is a write fault? */

Just notice that vcpu_mmio_write() tries lapic first. Should we do the
same here? Otherwise we may slow down apic access consider we may have
hundreds of eventfds.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/3] vmx: allow ioeventfd for EPT violations

2015-08-31 Thread Jason Wang


On 09/01/2015 12:36 PM, Michael S. Tsirkin wrote:
> On Tue, Sep 01, 2015 at 11:37:13AM +0800, Jason Wang wrote:
>> > 
>> > 
>> > On 08/30/2015 05:12 PM, Michael S. Tsirkin wrote:
>>> > > Even when we skip data decoding, MMIO is slightly slower
>>> > > than port IO because it uses the page-tables, so the CPU
>>> > > must do a pagewalk on each access.
>>> > >
>>> > > This overhead is normally masked by using the TLB cache:
>>> > > but not so for KVM MMIO, where PTEs are marked as reserved
>>> > > and so are never cached.
>>> > >
>>> > > As ioeventfd memory is never read, make it possible to use
>>> > > RO pages on the host for ioeventfds, instead.
>>> > > The result is that TLBs are cached, which finally makes MMIO
>>> > > as fast as port IO.
>>> > >
>>> > > Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
>>> > > ---
>>> > >  arch/x86/kvm/vmx.c | 5 +
>>> > >  1 file changed, 5 insertions(+)
>>> > >
>>> > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> > > index 9d1bfd3..ed44026 100644
>>> > > --- a/arch/x86/kvm/vmx.c
>>> > > +++ b/arch/x86/kvm/vmx.c
>>> > > @@ -5745,6 +5745,11 @@ static int handle_ept_violation(struct kvm_vcpu 
>>> > > *vcpu)
>>> > > vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, 
>>> > > GUEST_INTR_STATE_NMI);
>>> > >  
>>> > > gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
>>> > > +   if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
>>> > > +   skip_emulated_instruction(vcpu);
>>> > > +   return 1;
>>> > > +   }
>>> > > +
>>> > > trace_kvm_page_fault(gpa, exit_qualification);
>>> > >  
>>> > > /* It is a write fault? */
>> > 
>> > Just notice that vcpu_mmio_write() tries lapic first. Should we do the
>> > same here? Otherwise we may slow down apic access consider we may have
>> > hundreds of eventfds.
> IIUC this does not affect mmio at all: for mmio we set
> reserved page flag, so they trigger an EPT misconfiguration,
> not an EPT violation.

I see, so the question could be asked for current misconfiguration
handler instead?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 2/3] kvm: don't register wildcard MMIO EVENTFD on two buses

2015-08-31 Thread Jason Wang


On 09/01/2015 12:31 PM, Michael S. Tsirkin wrote:
> On Tue, Sep 01, 2015 at 11:33:43AM +0800, Jason Wang wrote:
>>
>> On 08/31/2015 07:33 PM, Michael S. Tsirkin wrote:
>>> On Mon, Aug 31, 2015 at 04:03:59PM +0800, Jason Wang wrote:
>>>>>
>>>>> On 08/31/2015 03:29 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>>>>>> Thinking more about this, invoking the 0-length write after
>>>>>>>>>>>>>>>>>>>>>>> the != 0 length one would be better: it would mean we 
>>>>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>> handle the userspace MMIO like this.
>>>>>>>>>>>>>>> Right.
>>>>>>>>>>>>>>>
>>>>>>>>>>> Using current unittest. This patch is about 2.9% slower than 
>>>>>>>>>>> before, and
>>>>>>>>>>> invoking 0-length write after is still 1.1% slower 
>>>>>>>>>>> (mmio-datamatch-eventfd).
>>>>>>>>>>>
>>>>>>>>>>> /patch/result/-+%/
>>>>>>>>>>> /base/2957/0/
>>>>>>>>>>> /V3/3043/+2.9%/
>>>>>>>>>>> /V3+invoking != 0 length first/2990/+1.1%/
>>>>>>>>>>>
>>>>>>>>>>> So looks like the best method is not searching KVM_FAST_MMIO_BUS 
>>>>>>>>>>> during
>>>>>>>>>>> KVM_MMIO_BUS. Instead, let userspace to register both datamatch and
>>>>>>>>>>> wildcard in this case. Does this sound good to you?
>>>>>>> No - we can't change userspace.
>>>>> Actually, the change was as simple as following. So I don't get the
>>>>> reason why.
>>> Because it's too late - we committed to a specific userspace ABI
>>> when this was merged in kernel, we must maintain it.
>> Ok ( Though I don't think it has real users for this now because it was
>> actually broken).
> It actually worked most of the time - you only trigger a use after free
> on deregister.
>

It doesn't work for amd and intel machine without ept.

>>> Even if I thought yours is a good API (and I don't BTW - it's exposing
>>> internal implementation details) it's too late to change it.
>> I believe we should document the special treatment in kernel of zero
>> length mmio eventfd in api.txt? If yes, is this an exposing? If not, how
>> can userspace know the advantages of this and use it? For better API,
>> probably we need another new flag just for fast mmio and obsolete
>> current one by failing the assigning for zero length mmio eventfd.
> I sent a patch to update api.txt already as part of
> kvm: add KVM_CAP_IOEVENTFD_PF capability.
> I should probably split it out.
>
> Sorry, I don't think the api change you propose makes sense - just fix the
> crash in the existing one.
>

Ok, so I believe the fix should go:

- having two ioeventfds when we want to assign zero length mmio eventfd
- change the kvm_io_bus_sort_cmp() and can handle zero length correctly

What's your thought?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 2/3] kvm: don't register wildcard MMIO EVENTFD on two buses

2015-08-30 Thread Jason Wang


On 08/26/2015 01:10 PM, Jason Wang wrote:
 On 08/25/2015 07:51 PM, Michael S. Tsirkin wrote:
  On Tue, Aug 25, 2015 at 05:05:47PM +0800, Jason Wang wrote:
   We register wildcard mmio eventfd on two buses, one for KVM_MMIO_BUS
   and another is KVM_FAST_MMIO_BUS. This leads to issue:
   
   - kvm_io_bus_destroy() knows nothing about the devices on two buses
 points to a single dev. Which will lead double free [1] during exit.
   - wildcard eventfd ignores data len, so it was registered as a
 kvm_io_range with zero length. This will fail the binary search in
 kvm_io_bus_get_first_dev() when we try to emulate through
 KVM_MMIO_BUS. This will cause userspace io emulation request instead
 of a eventfd notification (virtqueue kick will be trapped by qemu
 instead of vhost in this case).
   
   Fixing this by don't register wildcard mmio eventfd on two
   buses. Instead, only register it in KVM_FAST_MMIO_BUS. This fixes the
   double free issue of kvm_io_bus_destroy(). For the arch/setups that
   does not utilize KVM_FAST_MMIO_BUS, before searching KVM_MMIO_BUS, try
   KVM_FAST_MMIO_BUS first to see it it has a match.
   
   [1] Panic caused by double free:
   
   CPU: 1 PID: 2894 Comm: qemu-system-x86 Not tainted 3.19.0-26-generic 
   #28-Ubuntu
   Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET96WW (2.56 ) 
   09/12/2013
   task: 88009ae0c4b0 ti: 88020e7f task.ti: 88020e7f
   RIP: 0010:[c07e25d8]  [c07e25d8] 
   ioeventfd_release+0x28/0x60 [kvm]
   RSP: 0018:88020e7f3bc8  EFLAGS: 00010292
   RAX: dead00200200 RBX: 8801ec19c900 RCX: 00018200016d
   RDX: 8801ec19cf80 RSI: ea0008bf1d40 RDI: 8801ec19c900
   RBP: 88020e7f3bd8 R08: 2fc75a01 R09: 00018200016d
   R10: c07df6ae R11: 88022fc75a98 R12: 88021e7cc000
   R13: 88021e7cca48 R14: 88021e7cca50 R15: 8801ec19c880
   FS:  7fc1ee3e6700() GS:88023e24() 
   knlGS:
   CS:  0010 DS:  ES:  CR0: 80050033
   CR2: 7f8f389d8000 CR3: 00023dc13000 CR4: 001427e0
   Stack:
   88021e7cc000  88020e7f3be8 c07e2622
   88020e7f3c38 c07df69a 880232524160 88020e792d80
 880219b78c00 0008 8802321686a8
   Call Trace:
   [c07e2622] ioeventfd_destructor+0x12/0x20 [kvm]
   [c07df69a] kvm_put_kvm+0xca/0x210 [kvm]
   [c07df818] kvm_vcpu_release+0x18/0x20 [kvm]
   [811f69f7] __fput+0xe7/0x250
   [811f6bae] fput+0xe/0x10
   [81093f04] task_work_run+0xd4/0xf0
   [81079358] do_exit+0x368/0xa50
   [81082c8f] ? recalc_sigpending+0x1f/0x60
   [81079ad5] do_group_exit+0x45/0xb0
   [81085c71] get_signal+0x291/0x750
   [810144d8] do_signal+0x28/0xab0
   [810f3a3b] ? do_futex+0xdb/0x5d0
   [810b7028] ? __wake_up_locked_key+0x18/0x20
   [810f3fa6] ? SyS_futex+0x76/0x170
   [81014fc9] do_notify_resume+0x69/0xb0
   [817cb9af] int_signal+0x12/0x17
   Code: 5d c3 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 
   8b 7f 20 e8 06 d6 a5 c0 48 8b 43 08 48 8b 13 48 89 df 48 89 42 08 
   48 89 10 48 b8 00 01 10 00 00
   RIP  [c07e25d8] ioeventfd_release+0x28/0x60 [kvm]
   RSP 88020e7f3bc8
   
   Cc: Gleb Natapov g...@kernel.org
   Cc: Paolo Bonzini pbonz...@redhat.com
   Cc: Michael S. Tsirkin m...@redhat.com
   Signed-off-by: Jason Wang jasow...@redhat.com
   ---
   Changes from V2:
   - Tweak styles and comment suggested by Cornelia.
   Changes from v1:
   - change ioeventfd_bus_from_flags() to return KVM_FAST_MMIO_BUS when
 needed to save lots of unnecessary changes.
   ---
virt/kvm/eventfd.c  | 31 +--
virt/kvm/kvm_main.c | 16 ++--
2 files changed, 23 insertions(+), 24 deletions(-)
   
   diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
   index 9ff4193..c3ffdc3 100644
   --- a/virt/kvm/eventfd.c
   +++ b/virt/kvm/eventfd.c
   @@ -762,13 +762,16 @@ ioeventfd_check_collision(struct kvm *kvm, 
   struct _ioeventfd *p)
 return false;
}

   -static enum kvm_bus ioeventfd_bus_from_flags(__u32 flags)
   +static enum kvm_bus ioeventfd_bus_from_args(struct kvm_ioeventfd 
   *args)
{
   - if (flags  KVM_IOEVENTFD_FLAG_PIO)
   + if (args-flags  KVM_IOEVENTFD_FLAG_PIO)
 return KVM_PIO_BUS;
   - if (flags  KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY)
   + if (args-flags  KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY)
 return KVM_VIRTIO_CCW_NOTIFY_BUS;
   - return KVM_MMIO_BUS;
   + /* When length is ignored, MMIO is put on a separate bus, for
   +  * faster lookups.
   +  */
   + return args-len ? KVM_MMIO_BUS : KVM_FAST_MMIO_BUS;
}

static int
   @@ -779,7 +782,7 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct 
   kvm_ioeventfd *args

Re: [PATCH V2 3/3] kvm: add tracepoint for fast mmio

2015-08-25 Thread Jason Wang


On 08/25/2015 07:34 PM, Michael S. Tsirkin wrote:
 On Tue, Aug 25, 2015 at 03:47:15PM +0800, Jason Wang wrote:
  Cc: Gleb Natapov g...@kernel.org
  Cc: Paolo Bonzini pbonz...@redhat.com
  Cc: Michael S. Tsirkin m...@redhat.com
  Signed-off-by: Jason Wang jasow...@redhat.com
  ---
   arch/x86/kvm/trace.h | 17 +
   arch/x86/kvm/vmx.c   |  1 +
   arch/x86/kvm/x86.c   |  1 +
   3 files changed, 19 insertions(+)
  
  diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
  index 4eae7c3..2d4e81a 100644
  --- a/arch/x86/kvm/trace.h
  +++ b/arch/x86/kvm/trace.h
  @@ -128,6 +128,23 @@ TRACE_EVENT(kvm_pio,
   __entry-count  1 ? (...) : )
   );
   
  +TRACE_EVENT(kvm_fast_mmio,
  +  TP_PROTO(u64 gpa),
  +  TP_ARGS(gpa),
  +
  +  TP_STRUCT__entry(
  +  __field(u64,gpa)
  +  ),
  +
  +  TP_fast_assign(
  +  __entry-gpa= gpa;
  +  ),
  +
  +  TP_printk(fast mmio at gpa 0x%llx, __entry-gpa)
  +);
  +
  +
  +
 don't add multiple empty lines please.


Ok
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 2/3] kvm: don't register wildcard MMIO EVENTFD on two buses

2015-08-25 Thread Jason Wang


On 08/25/2015 07:33 PM, Michael S. Tsirkin wrote:
 On Tue, Aug 25, 2015 at 03:47:14PM +0800, Jason Wang wrote:
  We register wildcard mmio eventfd on two buses, one for KVM_MMIO_BUS
  and another is KVM_FAST_MMIO_BUS. This leads to issue:
  
  - kvm_io_bus_destroy() knows nothing about the devices on two buses
points to a single dev. Which will lead double free [1] during exit.
  - wildcard eventfd ignores data len, so it was registered as a
kvm_io_range with zero length. This will fail the binary search in
kvm_io_bus_get_first_dev() when we try to emulate through
KVM_MMIO_BUS. This will cause userspace io emulation request instead
of a eventfd notification (virtqueue kick will be trapped by qemu
instead of vhost in this case).
  
  Fixing this by don't register wildcard mmio eventfd on two
  buses. Instead, only register it in KVM_FAST_MMIO_BUS. This fixes the
  double free issue of kvm_io_bus_destroy(). For the arch/setups that
  does not utilize KVM_FAST_MMIO_BUS, before searching KVM_MMIO_BUS, try
  KVM_FAST_MMIO_BUS first to see it it has a match.
  
  [1] Panic caused by double free:
  
  CPU: 1 PID: 2894 Comm: qemu-system-x86 Not tainted 3.19.0-26-generic 
  #28-Ubuntu
  Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET96WW (2.56 ) 09/12/2013
  task: 88009ae0c4b0 ti: 88020e7f task.ti: 88020e7f
  RIP: 0010:[c07e25d8]  [c07e25d8] 
  ioeventfd_release+0x28/0x60 [kvm]
  RSP: 0018:88020e7f3bc8  EFLAGS: 00010292
  RAX: dead00200200 RBX: 8801ec19c900 RCX: 00018200016d
  RDX: 8801ec19cf80 RSI: ea0008bf1d40 RDI: 8801ec19c900
  RBP: 88020e7f3bd8 R08: 2fc75a01 R09: 00018200016d
  R10: c07df6ae R11: 88022fc75a98 R12: 88021e7cc000
  R13: 88021e7cca48 R14: 88021e7cca50 R15: 8801ec19c880
  FS:  7fc1ee3e6700() GS:88023e24() 
  knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 7f8f389d8000 CR3: 00023dc13000 CR4: 001427e0
  Stack:
  88021e7cc000  88020e7f3be8 c07e2622
  88020e7f3c38 c07df69a 880232524160 88020e792d80
    880219b78c00 0008 8802321686a8
  Call Trace:
  [c07e2622] ioeventfd_destructor+0x12/0x20 [kvm]
  [c07df69a] kvm_put_kvm+0xca/0x210 [kvm]
  [c07df818] kvm_vcpu_release+0x18/0x20 [kvm]
  [811f69f7] __fput+0xe7/0x250
  [811f6bae] fput+0xe/0x10
  [81093f04] task_work_run+0xd4/0xf0
  [81079358] do_exit+0x368/0xa50
  [81082c8f] ? recalc_sigpending+0x1f/0x60
  [81079ad5] do_group_exit+0x45/0xb0
  [81085c71] get_signal+0x291/0x750
  [810144d8] do_signal+0x28/0xab0
  [810f3a3b] ? do_futex+0xdb/0x5d0
  [810b7028] ? __wake_up_locked_key+0x18/0x20
  [810f3fa6] ? SyS_futex+0x76/0x170
  [81014fc9] do_notify_resume+0x69/0xb0
  [817cb9af] int_signal+0x12/0x17
  Code: 5d c3 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 7f 
  20 e8 06 d6 a5 c0 48 8b 43 08 48 8b 13 48 89 df 48 89 42 08 48 89 10 48 
  b8 00 01 10 00 00
  RIP  [c07e25d8] ioeventfd_release+0x28/0x60 [kvm]
  RSP 88020e7f3bc8
  
  Cc: Gleb Natapov g...@kernel.org
  Cc: Paolo Bonzini pbonz...@redhat.com
  Cc: Michael S. Tsirkin m...@redhat.com
  Signed-off-by: Jason Wang jasow...@redhat.com
 I'm worried that this slows down the regular MMIO.

I doubt whether or not it was measurable.

 Could you share performance #s please?
 You need a mix of len=0 and len=2 matches.

Ok.

 One solution for the first issue is to create two ioeventfd objects instead.

Sounds good.

 For the second issue, we could change bsearch compare function instead.

What do you mean by second issue ?

 Again, affects all devices to performance #s would be needed.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 1/3] kvm: use kmalloc() instead of kzalloc() during iodev register/unregister

2015-08-25 Thread Jason Wang


On 08/26/2015 01:45 PM, Joe Perches wrote:
 On Wed, 2015-08-26 at 13:39 +0800, Jason Wang wrote:
  
  On 08/25/2015 11:29 PM, Joe Perches wrote:
   On Tue, 2015-08-25 at 15:47 +0800, Jason Wang wrote:
All fields of kvm_io_range were initialized or copied explicitly
afterwards. So switch to use kmalloc().
   Is there any compiler added alignment padding
   in either structure?  If so, those padding
   areas would now be uninitialized and may leak
   kernel data if copied to user-space.
  
  I get your concern, but I don't a way to copy them to userspace, did you?
 I didn't look.

 I just wanted you to be aware there's a difference
 and a reason why kzalloc might be used even though
 all structure members are initialized.


I see, thanks for the reminding. Looks like we are safe and I will add
something like kvm_io_range was never accessed by userspace in the
commit log if there's a new version.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 2/3] kvm: don't register wildcard MMIO EVENTFD on two buses

2015-08-25 Thread Jason Wang


On 08/25/2015 07:51 PM, Michael S. Tsirkin wrote:
 On Tue, Aug 25, 2015 at 05:05:47PM +0800, Jason Wang wrote:
  We register wildcard mmio eventfd on two buses, one for KVM_MMIO_BUS
  and another is KVM_FAST_MMIO_BUS. This leads to issue:
  
  - kvm_io_bus_destroy() knows nothing about the devices on two buses
points to a single dev. Which will lead double free [1] during exit.
  - wildcard eventfd ignores data len, so it was registered as a
kvm_io_range with zero length. This will fail the binary search in
kvm_io_bus_get_first_dev() when we try to emulate through
KVM_MMIO_BUS. This will cause userspace io emulation request instead
of a eventfd notification (virtqueue kick will be trapped by qemu
instead of vhost in this case).
  
  Fixing this by don't register wildcard mmio eventfd on two
  buses. Instead, only register it in KVM_FAST_MMIO_BUS. This fixes the
  double free issue of kvm_io_bus_destroy(). For the arch/setups that
  does not utilize KVM_FAST_MMIO_BUS, before searching KVM_MMIO_BUS, try
  KVM_FAST_MMIO_BUS first to see it it has a match.
  
  [1] Panic caused by double free:
  
  CPU: 1 PID: 2894 Comm: qemu-system-x86 Not tainted 3.19.0-26-generic 
  #28-Ubuntu
  Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET96WW (2.56 ) 09/12/2013
  task: 88009ae0c4b0 ti: 88020e7f task.ti: 88020e7f
  RIP: 0010:[c07e25d8]  [c07e25d8] 
  ioeventfd_release+0x28/0x60 [kvm]
  RSP: 0018:88020e7f3bc8  EFLAGS: 00010292
  RAX: dead00200200 RBX: 8801ec19c900 RCX: 00018200016d
  RDX: 8801ec19cf80 RSI: ea0008bf1d40 RDI: 8801ec19c900
  RBP: 88020e7f3bd8 R08: 2fc75a01 R09: 00018200016d
  R10: c07df6ae R11: 88022fc75a98 R12: 88021e7cc000
  R13: 88021e7cca48 R14: 88021e7cca50 R15: 8801ec19c880
  FS:  7fc1ee3e6700() GS:88023e24() 
  knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 7f8f389d8000 CR3: 00023dc13000 CR4: 001427e0
  Stack:
  88021e7cc000  88020e7f3be8 c07e2622
  88020e7f3c38 c07df69a 880232524160 88020e792d80
    880219b78c00 0008 8802321686a8
  Call Trace:
  [c07e2622] ioeventfd_destructor+0x12/0x20 [kvm]
  [c07df69a] kvm_put_kvm+0xca/0x210 [kvm]
  [c07df818] kvm_vcpu_release+0x18/0x20 [kvm]
  [811f69f7] __fput+0xe7/0x250
  [811f6bae] fput+0xe/0x10
  [81093f04] task_work_run+0xd4/0xf0
  [81079358] do_exit+0x368/0xa50
  [81082c8f] ? recalc_sigpending+0x1f/0x60
  [81079ad5] do_group_exit+0x45/0xb0
  [81085c71] get_signal+0x291/0x750
  [810144d8] do_signal+0x28/0xab0
  [810f3a3b] ? do_futex+0xdb/0x5d0
  [810b7028] ? __wake_up_locked_key+0x18/0x20
  [810f3fa6] ? SyS_futex+0x76/0x170
  [81014fc9] do_notify_resume+0x69/0xb0
  [817cb9af] int_signal+0x12/0x17
  Code: 5d c3 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 7f 
  20 e8 06 d6 a5 c0 48 8b 43 08 48 8b 13 48 89 df 48 89 42 08 48 89 10 48 
  b8 00 01 10 00 00
  RIP  [c07e25d8] ioeventfd_release+0x28/0x60 [kvm]
  RSP 88020e7f3bc8
  
  Cc: Gleb Natapov g...@kernel.org
  Cc: Paolo Bonzini pbonz...@redhat.com
  Cc: Michael S. Tsirkin m...@redhat.com
  Signed-off-by: Jason Wang jasow...@redhat.com
  ---
  Changes from V2:
  - Tweak styles and comment suggested by Cornelia.
  Changes from v1:
  - change ioeventfd_bus_from_flags() to return KVM_FAST_MMIO_BUS when
needed to save lots of unnecessary changes.
  ---
   virt/kvm/eventfd.c  | 31 +--
   virt/kvm/kvm_main.c | 16 ++--
   2 files changed, 23 insertions(+), 24 deletions(-)
  
  diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
  index 9ff4193..c3ffdc3 100644
  --- a/virt/kvm/eventfd.c
  +++ b/virt/kvm/eventfd.c
  @@ -762,13 +762,16 @@ ioeventfd_check_collision(struct kvm *kvm, struct 
  _ioeventfd *p)
 return false;
   }
   
  -static enum kvm_bus ioeventfd_bus_from_flags(__u32 flags)
  +static enum kvm_bus ioeventfd_bus_from_args(struct kvm_ioeventfd *args)
   {
  -  if (flags  KVM_IOEVENTFD_FLAG_PIO)
  +  if (args-flags  KVM_IOEVENTFD_FLAG_PIO)
 return KVM_PIO_BUS;
  -  if (flags  KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY)
  +  if (args-flags  KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY)
 return KVM_VIRTIO_CCW_NOTIFY_BUS;
  -  return KVM_MMIO_BUS;
  +  /* When length is ignored, MMIO is put on a separate bus, for
  +   * faster lookups.
  +   */
  +  return args-len ? KVM_MMIO_BUS : KVM_FAST_MMIO_BUS;
   }
   
   static int
  @@ -779,7 +782,7 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct 
  kvm_ioeventfd *args)
 struct eventfd_ctx   *eventfd;
 int   ret;
   
  -  bus_idx = ioeventfd_bus_from_flags(args-flags);
  +  bus_idx = ioeventfd_bus_from_args(args);
 /* must

Re: [PATCH V2 1/3] kvm: use kmalloc() instead of kzalloc() during iodev register/unregister

2015-08-25 Thread Jason Wang


On 08/25/2015 11:29 PM, Joe Perches wrote:
 On Tue, 2015-08-25 at 15:47 +0800, Jason Wang wrote:
  All fields of kvm_io_range were initialized or copied explicitly
  afterwards. So switch to use kmalloc().
 Is there any compiler added alignment padding
 in either structure?  If so, those padding
 areas would now be uninitialized and may leak
 kernel data if copied to user-space.


I get your concern, but I don't a way to copy them to userspace, did you?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] kvm: don't register wildcard MMIO EVENTFD on two buses

2015-08-25 Thread Jason Wang


On 08/25/2015 11:04 AM, Jason Wang wrote:
[...]
 @@ -900,10 +899,11 @@ kvm_deassign_ioeventfd(struct kvm *kvm, struct 
 kvm_ioeventfd *args)
   if (!p-wildcard  p-datamatch != args-datamatch)
   continue;
   
  -kvm_io_bus_unregister_dev(kvm, bus_idx, p-dev);
   if (!p-length) {
   kvm_io_bus_unregister_dev(kvm, 
  KVM_FAST_MMIO_BUS,
 p-dev);
  +} else {
  +kvm_io_bus_unregister_dev(kvm, bus_idx, 
  p-dev);
   }
  Similar comments here... do you want to check for bus_idx ==
  KVM_MMIO_BUS as well?
  Good catch. I think keep the original code as is will be also ok to
  solve this. (with changing the bus_idx to KVM_FAST_MMIO_BUS during
  registering if it was an wildcard mmio).
  Do you need to handle the ioeventfd_count changes on the fast mmio bus
  as well?
 Yes. So actually, it needs some changes: checking the return value of
 kvm_io_bus_unregister_dev() and decide which bus does the device belongs to.


Looks like it will be more cleaner by just changing
ioeventfd_bus_from_flags() to return KVM_FAST_MMIO_BUS accordingly. Will
post V2 soon.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 1/3] kvm: use kmalloc() instead of kzalloc() during iodev register/unregister

2015-08-25 Thread Jason Wang
All fields of kvm_io_range were initialized or copied explicitly
afterwards. So switch to use kmalloc().

Cc: Gleb Natapov g...@kernel.org
Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Jason Wang jasow...@redhat.com
---
 virt/kvm/kvm_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8b8a444..0d79fe8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3248,7 +3248,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus 
bus_idx, gpa_t addr,
if (bus-dev_count - bus-ioeventfd_count  NR_IOBUS_DEVS - 1)
return -ENOSPC;
 
-   new_bus = kzalloc(sizeof(*bus) + ((bus-dev_count + 1) *
+   new_bus = kmalloc(sizeof(*bus) + ((bus-dev_count + 1) *
  sizeof(struct kvm_io_range)), GFP_KERNEL);
if (!new_bus)
return -ENOMEM;
@@ -3280,7 +3280,7 @@ int kvm_io_bus_unregister_dev(struct kvm *kvm, enum 
kvm_bus bus_idx,
if (r)
return r;
 
-   new_bus = kzalloc(sizeof(*bus) + ((bus-dev_count - 1) *
+   new_bus = kmalloc(sizeof(*bus) + ((bus-dev_count - 1) *
  sizeof(struct kvm_io_range)), GFP_KERNEL);
if (!new_bus)
return -ENOMEM;
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 3/3] kvm: add tracepoint for fast mmio

2015-08-25 Thread Jason Wang
Cc: Gleb Natapov g...@kernel.org
Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Jason Wang jasow...@redhat.com
---
 arch/x86/kvm/trace.h | 17 +
 arch/x86/kvm/vmx.c   |  1 +
 arch/x86/kvm/x86.c   |  1 +
 3 files changed, 19 insertions(+)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 4eae7c3..2d4e81a 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -128,6 +128,23 @@ TRACE_EVENT(kvm_pio,
  __entry-count  1 ? (...) : )
 );
 
+TRACE_EVENT(kvm_fast_mmio,
+   TP_PROTO(u64 gpa),
+   TP_ARGS(gpa),
+
+   TP_STRUCT__entry(
+   __field(u64,gpa)
+   ),
+
+   TP_fast_assign(
+   __entry-gpa= gpa;
+   ),
+
+   TP_printk(fast mmio at gpa 0x%llx, __entry-gpa)
+);
+
+
+
 /*
  * Tracepoint for cpuid.
  */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 83b7b5c..a55d279 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5831,6 +5831,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
skip_emulated_instruction(vcpu);
+   trace_kvm_fast_mmio(gpa);
return 1;
}
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8f0f6ec..36cf78e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8254,6 +8254,7 @@ bool kvm_arch_has_noncoherent_dma(struct kvm *kvm)
 EXPORT_SYMBOL_GPL(kvm_arch_has_noncoherent_dma);
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_msr);
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 2/3] kvm: don't register wildcard MMIO EVENTFD on two buses

2015-08-25 Thread Jason Wang
We register wildcard mmio eventfd on two buses, one for KVM_MMIO_BUS
and another is KVM_FAST_MMIO_BUS. This leads to issue:

- kvm_io_bus_destroy() knows nothing about the devices on two buses
  points to a single dev. Which will lead double free [1] during exit.
- wildcard eventfd ignores data len, so it was registered as a
  kvm_io_range with zero length. This will fail the binary search in
  kvm_io_bus_get_first_dev() when we try to emulate through
  KVM_MMIO_BUS. This will cause userspace io emulation request instead
  of a eventfd notification (virtqueue kick will be trapped by qemu
  instead of vhost in this case).

Fixing this by don't register wildcard mmio eventfd on two
buses. Instead, only register it in KVM_FAST_MMIO_BUS. This fixes the
double free issue of kvm_io_bus_destroy(). For the arch/setups that
does not utilize KVM_FAST_MMIO_BUS, before searching KVM_MMIO_BUS, try
KVM_FAST_MMIO_BUS first to see it it has a match.

[1] Panic caused by double free:

CPU: 1 PID: 2894 Comm: qemu-system-x86 Not tainted 3.19.0-26-generic #28-Ubuntu
Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET96WW (2.56 ) 09/12/2013
task: 88009ae0c4b0 ti: 88020e7f task.ti: 88020e7f
RIP: 0010:[c07e25d8]  [c07e25d8] 
ioeventfd_release+0x28/0x60 [kvm]
RSP: 0018:88020e7f3bc8  EFLAGS: 00010292
RAX: dead00200200 RBX: 8801ec19c900 RCX: 00018200016d
RDX: 8801ec19cf80 RSI: ea0008bf1d40 RDI: 8801ec19c900
RBP: 88020e7f3bd8 R08: 2fc75a01 R09: 00018200016d
R10: c07df6ae R11: 88022fc75a98 R12: 88021e7cc000
R13: 88021e7cca48 R14: 88021e7cca50 R15: 8801ec19c880
FS:  7fc1ee3e6700() GS:88023e24() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f8f389d8000 CR3: 00023dc13000 CR4: 001427e0
Stack:
88021e7cc000  88020e7f3be8 c07e2622
88020e7f3c38 c07df69a 880232524160 88020e792d80
  880219b78c00 0008 8802321686a8
Call Trace:
[c07e2622] ioeventfd_destructor+0x12/0x20 [kvm]
[c07df69a] kvm_put_kvm+0xca/0x210 [kvm]
[c07df818] kvm_vcpu_release+0x18/0x20 [kvm]
[811f69f7] __fput+0xe7/0x250
[811f6bae] fput+0xe/0x10
[81093f04] task_work_run+0xd4/0xf0
[81079358] do_exit+0x368/0xa50
[81082c8f] ? recalc_sigpending+0x1f/0x60
[81079ad5] do_group_exit+0x45/0xb0
[81085c71] get_signal+0x291/0x750
[810144d8] do_signal+0x28/0xab0
[810f3a3b] ? do_futex+0xdb/0x5d0
[810b7028] ? __wake_up_locked_key+0x18/0x20
[810f3fa6] ? SyS_futex+0x76/0x170
[81014fc9] do_notify_resume+0x69/0xb0
[817cb9af] int_signal+0x12/0x17
Code: 5d c3 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 7f 20 
e8 06 d6 a5 c0 48 8b 43 08 48 8b 13 48 89 df 48 89 42 08 48 89 10 48 b8 00 01 
10 00 00
RIP  [c07e25d8] ioeventfd_release+0x28/0x60 [kvm]
RSP 88020e7f3bc8

Cc: Gleb Natapov g...@kernel.org
Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Jason Wang jasow...@redhat.com
---
Changes from v1:
- change ioeventfd_bus_from_flags() to return KVM_FAST_MMIO_BUS when
  needed to save lots of unnecessary changes.
---
 virt/kvm/eventfd.c  | 30 --
 virt/kvm/kvm_main.c | 16 ++--
 2 files changed, 22 insertions(+), 24 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 9ff4193..95f2901 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -762,13 +762,15 @@ ioeventfd_check_collision(struct kvm *kvm, struct 
_ioeventfd *p)
return false;
 }
 
-static enum kvm_bus ioeventfd_bus_from_flags(__u32 flags)
+static enum kvm_bus ioeventfd_bus_from_flags(struct kvm_ioeventfd *args)
 {
-   if (flags  KVM_IOEVENTFD_FLAG_PIO)
+   if (args-flags  KVM_IOEVENTFD_FLAG_PIO)
return KVM_PIO_BUS;
-   if (flags  KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY)
+   if (args-flags  KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY)
return KVM_VIRTIO_CCW_NOTIFY_BUS;
-   return KVM_MMIO_BUS;
+   if (args-len)
+   return KVM_MMIO_BUS;
+   return KVM_FAST_MMIO_BUS;
 }
 
 static int
@@ -779,7 +781,7 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
struct eventfd_ctx   *eventfd;
int   ret;
 
-   bus_idx = ioeventfd_bus_from_flags(args-flags);
+   bus_idx = ioeventfd_bus_from_flags(args);
/* must be natural-word sized, or 0 to ignore length */
switch (args-len) {
case 0:
@@ -843,16 +845,6 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
if (ret  0)
goto unlock_fail;
 
-   /* When length is ignored, MMIO is also put on a separate bus, for
-* faster lookups.
-*/
-   if (!args-len  !(args-flags

Re: [PATCH V2 2/3] kvm: don't register wildcard MMIO EVENTFD on two buses

2015-08-25 Thread Jason Wang


On 08/25/2015 04:20 PM, Cornelia Huck wrote:
 On Tue, 25 Aug 2015 15:47:14 +0800
 Jason Wang jasow...@redhat.com wrote:

 diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
 index 9ff4193..95f2901 100644
 --- a/virt/kvm/eventfd.c
 +++ b/virt/kvm/eventfd.c
 @@ -762,13 +762,15 @@ ioeventfd_check_collision(struct kvm *kvm, struct 
 _ioeventfd *p)
  return false;
  }

 -static enum kvm_bus ioeventfd_bus_from_flags(__u32 flags)
 +static enum kvm_bus ioeventfd_bus_from_flags(struct kvm_ioeventfd *args)
 ioeventfd_bus_from_args()? But _from_flags() is not wrong either :)

  {
 -if (flags  KVM_IOEVENTFD_FLAG_PIO)
 +if (args-flags  KVM_IOEVENTFD_FLAG_PIO)
  return KVM_PIO_BUS;
 -if (flags  KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY)
 +if (args-flags  KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY)
  return KVM_VIRTIO_CCW_NOTIFY_BUS;
 -return KVM_MMIO_BUS;
 +if (args-len)
 +return KVM_MMIO_BUS;
 +return KVM_FAST_MMIO_BUS;
 Hm...

 /* When length is ignored, MMIO is put on a separate bus, for
  * faster lookups.
  */
 return args-len ? KVM_MMIO_BUS : KVM_FAST_MMIO_BUS;

  }

  static int
 This version of the patch looks nice and compact. Regardless whether
 you want to follow my (minor) style suggestions, consider this patch

 Acked-by: Cornelia Huck cornelia.h...@de.ibm.com


Thanks for the review. V3 posted :)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3 2/3] kvm: don't register wildcard MMIO EVENTFD on two buses

2015-08-25 Thread Jason Wang
We register wildcard mmio eventfd on two buses, one for KVM_MMIO_BUS
and another is KVM_FAST_MMIO_BUS. This leads to issue:

- kvm_io_bus_destroy() knows nothing about the devices on two buses
  points to a single dev. Which will lead double free [1] during exit.
- wildcard eventfd ignores data len, so it was registered as a
  kvm_io_range with zero length. This will fail the binary search in
  kvm_io_bus_get_first_dev() when we try to emulate through
  KVM_MMIO_BUS. This will cause userspace io emulation request instead
  of a eventfd notification (virtqueue kick will be trapped by qemu
  instead of vhost in this case).

Fixing this by don't register wildcard mmio eventfd on two
buses. Instead, only register it in KVM_FAST_MMIO_BUS. This fixes the
double free issue of kvm_io_bus_destroy(). For the arch/setups that
does not utilize KVM_FAST_MMIO_BUS, before searching KVM_MMIO_BUS, try
KVM_FAST_MMIO_BUS first to see it it has a match.

[1] Panic caused by double free:

CPU: 1 PID: 2894 Comm: qemu-system-x86 Not tainted 3.19.0-26-generic #28-Ubuntu
Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET96WW (2.56 ) 09/12/2013
task: 88009ae0c4b0 ti: 88020e7f task.ti: 88020e7f
RIP: 0010:[c07e25d8]  [c07e25d8] 
ioeventfd_release+0x28/0x60 [kvm]
RSP: 0018:88020e7f3bc8  EFLAGS: 00010292
RAX: dead00200200 RBX: 8801ec19c900 RCX: 00018200016d
RDX: 8801ec19cf80 RSI: ea0008bf1d40 RDI: 8801ec19c900
RBP: 88020e7f3bd8 R08: 2fc75a01 R09: 00018200016d
R10: c07df6ae R11: 88022fc75a98 R12: 88021e7cc000
R13: 88021e7cca48 R14: 88021e7cca50 R15: 8801ec19c880
FS:  7fc1ee3e6700() GS:88023e24() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f8f389d8000 CR3: 00023dc13000 CR4: 001427e0
Stack:
88021e7cc000  88020e7f3be8 c07e2622
88020e7f3c38 c07df69a 880232524160 88020e792d80
  880219b78c00 0008 8802321686a8
Call Trace:
[c07e2622] ioeventfd_destructor+0x12/0x20 [kvm]
[c07df69a] kvm_put_kvm+0xca/0x210 [kvm]
[c07df818] kvm_vcpu_release+0x18/0x20 [kvm]
[811f69f7] __fput+0xe7/0x250
[811f6bae] fput+0xe/0x10
[81093f04] task_work_run+0xd4/0xf0
[81079358] do_exit+0x368/0xa50
[81082c8f] ? recalc_sigpending+0x1f/0x60
[81079ad5] do_group_exit+0x45/0xb0
[81085c71] get_signal+0x291/0x750
[810144d8] do_signal+0x28/0xab0
[810f3a3b] ? do_futex+0xdb/0x5d0
[810b7028] ? __wake_up_locked_key+0x18/0x20
[810f3fa6] ? SyS_futex+0x76/0x170
[81014fc9] do_notify_resume+0x69/0xb0
[817cb9af] int_signal+0x12/0x17
Code: 5d c3 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 7f 20 
e8 06 d6 a5 c0 48 8b 43 08 48 8b 13 48 89 df 48 89 42 08 48 89 10 48 b8 00 01 
10 00 00
RIP  [c07e25d8] ioeventfd_release+0x28/0x60 [kvm]
RSP 88020e7f3bc8

Cc: Gleb Natapov g...@kernel.org
Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Jason Wang jasow...@redhat.com
---
Changes from V2:
- Tweak styles and comment suggested by Cornelia.
Changes from v1:
- change ioeventfd_bus_from_flags() to return KVM_FAST_MMIO_BUS when
  needed to save lots of unnecessary changes.
---
 virt/kvm/eventfd.c  | 31 +--
 virt/kvm/kvm_main.c | 16 ++--
 2 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 9ff4193..c3ffdc3 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -762,13 +762,16 @@ ioeventfd_check_collision(struct kvm *kvm, struct 
_ioeventfd *p)
return false;
 }
 
-static enum kvm_bus ioeventfd_bus_from_flags(__u32 flags)
+static enum kvm_bus ioeventfd_bus_from_args(struct kvm_ioeventfd *args)
 {
-   if (flags  KVM_IOEVENTFD_FLAG_PIO)
+   if (args-flags  KVM_IOEVENTFD_FLAG_PIO)
return KVM_PIO_BUS;
-   if (flags  KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY)
+   if (args-flags  KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY)
return KVM_VIRTIO_CCW_NOTIFY_BUS;
-   return KVM_MMIO_BUS;
+   /* When length is ignored, MMIO is put on a separate bus, for
+* faster lookups.
+*/
+   return args-len ? KVM_MMIO_BUS : KVM_FAST_MMIO_BUS;
 }
 
 static int
@@ -779,7 +782,7 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
struct eventfd_ctx   *eventfd;
int   ret;
 
-   bus_idx = ioeventfd_bus_from_flags(args-flags);
+   bus_idx = ioeventfd_bus_from_args(args);
/* must be natural-word sized, or 0 to ignore length */
switch (args-len) {
case 0:
@@ -843,16 +846,6 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
if (ret  0)
goto unlock_fail;
 
-   /* When length

[PATCH V3 1/3] kvm: use kmalloc() instead of kzalloc() during iodev register/unregister

2015-08-25 Thread Jason Wang
All fields of kvm_io_range were initialized or copied explicitly
afterwards. So switch to use kmalloc().

Cc: Gleb Natapov g...@kernel.org
Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Jason Wang jasow...@redhat.com
---
 virt/kvm/kvm_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8b8a444..0d79fe8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3248,7 +3248,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus 
bus_idx, gpa_t addr,
if (bus-dev_count - bus-ioeventfd_count  NR_IOBUS_DEVS - 1)
return -ENOSPC;
 
-   new_bus = kzalloc(sizeof(*bus) + ((bus-dev_count + 1) *
+   new_bus = kmalloc(sizeof(*bus) + ((bus-dev_count + 1) *
  sizeof(struct kvm_io_range)), GFP_KERNEL);
if (!new_bus)
return -ENOMEM;
@@ -3280,7 +3280,7 @@ int kvm_io_bus_unregister_dev(struct kvm *kvm, enum 
kvm_bus bus_idx,
if (r)
return r;
 
-   new_bus = kzalloc(sizeof(*bus) + ((bus-dev_count - 1) *
+   new_bus = kmalloc(sizeof(*bus) + ((bus-dev_count - 1) *
  sizeof(struct kvm_io_range)), GFP_KERNEL);
if (!new_bus)
return -ENOMEM;
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3 3/3] kvm: add tracepoint for fast mmio

2015-08-25 Thread Jason Wang
Cc: Gleb Natapov g...@kernel.org
Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Jason Wang jasow...@redhat.com
---
 arch/x86/kvm/trace.h | 17 +
 arch/x86/kvm/vmx.c   |  1 +
 arch/x86/kvm/x86.c   |  1 +
 3 files changed, 19 insertions(+)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 4eae7c3..2d4e81a 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -128,6 +128,23 @@ TRACE_EVENT(kvm_pio,
  __entry-count  1 ? (...) : )
 );
 
+TRACE_EVENT(kvm_fast_mmio,
+   TP_PROTO(u64 gpa),
+   TP_ARGS(gpa),
+
+   TP_STRUCT__entry(
+   __field(u64,gpa)
+   ),
+
+   TP_fast_assign(
+   __entry-gpa= gpa;
+   ),
+
+   TP_printk(fast mmio at gpa 0x%llx, __entry-gpa)
+);
+
+
+
 /*
  * Tracepoint for cpuid.
  */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 83b7b5c..a55d279 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5831,6 +5831,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
skip_emulated_instruction(vcpu);
+   trace_kvm_fast_mmio(gpa);
return 1;
}
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8f0f6ec..36cf78e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8254,6 +8254,7 @@ bool kvm_arch_has_noncoherent_dma(struct kvm *kvm)
 EXPORT_SYMBOL_GPL(kvm_arch_has_noncoherent_dma);
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_msr);
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] kvm: don't register wildcard MMIO EVENTFD on two buses

2015-08-24 Thread Jason Wang


On 08/24/2015 10:05 PM, Cornelia Huck wrote:
 On Mon, 24 Aug 2015 11:29:29 +0800
 Jason Wang jasow...@redhat.com wrote:

 On 08/21/2015 05:29 PM, Cornelia Huck wrote:
 On Fri, 21 Aug 2015 16:03:52 +0800
 Jason Wang jasow...@redhat.com wrote:
 @@ -850,9 +845,15 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct 
 kvm_ioeventfd *args)
 Unfortunately snipped by diff, but the check here is on !len  !PIO,
 which only does the desired thing as VIRTIO_CCW always uses len == 8.
 Should the check be for !len  MMIO instead?
 I think the answer depends on whether len == 0 is valid for ccw. If not
 we can fail the assign earlier. Since even without this patch, if
 userspace tries to register a dev with len equals to zero, it will also
 be registered to KVM_FAST_MMIO_BUS. If yes, we need check as you
 suggested here.
 I don't think len != 8 makes much sense for the way ioeventfd is
 defined for ccw (we handle hypercalls with a payload specifying the
 device), but we currently don't actively fence it.

 But regardless, I'd prefer to decide directly upon whether userspace
 actually tried to register for the mmio bus.

Ok.


ret = kvm_io_bus_register_dev(kvm, KVM_FAST_MMIO_BUS,
  p-addr, 0, p-dev);
if (ret  0)
 -  goto register_fail;
 +  goto unlock_fail;
 +  } else {
 +  ret = kvm_io_bus_register_dev(kvm, bus_idx, p-addr, p-length,
 +p-dev);
 +  if (ret  0)
 +  goto unlock_fail;
}
 Hm... maybe the following would be more obvious:

 my_bus = (p-length == 0)  (bus_idx == KVM_MMIO_BUS) ? KVM_FAST_MMIO_BUS 
 : bus_idx;
 ret = kvm_io_bus_register_dev(kvm, my_bus, p-addr, p-length, pdev-dev); 

  
 +
kvm-buses[bus_idx]-ioeventfd_count++;
list_add_tail(p-list, kvm-ioeventfds);
 (...)

 @@ -900,10 +899,11 @@ kvm_deassign_ioeventfd(struct kvm *kvm, struct 
 kvm_ioeventfd *args)
if (!p-wildcard  p-datamatch != args-datamatch)
continue;
  
 -  kvm_io_bus_unregister_dev(kvm, bus_idx, p-dev);
if (!p-length) {
kvm_io_bus_unregister_dev(kvm, KVM_FAST_MMIO_BUS,
  p-dev);
 +  } else {
 +  kvm_io_bus_unregister_dev(kvm, bus_idx, p-dev);
}
 Similar comments here... do you want to check for bus_idx ==
 KVM_MMIO_BUS as well?
 Good catch. I think keep the original code as is will be also ok to
 solve this. (with changing the bus_idx to KVM_FAST_MMIO_BUS during
 registering if it was an wildcard mmio).
 Do you need to handle the ioeventfd_count changes on the fast mmio bus
 as well?

Yes. So actually, it needs some changes: checking the return value of
kvm_io_bus_unregister_dev() and decide which bus does the device belongs to.


kvm-buses[bus_idx]-ioeventfd_count--;
ioeventfd_release(p);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] kvm: don't register wildcard MMIO EVENTFD on two buses

2015-08-23 Thread Jason Wang


On 08/21/2015 05:29 PM, Cornelia Huck wrote:
 On Fri, 21 Aug 2015 16:03:52 +0800
 Jason Wang jasow...@redhat.com wrote:


 diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
 index 9ff4193..834a409 100644
 --- a/virt/kvm/eventfd.c
 +++ b/virt/kvm/eventfd.c
 @@ -838,11 +838,6 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct 
 kvm_ioeventfd *args)
  
  kvm_iodevice_init(p-dev, ioeventfd_ops);
  
 -ret = kvm_io_bus_register_dev(kvm, bus_idx, p-addr, p-length,
 -  p-dev);
 -if (ret  0)
 -goto unlock_fail;
 -
  /* When length is ignored, MMIO is also put on a separate bus, for
   * faster lookups.
 You probably want to change this comment as well?

Yes.


   */
 @@ -850,9 +845,15 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct 
 kvm_ioeventfd *args)
 Unfortunately snipped by diff, but the check here is on !len  !PIO,
 which only does the desired thing as VIRTIO_CCW always uses len == 8.
 Should the check be for !len  MMIO instead?

I think the answer depends on whether len == 0 is valid for ccw. If not
we can fail the assign earlier. Since even without this patch, if
userspace tries to register a dev with len equals to zero, it will also
be registered to KVM_FAST_MMIO_BUS. If yes, we need check as you
suggested here.


  ret = kvm_io_bus_register_dev(kvm, KVM_FAST_MMIO_BUS,
p-addr, 0, p-dev);
  if (ret  0)
 -goto register_fail;
 +goto unlock_fail;
 +} else {
 +ret = kvm_io_bus_register_dev(kvm, bus_idx, p-addr, p-length,
 +  p-dev);
 +if (ret  0)
 +goto unlock_fail;
  }
 Hm... maybe the following would be more obvious:

 my_bus = (p-length == 0)  (bus_idx == KVM_MMIO_BUS) ? KVM_FAST_MMIO_BUS : 
 bus_idx;
 ret = kvm_io_bus_register_dev(kvm, my_bus, p-addr, p-length, pdev-dev); 

  
 +
  kvm-buses[bus_idx]-ioeventfd_count++;
  list_add_tail(p-list, kvm-ioeventfds);
 (...)

 @@ -900,10 +899,11 @@ kvm_deassign_ioeventfd(struct kvm *kvm, struct 
 kvm_ioeventfd *args)
  if (!p-wildcard  p-datamatch != args-datamatch)
  continue;
  
 -kvm_io_bus_unregister_dev(kvm, bus_idx, p-dev);
  if (!p-length) {
  kvm_io_bus_unregister_dev(kvm, KVM_FAST_MMIO_BUS,
p-dev);
 +} else {
 +kvm_io_bus_unregister_dev(kvm, bus_idx, p-dev);
  }
 Similar comments here... do you want to check for bus_idx ==
 KVM_MMIO_BUS as well?

Good catch. I think keep the original code as is will be also ok to
solve this. (with changing the bus_idx to KVM_FAST_MMIO_BUS during
registering if it was an wildcard mmio).


  kvm-buses[bus_idx]-ioeventfd_count--;
  ioeventfd_release(p);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] kvm: don't register wildcard MMIO EVENTFD on two buses

2015-08-21 Thread Jason Wang
We register wildcard mmio eventfd on two buses, one for KVM_MMIO_BUS
and another is KVM_FAST_MMIO_BUS. This leads to issue:

- kvm_io_bus_destroy() knows nothing about the devices on two buses
  points to a single dev. Which will lead double free [1] during exit.
- wildcard eventfd ignores data len, so it was registered as a
  kvm_io_range with zero length. This will fail the binary search in
  kvm_io_bus_get_first_dev() when we try to emulate through
  KVM_MMIO_BUS. This will cause userspace io emulation request instead
  of a eventfd notification (virtqueue kick will be trapped by qemu
  instead of vhost in this case).

Fixing this by don't register wildcard mmio eventfd on two
buses. Instead, only register it in KVM_FAST_MMIO_BUS. This fixes the
double free issue of kvm_io_bus_destroy(). For the arch/setups that
does not utilize KVM_FAST_MMIO_BUS, before search KVM_MMIO_BUS, try
KVM_FAST_MMIO_BUS first to see it it has a match.

[1] Panic caused by double free:

CPU: 1 PID: 2894 Comm: qemu-system-x86 Not tainted 3.19.0-26-generic #28-Ubuntu
Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET96WW (2.56 ) 09/12/2013
task: 88009ae0c4b0 ti: 88020e7f task.ti: 88020e7f
RIP: 0010:[c07e25d8]  [c07e25d8] 
ioeventfd_release+0x28/0x60 [kvm]
RSP: 0018:88020e7f3bc8  EFLAGS: 00010292
RAX: dead00200200 RBX: 8801ec19c900 RCX: 00018200016d
RDX: 8801ec19cf80 RSI: ea0008bf1d40 RDI: 8801ec19c900
RBP: 88020e7f3bd8 R08: 2fc75a01 R09: 00018200016d
R10: c07df6ae R11: 88022fc75a98 R12: 88021e7cc000
R13: 88021e7cca48 R14: 88021e7cca50 R15: 8801ec19c880
FS:  7fc1ee3e6700() GS:88023e24() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f8f389d8000 CR3: 00023dc13000 CR4: 001427e0
Stack:
88021e7cc000  88020e7f3be8 c07e2622
88020e7f3c38 c07df69a 880232524160 88020e792d80
  880219b78c00 0008 8802321686a8
Call Trace:
[c07e2622] ioeventfd_destructor+0x12/0x20 [kvm]
[c07df69a] kvm_put_kvm+0xca/0x210 [kvm]
[c07df818] kvm_vcpu_release+0x18/0x20 [kvm]
[811f69f7] __fput+0xe7/0x250
[811f6bae] fput+0xe/0x10
[81093f04] task_work_run+0xd4/0xf0
[81079358] do_exit+0x368/0xa50
[81082c8f] ? recalc_sigpending+0x1f/0x60
[81079ad5] do_group_exit+0x45/0xb0
[81085c71] get_signal+0x291/0x750
[810144d8] do_signal+0x28/0xab0
[810f3a3b] ? do_futex+0xdb/0x5d0
[810b7028] ? __wake_up_locked_key+0x18/0x20
[810f3fa6] ? SyS_futex+0x76/0x170
[81014fc9] do_notify_resume+0x69/0xb0
[817cb9af] int_signal+0x12/0x17
Code: 5d c3 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 7f 20 
e8 06 d6 a5 c0 48 8b 43 08 48 8b 13 48 89 df 48 89 42 08 48 89 10 48 b8 00 01 
10 00 00
RIP  [c07e25d8] ioeventfd_release+0x28/0x60 [kvm]
RSP 88020e7f3bc8

Cc: Gleb Natapov g...@kernel.org
Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Jason Wang jasow...@redhat.com
---
 virt/kvm/eventfd.c  | 18 +-
 virt/kvm/kvm_main.c | 16 ++--
 2 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 9ff4193..834a409 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -838,11 +838,6 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
 
kvm_iodevice_init(p-dev, ioeventfd_ops);
 
-   ret = kvm_io_bus_register_dev(kvm, bus_idx, p-addr, p-length,
- p-dev);
-   if (ret  0)
-   goto unlock_fail;
-
/* When length is ignored, MMIO is also put on a separate bus, for
 * faster lookups.
 */
@@ -850,9 +845,15 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
ret = kvm_io_bus_register_dev(kvm, KVM_FAST_MMIO_BUS,
  p-addr, 0, p-dev);
if (ret  0)
-   goto register_fail;
+   goto unlock_fail;
+   } else {
+   ret = kvm_io_bus_register_dev(kvm, bus_idx, p-addr, p-length,
+ p-dev);
+   if (ret  0)
+   goto unlock_fail;
}
 
+
kvm-buses[bus_idx]-ioeventfd_count++;
list_add_tail(p-list, kvm-ioeventfds);
 
@@ -860,8 +861,6 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
 
return 0;
 
-register_fail:
-   kvm_io_bus_unregister_dev(kvm, bus_idx, p-dev);
 unlock_fail:
mutex_unlock(kvm-slots_lock);
 
@@ -900,10 +899,11 @@ kvm_deassign_ioeventfd(struct kvm *kvm, struct 
kvm_ioeventfd *args)
if (!p-wildcard  p-datamatch != args-datamatch)
continue

[PATCH 3/3] kvm: add tracepoint for fast mmio

2015-08-21 Thread Jason Wang
Cc: Gleb Natapov g...@kernel.org
Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Jason Wang jasow...@redhat.com
---
 arch/x86/kvm/trace.h | 17 +
 arch/x86/kvm/vmx.c   |  1 +
 arch/x86/kvm/x86.c   |  1 +
 3 files changed, 19 insertions(+)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 4eae7c3..2d4e81a 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -128,6 +128,23 @@ TRACE_EVENT(kvm_pio,
  __entry-count  1 ? (...) : )
 );
 
+TRACE_EVENT(kvm_fast_mmio,
+   TP_PROTO(u64 gpa),
+   TP_ARGS(gpa),
+
+   TP_STRUCT__entry(
+   __field(u64,gpa)
+   ),
+
+   TP_fast_assign(
+   __entry-gpa= gpa;
+   ),
+
+   TP_printk(fast mmio at gpa 0x%llx, __entry-gpa)
+);
+
+
+
 /*
  * Tracepoint for cpuid.
  */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 83b7b5c..a55d279 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5831,6 +5831,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
skip_emulated_instruction(vcpu);
+   trace_kvm_fast_mmio(gpa);
return 1;
}
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5ef2560..271a0e6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8249,6 +8249,7 @@ bool kvm_arch_has_noncoherent_dma(struct kvm *kvm)
 EXPORT_SYMBOL_GPL(kvm_arch_has_noncoherent_dma);
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_msr);
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] kvm: use kmalloc() instead of kzalloc() during iodev register/unregister

2015-08-21 Thread Jason Wang
All fields of kvm_io_range were initialized or copied explicitly
afterwards. So switch to use kmalloc().

Cc: Gleb Natapov g...@kernel.org
Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Jason Wang jasow...@redhat.com
---
 virt/kvm/kvm_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8b8a444..0d79fe8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3248,7 +3248,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus 
bus_idx, gpa_t addr,
if (bus-dev_count - bus-ioeventfd_count  NR_IOBUS_DEVS - 1)
return -ENOSPC;
 
-   new_bus = kzalloc(sizeof(*bus) + ((bus-dev_count + 1) *
+   new_bus = kmalloc(sizeof(*bus) + ((bus-dev_count + 1) *
  sizeof(struct kvm_io_range)), GFP_KERNEL);
if (!new_bus)
return -ENOMEM;
@@ -3280,7 +3280,7 @@ int kvm_io_bus_unregister_dev(struct kvm *kvm, enum 
kvm_bus bus_idx,
if (r)
return r;
 
-   new_bus = kzalloc(sizeof(*bus) + ((bus-dev_count - 1) *
+   new_bus = kmalloc(sizeof(*bus) + ((bus-dev_count - 1) *
  sizeof(struct kvm_io_range)), GFP_KERNEL);
if (!new_bus)
return -ENOMEM;
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net] vhost_net: fix wrong iter offset when setting number of buffers

2015-02-15 Thread Jason Wang
In commit ba7438aed924 (vhost: don't bother copying iovecs in
handle_rx(), kill memcpy_toiovecend()), we advance iov iter fixup
sizeof(struct virtio_net_hdr) bytes and fill the number of buffers
after doing the socket recvmsg(). This work well but was broken after
commit 6e03f896b52c (Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net) which tries
to advance sizeof(struct virtio_net_hdr_mrg_rxbuf). It will fill the
number of buffers at the wrong place. This patch fixes this.

Fixes 6e03f896b52c
(Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net)
Cc: David S. Miller da...@davemloft.net
Cc: Al Viro v...@zeniv.linux.org.uk
Cc: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Jason Wang jasow...@redhat.com
---
 drivers/vhost/net.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8dccca9..afa06d2 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -528,9 +528,9 @@ static void handle_rx(struct vhost_net *net)
.msg_controllen = 0,
.msg_flags = MSG_DONTWAIT,
};
-   struct virtio_net_hdr_mrg_rxbuf hdr = {
-   .hdr.flags = 0,
-   .hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
+   struct virtio_net_hdr hdr = {
+   .flags = 0,
+   .gso_type = VIRTIO_NET_HDR_GSO_NONE
};
size_t total_len = 0;
int err, mergeable;
@@ -539,6 +539,7 @@ static void handle_rx(struct vhost_net *net)
size_t vhost_len, sock_len;
struct socket *sock;
struct iov_iter fixup;
+   __virtio16 num_buffers;
 
mutex_lock(vq-mutex);
sock = vq-private_data;
@@ -616,9 +617,9 @@ static void handle_rx(struct vhost_net *net)
}
/* TODO: Should check and handle checksum. */
 
-   hdr.num_buffers = cpu_to_vhost16(vq, headcount);
+   num_buffers = cpu_to_vhost16(vq, headcount);
if (likely(mergeable) 
-   copy_to_iter(hdr.num_buffers, 2, fixup) != 2) {
+   copy_to_iter(num_buffers, 2, fixup) != 2) {
vq_err(vq, Failed num_buffers write);
vhost_discard_vq_desc(vq, headcount);
break;
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to optimize virtio-vhost in 10G net

2015-02-14 Thread Jason Wang



On Sun, Feb 15, 2015 at 3:30 PM, Ding Xiao ssdxiaod...@gmail.com 
wrote:

I am test virtio-vhost in 10G environment

host info
cpu E2680@2.7GHz
memory 16G
network intel 82599BE
os centos 7

VM info
cpu 4
memory 4G
network using virtio vhost
os centos 7

I using pktgen to send udp package, the result like follow
64b 230Mb/s
1400b 5.9Gb/s

I test the speed in VMware too, the result like follow
64b 700Mb/s
1400 9.3Gb/s

I am very surprised why the speed with virtio-vhost is slow
so I test to analysis this by using perf tool
I found the tun_sendmsg occupancy rate of 35%


It looks like you're using pktgen in guest. Pktgen has known issue with 
driver that does not have tx completion. See discussion here: 
https://patchwork.kernel.org/patch/1800711/


So you can't trust pktgen result in this case.


tun_sendmsg using copy_from_user to get the data from VM
Perhaps mapping could improve the performance ?


If you enable vhost_net zerocopy, you will see obvious improvements.


or other Another improvement method ?


I suggest that you can use other benchmark tools (or apply the patch in 
the above link with pktgen).


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] vhost: remove unnecessary forward declarations in vhost.h

2014-11-30 Thread Jason Wang



On Sun, Nov 30, 2014 at 1:04 PM, David Miller da...@davemloft.net 
wrote:

From: Jason Wang jasow...@redhat.com
Date: Thu, 27 Nov 2014 14:41:21 +0800


 Signed-off-by: Jason Wang jasow...@redhat.com


I don't think generic vhost patches should go via my tree.

If you disagree, let me know why, thanks :)


Agree. Michael, could you pls pick this into vhost tree?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 28/46] vhost: make features 64 bit

2014-11-28 Thread Jason Wang



On Fri, Nov 28, 2014 at 4:10 AM, Michael S. Tsirkin m...@redhat.com 
wrote:

We need to use bit 32 for virtio 1.0

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 drivers/vhost/vhost.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 3eda654..c624b09 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -106,7 +106,7 @@ struct vhost_virtqueue {
/* Protected by virtqueue mutex. */
struct vhost_memory *memory;
void *private_data;
-   unsigned acked_features;
+   u64 acked_features;
/* Log write descriptors */
void __user *log_base;
struct vhost_log *log;
@@ -174,6 +174,6 @@ enum {
 
 static inline int vhost_has_feature(struct vhost_virtqueue *vq, int 
bit)

 {
-   return vq-acked_features  (1  bit);
+   return vq-acked_features  (1ULL  bit);
 }
 #endif
--
MST


Reviewed-by: Jason Wang jasow...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 29/46] vhost: add memory access wrappers

2014-11-28 Thread Jason Wang



On Fri, Nov 28, 2014 at 4:10 AM, Michael S. Tsirkin m...@redhat.com 
wrote:

Add guest memory access wrappers to handle virtio endianness
conversions.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 drivers/vhost/vhost.h | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index c624b09..1f321fd 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -176,4 +176,35 @@ static inline int vhost_has_feature(struct 
vhost_virtqueue *vq, int bit)

 {
return vq-acked_features  (1ULL  bit);
 }
+
+/* Memory accessors */
+static inline u16 vhost16_to_cpu(struct vhost_virtqueue *vq, 
__virtio16 val)

+{
+	return __virtio16_to_cpu(vhost_has_feature(vq, VIRTIO_F_VERSION_1), 
val);

+}
+
+static inline __virtio16 cpu_to_vhost16(struct vhost_virtqueue *vq, 
u16 val)

+{
+	return __cpu_to_virtio16(vhost_has_feature(vq, VIRTIO_F_VERSION_1), 
val);

+}
+
+static inline u32 vhost32_to_cpu(struct vhost_virtqueue *vq, 
__virtio32 val)

+{
+	return __virtio32_to_cpu(vhost_has_feature(vq, VIRTIO_F_VERSION_1), 
val);

+}
+
+static inline __virtio32 cpu_to_vhost32(struct vhost_virtqueue *vq, 
u32 val)

+{
+	return __cpu_to_virtio32(vhost_has_feature(vq, VIRTIO_F_VERSION_1), 
val);

+}
+
+static inline u64 vhost64_to_cpu(struct vhost_virtqueue *vq, 
__virtio64 val)

+{
+	return __virtio64_to_cpu(vhost_has_feature(vq, VIRTIO_F_VERSION_1), 
val);

+}
+
+static inline __virtio64 cpu_to_vhost64(struct vhost_virtqueue *vq, 
u64 val)

+{
+	return __cpu_to_virtio64(vhost_has_feature(vq, VIRTIO_F_VERSION_1), 
val);

+}
 #endif
--
MST


Reviewed-by: Jason Wang jasow...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 30/46] vhost/net: force len for TX to host endian

2014-11-28 Thread Jason Wang



On Fri, Nov 28, 2014 at 4:10 AM, Michael S. Tsirkin m...@redhat.com 
wrote:

vhost/net keeps a copy of some used ring but (ab)uses length
field for internal house-keeping. This works because
for tx used length is always 0.
Suppress sparse errors: we use native endian-ness internally but never
expose it to guest.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 drivers/vhost/net.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8dae2f7..dce5c58 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -48,15 +48,15 @@ MODULE_PARM_DESC(experimental_zcopytx, Enable 
Zero Copy TX;

  * status internally; used for zerocopy tx only.
  */
 /* Lower device DMA failed */
-#define VHOST_DMA_FAILED_LEN   3
+#define VHOST_DMA_FAILED_LEN   ((__force __virtio32)3)
 /* Lower device DMA done */
-#define VHOST_DMA_DONE_LEN 2
+#define VHOST_DMA_DONE_LEN ((__force __virtio32)2)
 /* Lower device DMA in progress */
-#define VHOST_DMA_IN_PROGRESS  1
+#define VHOST_DMA_IN_PROGRESS  ((__force __virtio32)1)
 /* Buffer unused */
-#define VHOST_DMA_CLEAR_LEN0
+#define VHOST_DMA_CLEAR_LEN((__force __virtio32)0)
 
-#define VHOST_DMA_IS_DONE(len) ((len) = VHOST_DMA_DONE_LEN)
+#define VHOST_DMA_IS_DONE(len) ((__force u32)(len) = (__force 
u32)VHOST_DMA_DONE_LEN)
 
 enum {

VHOST_NET_FEATURES = VHOST_FEATURES |
--
MST


Reviewed-by: Jason Wang jasow...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 33/46] vhost/net: larger header for virtio 1.0

2014-11-28 Thread Jason Wang



On Fri, Nov 28, 2014 at 4:10 AM, Michael S. Tsirkin m...@redhat.com 
wrote:

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 drivers/vhost/net.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index cae22f9..1ac58d0 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1027,7 +1027,8 @@ static int vhost_net_set_features(struct 
vhost_net *n, u64 features)

size_t vhost_hlen, sock_hlen, hdr_len;
int i;
 
-	hdr_len = (features  (1  VIRTIO_NET_F_MRG_RXBUF)) ?

+   hdr_len = (features  ((1ULL  VIRTIO_NET_F_MRG_RXBUF) |
+  (1ULL  VIRTIO_F_VERSION_1))) ?
sizeof(struct virtio_net_hdr_mrg_rxbuf) :
sizeof(struct virtio_net_hdr);
if (features  (1  VHOST_NET_F_VIRTIO_NET_HDR)) {
--
MST



Reviewed-by: Jason Wang jasow...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] vhost: remove unnecessary forward declarations in vhost.h

2014-11-26 Thread Jason Wang
Signed-off-by: Jason Wang jasow...@redhat.com
---
 drivers/vhost/vhost.h | 4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 3eda654..7d039ef 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -12,8 +12,6 @@
 #include linux/virtio_ring.h
 #include linux/atomic.h
 
-struct vhost_device;
-
 struct vhost_work;
 typedef void (*vhost_work_fn_t)(struct vhost_work *work);
 
@@ -54,8 +52,6 @@ struct vhost_log {
u64 len;
 };
 
-struct vhost_virtqueue;
-
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
struct vhost_dev *dev;
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost + multiqueue + RSS question.

2014-11-18 Thread Jason Wang
On 11/18/2014 07:05 PM, Michael S. Tsirkin wrote:
 On Tue, Nov 18, 2014 at 11:37:03AM +0800, Jason Wang wrote:
  On 11/17/2014 07:58 PM, Michael S. Tsirkin wrote:
   On Mon, Nov 17, 2014 at 01:22:07PM +0200, Gleb Natapov wrote:
On Mon, Nov 17, 2014 at 12:38:16PM +0200, Michael S. Tsirkin wrote:
 On Mon, Nov 17, 2014 at 09:44:23AM +0200, Gleb Natapov wrote:
  On Sun, Nov 16, 2014 at 08:56:04PM +0200, Michael S. 
  Tsirkin wrote:
   On Sun, Nov 16, 2014 at 06:18:18PM +0200, Gleb 
   Natapov wrote:
Hi Michael,

 I am playing with vhost multiqueue capability 
and have a question about
vhost multiqueue and RSS (receive side 
steering). My setup has Mellanox
ConnectX-3 NIC which supports multiqueue and 
RSS. Network related
parameters for qemu are:

   -netdev 
tap,id=hn0,script=qemu-ifup.sh,vhost=on,queues=4
   -device 
virtio-net-pci,netdev=hn0,id=nic1,mq=on,vectors=10

In a guest I ran ethtool -L eth0 combined 4 
to enable multiqueue.

I am running one tcp stream into the guest 
using iperf. Since there is
only one tcp stream I expect it to be handled 
by one queue only but
this seams to be not the case. ethtool -S on a 
host shows that the
stream is handled by one queue in the NIC, 
just like I would expect,
but in a guest all 4 virtio-input interrupt 
are incremented. Am I
missing any configuration?
   
   I don't see anything obviously wrong with what you 
   describe.
   Maybe, somehow, same irqfd got bound to multiple 
   MSI vectors?
  It does not look like this is what is happening judging 
  by the way
  interrupts are distributed between queues. They are not 
  distributed
  uniformly and often I see one queue gets most interrupt 
  and others get
  much less and then it changes.
 
 Weird. It would happen if you transmitted from multiple CPUs.
 You did pin iperf to a single CPU within guest, did you not?
 
No, I didn't because I didn't expect it to matter for input 
interrupts.
When I run iperf on a host rx queue that receives all packets 
depends
only on a connection itself, not on a cpu iperf is running on (I 
tested
that).
   This really depends on the type of networking card you have
   on the host, and how it's configured.
  
   I think you will get something more closely resembling this
   behaviour if you enable RFS in host.
  
When I pin iperf in a guest I do indeed see that all interrupts
are arriving to the same irq vector. Is a number after virtio-input
in /proc/interrupt any indication of a queue a packet arrived to 
(on
a host I can use ethtool -S to check what queue receives packets, 
but
unfortunately this does not work for virtio nic in a guest)?
   I think it is.
  
Because if
it is the way RSS works in virtio is not how it works on a host 
and not
what I would expect after reading about RSS. The queue a packets 
arrives
to should be calculated by hashing fields from a packet header 
only.
   Yes, what virtio has is not RSS - it's an accelerated RFS really.
  
  Strictly speaking, not aRFS. aRFS requires a programmable filter and
  needs driver to fill the filter on demand. For virtio-net, this is done
  automatically in host side (tun/tap). There's no guest involvement.
 Well guest affects the filter by sending tx packets.



Yes, it is.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost + multiqueue + RSS question.

2014-11-17 Thread Jason Wang
On 11/17/2014 07:58 PM, Michael S. Tsirkin wrote:
 On Mon, Nov 17, 2014 at 01:22:07PM +0200, Gleb Natapov wrote:
  On Mon, Nov 17, 2014 at 12:38:16PM +0200, Michael S. Tsirkin wrote:
   On Mon, Nov 17, 2014 at 09:44:23AM +0200, Gleb Natapov wrote:
On Sun, Nov 16, 2014 at 08:56:04PM +0200, Michael S. Tsirkin wrote:
 On Sun, Nov 16, 2014 at 06:18:18PM +0200, Gleb Natapov wrote:
  Hi Michael,
  
   I am playing with vhost multiqueue capability and have a 
  question about
  vhost multiqueue and RSS (receive side steering). My setup has 
  Mellanox
  ConnectX-3 NIC which supports multiqueue and RSS. Network 
  related
  parameters for qemu are:
  
 -netdev tap,id=hn0,script=qemu-ifup.sh,vhost=on,queues=4
 -device virtio-net-pci,netdev=hn0,id=nic1,mq=on,vectors=10
  
  In a guest I ran ethtool -L eth0 combined 4 to enable 
  multiqueue.
  
  I am running one tcp stream into the guest using iperf. Since 
  there is
  only one tcp stream I expect it to be handled by one queue 
  only but
  this seams to be not the case. ethtool -S on a host shows that 
  the
  stream is handled by one queue in the NIC, just like I would 
  expect,
  but in a guest all 4 virtio-input interrupt are incremented. 
  Am I
  missing any configuration?
 
 I don't see anything obviously wrong with what you describe.
 Maybe, somehow, same irqfd got bound to multiple MSI vectors?
It does not look like this is what is happening judging by the way
interrupts are distributed between queues. They are not distributed
uniformly and often I see one queue gets most interrupt and others 
get
much less and then it changes.
   
   Weird. It would happen if you transmitted from multiple CPUs.
   You did pin iperf to a single CPU within guest, did you not?
   
  No, I didn't because I didn't expect it to matter for input interrupts.
  When I run iperf on a host rx queue that receives all packets depends
  only on a connection itself, not on a cpu iperf is running on (I tested
  that).
 This really depends on the type of networking card you have
 on the host, and how it's configured.

 I think you will get something more closely resembling this
 behaviour if you enable RFS in host.

  When I pin iperf in a guest I do indeed see that all interrupts
  are arriving to the same irq vector. Is a number after virtio-input
  in /proc/interrupt any indication of a queue a packet arrived to (on
  a host I can use ethtool -S to check what queue receives packets, but
  unfortunately this does not work for virtio nic in a guest)?
 I think it is.

  Because if
  it is the way RSS works in virtio is not how it works on a host and not
  what I would expect after reading about RSS. The queue a packets arrives
  to should be calculated by hashing fields from a packet header only.
 Yes, what virtio has is not RSS - it's an accelerated RFS really.

Strictly speaking, not aRFS. aRFS requires a programmable filter and
needs driver to fill the filter on demand. For virtio-net, this is done
automatically in host side (tun/tap). There's no guest involvement.


 The point is to try and take application locality into account.


Yes, the locality was done through (consider a N vcpu guest with N queue):

- virtio-net driver will provide a default 1:1 mapping between vcpu and
txq through XPS
- virtio-net driver will suggest a default irq affinity hint also for a
1:1 mapping bettwen vcpu and txq/rxq

With all these, each vcpu get its private txq/rxq paris. And host side
implementation (tun/tap) will make sure if the packets of a flow were
received from queue N, if will also use queue N to transmit the packets
of this flow to guest.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost + multiqueue + RSS question.

2014-11-17 Thread Jason Wang
On 11/18/2014 09:37 AM, Zhang Haoyu wrote:
 On Mon, Nov 17, 2014 at 01:58:20PM +0200, Michael S. Tsirkin wrote:
 On Mon, Nov 17, 2014 at 01:22:07PM +0200, Gleb Natapov wrote:
 On Mon, Nov 17, 2014 at 12:38:16PM +0200, Michael S. Tsirkin wrote:
 On Mon, Nov 17, 2014 at 09:44:23AM +0200, Gleb Natapov wrote:
 On Sun, Nov 16, 2014 at 08:56:04PM +0200, Michael S. Tsirkin wrote:
 On Sun, Nov 16, 2014 at 06:18:18PM +0200, Gleb Natapov wrote:
 Hi Michael,

  I am playing with vhost multiqueue capability and have a question 
 about
 vhost multiqueue and RSS (receive side steering). My setup has Mellanox
 ConnectX-3 NIC which supports multiqueue and RSS. Network related
 parameters for qemu are:

-netdev tap,id=hn0,script=qemu-ifup.sh,vhost=on,queues=4
-device virtio-net-pci,netdev=hn0,id=nic1,mq=on,vectors=10

 In a guest I ran ethtool -L eth0 combined 4 to enable multiqueue.

 I am running one tcp stream into the guest using iperf. Since there is
 only one tcp stream I expect it to be handled by one queue only but
 this seams to be not the case. ethtool -S on a host shows that the
 stream is handled by one queue in the NIC, just like I would expect,
 but in a guest all 4 virtio-input interrupt are incremented. Am I
 missing any configuration?
 I don't see anything obviously wrong with what you describe.
 Maybe, somehow, same irqfd got bound to multiple MSI vectors?
 It does not look like this is what is happening judging by the way
 interrupts are distributed between queues. They are not distributed
 uniformly and often I see one queue gets most interrupt and others get
 much less and then it changes.
 Weird. It would happen if you transmitted from multiple CPUs.
 You did pin iperf to a single CPU within guest, did you not?

 No, I didn't because I didn't expect it to matter for input interrupts.
 When I run iperf on a host rx queue that receives all packets depends
 only on a connection itself, not on a cpu iperf is running on (I tested
 that).
 This really depends on the type of networking card you have
 on the host, and how it's configured.

 I think you will get something more closely resembling this
 behaviour if you enable RFS in host.

 When I pin iperf in a guest I do indeed see that all interrupts
 are arriving to the same irq vector. Is a number after virtio-input
 in /proc/interrupt any indication of a queue a packet arrived to (on
 a host I can use ethtool -S to check what queue receives packets, but
 unfortunately this does not work for virtio nic in a guest)?
 I think it is.

 Because if
 it is the way RSS works in virtio is not how it works on a host and not
 what I would expect after reading about RSS. The queue a packets arrives
 to should be calculated by hashing fields from a packet header only.
 Yes, what virtio has is not RSS - it's an accelerated RFS really.

 OK, if what virtio has is RFS and not RSS my test results make sense.
 Thanks!
 I think the RSS emulation for virtio-mq NIC is implemented in 
 tun_select_queue(),
 am I missing something?

 Thanks,
 Zhang Haoyu


Yes, if RSS is the short for Receive Side Steering which is a generic
technology. But RSS is usually short for Receive Side Scaling which was
commonly technology used by Windows, it was implemented through a
indirection table in the card which is obviously not supported in tun
currently.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost + multiqueue + RSS question.

2014-11-16 Thread Jason Wang
On 11/17/2014 02:56 AM, Michael S. Tsirkin wrote:
 On Sun, Nov 16, 2014 at 06:18:18PM +0200, Gleb Natapov wrote:
 Hi Michael,

  I am playing with vhost multiqueue capability and have a question about
 vhost multiqueue and RSS (receive side steering). My setup has Mellanox
 ConnectX-3 NIC which supports multiqueue and RSS. Network related
 parameters for qemu are:

-netdev tap,id=hn0,script=qemu-ifup.sh,vhost=on,queues=4
-device virtio-net-pci,netdev=hn0,id=nic1,mq=on,vectors=10

 In a guest I ran ethtool -L eth0 combined 4 to enable multiqueue.

 I am running one tcp stream into the guest using iperf. Since there is
 only one tcp stream I expect it to be handled by one queue only but
 this seams to be not the case. ethtool -S on a host shows that the
 stream is handled by one queue in the NIC, just like I would expect,
 but in a guest all 4 virtio-input interrupt are incremented. Am I
 missing any configuration?
 I don't see anything obviously wrong with what you describe.
 Maybe, somehow, same irqfd got bound to multiple MSI vectors?
 To see, can you try dumping struct kvm_irqfd that's passed to kvm?


 --
  Gleb.

This sounds like a regression, which kernel/qemu version did you use?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost + multiqueue + RSS question.

2014-11-16 Thread Jason Wang
On 11/17/2014 12:54 PM, Venkateswara Rao Nandigam wrote:
 I have a question related this topic. So How do you set the RSS Key on the 
 Mellanox NIc? I mean from your Guest?

I believe it's possible but not implemented currently. The issue is the
implementation should not be vendor specific.

TUN/TAP has its own automatic flow steering implementation (flow caches).

 If it being set as part of Host driver, is there a way to set it from Guest? 
 I mean my guest will choose a RSS Key and will try to set on the Physical NIC.

Flow caches can co-operate with RFS/aRFS now, so there's indeed some
kind of co-operation between host card and guest I believe.

 Thanks,
 Venkatesh


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC 1/3] virtio: support for urgent descriptors

2014-10-16 Thread Jason Wang
On 10/15/2014 01:40 PM, Rusty Russell wrote:
 Jason Wang jasow...@redhat.com writes:
 Below should be useful for some experiments Jason is doing.
 I thought I'd send it out for early review/feedback.

 event idx feature allows us to defer interrupts until
 a specific # of descriptors were used.
 Sometimes it might be useful to get an interrupt after
 a specific descriptor, regardless.
 This adds a descriptor flag for this, and an API
 to create an urgent output descriptor.
 This is still an RFC:
 we'll need a feature bit for drivers to detect this,
 but we've run out of feature bits for virtio 0.X.
 For experimentation purposes, drivers can assume
 this is set, or add a driver-specific feature bit.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 Signed-off-by: Jason Wang jasow...@redhat.com
 The new VRING_DESC_F_URGENT bit is theoretically nicer, but for
 networking (which tends to take packets in order) couldn't we just set
 the event counter to give us a tx interrupt at the packet we want?

 Cheers,
 Rusty.

Yes, we could. Recent RFC of enabling tx interrupt use this.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt

2014-10-15 Thread Jason Wang
On 10/15/2014 07:06 AM, Michael S. Tsirkin wrote:
 On Tue, Oct 14, 2014 at 02:53:27PM -0400, David Miller wrote:
  From: Jason Wang jasow...@redhat.com
  Date: Sat, 11 Oct 2014 15:16:43 +0800
  
   We free old transmitted packets in ndo_start_xmit() currently, so any
   packet must be orphaned also there. This was used to reduce the 
   overhead of
   tx interrupt to achieve better performance. But this may not work for 
   some
   protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc 
   to
   implement various optimization for small packets stream such as TCP 
   small
   queue and auto corking. But orphaning packets early in ndo_start_xmit()
   disable such things more or less since sk_wmem_alloc was not accurate. 
   This
   lead extra low throughput for TCP stream of small writes.
   
   This series tries to solve this issue by enable tx interrupts for all 
   TCP
   packets other than the ones with push bit or pure ACK. This is done 
   through
   the support of urgent descriptor which can force an interrupt for a
   specified packet. If tx interrupt was enabled for a packet, there's no 
   need
   to orphan it in ndo_start_xmit(), we can free it tx napi which is 
   scheduled
   by tx interrupt. Then sk_wmem_alloc was more accurate than before and 
   TCP
   can batch more for small write. More larger skb was produced by TCP in 
   this
   case to improve both throughput and cpu utilization.
   
   Test shows great improvements on small write tcp streams. For most of 
   the
   other cases, the throughput and cpu utilization are the same in the
   past. Only few cases, more cpu utilization was noticed which needs more
   investigation.
   
   Review and comments are welcomed.
  
  I think proper accounting and queueing (at all levels, not just TCP
  sockets) is more important than trying to skim a bunch of cycles by
  avoiding TX interrupts.
  
  Having an event to free the SKB is absolutely essential for the stack
  to operate correctly.
  
  And with virtio-net you don't even have the excuse of the HW
  unfortunately doesn't have an appropriate TX event.
  
  So please don't play games, and instead use TX interrupts all the
  time.  You can mitigate them in various ways, but don't turn them on
  selectively based upon traffic type, that's terrible.
  
  You can even use -xmit_more to defer the TX interrupt indication to
  the final TX packet in the chain.
 I guess we can just defer the kick, interrupt will naturally be
 deferred as well.
 This should solve the problem for old hosts as well.

Interrupt were delayed but not reduced, to support this we need publish
avail idx as used event. This should reduce the tx interrupt in the case
of bulk dequeuing.

I will draft a new rfc series contain this.

 We'll also need to implement bql for this.
 Something like the below?
 Completely untested - posting here to see if I figured the
 API out correctly. Has to be applied on top of the previous patch.

Looks so. I believe better to have but not a must.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt

2014-10-14 Thread Jason Wang
On 10/15/2014 05:51 AM, Michael S. Tsirkin wrote:
 On Tue, Oct 14, 2014 at 02:53:27PM -0400, David Miller wrote:
 From: Jason Wang jasow...@redhat.com
 Date: Sat, 11 Oct 2014 15:16:43 +0800

 We free old transmitted packets in ndo_start_xmit() currently, so any
 packet must be orphaned also there. This was used to reduce the overhead of
 tx interrupt to achieve better performance. But this may not work for some
 protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
 implement various optimization for small packets stream such as TCP small
 queue and auto corking. But orphaning packets early in ndo_start_xmit()
 disable such things more or less since sk_wmem_alloc was not accurate. This
 lead extra low throughput for TCP stream of small writes.

 This series tries to solve this issue by enable tx interrupts for all TCP
 packets other than the ones with push bit or pure ACK. This is done through
 the support of urgent descriptor which can force an interrupt for a
 specified packet. If tx interrupt was enabled for a packet, there's no need
 to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
 by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
 can batch more for small write. More larger skb was produced by TCP in this
 case to improve both throughput and cpu utilization.

 Test shows great improvements on small write tcp streams. For most of the
 other cases, the throughput and cpu utilization are the same in the
 past. Only few cases, more cpu utilization was noticed which needs more
 investigation.

 Review and comments are welcomed.
 I think proper accounting and queueing (at all levels, not just TCP
 sockets) is more important than trying to skim a bunch of cycles by
 avoiding TX interrupts.

 Having an event to free the SKB is absolutely essential for the stack
 to operate correctly.

 And with virtio-net you don't even have the excuse of the HW
 unfortunately doesn't have an appropriate TX event.

 So please don't play games, and instead use TX interrupts all the
 time.  You can mitigate them in various ways, but don't turn them on
 selectively based upon traffic type, that's terrible.
 This got me thinking: how about using virtqueue_enable_cb_delayed
 for this mitigation?

Should work, another possible solution is interrupt coalescing, which
can speed up also the case without event index.
 It's pretty easy to implement - I'll send a proof of concept patch
 separately.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt

2014-10-14 Thread Jason Wang
On 10/15/2014 05:51 AM, Michael S. Tsirkin wrote:
 On Sat, Oct 11, 2014 at 03:16:46PM +0800, Jason Wang wrote:
  We free transmitted packets in ndo_start_xmit() in the past to get better
  performance in the past. One side effect is that skb_orphan() needs to be
  called in ndo_start_xmit() which makes sk_wmem_alloc not accurate in
  fact. For TCP protocol, this means several optimization could not work well
  such as TCP small queue and auto corking. This can lead extra low
  throughput of small packets stream.
  
  Thanks to the urgent descriptor support. This patch tries to solve this
  issue by enable the tx interrupt selectively for stream packets. This means
  we don't need to orphan TCP stream packets in ndo_start_xmit() but enable
  tx interrupt for those packets. After we get tx interrupt, a tx napi was
  scheduled to free those packets.
  
  With this method, sk_wmem_alloc of TCP socket were more accurate than in
  the past which let TCP can batch more through TSQ and auto corking.
  
  Signed-off-by: Jason Wang jasow...@redhat.com
  ---
   drivers/net/virtio_net.c | 164 
  ---
   1 file changed, 128 insertions(+), 36 deletions(-)
  
  diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
  index 5810841..b450fc4 100644
  --- a/drivers/net/virtio_net.c
  +++ b/drivers/net/virtio_net.c
  @@ -72,6 +72,8 @@ struct send_queue {
   
 /* Name of the send queue: output.$index */
 char name[40];
  +
  +  struct napi_struct napi;
   };
   
   /* Internal representation of a receive virtqueue */
  @@ -217,15 +219,40 @@ static struct page *get_a_page(struct receive_queue 
  *rq, gfp_t gfp_mask)
 return p;
   }
   
  +static int free_old_xmit_skbs(struct send_queue *sq, int budget)
  +{
  +  struct sk_buff *skb;
  +  unsigned int len;
  +  struct virtnet_info *vi = sq-vq-vdev-priv;
  +  struct virtnet_stats *stats = this_cpu_ptr(vi-stats);
  +  int sent = 0;
  +
  +  while (sent  budget 
  + (skb = virtqueue_get_buf(sq-vq, len)) != NULL) {
  +  pr_debug(Sent skb %p\n, skb);
  +
  +  u64_stats_update_begin(stats-tx_syncp);
  +  stats-tx_bytes += skb-len;
  +  stats-tx_packets++;
  +  u64_stats_update_end(stats-tx_syncp);
  +
  +  dev_kfree_skb_any(skb);
  +  sent++;
  +  }
  +
  +  return sent;
  +}
  +
   static void skb_xmit_done(struct virtqueue *vq)
   {
 struct virtnet_info *vi = vq-vdev-priv;
  +  struct send_queue *sq = vi-sq[vq2txq(vq)];
   
  -  /* Suppress further interrupts. */
  -  virtqueue_disable_cb(vq);
  -
  -  /* We were probably waiting for more output buffers. */
  -  netif_wake_subqueue(vi-dev, vq2txq(vq));
  +  if (napi_schedule_prep(sq-napi)) {
  +  virtqueue_disable_cb(vq);
  +  virtqueue_disable_cb_urgent(vq);
 This disable_cb is no longer safe in xmit_done callback,
 since queue can be running at the same time.

 You must do it under tx lock. And yes, this likely will not work
 work well without event_idx. We'll probably need extra
 synchronization for such old hosts.




Yes, and the virtqueue_enable_cb_prepare() in virtnet_poll_tx() needs to
be synced with virtqueue_enabled_cb_dealyed(). Otherwise an old idx will
be published and we may still see tx interrupt storm.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt

2014-10-13 Thread Jason Wang
On 10/11/2014 10:48 PM, Eric Dumazet wrote:
 On Sat, 2014-10-11 at 15:16 +0800, Jason Wang wrote:
 We free transmitted packets in ndo_start_xmit() in the past to get better
 performance in the past. One side effect is that skb_orphan() needs to be
 called in ndo_start_xmit() which makes sk_wmem_alloc not accurate in
 fact. For TCP protocol, this means several optimization could not work well
 such as TCP small queue and auto corking. This can lead extra low
 throughput of small packets stream.

 Thanks to the urgent descriptor support. This patch tries to solve this
 issue by enable the tx interrupt selectively for stream packets. This means
 we don't need to orphan TCP stream packets in ndo_start_xmit() but enable
 tx interrupt for those packets. After we get tx interrupt, a tx napi was
 scheduled to free those packets.

 With this method, sk_wmem_alloc of TCP socket were more accurate than in
 the past which let TCP can batch more through TSQ and auto corking.

 Signed-off-by: Jason Wang jasow...@redhat.com
 ---
  drivers/net/virtio_net.c | 164 
 ---
  1 file changed, 128 insertions(+), 36 deletions(-)

 diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
 index 5810841..b450fc4 100644
 --- a/drivers/net/virtio_net.c
 +++ b/drivers/net/virtio_net.c
 @@ -72,6 +72,8 @@ struct send_queue {
  
  /* Name of the send queue: output.$index */
  char name[40];
 +
 +struct napi_struct napi;
  };
  
  /* Internal representation of a receive virtqueue */
 @@ -217,15 +219,40 @@ static struct page *get_a_page(struct receive_queue 
 *rq, gfp_t gfp_mask)
  return p;
  }
  
 +static int free_old_xmit_skbs(struct send_queue *sq, int budget)
 +{
 +struct sk_buff *skb;
 +unsigned int len;
 +struct virtnet_info *vi = sq-vq-vdev-priv;
 +struct virtnet_stats *stats = this_cpu_ptr(vi-stats);
 +int sent = 0;
 +
 +while (sent  budget 
 +   (skb = virtqueue_get_buf(sq-vq, len)) != NULL) {
 +pr_debug(Sent skb %p\n, skb);
 +
 +u64_stats_update_begin(stats-tx_syncp);
 +stats-tx_bytes += skb-len;
 +stats-tx_packets++;
 +u64_stats_update_end(stats-tx_syncp);
 +
 +dev_kfree_skb_any(skb);
 +sent++;
 +}
 +
 You could accumulate skb-len in a totlen var, and perform a single

   u64_stats_update_begin(stats-tx_syncp);
   stats-tx_bytes += totlen;
   stats-tx_packets += sent;
   u64_stats_update_end(stats-tx_syncp);

 after the loop.


Yes, will do this in a separated patch.
 +return sent;
 +}
 +
 ...

 +
 +static bool virtnet_skb_needs_intr(struct sk_buff *skb)
 +{
 +union {
 +unsigned char *network;
 +struct iphdr *ipv4;
 +struct ipv6hdr *ipv6;
 +} hdr;
 +struct tcphdr *th = tcp_hdr(skb);
 +u16 payload_len;
 +
 +hdr.network = skb_network_header(skb);
 +
 +/* Only IPv4/IPv6 with TCP is supported */
   Oh well, yet another packet flow dissector :)

   If most packets were caught by your implementation, you could use it
 for fast patj and fallback to skb_flow_dissect() for encapsulated
 traffic.

   struct flow_keys keys;   

   if (!skb_flow_dissect(skb, keys)) 
   return false;

   if (keys.ip_proto != IPPROTO_TCP)
   return false;

   then check __skb_get_poff() how to get th, and check if there is some
 payload...



Yes, but we don't know if most of packets were TCP or encapsulated TCP,
it depends on userspace application. If not, looks like
skb_flow_dissect() can bring some overhead, or it could be ignored?

skb_flow_dissect

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC 1/3] virtio: support for urgent descriptors

2014-10-13 Thread Jason Wang
On 10/12/2014 05:27 PM, Michael S. Tsirkin wrote:
 On Sat, Oct 11, 2014 at 03:16:44PM +0800, Jason Wang wrote:
 Below should be useful for some experiments Jason is doing.
 I thought I'd send it out for early review/feedback.

 event idx feature allows us to defer interrupts until
 a specific # of descriptors were used.
 Sometimes it might be useful to get an interrupt after
 a specific descriptor, regardless.
 This adds a descriptor flag for this, and an API
 to create an urgent output descriptor.
 This is still an RFC:
 we'll need a feature bit for drivers to detect this,
 but we've run out of feature bits for virtio 0.X.
 For experimentation purposes, drivers can assume
 this is set, or add a driver-specific feature bit.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 Signed-off-by: Jason Wang jasow...@redhat.com
 I see that as compared to my original patch, you have
 added a new flag: VRING_AVAIL_F_NO_URGENT_INTERRUPT
 I don't think it's necessary, see below.

 As such, I think this patch should be split:
 - original patch adding support for urgent descriptors
 - a patch adding virtqueue_enable/disable_cb_urgent(_prepare)?

Not sure this is a good idea, since the api of first patch is in-completed.

 ---
  drivers/virtio/virtio_ring.c | 75 
 +---
  include/linux/virtio.h   | 14 
  include/uapi/linux/virtio_ring.h |  5 ++-
  3 files changed, 89 insertions(+), 5 deletions(-)

[...]
  
 +unsigned virtqueue_enable_cb_prepare_urgent(struct virtqueue *_vq)
 +{
 +struct vring_virtqueue *vq = to_vvq(_vq);
 +u16 last_used_idx;
 +
 +START_USE(vq);
 +vq-vring.avail-flags = ~VRING_AVAIL_F_NO_URGENT_INTERRUPT;
 +last_used_idx = vq-last_used_idx;
 +END_USE(vq);
 +return last_used_idx;
 +}
 +EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare_urgent);
 +
 You can implement virtqueue_enable_cb_prepare_urgent
 simply by clearing ~VRING_AVAIL_F_NO_INTERRUPT.

 The effect is same: host sends interrupts only if there
 is an urgent descriptor.

Seems not, consider the case when event index was disabled. This will
turn on all interrupts.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt

2014-10-11 Thread Jason Wang
Hello all:

We free old transmitted packets in ndo_start_xmit() currently, so any
packet must be orphaned also there. This was used to reduce the overhead of
tx interrupt to achieve better performance. But this may not work for some
protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
implement various optimization for small packets stream such as TCP small
queue and auto corking. But orphaning packets early in ndo_start_xmit()
disable such things more or less since sk_wmem_alloc was not accurate. This
lead extra low throughput for TCP stream of small writes.

This series tries to solve this issue by enable tx interrupts for all TCP
packets other than the ones with push bit or pure ACK. This is done through
the support of urgent descriptor which can force an interrupt for a
specified packet. If tx interrupt was enabled for a packet, there's no need
to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
can batch more for small write. More larger skb was produced by TCP in this
case to improve both throughput and cpu utilization.

Test shows great improvements on small write tcp streams. For most of the
other cases, the throughput and cpu utilization are the same in the
past. Only few cases, more cpu utilization was noticed which needs more
investigation.

Review and comments are welcomed.

Thanks

Test result:

- Two Intel Corporation Xeon 5600s (8 cores) with back to back connected
  82599ES:
- netperf test between guest and remote host
- 1 queue 2 vcpus with zercopy enabled vhost_net
- both host and guest are net-next.git with the patches.
- Value with '[]' means obvious difference (the significance is greater
  than 95%).
- he significance of the differences between the two averages is calculated
  using unpaired T-test that takes into account the SD of the averages.

Guest RX
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
64/1/+3.7872%/+3.2307%/+0.5390%/
64/2/-0.2325%/+2.9552%/-3.0962%/
64/4/[-2.0296%]/+2.2955%/[-4.2280%]/
64/8/+0.0944%/[+2.2654%]/-2.4662%/
256/1/+1.1947%/-2.5462%/+3.8386%/
256/2/-1.6477%/+3.4421%/-4.9301%/
256/4/[-5.9526%]/[+6.8861%]/[-11.9951%]/
256/8/-3.6470%/-1.5887%/-2.0916%/
1024/1/-4.2225%/-1.3238%/-2.9376%/
1024/2/+0.3568%/+1.8439%/-1.4601%/
1024/4/-0.7065%/-0.0099%/-2.3483%/
1024/8/-1.8620%/-2.4774%/+0.6310%/
4096/1/+0.0115%/-0.3693%/+0.3823%/
4096/2/-0.0209%/+0.8730%/-0.8862%/
4096/4/+0.0729%/-7.0303%/+7.6403%/
4096/8/-2.3720%/+0.0507%/-2.4214%/
16384/1/+0.0222%/-1.8672%/+1.9254%/
16384/2/+0.0986%/+3.2968%/-3.0961%/
16384/4/-1.2059%/+7.4291%/-8.0379%/
16384/8/-1.4893%/+0.3403%/-1.8234%/
65535/1/-0.0445%/-1.4060%/+1.3808%/
65535/2/-0.0311%/+0.9610%/-0.9827%/
65535/4/-0.7015%/+0.3660%/-1.0637%/
65535/8/-3.1585%/+11.1302%/[-12.8576%]/

Guest TX
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
64/1/[+75.2622%]/[-14.3928%]/[+104.7283%]/
64/2/[+68.9596%]/[-12.6655%]/[+93.4625%]/
64/4/[+68.0126%]/[-12.7982%]/[+92.6710%]/
64/8/[+67.9870%]/[-12.6297%]/[+92.2703%]/
256/1/[+160.4177%]/[-26.9643%]/[+256.5624%]/
256/2/[+48.4357%]/[-24.3380%]/[+96.1825%]/
256/4/[+48.3663%]/[-24.1127%]/[+95.5087%]/
256/8/[+47.9722%]/[-24.2516%]/[+95.3469%]/
1024/1/[+54.4474%]/[-52.9223%]/[+228.0694%]/
1024/2/+0.0742%/[-12.7444%]/[+14.6908%]/
1024/4/[+0.5524%]/-0.0327%/+0.5853%/
1024/8/[-1.2783%]/[+6.2902%]/[-7.1206%]/
4096/1/+0.0778%/-13.1121%/+15.1804%/
4096/2/+0.0189%/[-11.3176%]/[+12.7832%]/
4096/4/+0.0218%/-1.0389%/+1.0718%/
4096/8/-1.3774%/[+12.7396%]/[-12.5218%]/
16384/1/+0.0136%/-2.5043%/+2.5826%/
16384/2/+0.0509%/[-15.3846%]/[+18.2420%]/
16384/4/-0.0163%/[-4.8808%]/[+5.1141%]/
16384/8/[-1.7249%]/[+13.9174%]/[-13.7313%]/
65535/1/+0.0686%/-5.4942%/+5.8862%/
65535/2/+0.0043%/[-7.5816%]/[+8.2082%]/
65535/4/+0.0080%/[-7.2993%]/[+7.8827%]/
65535/8/[-1.3669%]/[+16.6536%]/[-15.4479%]/

Guest TCP_RR
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
256/1/-0.2914%/+12.6457%/-11.4848%/
256/25/-0.5968%/-5.0531%/+4.6935%/
256/50/+0.0262%/+0.2079%/-0.1813%/
4096/1/+2.6965%/[+16.1248%]/[-11.5636%]/
4096/25/-0.5002%/+0.5449%/-1.0395%/
4096/50/[-2.0987%]/-0.0330%/[-2.0664%]/

Tests on mlx4 was ongoing, will post the result in next week.

Jason Wang (3):
  virtio: support for urgent descriptors
  vhost: support urgent descriptors
  virtio-net: conditionally enable tx interrupt

 drivers/net/virtio_net.c | 164 ++-
 drivers/vhost/net.c  |  43 +++---
 drivers/vhost/scsi.c |  23 --
 drivers/vhost/test.c |   5 +-
 drivers/vhost/vhost.c|  44 +++
 drivers/vhost/vhost.h|  19 +++--
 drivers/virtio/virtio_ring.c |  75 +-
 include/linux/virtio.h   |  14 
 include/uapi/linux/virtio_ring.h |   5 +-
 9 files changed, 308 insertions(+), 84 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm

[PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt

2014-10-11 Thread Jason Wang
We free transmitted packets in ndo_start_xmit() in the past to get better
performance in the past. One side effect is that skb_orphan() needs to be
called in ndo_start_xmit() which makes sk_wmem_alloc not accurate in
fact. For TCP protocol, this means several optimization could not work well
such as TCP small queue and auto corking. This can lead extra low
throughput of small packets stream.

Thanks to the urgent descriptor support. This patch tries to solve this
issue by enable the tx interrupt selectively for stream packets. This means
we don't need to orphan TCP stream packets in ndo_start_xmit() but enable
tx interrupt for those packets. After we get tx interrupt, a tx napi was
scheduled to free those packets.

With this method, sk_wmem_alloc of TCP socket were more accurate than in
the past which let TCP can batch more through TSQ and auto corking.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 drivers/net/virtio_net.c | 164 ---
 1 file changed, 128 insertions(+), 36 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5810841..b450fc4 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -72,6 +72,8 @@ struct send_queue {
 
/* Name of the send queue: output.$index */
char name[40];
+
+   struct napi_struct napi;
 };
 
 /* Internal representation of a receive virtqueue */
@@ -217,15 +219,40 @@ static struct page *get_a_page(struct receive_queue *rq, 
gfp_t gfp_mask)
return p;
 }
 
+static int free_old_xmit_skbs(struct send_queue *sq, int budget)
+{
+   struct sk_buff *skb;
+   unsigned int len;
+   struct virtnet_info *vi = sq-vq-vdev-priv;
+   struct virtnet_stats *stats = this_cpu_ptr(vi-stats);
+   int sent = 0;
+
+   while (sent  budget 
+  (skb = virtqueue_get_buf(sq-vq, len)) != NULL) {
+   pr_debug(Sent skb %p\n, skb);
+
+   u64_stats_update_begin(stats-tx_syncp);
+   stats-tx_bytes += skb-len;
+   stats-tx_packets++;
+   u64_stats_update_end(stats-tx_syncp);
+
+   dev_kfree_skb_any(skb);
+   sent++;
+   }
+
+   return sent;
+}
+
 static void skb_xmit_done(struct virtqueue *vq)
 {
struct virtnet_info *vi = vq-vdev-priv;
+   struct send_queue *sq = vi-sq[vq2txq(vq)];
 
-   /* Suppress further interrupts. */
-   virtqueue_disable_cb(vq);
-
-   /* We were probably waiting for more output buffers. */
-   netif_wake_subqueue(vi-dev, vq2txq(vq));
+   if (napi_schedule_prep(sq-napi)) {
+   virtqueue_disable_cb(vq);
+   virtqueue_disable_cb_urgent(vq);
+   __napi_schedule(sq-napi);
+   }
 }
 
 static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
@@ -772,7 +799,38 @@ again:
return received;
 }
 
+static int virtnet_poll_tx(struct napi_struct *napi, int budget)
+{
+   struct send_queue *sq =
+   container_of(napi, struct send_queue, napi);
+   struct virtnet_info *vi = sq-vq-vdev-priv;
+   struct netdev_queue *txq = netdev_get_tx_queue(vi-dev, vq2txq(sq-vq));
+   unsigned int r, sent = 0;
+
+again:
+   __netif_tx_lock(txq, smp_processor_id());
+   sent += free_old_xmit_skbs(sq, budget - sent);
+
+   if (sent  budget) {
+   r = virtqueue_enable_cb_prepare_urgent(sq-vq);
+   napi_complete(napi);
+   __netif_tx_unlock(txq);
+   if (unlikely(virtqueue_poll(sq-vq, r)) 
+   napi_schedule_prep(napi)) {
+   virtqueue_disable_cb_urgent(sq-vq);
+   __napi_schedule(napi);
+   goto again;
+   }
+   } else {
+   __netif_tx_unlock(txq);
+   }
+
+   netif_wake_subqueue(vi-dev, vq2txq(sq-vq));
+   return sent;
+}
+
 #ifdef CONFIG_NET_RX_BUSY_POLL
+
 /* must be called with local_bh_disable()d */
 static int virtnet_busy_poll(struct napi_struct *napi)
 {
@@ -820,31 +878,13 @@ static int virtnet_open(struct net_device *dev)
if (!try_fill_recv(vi-rq[i], GFP_KERNEL))
schedule_delayed_work(vi-refill, 0);
virtnet_napi_enable(vi-rq[i]);
+   napi_enable(vi-sq[i].napi);
}
 
return 0;
 }
 
-static void free_old_xmit_skbs(struct send_queue *sq)
-{
-   struct sk_buff *skb;
-   unsigned int len;
-   struct virtnet_info *vi = sq-vq-vdev-priv;
-   struct virtnet_stats *stats = this_cpu_ptr(vi-stats);
-
-   while ((skb = virtqueue_get_buf(sq-vq, len)) != NULL) {
-   pr_debug(Sent skb %p\n, skb);
-
-   u64_stats_update_begin(stats-tx_syncp);
-   stats-tx_bytes += skb-len;
-   stats-tx_packets++;
-   u64_stats_update_end(stats-tx_syncp);
-
-   dev_kfree_skb_any(skb);
-   }
-}
-
-static int

[PATCH net-next RFC 2/3] vhost: support urgent descriptors

2014-10-11 Thread Jason Wang
This patches let vhost-net support urgent descriptors. For zerocopy case,
two new types of length was introduced to make it work.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Jason Wang jasow...@redhat.com
---
 drivers/vhost/net.c   | 43 +++
 drivers/vhost/scsi.c  | 23 +++
 drivers/vhost/test.c  |  5 +++--
 drivers/vhost/vhost.c | 44 +---
 drivers/vhost/vhost.h | 19 +--
 5 files changed, 91 insertions(+), 43 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8dae2f7..37b0bb5 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -48,9 +48,13 @@ MODULE_PARM_DESC(experimental_zcopytx, Enable Zero Copy TX;
  * status internally; used for zerocopy tx only.
  */
 /* Lower device DMA failed */
-#define VHOST_DMA_FAILED_LEN   3
+#define VHOST_DMA_FAILED_LEN   5
+/* Lower device DMA doen, urgent bit set */
+#define VHOST_DMA_DONE_LEN_URGENT  4
 /* Lower device DMA done */
-#define VHOST_DMA_DONE_LEN 2
+#define VHOST_DMA_DONE_LEN 3
+/* Lower device DMA in progress, urgent bit set */
+#define VHOST_DMA_URGENT   2
 /* Lower device DMA in progress */
 #define VHOST_DMA_IN_PROGRESS  1
 /* Buffer unused */
@@ -284,11 +288,13 @@ static void vhost_zerocopy_signal_used(struct vhost_net 
*net,
container_of(vq, struct vhost_net_virtqueue, vq);
int i, add;
int j = 0;
+   bool urgent = false;
 
for (i = nvq-done_idx; i != nvq-upend_idx; i = (i + 1) % UIO_MAXIOV) {
if (vq-heads[i].len == VHOST_DMA_FAILED_LEN)
vhost_net_tx_err(net);
if (VHOST_DMA_IS_DONE(vq-heads[i].len)) {
+   urgent = urgent || vq-heads[i].len == 
VHOST_DMA_DONE_LEN_URGENT;
vq-heads[i].len = VHOST_DMA_CLEAR_LEN;
++j;
} else
@@ -296,7 +302,7 @@ static void vhost_zerocopy_signal_used(struct vhost_net 
*net,
}
while (j) {
add = min(UIO_MAXIOV - nvq-done_idx, j);
-   vhost_add_used_and_signal_n(vq-dev, vq,
+   vhost_add_used_and_signal_n(vq-dev, vq, urgent,
vq-heads[nvq-done_idx], add);
nvq-done_idx = (nvq-done_idx + add) % UIO_MAXIOV;
j -= add;
@@ -311,9 +317,14 @@ static void vhost_zerocopy_callback(struct ubuf_info 
*ubuf, bool success)
 
rcu_read_lock_bh();
 
-   /* set len to mark this desc buffers done DMA */
-   vq-heads[ubuf-desc].len = success ?
-   VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
+   if (success) {
+   if (vq-heads[ubuf-desc].len == VHOST_DMA_IN_PROGRESS)
+   vq-heads[ubuf-desc].len = VHOST_DMA_DONE_LEN;
+   else
+   vq-heads[ubuf-desc].len = VHOST_DMA_DONE_LEN_URGENT;
+   } else {
+   vq-heads[ubuf-desc].len = VHOST_DMA_FAILED_LEN;
+   }
cnt = vhost_net_ubuf_put(ubufs);
 
/*
@@ -363,6 +374,7 @@ static void handle_tx(struct vhost_net *net)
zcopy = nvq-ubufs;
 
for (;;) {
+   bool urgent;
/* Release DMAs done buffers first */
if (zcopy)
vhost_zerocopy_signal_used(net, vq);
@@ -374,7 +386,7 @@ static void handle_tx(struct vhost_net *net)
  % UIO_MAXIOV == nvq-done_idx))
break;
 
-   head = vhost_get_vq_desc(vq, vq-iov,
+   head = vhost_get_vq_desc(vq, urgent, vq-iov,
 ARRAY_SIZE(vq-iov),
 out, in,
 NULL, NULL);
@@ -417,7 +429,8 @@ static void handle_tx(struct vhost_net *net)
ubuf = nvq-ubuf_info + nvq-upend_idx;
 
vq-heads[nvq-upend_idx].id = head;
-   vq-heads[nvq-upend_idx].len = VHOST_DMA_IN_PROGRESS;
+   vq-heads[nvq-upend_idx].len = urgent ?
+   VHOST_DMA_URGENT : VHOST_DMA_IN_PROGRESS;
ubuf-callback = vhost_zerocopy_callback;
ubuf-ctx = nvq-ubufs;
ubuf-desc = nvq-upend_idx;
@@ -445,7 +458,7 @@ static void handle_tx(struct vhost_net *net)
pr_debug(Truncated TX packet: 
  len %d != %zd\n, err, len);
if (!zcopy_used)
-   vhost_add_used_and_signal(net-dev, vq, head, 0);
+   vhost_add_used_and_signal(net-dev, vq, urgent, head, 
0);
else
vhost_zerocopy_signal_used(net, vq);
total_len += len;
@@ -488,6 +501,7 @@ static int peek_head_len(struct sock *sk)
  * returns number of buffer heads allocated

  1   2   3   4   5   6   7   8   >