Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-12-01 Thread achiad shochat
On 30 November 2017 at 05:29, Jason Wang  wrote:
>
>
> On 2017年11月29日 03:27, Jesse Brandeburg wrote:
>>
>> Hi, I'd like to get some feedback on a proposal to enhance virtio-net
>> to ease configuration of a VM and that would enable live migration of
>> passthrough network SR-IOV devices.
>>
>> Today we have SR-IOV network devices (VFs) that can be passed into a VM
>> in order to enable high performance networking direct within the VM.
>> The problem I am trying to address is that this configuration is
>> generally difficult to live-migrate.  There is documentation [1]
>> indicating that some OS/Hypervisor vendors will support live migration
>> of a system with a direct assigned networking device.  The problem I
>> see with these implementations is that the network configuration
>> requirements that are passed on to the owner of the VM are quite
>> complicated.  You have to set up bonding, you have to configure it to
>> enslave two interfaces, those interfaces (one is virtio-net, the other
>> is SR-IOV device/driver like ixgbevf) must support MAC address changes
>> requested in the VM, and on and on...
>>
>> So, on to the proposal:
>> Modify virtio-net driver to be a single VM network device that
>> enslaves an SR-IOV network device (inside the VM) with the same MAC
>> address. This would cause the virtio-net driver to appear and work like
>> a simplified bonding/team driver.  The live migration problem would be
>> solved just like today's bonding solution, but the VM user's networking
>> config would be greatly simplified.
>>
>> At it's simplest, it would appear something like this in the VM.
>>
>> ==
>> = vnet0  =
>>   =
>> (virtio- =   |
>>   net)=   |
>>   =  ==
>>   =  = ixgbef =
>> ==  ==
>>
>> (forgive the ASCII art)
>>
>> The fast path traffic would prefer the ixgbevf or other SR-IOV device
>> path, and fall back to virtio's transmit/receive when migrating.
>>
>> Compared to today's options this proposal would
>> 1) make virtio-net more sticky, allow fast path traffic at SR-IOV
>> speeds
>> 2) simplify end user configuration in the VM (most if not all of the
>> set up to enable migration would be done in the hypervisor)
>> 3) allow live migration via a simple link down and maybe a PCI
>> hot-unplug of the SR-IOV device, with failover to the virtio-net
>> driver core
>> 4) allow vendor agnostic hardware acceleration, and live migration
>> between vendors if the VM os has driver support for all the required
>> SR-IOV devices.
>>
>> Runtime operation proposed:
>> -  virtio-net driver loads, SR-IOV driver loads
>> - virtio-net finds other NICs that match it's MAC address by
>>both examining existing interfaces, and sets up a new device notifier
>> - virtio-net enslaves the first NIC with the same MAC address
>> - virtio-net brings up the slave, and makes it the "preferred" path
>> - virtio-net follows the behavior of an active backup bond/team
>> - virtio-net acts as the interface to the VM
>> - live migration initiates
>> - link goes down on SR-IOV, or SR-IOV device is removed
>> - failover to virtio-net as primary path
>> - migration continues to new host
>> - new host is started with virio-net as primary
>> - if no SR-IOV, virtio-net stays primary
>> - hypervisor can hot-add SR-IOV NIC, with same MAC addr as virtio
>> - virtio-net notices new NIC and starts over at enslave step above
>>
>> Future ideas (brainstorming):
>> - Optimize Fast east-west by having special rules to direct east-west
>>traffic through virtio-net traffic path
>>
>> Thanks for reading!
>> Jesse
>
>
> Cc netdev.
>
> Interesting, and this method is actually used by netvsc now:
>
> commit 0c195567a8f6e82ea5535cd9f1d54a1626dd233e
> Author: stephen hemminger 
> Date:   Tue Aug 1 19:58:53 2017 -0700
>
> netvsc: transparent VF management
>
> This patch implements transparent fail over from synthetic NIC to
> SR-IOV virtual function NIC in Hyper-V environment. It is a better
> alternative to using bonding as is done now. Instead, the receive and
> transmit fail over is done internally inside the driver.
>
> Using bonding driver has lots of issues because it depends on the
> script being run early enough in the boot process and with sufficient
> information to make the association. This patch moves all that
> functionality into the kernel.
>
> Signed-off-by: Stephen Hemminger 
> Signed-off-by: David S. Miller 
>
> If my understanding is correct there's no need to for any extension of
> virtio spec. If this is true, maybe you can start to prepare the patch?
>
> Thanks
>
>>
>> [1]
>>
>> https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts
>> ___
>> 

[PATCH net,stable v3] vhost: fix a few skb leaks

2017-12-01 Thread wexu
From: Wei Xu 

Matthew found a roughly 40% tcp throughput regression with commit
c67df11f(vhost_net: try batch dequing from skb array) as discussed
in the following thread:
https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html

This is v3.

v3:
- move freeing skb from vhost to tun/tap recvmsg() to not
  confuse the callers.

v2:
- add Matthew as the reporter, thanks matthew.
- moving zero headcount check ahead instead of defer consuming skb
  due to jason and mst's comment.
- add freeing skb in favor of recvmsg() fails.

Wei Xu (3):
  vhost: fix skb leak in handle_rx()
  tun: free skb in early errors
  tap: free skb if flags error

 drivers/net/tap.c   |  6 +-
 drivers/net/tun.c   | 14 +++---
 drivers/vhost/net.c | 20 ++--
 3 files changed, 26 insertions(+), 14 deletions(-)

-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] virtio-mmio: Use PTR_ERR_OR_ZERO()

2017-12-01 Thread Vasyl Gomonovych
Fix ptr_ret.cocci warnings:
drivers/virtio/virtio_mmio.c:653:1-3: WARNING: PTR_ERR_OR_ZERO can be used

Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR

Generated by: scripts/coccinelle/api/ptr_ret.cocci

Signed-off-by: Vasyl Gomonovych 
---
 drivers/virtio/virtio_mmio.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 74dc7170fd35..8cffcb1706e1 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -650,10 +650,8 @@ static int vm_cmdline_set(const char *device,
pdev = platform_device_register_resndata(_cmdline_parent,
"virtio-mmio", vm_cmdline_id++,
resources, ARRAY_SIZE(resources), NULL, 0);
-   if (IS_ERR(pdev))
-   return PTR_ERR(pdev);
 
-   return 0;
+   return PTR_ERR_OR_ZERO(pdev);
 }
 
 static int vm_cmdline_get_device(struct device *dev, void *data)
-- 
1.9.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net,stable] vhost: fix skb leak in handle_rx()

2017-12-01 Thread Wei Xu
On Tue, Nov 28, 2017 at 07:53:33PM +0200, Michael S. Tsirkin wrote:
> On Tue, Nov 28, 2017 at 12:17:16PM -0500, w...@redhat.com wrote:
> > From: Wei Xu 
> > 
> > Matthew found a roughly 40% tcp throughput regression with commit
> > c67df11f(vhost_net: try batch dequing from skb array) as discussed
> > in the following thread:
> > https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html
> > 
> > Eventually we figured out that it was a skb leak in handle_rx()
> > when sending packets to the VM. This usually happens when a guest
> > can not drain out vq as fast as vhost fills in, afterwards it sets
> > off the traffic jam and leaks skb(s) which occurs as no headcount
> > to send on the vq from vhost side.
> > 
> > This can be avoided by making sure we have got enough headcount
> > before actually consuming a skb from the batched rx array while
> > transmitting, which is simply done by deferring it a moment later
> > in this patch.
> > 
> > Signed-off-by: Wei Xu 
> 
> Given the amount of effort Matthew has put into this,
> you definitely want
> 
> Reported-by: Matthew Rosato 
> 
> here.
> 

Absolutely we want that, sorry for missed you here, Matthew. It is more
like you have figured out this issue independently all by yourself with
a wide assortment of extremely quick tests(tools, throughput, slub leak,
etc), and I am happy that I have the opportunity to do the paperwork on
behalf of you.:) Thanks a lot Matthew.

Wei

> Let's give credit where credit is due.
> 
> Thanks a lot Matthew!
> 
> > ---
> >  drivers/vhost/net.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 8d626d7..e76535e 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -778,8 +778,6 @@ static void handle_rx(struct vhost_net *net)
> > /* On error, stop handling until the next kick. */
> > if (unlikely(headcount < 0))
> > goto out;
> > -   if (nvq->rx_array)
> > -   msg.msg_control = vhost_net_buf_consume(>rxq);
> > /* On overrun, truncate and discard */
> > if (unlikely(headcount > UIO_MAXIOV)) {
> > iov_iter_init(_iter, READ, vq->iov, 1, 1);
> > @@ -809,6 +807,8 @@ static void handle_rx(struct vhost_net *net)
> >  */
> > iov_iter_advance(_iter, vhost_hlen);
> > }
> > +   if (nvq->rx_array)
> > +   msg.msg_control = vhost_net_buf_consume(>rxq);
> > err = sock->ops->recvmsg(sock, ,
> >  sock_len, MSG_DONTWAIT | MSG_TRUNC);
> > /* Userspace might have consumed the packet meanwhile:
> > -- 
> > 1.8.3.1
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net,stable] vhost: fix skb leak in handle_rx()

2017-12-01 Thread Wei Xu
On Tue, Nov 28, 2017 at 07:50:58PM +0200, Michael S. Tsirkin wrote:
> On Tue, Nov 28, 2017 at 12:17:16PM -0500, w...@redhat.com wrote:
> > From: Wei Xu 
> > 
> > Matthew found a roughly 40% tcp throughput regression with commit
> > c67df11f(vhost_net: try batch dequing from skb array) as discussed
> > in the following thread:
> > https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html
> > 
> > Eventually we figured out that it was a skb leak in handle_rx()
> > when sending packets to the VM. This usually happens when a guest
> > can not drain out vq as fast as vhost fills in, afterwards it sets
> > off the traffic jam and leaks skb(s) which occurs as no headcount
> > to send on the vq from vhost side.
> > 
> > This can be avoided by making sure we have got enough headcount
> > before actually consuming a skb from the batched rx array while
> > transmitting, which is simply done by deferring it a moment later
> > in this patch.
> > 
> > Signed-off-by: Wei Xu 
> 
> Looks like a good way to fix it, but it will still leak if recvmsg
> returns a wrong length (I'm not sure this can happen in practice, but
> best to keep it simple).

Right, it is better to defend this case, will include it in v2.

> 
> Also, we need to add this before each recvmsg, including overrun,
> and discard on error.

OK.

Wei

> 
> > ---
> >  drivers/vhost/net.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 8d626d7..e76535e 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -778,8 +778,6 @@ static void handle_rx(struct vhost_net *net)
> > /* On error, stop handling until the next kick. */
> > if (unlikely(headcount < 0))
> > goto out;
> > -   if (nvq->rx_array)
> > -   msg.msg_control = vhost_net_buf_consume(>rxq);
> > /* On overrun, truncate and discard */
> > if (unlikely(headcount > UIO_MAXIOV)) {
> > iov_iter_init(_iter, READ, vq->iov, 1, 1);
> > @@ -809,6 +807,8 @@ static void handle_rx(struct vhost_net *net)
> >  */
> > iov_iter_advance(_iter, vhost_hlen);
> > }
> > +   if (nvq->rx_array)
> > +   msg.msg_control = vhost_net_buf_consume(>rxq);
> > err = sock->ops->recvmsg(sock, ,
> >  sock_len, MSG_DONTWAIT | MSG_TRUNC);
> > /* Userspace might have consumed the packet meanwhile:
> > -- 
> > 1.8.3.1
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-12-01 Thread Shannon Nelson

On 11/30/2017 6:11 AM, Michael S. Tsirkin wrote:

On Thu, Nov 30, 2017 at 10:08:45AM +0200, achiad shochat wrote:

Re. problem #2:
Indeed the best way to address it seems to be to enslave the VF driver
netdev under a persistent anchor netdev.
And it's indeed desired to allow (but not enforce) PV netdev and VF
netdev to work in conjunction.
And it's indeed desired that this enslavement logic work out-of-the box.
But in case of PV+VF some configurable policies must be in place (and
they'd better be generic rather than differ per PV technology).
For example - based on which characteristics should the PV+VF coupling
be done? netvsc uses MAC address, but that might not always be the
desire.


It's a policy but not guest userspace policy.

The hypervisor certainly knows.

Are you concerned that someone might want to create two devices with the
same MAC for an unrelated reason?  If so, hypervisor could easily set a
flag in the virtio device to say "this is a backup, use MAC to find
another device".


This is something I was going to suggest: a flag or other configuration 
on the virtio device to help control how this new feature is used.  I 
can imagine this might be useful to control from either the hypervisor 
side or the VM side.


The hypervisor might want to (1) disable it (force it off), (2) enable 
it for VM choice, or (3) force it on for the VM.  In case (2), the VM 
might be able to chose whether it wants to make use of the feature, or 
stick with the bonding solution.


Either way, the kernel is making a feature available, and the user (VM 
or hypervisor) is able to control it by selecting the feature based on 
the policy desired.


sln
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-12-01 Thread Jakub Kicinski
On Thu, 30 Nov 2017 15:54:40 +0200, Michael S. Tsirkin wrote:
> On Wed, Nov 29, 2017 at 07:51:38PM -0800, Jakub Kicinski wrote:
> > On Thu, 30 Nov 2017 11:29:56 +0800, Jason Wang wrote:  
> > > On 2017年11月29日 03:27, Jesse Brandeburg wrote:  
> > > > Hi, I'd like to get some feedback on a proposal to enhance virtio-net
> > > > to ease configuration of a VM and that would enable live migration of
> > > > passthrough network SR-IOV devices.
> > > >
> > > > Today we have SR-IOV network devices (VFs) that can be passed into a VM
> > > > in order to enable high performance networking direct within the VM.
> > > > The problem I am trying to address is that this configuration is
> > > > generally difficult to live-migrate.  There is documentation [1]
> > > > indicating that some OS/Hypervisor vendors will support live migration
> > > > of a system with a direct assigned networking device.  The problem I
> > > > see with these implementations is that the network configuration
> > > > requirements that are passed on to the owner of the VM are quite
> > > > complicated.  You have to set up bonding, you have to configure it to
> > > > enslave two interfaces, those interfaces (one is virtio-net, the other
> > > > is SR-IOV device/driver like ixgbevf) must support MAC address changes
> > > > requested in the VM, and on and on...
> > > >
> > > > So, on to the proposal:
> > > > Modify virtio-net driver to be a single VM network device that
> > > > enslaves an SR-IOV network device (inside the VM) with the same MAC
> > > > address. This would cause the virtio-net driver to appear and work like
> > > > a simplified bonding/team driver.  The live migration problem would be
> > > > solved just like today's bonding solution, but the VM user's networking
> > > > config would be greatly simplified.
> > > >
> > > > At it's simplest, it would appear something like this in the VM.
> > > >
> > > > ==
> > > > = vnet0  =
> > > >   =
> > > > (virtio- =   |
> > > >   net)=   |
> > > >   =  ==
> > > >   =  = ixgbef =
> > > > ==  ==
> > > >
> > > > (forgive the ASCII art)
> > > >
> > > > The fast path traffic would prefer the ixgbevf or other SR-IOV device
> > > > path, and fall back to virtio's transmit/receive when migrating.
> > > >
> > > > Compared to today's options this proposal would
> > > > 1) make virtio-net more sticky, allow fast path traffic at SR-IOV
> > > > speeds
> > > > 2) simplify end user configuration in the VM (most if not all of the
> > > > set up to enable migration would be done in the hypervisor)
> > > > 3) allow live migration via a simple link down and maybe a PCI
> > > > hot-unplug of the SR-IOV device, with failover to the virtio-net
> > > > driver core
> > > > 4) allow vendor agnostic hardware acceleration, and live migration
> > > > between vendors if the VM os has driver support for all the required
> > > > SR-IOV devices.
> > > >
> > > > Runtime operation proposed:
> > > > -  virtio-net driver loads, SR-IOV driver loads
> > > > - virtio-net finds other NICs that match it's MAC address by
> > > >both examining existing interfaces, and sets up a new device notifier
> > > > - virtio-net enslaves the first NIC with the same MAC address
> > > > - virtio-net brings up the slave, and makes it the "preferred" path
> > > > - virtio-net follows the behavior of an active backup bond/team
> > > > - virtio-net acts as the interface to the VM
> > > > - live migration initiates
> > > > - link goes down on SR-IOV, or SR-IOV device is removed
> > > > - failover to virtio-net as primary path
> > > > - migration continues to new host
> > > > - new host is started with virio-net as primary
> > > > - if no SR-IOV, virtio-net stays primary
> > > > - hypervisor can hot-add SR-IOV NIC, with same MAC addr as virtio
> > > > - virtio-net notices new NIC and starts over at enslave step above
> > > >
> > > > Future ideas (brainstorming):
> > > > - Optimize Fast east-west by having special rules to direct east-west
> > > >traffic through virtio-net traffic path
> > > >
> > > > Thanks for reading!
> > > > Jesse
> > > 
> > > Cc netdev.
> > > 
> > > Interesting, and this method is actually used by netvsc now:
> > > 
> > > commit 0c195567a8f6e82ea5535cd9f1d54a1626dd233e
> > > Author: stephen hemminger 
> > > Date:   Tue Aug 1 19:58:53 2017 -0700
> > > 
> > >      netvsc: transparent VF management
> > > 
> > >      This patch implements transparent fail over from synthetic NIC to
> > >      SR-IOV virtual function NIC in Hyper-V environment. It is a better
> > >      alternative to using bonding as is done now. Instead, the receive and
> > >      transmit fail over is done internally inside the driver.
> > > 
> > >      Using bonding driver has lots of issues because it depends on the
> > >      script being run early enough in the boot process and with sufficient
> > >      information to make the 

[PATCH] drm/virtio: Don't return invalid caps on timeout

2017-12-01 Thread Tomeu Vizoso
If the wait timeouts, the caps are probably invalid and we shouldn't be
passing them to userspace.

Signed-off-by: Tomeu Vizoso 
---
 drivers/gpu/drm/virtio/virtgpu_ioctl.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/virtio/virtgpu_ioctl.c 
b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
index b94bd5440e57..902120ad4a6d 100644
--- a/drivers/gpu/drm/virtio/virtgpu_ioctl.c
+++ b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
@@ -517,6 +517,8 @@ static int virtio_gpu_get_caps_ioctl(struct drm_device *dev,
 
ret = wait_event_timeout(vgdev->resp_wq,
 atomic_read(_ent->is_valid), 5 * HZ);
+   if (!ret)
+   return -EBUSY;
 
ptr = cache_ent->caps_cache;
 
-- 
2.14.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] virtio_balloon: fix increment of vb->num_pfns in fill_balloon()

2017-12-01 Thread Jan Stancek
commit c7cdff0e8647 ("virtio_balloon: fix deadlock on OOM")
changed code to increment vb->num_pfns before call to
set_page_pfns(), which used to happen only after.

This patch fixes boot hang for me on ppc64le KVM guests.

Fixes: c7cdff0e8647 ("virtio_balloon: fix deadlock on OOM")
Cc: Michael S. Tsirkin 
Cc: Tetsuo Handa 
Cc: Michal Hocko 
Cc: Wei Wang 
Signed-off-by: Jan Stancek 
---
 drivers/virtio/virtio_balloon.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 7960746f7597..a1fb52cb3f0a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -174,13 +174,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, 
size_t num)
while ((page = balloon_page_pop())) {
balloon_page_enqueue(>vb_dev_info, page);
 
-   vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE;
-
set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
if (!virtio_has_feature(vb->vdev,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
adjust_managed_page_count(page, -1);
+   vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE;
}
 
num_allocated_pages = vb->num_pfns;
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net,stable v4 0/3] vhost: fix a few skb leaks

2017-12-01 Thread wexu
From: Wei Xu 

Matthew found a roughly 40% tcp throughput regression with commit
c67df11f(vhost_net: try batch dequing from skb array) as discussed
in the following thread:
https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html

v4:
- fix zero iov iterator count in tap/tap_do_read()(Jason)
- don't put tun in case of EBADFD(Jason)
- Replace msg->msg_control with new 'skb' when calling tun/tap_do_read()

v3:
- move freeing skb from vhost to tun/tap recvmsg() to not
  confuse the callers.

v2:
- add Matthew as the reporter, thanks matthew.
- moving zero headcount check ahead instead of defer consuming skb
  due to jason and mst's comment.
- add freeing skb in favor of recvmsg() fails.

Wei Xu (3):
  vhost: fix skb leak in handle_rx()
  tun: free skb in early errors
  tap: free skb if flags error

 drivers/net/tap.c   | 14 ++
 drivers/net/tun.c   | 24 ++--
 drivers/vhost/net.c | 20 ++--
 3 files changed, 38 insertions(+), 20 deletions(-)

-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 1/3] vhost: fix skb leak in handle_rx()

2017-12-01 Thread wexu
From: Wei Xu 

Matthew found a roughly 40% tcp throughput regression with commit
c67df11f(vhost_net: try batch dequing from skb array) as discussed
in the following thread:
https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html

Eventually we figured out that it was a skb leak in handle_rx()
when sending packets to the VM. This usually happens when a guest
can not drain out vq as fast as vhost fills in, afterwards it sets
off the traffic jam and leaks skb(s) which occurs as no headcount
to send on the vq from vhost side.

This can be avoided by making sure we have got enough headcount
before actually consuming a skb from the batched rx array while
transmitting, which is simply done by moving checking the zero
headcount a bit ahead.

Signed-off-by: Wei Xu 
Reported-by: Matthew Rosato 
---
 drivers/vhost/net.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8d626d7..c7bdeb6 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -778,16 +778,6 @@ static void handle_rx(struct vhost_net *net)
/* On error, stop handling until the next kick. */
if (unlikely(headcount < 0))
goto out;
-   if (nvq->rx_array)
-   msg.msg_control = vhost_net_buf_consume(>rxq);
-   /* On overrun, truncate and discard */
-   if (unlikely(headcount > UIO_MAXIOV)) {
-   iov_iter_init(_iter, READ, vq->iov, 1, 1);
-   err = sock->ops->recvmsg(sock, ,
-1, MSG_DONTWAIT | MSG_TRUNC);
-   pr_debug("Discarded rx packet: len %zd\n", sock_len);
-   continue;
-   }
/* OK, now we need to know about added descriptors. */
if (!headcount) {
if (unlikely(vhost_enable_notify(>dev, vq))) {
@@ -800,6 +790,16 @@ static void handle_rx(struct vhost_net *net)
 * they refilled. */
goto out;
}
+   if (nvq->rx_array)
+   msg.msg_control = vhost_net_buf_consume(>rxq);
+   /* On overrun, truncate and discard */
+   if (unlikely(headcount > UIO_MAXIOV)) {
+   iov_iter_init(_iter, READ, vq->iov, 1, 1);
+   err = sock->ops->recvmsg(sock, ,
+1, MSG_DONTWAIT | MSG_TRUNC);
+   pr_debug("Discarded rx packet: len %zd\n", sock_len);
+   continue;
+   }
/* We don't need to be notified again. */
iov_iter_init(_iter, READ, vq->iov, in, vhost_len);
fixup = msg.msg_iter;
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net,stable v4 0/3] vhost: fix a few skb leaks

2017-12-01 Thread Matthew Rosato
On 12/01/2017 09:47 AM, Michael S. Tsirkin wrote:
> On Fri, Dec 01, 2017 at 05:10:35AM -0500, w...@redhat.com wrote:
>> From: Wei Xu 
>>
>> Matthew found a roughly 40% tcp throughput regression with commit
>> c67df11f(vhost_net: try batch dequing from skb array) as discussed
>> in the following thread:
>> https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html
> 
> Series
> 
> Acked-by: Michael S. Tsirkin 

Tested this series on a z13 (s390x) on top of net-next using 4GB/4vcpu
guests.  Verified that both the reported TCP throughput regression and
memory leak are resolved.

net-next:  19.83 Gb/s
net-next+: 35.02 Gb/s

Thanks all!

Matt

> 
> 
>> v4:
>> - fix zero iov iterator count in tap/tap_do_read()(Jason)
>> - don't put tun in case of EBADFD(Jason)
>> - Replace msg->msg_control with new 'skb' when calling tun/tap_do_read()
>>
>> v3:
>> - move freeing skb from vhost to tun/tap recvmsg() to not
>>   confuse the callers.
>>
>> v2:
>> - add Matthew as the reporter, thanks matthew.
>> - moving zero headcount check ahead instead of defer consuming skb
>>   due to jason and mst's comment.
>> - add freeing skb in favor of recvmsg() fails.
>>
>> Wei Xu (3):
>>   vhost: fix skb leak in handle_rx()
>>   tun: free skb in early errors
>>   tap: free skb if flags error
>>
>>  drivers/net/tap.c   | 14 ++
>>  drivers/net/tun.c   | 24 ++--
>>  drivers/vhost/net.c | 20 ++--
>>  3 files changed, 38 insertions(+), 20 deletions(-)
>>
>> -- 
>> 1.8.3.1
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 2/3] tun: free skb in early errors

2017-12-01 Thread wexu
From: Wei Xu 

tun_recvmsg() supports accepting skb by msg_control after
commit ac77cfd4258f ("tun: support receiving skb through msg_control"),
the skb if presented should be freed no matter how far it can go
along, otherwise it would be leaked.

This patch fixes several missed cases.

Signed-off-by: Wei Xu 
Reported-by: Matthew Rosato 

---
 drivers/net/tun.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 9574900..4f4a842 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1952,8 +1952,11 @@ static ssize_t tun_do_read(struct tun_struct *tun, 
struct tun_file *tfile,
 
tun_debug(KERN_INFO, tun, "tun_do_read\n");
 
-   if (!iov_iter_count(to))
+   if (!iov_iter_count(to)) {
+   if (skb)
+   kfree_skb(skb);
return 0;
+   }
 
if (!skb) {
/* Read frames from ring */
@@ -2069,22 +2072,24 @@ static int tun_recvmsg(struct socket *sock, struct 
msghdr *m, size_t total_len,
 {
struct tun_file *tfile = container_of(sock, struct tun_file, socket);
struct tun_struct *tun = tun_get(tfile);
+   struct sk_buff *skb = m->msg_control;
int ret;
 
-   if (!tun)
-   return -EBADFD;
+   if (!tun) {
+   ret = -EBADFD;
+   goto out_free_skb;
+   }
 
if (flags & ~(MSG_DONTWAIT|MSG_TRUNC|MSG_ERRQUEUE)) {
ret = -EINVAL;
-   goto out;
+   goto out_put_tun;
}
if (flags & MSG_ERRQUEUE) {
ret = sock_recv_errqueue(sock->sk, m, total_len,
 SOL_PACKET, TUN_TX_TIMESTAMP);
goto out;
}
-   ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT,
- m->msg_control);
+   ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT, skb);
if (ret > (ssize_t)total_len) {
m->msg_flags |= MSG_TRUNC;
ret = flags & MSG_TRUNC ? ret : total_len;
@@ -2092,6 +2097,13 @@ static int tun_recvmsg(struct socket *sock, struct 
msghdr *m, size_t total_len,
 out:
tun_put(tun);
return ret;
+
+out_put_tun:
+   tun_put(tun);
+out_free_skb:
+   if (skb)
+   kfree_skb(skb);
+   return ret;
 }
 
 static int tun_peek_len(struct socket *sock)
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 1/3] vhost: fix skb leak in handle_rx()

2017-12-01 Thread wexu
From: Wei Xu 

Matthew found a roughly 40% tcp throughput regression with commit
c67df11f(vhost_net: try batch dequing from skb array) as discussed
in the following thread:
https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html

Eventually we figured out that it was a skb leak in handle_rx()
when sending packets to the VM. This usually happens when a guest
can not drain out vq as fast as vhost fills in, afterwards it sets
off the traffic jam and leaks skb(s) which occurs as no headcount
to send on the vq from vhost side.

This can be avoided by making sure we have got enough headcount
before actually consuming a skb from the batched rx array while
transmitting, which is simply done by moving checking the zero
headcount a bit ahead.

Signed-off-by: Wei Xu 
Reported-by: Matthew Rosato 
---
 drivers/vhost/net.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8d626d7..c7bdeb6 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -778,16 +778,6 @@ static void handle_rx(struct vhost_net *net)
/* On error, stop handling until the next kick. */
if (unlikely(headcount < 0))
goto out;
-   if (nvq->rx_array)
-   msg.msg_control = vhost_net_buf_consume(>rxq);
-   /* On overrun, truncate and discard */
-   if (unlikely(headcount > UIO_MAXIOV)) {
-   iov_iter_init(_iter, READ, vq->iov, 1, 1);
-   err = sock->ops->recvmsg(sock, ,
-1, MSG_DONTWAIT | MSG_TRUNC);
-   pr_debug("Discarded rx packet: len %zd\n", sock_len);
-   continue;
-   }
/* OK, now we need to know about added descriptors. */
if (!headcount) {
if (unlikely(vhost_enable_notify(>dev, vq))) {
@@ -800,6 +790,16 @@ static void handle_rx(struct vhost_net *net)
 * they refilled. */
goto out;
}
+   if (nvq->rx_array)
+   msg.msg_control = vhost_net_buf_consume(>rxq);
+   /* On overrun, truncate and discard */
+   if (unlikely(headcount > UIO_MAXIOV)) {
+   iov_iter_init(_iter, READ, vq->iov, 1, 1);
+   err = sock->ops->recvmsg(sock, ,
+1, MSG_DONTWAIT | MSG_TRUNC);
+   pr_debug("Discarded rx packet: len %zd\n", sock_len);
+   continue;
+   }
/* We don't need to be notified again. */
iov_iter_init(_iter, READ, vq->iov, in, vhost_len);
fixup = msg.msg_iter;
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 2/3] tun: free skb in early errors

2017-12-01 Thread wexu
From: Wei Xu 

tun_recvmsg() supports accepting skb by msg_control after
commit ac77cfd4258f ("tun: support receiving skb through msg_control"),
the skb if presented should be freed within the function, otherwise it
would be leaked.

Signed-off-by: Wei Xu 
Reported-by: Matthew Rosato 
---
 drivers/net/tun.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 6a7bde9..5563430 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2067,14 +2067,17 @@ static int tun_recvmsg(struct socket *sock, struct 
msghdr *m, size_t total_len,
 {
struct tun_file *tfile = container_of(sock, struct tun_file, socket);
struct tun_struct *tun = tun_get(tfile);
+   struct sk_buff *skb = m->msg_control;
int ret;
 
-   if (!tun)
-   return -EBADFD;
+   if (!tun) {
+   ret = -EBADFD;
+   goto out_free_skb;
+   }
 
if (flags & ~(MSG_DONTWAIT|MSG_TRUNC|MSG_ERRQUEUE)) {
ret = -EINVAL;
-   goto out;
+   goto out_free_skb;
}
if (flags & MSG_ERRQUEUE) {
ret = sock_recv_errqueue(sock->sk, m, total_len,
@@ -2087,6 +2090,11 @@ static int tun_recvmsg(struct socket *sock, struct 
msghdr *m, size_t total_len,
m->msg_flags |= MSG_TRUNC;
ret = flags & MSG_TRUNC ? ret : total_len;
}
+   goto out;
+
+out_free_skb:
+   if (skb)
+   kfree_skb(skb);
 out:
tun_put(tun);
return ret;
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 3/3] tap: free skb if flags error

2017-12-01 Thread wexu
From: Wei Xu 

tap_recvmsg() supports accepting skb by msg_control after
commit 3b4ba04acca8 ("tap: support receiving skb from msg_control"),
the skb if presented should be freed within the function, otherwise
it would be leaked.

Signed-off-by: Wei Xu 
Reported-by: Matthew Rosato 
---
 drivers/net/tap.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index e9489b8..1c66b75 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1154,9 +1154,13 @@ static int tap_recvmsg(struct socket *sock, struct 
msghdr *m,
   size_t total_len, int flags)
 {
struct tap_queue *q = container_of(sock, struct tap_queue, sock);
+   struct sk_buff *skb = m->msg_control;
int ret;
-   if (flags & ~(MSG_DONTWAIT|MSG_TRUNC))
+   if (flags & ~(MSG_DONTWAIT|MSG_TRUNC)) {
+   if (skb)
+   kfree_skb(skb);
return -EINVAL;
+   }
ret = tap_do_read(q, >msg_iter, flags & MSG_DONTWAIT,
  m->msg_control);
if (ret > total_len) {
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net,stable v2] vhost: fix skb leak in handle_rx()

2017-12-01 Thread Wei Xu
On Thu, Nov 30, 2017 at 10:46:17AM +0800, Jason Wang wrote:
> 
> 
> On 2017年11月29日 23:31, Michael S. Tsirkin wrote:
> > On Wed, Nov 29, 2017 at 09:23:24AM -0500,w...@redhat.com  wrote:
> > > From: Wei Xu
> > > 
> > > Matthew found a roughly 40% tcp throughput regression with commit
> > > c67df11f(vhost_net: try batch dequing from skb array) as discussed
> > > in the following thread:
> > > https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html
> > > 
> > > Eventually we figured out that it was a skb leak in handle_rx()
> > > when sending packets to the VM. This usually happens when a guest
> > > can not drain out vq as fast as vhost fills in, afterwards it sets
> > > off the traffic jam and leaks skb(s) which occurs as no headcount
> > > to send on the vq from vhost side.
> > > 
> > > This can be avoided by making sure we have got enough headcount
> > > before actually consuming a skb from the batched rx array while
> > > transmitting, which is simply done by moving checking the zero
> > > headcount a bit ahead.
> > > 
> > > Also strengthen the small possibility of leak in case of recvmsg()
> > > fails by freeing the skb.
> > > 
> > > Signed-off-by: Wei Xu
> > > Reported-by: Matthew Rosato
> > > ---
> > >   drivers/vhost/net.c | 23 +--
> > >   1 file changed, 13 insertions(+), 10 deletions(-)
> > > 
> > > v2:
> > > - add Matthew as the reporter, thanks matthew.
> > > - moving zero headcount check ahead instead of defer consuming skb
> > >due to jason and mst's comment.
> > > - add freeing skb in favor of recvmsg() fails.
> > > 
> > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > index 8d626d7..e302e08 100644
> > > --- a/drivers/vhost/net.c
> > > +++ b/drivers/vhost/net.c
> > > @@ -778,16 +778,6 @@ static void handle_rx(struct vhost_net *net)
> > >   /* On error, stop handling until the next kick. */
> > >   if (unlikely(headcount < 0))
> > >   goto out;
> > > - if (nvq->rx_array)
> > > - msg.msg_control = vhost_net_buf_consume(>rxq);
> > > - /* On overrun, truncate and discard */
> > > - if (unlikely(headcount > UIO_MAXIOV)) {
> > > - iov_iter_init(_iter, READ, vq->iov, 1, 1);
> > > - err = sock->ops->recvmsg(sock, ,
> > > -  1, MSG_DONTWAIT | MSG_TRUNC);
> > > - pr_debug("Discarded rx packet: len %zd\n", sock_len);
> > > - continue;
> > > - }
> > >   /* OK, now we need to know about added descriptors. */
> > >   if (!headcount) {
> > >   if (unlikely(vhost_enable_notify(>dev, 
> > > vq))) {
> > > @@ -800,6 +790,18 @@ static void handle_rx(struct vhost_net *net)
> > >* they refilled. */
> > >   goto out;
> > >   }
> > > + if (nvq->rx_array)
> > > + msg.msg_control = vhost_net_buf_consume(>rxq);
> > > + /* On overrun, truncate and discard */
> > > + if (unlikely(headcount > UIO_MAXIOV)) {
> > > + iov_iter_init(_iter, READ, vq->iov, 1, 1);
> > > + err = sock->ops->recvmsg(sock, ,
> > > +  1, MSG_DONTWAIT | MSG_TRUNC);
> > > + if (unlikely(err != 1))
> > Why 1? How is receiving 1 byte special or even possible?
> > Also, I wouldn't put an unlikely here. It's all error handling code anyway.

Vhost is dropping the skb by invoking a 1 byte recvmsg() here, while it
is kind of weird to free skb since it would have been freed in recvmsg()
for most cases, and the return value doesn't make sense too much.

> > 
> > > + kfree_skb((struct sk_buff *)msg.msg_control);
> > You do not need a cast here.

Yes, exactly, I missed it.

> > Also, is it really safe to refer to msg_control here?
> > I'd rather keep a copy of the skb pointer and use it than assume
> > caller did not change it. But also see below.

It should be safe since msg is a local variable here, the callee has no
chance to modify it, except rx_array is not used by vhost and then it
becomes uncertain, but I don't know what the case is. Isn't vhost using
rx_array for all kinds of devices? Any clue rings the bell?

> > 
> > > + pr_debug("Discarded rx packet: len %zd\n", sock_len);
> > > + continue;
> > > + }
> > >   /* We don't need to be notified again. */
> > >   iov_iter_init(_iter, READ, vq->iov, in, 
> > > vhost_len);
> > >   fixup = msg.msg_iter;
> > > @@ -818,6 +820,7 @@ static void handle_rx(struct vhost_net *net)
> > >   pr_debug("Discarded rx packet: "
> > >" len %d, expected %zd\n", err, 
> > > sock_len);
> > >   

Re: [PATCH net,stable v2] vhost: fix skb leak in handle_rx()

2017-12-01 Thread Wei Xu
On Wed, Nov 29, 2017 at 10:43:33PM +0800, Jason Wang wrote:
> 
> 
> On 2017年11月29日 22:23, w...@redhat.com wrote:
> > From: Wei Xu 
> > 
> > Matthew found a roughly 40% tcp throughput regression with commit
> > c67df11f(vhost_net: try batch dequing from skb array) as discussed
> > in the following thread:
> > https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html
> > 
> > Eventually we figured out that it was a skb leak in handle_rx()
> > when sending packets to the VM. This usually happens when a guest
> > can not drain out vq as fast as vhost fills in, afterwards it sets
> > off the traffic jam and leaks skb(s) which occurs as no headcount
> > to send on the vq from vhost side.
> > 
> > This can be avoided by making sure we have got enough headcount
> > before actually consuming a skb from the batched rx array while
> > transmitting, which is simply done by moving checking the zero
> > headcount a bit ahead.
> > 
> > Also strengthen the small possibility of leak in case of recvmsg()
> > fails by freeing the skb.
> > 
> > Signed-off-by: Wei Xu 
> > Reported-by: Matthew Rosato 
> > ---
> >   drivers/vhost/net.c | 23 +--
> >   1 file changed, 13 insertions(+), 10 deletions(-)
> > 
> > v2:
> > - add Matthew as the reporter, thanks matthew.
> > - moving zero headcount check ahead instead of defer consuming skb
> >due to jason and mst's comment.
> > - add freeing skb in favor of recvmsg() fails.
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 8d626d7..e302e08 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -778,16 +778,6 @@ static void handle_rx(struct vhost_net *net)
> > /* On error, stop handling until the next kick. */
> > if (unlikely(headcount < 0))
> > goto out;
> > -   if (nvq->rx_array)
> > -   msg.msg_control = vhost_net_buf_consume(>rxq);
> > -   /* On overrun, truncate and discard */
> > -   if (unlikely(headcount > UIO_MAXIOV)) {
> > -   iov_iter_init(_iter, READ, vq->iov, 1, 1);
> > -   err = sock->ops->recvmsg(sock, ,
> > -1, MSG_DONTWAIT | MSG_TRUNC);
> > -   pr_debug("Discarded rx packet: len %zd\n", sock_len);
> > -   continue;
> > -   }
> > /* OK, now we need to know about added descriptors. */
> > if (!headcount) {
> > if (unlikely(vhost_enable_notify(>dev, vq))) {
> > @@ -800,6 +790,18 @@ static void handle_rx(struct vhost_net *net)
> >  * they refilled. */
> > goto out;
> > }
> > +   if (nvq->rx_array)
> > +   msg.msg_control = vhost_net_buf_consume(>rxq);
> > +   /* On overrun, truncate and discard */
> > +   if (unlikely(headcount > UIO_MAXIOV)) {
> > +   iov_iter_init(_iter, READ, vq->iov, 1, 1);
> > +   err = sock->ops->recvmsg(sock, ,
> > +1, MSG_DONTWAIT | MSG_TRUNC);
> > +   if (unlikely(err != 1))
> > +   kfree_skb((struct sk_buff *)msg.msg_control);
> 
> I think we'd better fix this in tun/tap (better in another patch) otherwise
> it lead to an odd API: some case skb were freed in recvmsg() but caller
> still need to deal with the rest case.

Right, it is better to handle it in recvmsg().

Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-12-01 Thread Jakub Kicinski
On Wed, 29 Nov 2017 20:10:09 -0800, Stephen Hemminger wrote:
> On Wed, 29 Nov 2017 19:51:38 -0800 Jakub Kicinski wrote:
> > On Thu, 30 Nov 2017 11:29:56 +0800, Jason Wang wrote:  
> > > On 2017年11月29日 03:27, Jesse Brandeburg wrote:
> > > commit 0c195567a8f6e82ea5535cd9f1d54a1626dd233e
> > > Author: stephen hemminger 
> > > Date:   Tue Aug 1 19:58:53 2017 -0700
> > > 
> > >      netvsc: transparent VF management
> > > 
> > >      This patch implements transparent fail over from synthetic NIC
> > > to SR-IOV virtual function NIC in Hyper-V environment. It is a
> > > better alternative to using bonding as is done now. Instead, the
> > > receive and transmit fail over is done internally inside the driver.
> > > 
> > >      Using bonding driver has lots of issues because it depends on
> > > the script being run early enough in the boot process and with
> > > sufficient information to make the association. This patch moves
> > > all that functionality into the kernel.
> > > 
> > >      Signed-off-by: Stephen Hemminger 
> > >      Signed-off-by: David S. Miller 
> > > 
> > > If my understanding is correct there's no need to for any extension
> > > of virtio spec. If this is true, maybe you can start to prepare the
> > > patch?
> > 
> > IMHO this is as close to policy in the kernel as one can get.  User
> > land has all the information it needs to instantiate that bond/team
> > automatically.  In fact I'm trying to discuss this with NetworkManager
> > folks and Red Hat right now:
> > 
> > https://mail.gnome.org/archives/networkmanager-list/2017-November/msg00038.html
> > 
> > Can we flip the argument and ask why is the kernel supposed to be
> > responsible for this?  It's not like we run DHCP out of the kernel
> > on new interfaces...   
> 
> Although "policy should not be in the kernel" is a a great mantra,
> it is not practical in the real world.
> 
> If you think it can be solved in userspace, then you haven't had to
> deal with four different network initialization
> systems, multiple orchestration systems and customers on ancient
> Enterprise distributions.

I would accept that argument if anyone ever tried to get those
Enterprise distros to handle this use case.  From conversations I 
had it seemed like no one ever did, and SR-IOV+virtio bonding is 
what has been done to solve this since day 1 of SR-IOV networking.

For practical reasons it's easier to push this into the kernel, 
because vendors rarely employ developers of the user space
orchestrations systems.  Is that not the real problem here,
potentially? :)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net,stable] vhost: fix skb leak in handle_rx()

2017-12-01 Thread wexu
From: Wei Xu 

Matthew found a roughly 40% tcp throughput regression with commit
c67df11f(vhost_net: try batch dequing from skb array) as discussed
in the following thread:
https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html

Eventually we figured out that it was a skb leak in handle_rx()
when sending packets to the VM. This usually happens when a guest
can not drain out vq as fast as vhost fills in, afterwards it sets
off the traffic jam and leaks skb(s) which occurs as no headcount
to send on the vq from vhost side.

This can be avoided by making sure we have got enough headcount
before actually consuming a skb from the batched rx array while
transmitting, which is simply done by deferring it a moment later
in this patch.

Signed-off-by: Wei Xu 
---
 drivers/vhost/net.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8d626d7..e76535e 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -778,8 +778,6 @@ static void handle_rx(struct vhost_net *net)
/* On error, stop handling until the next kick. */
if (unlikely(headcount < 0))
goto out;
-   if (nvq->rx_array)
-   msg.msg_control = vhost_net_buf_consume(>rxq);
/* On overrun, truncate and discard */
if (unlikely(headcount > UIO_MAXIOV)) {
iov_iter_init(_iter, READ, vq->iov, 1, 1);
@@ -809,6 +807,8 @@ static void handle_rx(struct vhost_net *net)
 */
iov_iter_advance(_iter, vhost_hlen);
}
+   if (nvq->rx_array)
+   msg.msg_control = vhost_net_buf_consume(>rxq);
err = sock->ops->recvmsg(sock, ,
 sock_len, MSG_DONTWAIT | MSG_TRUNC);
/* Userspace might have consumed the packet meanwhile:
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net,stable v2] vhost: fix skb leak in handle_rx()

2017-12-01 Thread wexu
From: Wei Xu 

Matthew found a roughly 40% tcp throughput regression with commit
c67df11f(vhost_net: try batch dequing from skb array) as discussed
in the following thread:
https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html

Eventually we figured out that it was a skb leak in handle_rx()
when sending packets to the VM. This usually happens when a guest
can not drain out vq as fast as vhost fills in, afterwards it sets
off the traffic jam and leaks skb(s) which occurs as no headcount
to send on the vq from vhost side.

This can be avoided by making sure we have got enough headcount
before actually consuming a skb from the batched rx array while
transmitting, which is simply done by moving checking the zero
headcount a bit ahead.

Also strengthen the small possibility of leak in case of recvmsg()
fails by freeing the skb.

Signed-off-by: Wei Xu 
Reported-by: Matthew Rosato 
---
 drivers/vhost/net.c | 23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

v2:
- add Matthew as the reporter, thanks matthew.
- moving zero headcount check ahead instead of defer consuming skb
  due to jason and mst's comment.
- add freeing skb in favor of recvmsg() fails.

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8d626d7..e302e08 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -778,16 +778,6 @@ static void handle_rx(struct vhost_net *net)
/* On error, stop handling until the next kick. */
if (unlikely(headcount < 0))
goto out;
-   if (nvq->rx_array)
-   msg.msg_control = vhost_net_buf_consume(>rxq);
-   /* On overrun, truncate and discard */
-   if (unlikely(headcount > UIO_MAXIOV)) {
-   iov_iter_init(_iter, READ, vq->iov, 1, 1);
-   err = sock->ops->recvmsg(sock, ,
-1, MSG_DONTWAIT | MSG_TRUNC);
-   pr_debug("Discarded rx packet: len %zd\n", sock_len);
-   continue;
-   }
/* OK, now we need to know about added descriptors. */
if (!headcount) {
if (unlikely(vhost_enable_notify(>dev, vq))) {
@@ -800,6 +790,18 @@ static void handle_rx(struct vhost_net *net)
 * they refilled. */
goto out;
}
+   if (nvq->rx_array)
+   msg.msg_control = vhost_net_buf_consume(>rxq);
+   /* On overrun, truncate and discard */
+   if (unlikely(headcount > UIO_MAXIOV)) {
+   iov_iter_init(_iter, READ, vq->iov, 1, 1);
+   err = sock->ops->recvmsg(sock, ,
+1, MSG_DONTWAIT | MSG_TRUNC);
+   if (unlikely(err != 1))
+   kfree_skb((struct sk_buff *)msg.msg_control);
+   pr_debug("Discarded rx packet: len %zd\n", sock_len);
+   continue;
+   }
/* We don't need to be notified again. */
iov_iter_init(_iter, READ, vq->iov, in, vhost_len);
fixup = msg.msg_iter;
@@ -818,6 +820,7 @@ static void handle_rx(struct vhost_net *net)
pr_debug("Discarded rx packet: "
 " len %d, expected %zd\n", err, sock_len);
vhost_discard_vq_desc(vq, headcount);
+   kfree_skb((struct sk_buff *)msg.msg_control);
continue;
}
/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-12-01 Thread Jakub Kicinski
On Thu, 30 Nov 2017 11:29:56 +0800, Jason Wang wrote:
> On 2017年11月29日 03:27, Jesse Brandeburg wrote:
> > Hi, I'd like to get some feedback on a proposal to enhance virtio-net
> > to ease configuration of a VM and that would enable live migration of
> > passthrough network SR-IOV devices.
> >
> > Today we have SR-IOV network devices (VFs) that can be passed into a VM
> > in order to enable high performance networking direct within the VM.
> > The problem I am trying to address is that this configuration is
> > generally difficult to live-migrate.  There is documentation [1]
> > indicating that some OS/Hypervisor vendors will support live migration
> > of a system with a direct assigned networking device.  The problem I
> > see with these implementations is that the network configuration
> > requirements that are passed on to the owner of the VM are quite
> > complicated.  You have to set up bonding, you have to configure it to
> > enslave two interfaces, those interfaces (one is virtio-net, the other
> > is SR-IOV device/driver like ixgbevf) must support MAC address changes
> > requested in the VM, and on and on...
> >
> > So, on to the proposal:
> > Modify virtio-net driver to be a single VM network device that
> > enslaves an SR-IOV network device (inside the VM) with the same MAC
> > address. This would cause the virtio-net driver to appear and work like
> > a simplified bonding/team driver.  The live migration problem would be
> > solved just like today's bonding solution, but the VM user's networking
> > config would be greatly simplified.
> >
> > At it's simplest, it would appear something like this in the VM.
> >
> > ==
> > = vnet0  =
> >   =
> > (virtio- =   |
> >   net)=   |
> >   =  ==
> >   =  = ixgbef =
> > ==  ==
> >
> > (forgive the ASCII art)
> >
> > The fast path traffic would prefer the ixgbevf or other SR-IOV device
> > path, and fall back to virtio's transmit/receive when migrating.
> >
> > Compared to today's options this proposal would
> > 1) make virtio-net more sticky, allow fast path traffic at SR-IOV
> > speeds
> > 2) simplify end user configuration in the VM (most if not all of the
> > set up to enable migration would be done in the hypervisor)
> > 3) allow live migration via a simple link down and maybe a PCI
> > hot-unplug of the SR-IOV device, with failover to the virtio-net
> > driver core
> > 4) allow vendor agnostic hardware acceleration, and live migration
> > between vendors if the VM os has driver support for all the required
> > SR-IOV devices.
> >
> > Runtime operation proposed:
> > -  virtio-net driver loads, SR-IOV driver loads
> > - virtio-net finds other NICs that match it's MAC address by
> >both examining existing interfaces, and sets up a new device notifier
> > - virtio-net enslaves the first NIC with the same MAC address
> > - virtio-net brings up the slave, and makes it the "preferred" path
> > - virtio-net follows the behavior of an active backup bond/team
> > - virtio-net acts as the interface to the VM
> > - live migration initiates
> > - link goes down on SR-IOV, or SR-IOV device is removed
> > - failover to virtio-net as primary path
> > - migration continues to new host
> > - new host is started with virio-net as primary
> > - if no SR-IOV, virtio-net stays primary
> > - hypervisor can hot-add SR-IOV NIC, with same MAC addr as virtio
> > - virtio-net notices new NIC and starts over at enslave step above
> >
> > Future ideas (brainstorming):
> > - Optimize Fast east-west by having special rules to direct east-west
> >traffic through virtio-net traffic path
> >
> > Thanks for reading!
> > Jesse  
> 
> Cc netdev.
> 
> Interesting, and this method is actually used by netvsc now:
> 
> commit 0c195567a8f6e82ea5535cd9f1d54a1626dd233e
> Author: stephen hemminger 
> Date:   Tue Aug 1 19:58:53 2017 -0700
> 
>      netvsc: transparent VF management
> 
>      This patch implements transparent fail over from synthetic NIC to
>      SR-IOV virtual function NIC in Hyper-V environment. It is a better
>      alternative to using bonding as is done now. Instead, the receive and
>      transmit fail over is done internally inside the driver.
> 
>      Using bonding driver has lots of issues because it depends on the
>      script being run early enough in the boot process and with sufficient
>      information to make the association. This patch moves all that
>      functionality into the kernel.
> 
>      Signed-off-by: Stephen Hemminger 
>      Signed-off-by: David S. Miller 
> 
> If my understanding is correct there's no need to for any extension of 
> virtio spec. If this is true, maybe you can start to prepare the patch?

IMHO this is as close to policy in the kernel as one can get.  User
land has all the information it needs to instantiate that bond/team
automatically.  In fact I'm 

[PATCH] virtio: release virtio index when fail to device_register

2017-12-01 Thread weiping zhang
index can be reused by other virtio device.

Signed-off-by: weiping zhang 
---
 drivers/virtio/virtio.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 48230a5..bf7ff39 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -333,6 +333,8 @@ int register_virtio_device(struct virtio_device *dev)
/* device_register() causes the bus infrastructure to look for a
 * matching driver. */
err = device_register(>dev);
+   if (err)
+   ida_simple_remove(_index_ida, dev->index);
 out:
if (err)
virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED);
-- 
2.9.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 00/13] remove_conflicting_framebuffers() cleanup

2017-12-01 Thread Thierry Reding
On Mon, Nov 27, 2017 at 11:30:44AM +0100, Daniel Vetter wrote:
> On Fri, Nov 24, 2017 at 06:53:25PM +0100, Michał Mirosław wrote:
> > This series cleans up duplicated code for replacing firmware FB
> > driver with proper DRI driver and adds handover support to
> > Tegra driver.
> > 
> > The last patch is here because it uses new semantics of
> > remove_conflicting_framebuffers() from this series. This
> > can be considered independently, though.
> 
> Except for that patches I've commented on:
> 
> Acked-by: Daniel Vetter 
> 
> Since this is for tegra and Thierry has drm-misc commit rights, it's
> probably simplest when Thierry pushes this all to drm-misc once driver
> maintainers had a chance to look at it.

Will do.

Thierry


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH tip/core/rcu 21/21] drivers/vhost: Remove now-redundant read_barrier_depends()

2017-12-01 Thread Paul E. McKenney
Because READ_ONCE() now implies read_barrier_depends(), the
read_barrier_depends() in next_desc() is now redundant.  This commit
therefore removes it and the related comments.

Signed-off-by: Paul E. McKenney 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: 
Cc: 
Cc: 
---
 drivers/vhost/vhost.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 33ac2b186b85..78b5940a415a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1877,12 +1877,7 @@ static unsigned next_desc(struct vhost_virtqueue *vq, 
struct vring_desc *desc)
return -1U;
 
/* Check they're not leading us off end of descriptors. */
-   next = vhost16_to_cpu(vq, desc->next);
-   /* Make sure compiler knows to grab that: we don't want it changing! */
-   /* We will use the result as an index in an array, so most
-* architectures only need a compiler barrier here. */
-   read_barrier_depends();
-
+   next = vhost16_to_cpu(vq, READ_ONCE(desc->next));
return next;
 }
 
-- 
2.5.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] virtio_mmio: add cleanup for virtio_mmio_probe

2017-12-01 Thread weiping zhang
cleanup all resource allocated by virtio_mmio_probe.

Signed-off-by: weiping zhang 
---
 drivers/virtio/virtio_mmio.c | 34 ++
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 74dc717..3fd0e66 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -513,8 +513,10 @@ static int virtio_mmio_probe(struct platform_device *pdev)
return -EBUSY;
 
vm_dev = devm_kzalloc(>dev, sizeof(*vm_dev), GFP_KERNEL);
-   if (!vm_dev)
-   return  -ENOMEM;
+   if (!vm_dev) {
+   rc =  -ENOMEM;
+   goto free_mem;
+   }
 
vm_dev->vdev.dev.parent = >dev;
vm_dev->vdev.dev.release = virtio_mmio_release_dev_empty;
@@ -524,14 +526,17 @@ static int virtio_mmio_probe(struct platform_device *pdev)
spin_lock_init(_dev->lock);
 
vm_dev->base = devm_ioremap(>dev, mem->start, resource_size(mem));
-   if (vm_dev->base == NULL)
-   return -EFAULT;
+   if (vm_dev->base == NULL) {
+   rc = -EFAULT;
+   goto free_vmdev;
+   }
 
/* Check magic value */
magic = readl(vm_dev->base + VIRTIO_MMIO_MAGIC_VALUE);
if (magic != ('v' | 'i' << 8 | 'r' << 16 | 't' << 24)) {
dev_warn(>dev, "Wrong magic value 0x%08lx!\n", magic);
-   return -ENODEV;
+   rc = -ENODEV;
+   goto unmap;
}
 
/* Check device version */
@@ -539,7 +544,8 @@ static int virtio_mmio_probe(struct platform_device *pdev)
if (vm_dev->version < 1 || vm_dev->version > 2) {
dev_err(>dev, "Version %ld not supported!\n",
vm_dev->version);
-   return -ENXIO;
+   rc = -ENXIO;
+   goto unmap;
}
 
vm_dev->vdev.id.device = readl(vm_dev->base + VIRTIO_MMIO_DEVICE_ID);
@@ -548,7 +554,8 @@ static int virtio_mmio_probe(struct platform_device *pdev)
 * virtio-mmio device with an ID 0 is a (dummy) placeholder
 * with no function. End probing now with no error reported.
 */
-   return -ENODEV;
+   rc = -ENODEV;
+   goto unmap;
}
vm_dev->vdev.id.vendor = readl(vm_dev->base + VIRTIO_MMIO_VENDOR_ID);
 
@@ -573,7 +580,18 @@ static int virtio_mmio_probe(struct platform_device *pdev)
 
platform_set_drvdata(pdev, vm_dev);
 
-   return register_virtio_device(_dev->vdev);
+   rc = register_virtio_device(_dev->vdev);
+   if (rc)
+   goto unmap;
+   return 0;
+unmap:
+   iounmap(vm_dev->base);
+free_mem:
+   devm_release_mem_region(>dev, mem->start,
+   resource_size(mem));
+free_vmdev:
+   devm_kfree(>dev, vm_dev);
+   return rc;
 }
 
 static int virtio_mmio_remove(struct platform_device *pdev)
-- 
2.9.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio: release virtio index when fail to device_register

2017-12-01 Thread weiping zhang
2017-12-02 0:55 GMT+08:00 weiping zhang :
> On Wed, Nov 29, 2017 at 10:50:44AM +0100, Cornelia Huck wrote:
>> On Wed, 29 Nov 2017 09:23:01 +0800
>> weiping zhang  wrote:
>>
>> > index can be reused by other virtio device.
>> >
>> > Signed-off-by: weiping zhang 
>>
>> Reviewed-by: Cornelia Huck 
>>
>> > ---
>> >  drivers/virtio/virtio.c | 2 ++
>> >  1 file changed, 2 insertions(+)
>> >
>> > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
>> > index 48230a5..bf7ff39 100644
>> > --- a/drivers/virtio/virtio.c
>> > +++ b/drivers/virtio/virtio.c
>> > @@ -333,6 +333,8 @@ int register_virtio_device(struct virtio_device *dev)
>> > /* device_register() causes the bus infrastructure to look for a
>> >  * matching driver. */
>> > err = device_register(>dev);
>> > +   if (err)
>> > +   ida_simple_remove(_index_ida, dev->index);
>> >  out:
>> > if (err)
>> > virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED);
>>
>> I think your patch is correct (giving up the index if we failed to
>> register), but this made me look at our error handling if a virtio
>> device failed to register.
>>
>> We hold an extra reference to the struct device, even after a failed
>> register, and it is the responsibility of the caller to give up that
>> reference once no longer needed. As callers toregister_virtio_device()
>> embed the struct virtio_device, it needs to be their responsibility.
>> Looking at the existing callers,
>>
>> - ccw does a put_device
>> - pci, mmio and remoteproc do nothing, causing a leak
>> - vop does a free on the embedding structure, which is a big no-no
>>
>> Thoughts?
> Sorry to relay late and thanks for your review.
> Do you mean the "extra reference to the struct device" caused by the
> following code?
>
> err = device_register(>dev);
> device_add(dev)
> get_device(dev)
> If I'm understand right, I think there is no extra reference if we fail
> virtio_register_device, because if device_register we don't get a
> reference.

I go through these callers,  virtio_mmio_probe should do more cleanup.
I'll sent a patch fix that.

> --
> Thanks
> weiping
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v18 05/10] xbitmap: add more operations

2017-12-01 Thread Matthew Wilcox
On Fri, Dec 01, 2017 at 03:09:08PM +, Wang, Wei W wrote:
> On Friday, December 1, 2017 9:02 PM, Tetsuo Handa wrote:
> > If start == end is legal,
> > 
> >for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> > 
> > makes this loop do nothing because 10 < 10 is false.
> 
> How about "start <= end "?

Don't ask Tetsuo for his opinion, write some userspace code that uses it.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v18 10/10] virtio-balloon: don't report free pages when page poisoning is enabled

2017-12-01 Thread Michael S. Tsirkin
On Wed, Nov 29, 2017 at 09:55:26PM +0800, Wei Wang wrote:
> The guest free pages should not be discarded by the live migration thread
> when page poisoning is enabled with PAGE_POISONING_NO_SANITY=n, because
> skipping the transfer of such poisoned free pages will trigger false
> positive when new pages are allocated and checked on the destination.
> This patch skips the reporting of free pages in the above case.
> 
> Reported-by: Michael S. Tsirkin 
> Signed-off-by: Wei Wang 
> Cc: Michal Hocko 
> ---
>  drivers/virtio/virtio_balloon.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 035bd3a..6ac4cff 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -652,7 +652,9 @@ static void report_free_page(struct work_struct *work)
>   /* Start by sending the obtained cmd id to the host with an outbuf */
>   send_one_desc(vb, vb->free_page_vq, virt_to_phys(>start_cmd_id),
> sizeof(uint32_t), false, true, false);
> - walk_free_mem_block(vb, 0, _balloon_send_free_pages);
> + if (!(page_poisoning_enabled() &&
> + !IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY)))
> + walk_free_mem_block(vb, 0, _balloon_send_free_pages);
>   /*
>* End by sending the stop id to the host with an outbuf. Use the
>* non-batching mode here to trigger a kick after adding the stop id.

PAGE_POISONING_ZERO is actually OK.

But I really would prefer it that we still send pages to host,
otherwise debugging becomes much harder.

And it does not have to be completely useless, even though
you can not discard them as they would be zero-filled then.

How about a config field telling host what should be there in the free
pages? This way even though host can not discard them, host can send
them out without reading them, still a win.



> -- 
> 2.7.4
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v18 07/10] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-12-01 Thread Michael S. Tsirkin
On Wed, Nov 29, 2017 at 09:55:23PM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer of
> balloon (i.e. inflated/deflated) pages using scatter-gather lists to the
> host. A scatter-gather list is described by a vring desc.
> 
> The implementation of the previous virtio-balloon is not very efficient,
> because the balloon pages are transferred to the host by one array each
> time. Here is the breakdown of the time in percentage spent on each step
> of the balloon inflating process (inflating 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete. The above
> profiling shows that the bottlenecks are stage 2) and stage 4).
> 
> This patch optimizes step 2) by transferring pages to host in sgs. An sg
> describes a chunk of guest physically continuous pages. With this
> mechanism, step 4) can also be optimized by doing address translation and
> madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~440ms resulting
> in an improvement of ~89%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages instead of
> a single page each time.
> 
> Signed-off-by: Wei Wang 
> Signed-off-by: Liang Li 
> Suggested-by: Michael S. Tsirkin 
> Cc: Tetsuo Handa 
> ---
>  drivers/virtio/virtio_balloon.c | 230 
> +---
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 212 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 7960746..2c21c5a 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -32,6 +32,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  /*
>   * Balloon device works in 4K page units.  So each page is pointed to by
> @@ -79,6 +81,9 @@ struct virtio_balloon {
>   /* Synchronize access/update to this struct virtio_balloon elements */
>   struct mutex balloon_lock;
>  
> + /* The xbitmap used to record balloon pages */
> + struct xb page_xb;
> +
>   /* The array of pfns we tell the Host about. */
>   unsigned int num_pfns;
>   __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -141,15 +146,121 @@ static void set_page_pfns(struct virtio_balloon *vb,
> page_to_balloon_pfn(page) + i);
>  }
>  
> +static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
> +{
> + unsigned int len;
> +
> + virtqueue_kick(vq);
> + wait_event(wq_head, virtqueue_get_buf(vq, ));
> +}
> +
> +static void send_one_desc(struct virtio_balloon *vb,
> +   struct virtqueue *vq,
> +   uint64_t addr,
> +   uint32_t len,
> +   bool inbuf,
> +   bool batch)
> +{
> + int err;
> + unsigned int size;
> +
> + /* Detach all the used buffers from the vq */
> + while (virtqueue_get_buf(vq, ))
> + ;
> +
> + err = virtqueue_add_one_desc(vq, addr, len, inbuf, vq);
> + /*
> +  * This is expected to never fail: there is always at least 1 entry
> +  * available on the vq, because when the vq is full the worker thread
> +  * that adds the desc will be put into sleep until at least 1 entry is
> +  * available to use.
> +  */
> + BUG_ON(err);
> +
> + /* If batching is requested, we batch till the vq is full */
> + if (!batch || !vq->num_free)
> + kick_and_wait(vq, vb->acked);
> +}
> +

This internal kick complicates callers. I suggest that instead,
you move this to callers, just return a "kick required" boolean.
This way callers do not need to play with num_free at all.

> +/*
> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> + * The page xbitmap is searched for continuous "1" bits, which correspond
> + * to continuous pages, to chunk into sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be searched.
> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +   struct virtqueue *vq,
> +   unsigned long page_xb_start,
> +   unsigned long page_xb_end)
> +{
> + unsigned long pfn_start, pfn_end;
> + uint64_t addr;
> + uint32_t len, max_len = round_down(UINT_MAX, PAGE_SIZE);
> +
> + pfn_start = page_xb_start;
> + while (pfn_start < page_xb_end) {
> + pfn_start = xb_find_next_set_bit(>page_xb, pfn_start,
> +  page_xb_end);
> + if (pfn_start == 

RE: [PATCH v18 05/10] xbitmap: add more operations

2017-12-01 Thread Wang, Wei W
On Friday, December 1, 2017 9:02 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
> > On 11/30/2017 06:34 PM, Tetsuo Handa wrote:
> > > Wei Wang wrote:
> > >> + * @start: the start of the bit range, inclusive
> > >> + * @end: the end of the bit range, inclusive
> > >> + *
> > >> + * This function is used to clear a bit in the xbitmap. If all the
> > >> +bits of the
> > >> + * bitmap are 0, the bitmap will be freed.
> > >> + */
> > >> +void xb_clear_bit_range(struct xb *xb, unsigned long start,
> > >> +unsigned long end) {
> > >> +struct radix_tree_root *root = >xbrt;
> > >> +struct radix_tree_node *node;
> > >> +void **slot;
> > >> +struct ida_bitmap *bitmap;
> > >> +unsigned int nbits;
> > >> +
> > >> +for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 
> > >> 1) {
> > >> +unsigned long index = start / IDA_BITMAP_BITS;
> > >> +unsigned long bit = start % IDA_BITMAP_BITS;
> > >> +
> > >> +bitmap = __radix_tree_lookup(root, index, , );
> > >> +if (radix_tree_exception(bitmap)) {
> > >> +unsigned long ebit = bit + 2;
> > >> +unsigned long tmp = (unsigned long)bitmap;
> > >> +
> > >> +nbits = min(end - start + 1, BITS_PER_LONG - 
> > >> ebit);
> > > "nbits = min(end - start + 1," seems to expect that start == end is
> > > legal for clearing only 1 bit. But this function is no-op if start == end.
> > > Please clarify what "inclusive" intended.
> >
> > If xb_clear_bit_range(xb,10,10), then it is effectively the same as
> > xb_clear_bit(10). Why would it be illegal?
> >
> > "@start inclusive" means that the @start will also be included to be
> > cleared.
> 
> If start == end is legal,
> 
>for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> 
> makes this loop do nothing because 10 < 10 is false.


How about "start <= end "?

Best,
Wei



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net,stable v4 0/3] vhost: fix a few skb leaks

2017-12-01 Thread Michael S. Tsirkin
On Fri, Dec 01, 2017 at 09:54:02AM -0500, Matthew Rosato wrote:
> On 12/01/2017 09:47 AM, Michael S. Tsirkin wrote:
> > On Fri, Dec 01, 2017 at 05:10:35AM -0500, w...@redhat.com wrote:
> >> From: Wei Xu 
> >>
> >> Matthew found a roughly 40% tcp throughput regression with commit
> >> c67df11f(vhost_net: try batch dequing from skb array) as discussed
> >> in the following thread:
> >> https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html
> > 
> > Series
> > 
> > Acked-by: Michael S. Tsirkin 
> 
> Tested this series on a z13 (s390x) on top of net-next using 4GB/4vcpu
> guests.  Verified that both the reported TCP throughput regression and
> memory leak are resolved.
> 
> net-next:  19.83 Gb/s
> net-next+: 35.02 Gb/s
> 
> Thanks all!
> 
> Matt

So we can also add
Tested-by: Matthew Rosato 

Dave, pls take this through the net tree as most changes are
in tun/tap.

Thanks!

> > 
> > 
> >> v4:
> >> - fix zero iov iterator count in tap/tap_do_read()(Jason)
> >> - don't put tun in case of EBADFD(Jason)
> >> - Replace msg->msg_control with new 'skb' when calling tun/tap_do_read()
> >>
> >> v3:
> >> - move freeing skb from vhost to tun/tap recvmsg() to not
> >>   confuse the callers.
> >>
> >> v2:
> >> - add Matthew as the reporter, thanks matthew.
> >> - moving zero headcount check ahead instead of defer consuming skb
> >>   due to jason and mst's comment.
> >> - add freeing skb in favor of recvmsg() fails.
> >>
> >> Wei Xu (3):
> >>   vhost: fix skb leak in handle_rx()
> >>   tun: free skb in early errors
> >>   tap: free skb if flags error
> >>
> >>  drivers/net/tap.c   | 14 ++
> >>  drivers/net/tun.c   | 24 ++--
> >>  drivers/vhost/net.c | 20 ++--
> >>  3 files changed, 38 insertions(+), 20 deletions(-)
> >>
> >> -- 
> >> 1.8.3.1
> > 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] tap: free skb if flags error

2017-12-01 Thread Michael S. Tsirkin
On Fri, Dec 01, 2017 at 05:10:38AM -0500, w...@redhat.com wrote:
> From: Wei Xu 
> 
> tap_recvmsg() supports accepting skb by msg_control after
> commit 3b4ba04acca8 ("tap: support receiving skb from msg_control"),
> the skb if presented should be freed within the function, otherwise
> it would be leaked.
> 
> Signed-off-by: Wei Xu 
> Reported-by: Matthew Rosato 


Acked-by: Michael S. Tsirkin 

> ---
>  drivers/net/tap.c | 14 ++
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> index e9489b8..0a886fda 100644
> --- a/drivers/net/tap.c
> +++ b/drivers/net/tap.c
> @@ -829,8 +829,11 @@ static ssize_t tap_do_read(struct tap_queue *q,
>   DEFINE_WAIT(wait);
>   ssize_t ret = 0;
>  
> - if (!iov_iter_count(to))
> + if (!iov_iter_count(to)) {
> + if (skb)
> + kfree_skb(skb);
>   return 0;
> + }
>  
>   if (skb)
>   goto put;
> @@ -1154,11 +1157,14 @@ static int tap_recvmsg(struct socket *sock, struct 
> msghdr *m,
>  size_t total_len, int flags)
>  {
>   struct tap_queue *q = container_of(sock, struct tap_queue, sock);
> + struct sk_buff *skb = m->msg_control;
>   int ret;
> - if (flags & ~(MSG_DONTWAIT|MSG_TRUNC))
> + if (flags & ~(MSG_DONTWAIT|MSG_TRUNC)) {
> + if (skb)
> + kfree_skb(skb);
>   return -EINVAL;
> - ret = tap_do_read(q, >msg_iter, flags & MSG_DONTWAIT,
> -   m->msg_control);
> + }
> + ret = tap_do_read(q, >msg_iter, flags & MSG_DONTWAIT, skb);
>   if (ret > total_len) {
>   m->msg_flags |= MSG_TRUNC;
>   ret = flags & MSG_TRUNC ? ret : total_len;
> -- 
> 1.8.3.1
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net,stable v4 0/3] vhost: fix a few skb leaks

2017-12-01 Thread Michael S. Tsirkin
On Fri, Dec 01, 2017 at 05:10:35AM -0500, w...@redhat.com wrote:
> From: Wei Xu 
> 
> Matthew found a roughly 40% tcp throughput regression with commit
> c67df11f(vhost_net: try batch dequing from skb array) as discussed
> in the following thread:
> https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html

Series

Acked-by: Michael S. Tsirkin 


> v4:
> - fix zero iov iterator count in tap/tap_do_read()(Jason)
> - don't put tun in case of EBADFD(Jason)
> - Replace msg->msg_control with new 'skb' when calling tun/tap_do_read()
> 
> v3:
> - move freeing skb from vhost to tun/tap recvmsg() to not
>   confuse the callers.
> 
> v2:
> - add Matthew as the reporter, thanks matthew.
> - moving zero headcount check ahead instead of defer consuming skb
>   due to jason and mst's comment.
> - add freeing skb in favor of recvmsg() fails.
> 
> Wei Xu (3):
>   vhost: fix skb leak in handle_rx()
>   tun: free skb in early errors
>   tap: free skb if flags error
> 
>  drivers/net/tap.c   | 14 ++
>  drivers/net/tun.c   | 24 ++--
>  drivers/vhost/net.c | 20 ++--
>  3 files changed, 38 insertions(+), 20 deletions(-)
> 
> -- 
> 1.8.3.1
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] tun: free skb in early errors

2017-12-01 Thread Michael S. Tsirkin
On Fri, Dec 01, 2017 at 05:10:37AM -0500, w...@redhat.com wrote:
> From: Wei Xu 
> 
> tun_recvmsg() supports accepting skb by msg_control after
> commit ac77cfd4258f ("tun: support receiving skb through msg_control"),
> the skb if presented should be freed no matter how far it can go
> along, otherwise it would be leaked.
> 
> This patch fixes several missed cases.
> 
> Signed-off-by: Wei Xu 
> Reported-by: Matthew Rosato 


Acked-by: Michael S. Tsirkin 

> ---
>  drivers/net/tun.c | 24 ++--
>  1 file changed, 18 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 9574900..4f4a842 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1952,8 +1952,11 @@ static ssize_t tun_do_read(struct tun_struct *tun, 
> struct tun_file *tfile,
>  
>   tun_debug(KERN_INFO, tun, "tun_do_read\n");
>  
> - if (!iov_iter_count(to))
> + if (!iov_iter_count(to)) {
> + if (skb)
> + kfree_skb(skb);
>   return 0;
> + }
>  
>   if (!skb) {
>   /* Read frames from ring */
> @@ -2069,22 +2072,24 @@ static int tun_recvmsg(struct socket *sock, struct 
> msghdr *m, size_t total_len,
>  {
>   struct tun_file *tfile = container_of(sock, struct tun_file, socket);
>   struct tun_struct *tun = tun_get(tfile);
> + struct sk_buff *skb = m->msg_control;
>   int ret;
>  
> - if (!tun)
> - return -EBADFD;
> + if (!tun) {
> + ret = -EBADFD;
> + goto out_free_skb;
> + }
>  
>   if (flags & ~(MSG_DONTWAIT|MSG_TRUNC|MSG_ERRQUEUE)) {
>   ret = -EINVAL;
> - goto out;
> + goto out_put_tun;
>   }
>   if (flags & MSG_ERRQUEUE) {
>   ret = sock_recv_errqueue(sock->sk, m, total_len,
>SOL_PACKET, TUN_TX_TIMESTAMP);
>   goto out;
>   }
> - ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT,
> -   m->msg_control);
> + ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT, skb);
>   if (ret > (ssize_t)total_len) {
>   m->msg_flags |= MSG_TRUNC;
>   ret = flags & MSG_TRUNC ? ret : total_len;
> @@ -2092,6 +2097,13 @@ static int tun_recvmsg(struct socket *sock, struct 
> msghdr *m, size_t total_len,
>  out:
>   tun_put(tun);
>   return ret;
> +
> +out_put_tun:
> + tun_put(tun);
> +out_free_skb:
> + if (skb)
> + kfree_skb(skb);
> + return ret;
>  }
>  
>  static int tun_peek_len(struct socket *sock)
> -- 
> 1.8.3.1
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/3] vhost: fix skb leak in handle_rx()

2017-12-01 Thread Michael S. Tsirkin
On Fri, Dec 01, 2017 at 05:10:36AM -0500, w...@redhat.com wrote:
> From: Wei Xu 
> 
> Matthew found a roughly 40% tcp throughput regression with commit
> c67df11f(vhost_net: try batch dequing from skb array) as discussed
> in the following thread:
> https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html
> 
> Eventually we figured out that it was a skb leak in handle_rx()
> when sending packets to the VM. This usually happens when a guest
> can not drain out vq as fast as vhost fills in, afterwards it sets
> off the traffic jam and leaks skb(s) which occurs as no headcount
> to send on the vq from vhost side.
> 
> This can be avoided by making sure we have got enough headcount
> before actually consuming a skb from the batched rx array while
> transmitting, which is simply done by moving checking the zero
> headcount a bit ahead.
> 
> Signed-off-by: Wei Xu 
> Reported-by: Matthew Rosato 

Acked-by: Michael S. Tsirkin 

> ---
>  drivers/vhost/net.c | 20 ++--
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 8d626d7..c7bdeb6 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -778,16 +778,6 @@ static void handle_rx(struct vhost_net *net)
>   /* On error, stop handling until the next kick. */
>   if (unlikely(headcount < 0))
>   goto out;
> - if (nvq->rx_array)
> - msg.msg_control = vhost_net_buf_consume(>rxq);
> - /* On overrun, truncate and discard */
> - if (unlikely(headcount > UIO_MAXIOV)) {
> - iov_iter_init(_iter, READ, vq->iov, 1, 1);
> - err = sock->ops->recvmsg(sock, ,
> -  1, MSG_DONTWAIT | MSG_TRUNC);
> - pr_debug("Discarded rx packet: len %zd\n", sock_len);
> - continue;
> - }
>   /* OK, now we need to know about added descriptors. */
>   if (!headcount) {
>   if (unlikely(vhost_enable_notify(>dev, vq))) {
> @@ -800,6 +790,16 @@ static void handle_rx(struct vhost_net *net)
>* they refilled. */
>   goto out;
>   }
> + if (nvq->rx_array)
> + msg.msg_control = vhost_net_buf_consume(>rxq);
> + /* On overrun, truncate and discard */
> + if (unlikely(headcount > UIO_MAXIOV)) {
> + iov_iter_init(_iter, READ, vq->iov, 1, 1);
> + err = sock->ops->recvmsg(sock, ,
> +  1, MSG_DONTWAIT | MSG_TRUNC);
> + pr_debug("Discarded rx packet: len %zd\n", sock_len);
> + continue;
> + }
>   /* We don't need to be notified again. */
>   iov_iter_init(_iter, READ, vq->iov, in, vhost_len);
>   fixup = msg.msg_iter;
> -- 
> 1.8.3.1
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Memory corruption in powerpc guests with virtio_balloon (was Re: [PATCH v3] virtio_balloon: fix deadlock on OOM)

2017-12-01 Thread Michael S. Tsirkin
On Fri, Dec 01, 2017 at 11:31:08PM +1100, Michael Ellerman wrote:
> "Michael S. Tsirkin"  writes:
> 
> > fill_balloon doing memory allocations under balloon_lock
> > can cause a deadlock when leak_balloon is called from
> > virtballoon_oom_notify and tries to take same lock.
> >
> > To fix, split page allocation and enqueue and do allocations outside the 
> > lock.
> >
> > Here's a detailed analysis of the deadlock by Tetsuo Handa:
> >
> > In leak_balloon(), mutex_lock(>balloon_lock) is called in order to
> > serialize against fill_balloon(). But in fill_balloon(),
> > alloc_page(GFP_HIGHUSER[_MOVABLE] | __GFP_NOMEMALLOC | __GFP_NORETRY) is
> > called with vb->balloon_lock mutex held. Since GFP_HIGHUSER[_MOVABLE]
> > implies __GFP_DIRECT_RECLAIM | __GFP_IO | __GFP_FS, despite __GFP_NORETRY
> > is specified, this allocation attempt might indirectly depend on somebody
> > else's __GFP_DIRECT_RECLAIM memory allocation. And such indirect
> > __GFP_DIRECT_RECLAIM memory allocation might call leak_balloon() via
> > virtballoon_oom_notify() via blocking_notifier_call_chain() callback via
> > out_of_memory() when it reached __alloc_pages_may_oom() and held oom_lock
> > mutex. Since vb->balloon_lock mutex is already held by fill_balloon(), it
> > will cause OOM lockup.
> >
> >   Thread1   Thread2
> > fill_balloon()
> >   takes a balloon_lock
> >   balloon_page_enqueue()
> > alloc_page(GFP_HIGHUSER_MOVABLE)
> >   direct reclaim (__GFP_FS context)   takes a fs lock
> > waits for that fs lock  alloc_page(GFP_NOFS)
> >   
> > __alloc_pages_may_oom()
> > takes the oom_lock
> > out_of_memory()
> >   
> > blocking_notifier_call_chain()
> > leak_balloon()
> >   tries to take 
> > that balloon_lock and deadlocks
> >
> > Reported-by: Tetsuo Handa 
> > Cc: Michal Hocko 
> > Cc: Wei Wang 
> > Signed-off-by: Michael S. Tsirkin 
> >
> >  drivers/virtio/virtio_balloon.c| 24 +++-
> >  include/linux/balloon_compaction.h | 35 ++-
> >  mm/balloon_compaction.c| 28 +---
> >  3 files changed, 74 insertions(+), 13 deletions(-)
> 
> 
> Somehow this commit seems to be killing powerpc guests.
> 
> The symptom is that the first page (64K) of the guests memory gets over
> written with zeroes, which is where our interrupt handlers are, so the
> system rapidly locks up due to illegal instructions in the illegal
> instruction handler.
> 
> There seems to be some element of a race, because it doesn't always
> crash. Sometimes I can boot to a shell, but not often. When it does
> happen it's fairly late in boot, but before I get to a shell.
> 
> I had a few bisects go off into the weeds due to the intermittent
> nature. But once I realised that I changed my script to boot 5 times
> before declaring a kernel good, and that bisected straight here.
> 
> I can also revert this commit on v4.15-rc1 and everything's fine again -
> I got through ~250 boots with that kernel.
> 
> So I'm pretty sure this commit is triggering/exposing/causing the bug.
> 
> The other data point is that the page that's overwritten is mapped read
> only in the guest kernel. So either the guest kernel is writing to it in
> real mode (MMU off), or the hypervisor/DMA is doing it.
> 
> I haven't isolated if it's host kernel/qemu version dependent, at the
> moment I'm just using distro packaged versions of both.
> 
> Anyway I'll try and dig further into it on Monday, but I thought I'd let
> you know in case this is a known bug with a fix in the pipeline, or
> rings any bells or whatever.
> 
> cheers

Thanks for the report!
A fix was just posted:
virtio_balloon: fix increment of vb->num_pfns in fill_balloon()

Would appreciate testing.

> 
> > diff --git a/drivers/virtio/virtio_balloon.c 
> > b/drivers/virtio/virtio_balloon.c
> > index f0b3a0b..7960746 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -143,16 +143,17 @@ static void set_page_pfns(struct virtio_balloon *vb,
> >  
> >  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
> >  {
> > -   struct balloon_dev_info *vb_dev_info = >vb_dev_info;
> > unsigned num_allocated_pages;
> > +   unsigned num_pfns;
> > +   struct page *page;
> > +   LIST_HEAD(pages);
> >  
> > /* We can only do one array worth at a time. */
> > num = min(num, ARRAY_SIZE(vb->pfns));
> >  
> > -   mutex_lock(>balloon_lock);
> > -   for (vb->num_pfns = 0; 

Re: [PATCH 02/13] fbdev: add remove_conflicting_pci_framebuffers()

2017-12-01 Thread Emil Velikov
On 30 November 2017 at 23:49, Sudip Mukherjee
 wrote:
> Hi Daniel,
>
> On Wed, Nov 29, 2017 at 10:56:34AM +0100, Daniel Vetter wrote:
>> On Tue, Nov 28, 2017 at 12:30:30PM +, Sudip Mukherjee wrote:
>> > On Tue, Nov 28, 2017 at 12:32:38PM +0100, Greg KH wrote:
>> > > On Tue, Nov 28, 2017 at 11:22:17AM +0100, Daniel Vetter wrote:
>> > > > On Mon, Nov 27, 2017 at 08:52:19PM +, Sudip Mukherjee wrote:
>> > > > > On Mon, Nov 27, 2017 at 11:27:59AM +0100, Daniel Vetter wrote:
>> > > > > > On Fri, Nov 24, 2017 at 06:53:31PM +0100, Michał Mirosław wrote:
>> > > > > > > Almost all drivers using remove_conflicting_framebuffers() wrap 
>> > > > > > > it with
>> > > > > > > the same code. Extract common part from PCI drivers into separate
>> > > > > > > remove_conflicting_pci_framebuffers().
>> > > > > > >
>
> 
>
>> > > >
>> > > > Greg?
>> > >
>> > > Yes, if no one is working to get it out of staging, that means no one
>> > > cares about it, and it needs to be removed from the tree.
>> >
>> > Agreed. I was not working on getting it out of staging as there is no
>> > place for it to go.
>> > But, Teddy Wang told me that they have a working drm driver for it, but
>> > it is not atomic like Daniel was asking for. If it is ok, then I can send
>> > in a patch to remove the existing sm750 from staging and add the new sm750
>> > drm driver to staging. And after it is ready, we can go ahead with moving
>> > it out of staging to drm.
>>
>> Please keep the todo item that it needs to be converted to atomic. And
>> tbh, it's probably faster if you just submit it to dri-devel, assuming you
>> have time to work on it. For small drivers we tend to be fairly quick in
>> getting them into good enough shape.
>
> I have received the driver from Teddy and pushed it to
> https://github.com/sudipm-mukherjee/parport/tree/drm_smi for your first
> look into it. It is not even building with next-20171130 and has lots of
> build warnings. I will have to do a lot of work on it before I can even
> submit it to dri-devel.
>
A crazy idea, mostly towards Tedd and Sudip:

Start small and build gradually. An example split for separate patch series:

 - one HW, basic setup + atomic KMS
 - add second HW
 - more KMS features
 - fancy memory management
 - 2D/3D/other acceleration

The driver as seen above tries to do all of the above (almost, it's not atomic)
at once - 40k loc.

Someone familiar with the code can quickly split it up and while doing
so, feed it through checkpatch.
Current code is _very_ far from kernel coding style, plus the
copyright blurp is very disturbing:

* All rights are reserved. Reproduction or in part is prohibited
* without the written consent of the copyright owner.

HTH
Emil
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/3] vhost: fix skb leak in handle_rx()

2017-12-01 Thread Michael S. Tsirkin
On Fri, Dec 01, 2017 at 03:11:05PM +0800, Jason Wang wrote:
> 
> 
> On 2017年12月01日 13:54, w...@redhat.com wrote:
> > From: Wei Xu 
> > 
> > Matthew found a roughly 40% tcp throughput regression with commit
> > c67df11f(vhost_net: try batch dequing from skb array) as discussed
> > in the following thread:
> > https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html
> > 
> > Eventually we figured out that it was a skb leak in handle_rx()
> > when sending packets to the VM. This usually happens when a guest
> > can not drain out vq as fast as vhost fills in, afterwards it sets
> > off the traffic jam and leaks skb(s) which occurs as no headcount
> > to send on the vq from vhost side.
> > 
> > This can be avoided by making sure we have got enough headcount
> > before actually consuming a skb from the batched rx array while
> > transmitting, which is simply done by moving checking the zero
> > headcount a bit ahead.
> > 
> > Signed-off-by: Wei Xu 
> > Reported-by: Matthew Rosato 
> > ---
> >   drivers/vhost/net.c | 20 ++--
> >   1 file changed, 10 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 8d626d7..c7bdeb6 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -778,16 +778,6 @@ static void handle_rx(struct vhost_net *net)
> > /* On error, stop handling until the next kick. */
> > if (unlikely(headcount < 0))
> > goto out;
> > -   if (nvq->rx_array)
> > -   msg.msg_control = vhost_net_buf_consume(>rxq);
> > -   /* On overrun, truncate and discard */
> > -   if (unlikely(headcount > UIO_MAXIOV)) {
> > -   iov_iter_init(_iter, READ, vq->iov, 1, 1);
> > -   err = sock->ops->recvmsg(sock, ,
> > -1, MSG_DONTWAIT | MSG_TRUNC);
> > -   pr_debug("Discarded rx packet: len %zd\n", sock_len);
> > -   continue;
> > -   }
> > /* OK, now we need to know about added descriptors. */
> > if (!headcount) {
> > if (unlikely(vhost_enable_notify(>dev, vq))) {
> > @@ -800,6 +790,16 @@ static void handle_rx(struct vhost_net *net)
> >  * they refilled. */
> > goto out;
> > }
> > +   if (nvq->rx_array)
> > +   msg.msg_control = vhost_net_buf_consume(>rxq);
> > +   /* On overrun, truncate and discard */
> > +   if (unlikely(headcount > UIO_MAXIOV)) {
> > +   iov_iter_init(_iter, READ, vq->iov, 1, 1);
> > +   err = sock->ops->recvmsg(sock, ,
> > +1, MSG_DONTWAIT | MSG_TRUNC);
> > +   pr_debug("Discarded rx packet: len %zd\n", sock_len);
> > +   continue;
> > +   }
> > /* We don't need to be notified again. */
> > iov_iter_init(_iter, READ, vq->iov, in, vhost_len);
> > fixup = msg.msg_iter;
> 
> I suggest to reorder this patch to 3/3.
> 
> Thanks

Why? This doesn't cause any new leaks, does it?

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/3] tun: free skb in early errors

2017-12-01 Thread Michael S. Tsirkin
On Fri, Dec 01, 2017 at 03:07:44PM +0800, Jason Wang wrote:
> 
> 
> On 2017年12月01日 13:54, w...@redhat.com wrote:
> > From: Wei Xu 
> > 
> > tun_recvmsg() supports accepting skb by msg_control after
> > commit ac77cfd4258f ("tun: support receiving skb through msg_control"),
> > the skb if presented should be freed within the function, otherwise it
> > would be leaked.
> > 
> > Signed-off-by: Wei Xu 
> > Reported-by: Matthew Rosato 
> > ---
> >   drivers/net/tun.c | 14 +++---
> >   1 file changed, 11 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > index 6a7bde9..5563430 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -2067,14 +2067,17 @@ static int tun_recvmsg(struct socket *sock, struct 
> > msghdr *m, size_t total_len,
> >   {
> > struct tun_file *tfile = container_of(sock, struct tun_file, socket);
> > struct tun_struct *tun = tun_get(tfile);
> > +   struct sk_buff *skb = m->msg_control;
> > int ret;
> > -   if (!tun)
> > -   return -EBADFD;
> > +   if (!tun) {
> > +   ret = -EBADFD;
> > +   goto out_free_skb;
> 
> Unfortunately, you can't to there since tun is NULL.

Right, this should just be kfree_skb(skb); return -EBADFD;

> 
> > +   }
> > if (flags & ~(MSG_DONTWAIT|MSG_TRUNC|MSG_ERRQUEUE)) {
> > ret = -EINVAL;
> > -   goto out;
> > +   goto out_free_skb;
> > }
> > if (flags & MSG_ERRQUEUE) {
> > ret = sock_recv_errqueue(sock->sk, m, total_len,
> > @@ -2087,6 +2090,11 @@ static int tun_recvmsg(struct socket *sock, struct 
> > msghdr *m, size_t total_len,
> > m->msg_flags |= MSG_TRUNC;
> > ret = flags & MSG_TRUNC ? ret : total_len;
> > }
> > +   goto out;
> 
> We usually don't use goto in the case of success, and you need deal with the
> case skb != NULL but iov_iter_count(to) == 0 in tun_do_read().
> 
> Thanks

I agree, the way to lay this out is:


tun_put(tun);
return ret;

err:
tun_put(tun);
err_tun:
if (skb)
kfree_skb(skb);
return ret;




> > +
> > +out_free_skb:
> > +   if (skb)
> > +   kfree_skb(skb);
> >   out:
> > tun_put(tun);
> > return ret;
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v18 05/10] xbitmap: add more operations

2017-12-01 Thread Matthew Wilcox
On Fri, Dec 01, 2017 at 10:02:01PM +0900, Tetsuo Handa wrote:
> If start == end is legal,
> 
>for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> 
> makes this loop do nothing because 10 < 10 is false.

... and this is why we add tests to the test-suite!
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 02/13] fbdev: add remove_conflicting_pci_framebuffers()

2017-12-01 Thread Sudip Mukherjee
On Fri, Dec 01, 2017 at 08:19:16AM +0100, Daniel Vetter wrote:
> On Thu, Nov 30, 2017 at 11:49:53PM +, Sudip Mukherjee wrote:
> > Hi Daniel,
> > 
> > 
> > 
> > > > > > 
> > > > > > Greg?
> > > > > 
> > > > > Yes, if no one is working to get it out of staging, that means no one
> > > > > cares about it, and it needs to be removed from the tree.
> > > > 
> > > > Agreed. I was not working on getting it out of staging as there is no
> > > > place for it to go.
> > > > But, Teddy Wang told me that they have a working drm driver for it, but
> > > > it is not atomic like Daniel was asking for. If it is ok, then I can 
> > > > send
> > > > in a patch to remove the existing sm750 from staging and add the new 
> > > > sm750
> > > > drm driver to staging. And after it is ready, we can go ahead with 
> > > > moving
> > > > it out of staging to drm.
> > > 
> > > Please keep the todo item that it needs to be converted to atomic. And
> > > tbh, it's probably faster if you just submit it to dri-devel, assuming you
> > > have time to work on it. For small drivers we tend to be fairly quick in
> > > getting them into good enough shape.
> > 
> > I have received the driver from Teddy and pushed it to
> > https://github.com/sudipm-mukherjee/parport/tree/drm_smi for your first
> > look into it. It is not even building with next-20171130 and has lots of
> > build warnings. I will have to do a lot of work on it before I can even
> > submit it to dri-devel.
> > 
> > Time will be the problem, as this is not part of my day job.
> > 
> > > 
> > > Staging is also a major pain for drm subsystem refactorings, I really,
> > > really, really prefer we don't add more than the vbox pain we have
> > > already.
> > 
> > I am hoping that after seeing it, you will agree to have it in staging. :)
> 
> So I know Greg is willing to take anything into staging, but I'm no fan.
> We refactor and improve drm a lot, with a lot of cross-driver changes
> necessary to move things forward. We can do that since we have a generally
> rather active development community, and we try hard to keep most drivers
> in reasonable good shape and so easy to maintain.
> 
> If you know throw a pile of unmaintainable stuff into staging, but without
> working on it, then that's just cost, no benefit to the dri-devel
> community. On top, staging tree is separate from drm trees, so more pain
> to synchronize trees (and we have to do that a few times per release cycle
> or drivers simply stop compiling). Where's the upside of taking this
> driver into staging?
> 
> One is users, but ime for soc display drivers usually everything else is
> in worse shape (e.g. even great drivers like the one for qualcom must be
> tested on some vendor tree because critical core bits are missing in
> upstream), so "more users" is not the good reason. And it's clearly not
> "more developers", because no time to clean things up. So what exactly is
> the good reason to put this into staging instead of just waiting until
> someone has the time to clean it up quickly?

Ok, I will not try to give any more reasons now. :)

I will cleanup the basic things I can and then submit it to dri-devel.

Greg - Please keep the SM750 driver in staging for now. I will send
a patch later to add in TODO the git location where the drm driver is
being developed. And when that drm driver is ready, I will send a patch
to remove the sm750fb driver from staging.

--
Regards
Sudip
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Memory corruption in powerpc guests with virtio_balloon (was Re: [PATCH v3] virtio_balloon: fix deadlock on OOM)

2017-12-01 Thread Michael Ellerman
"Michael S. Tsirkin"  writes:

> fill_balloon doing memory allocations under balloon_lock
> can cause a deadlock when leak_balloon is called from
> virtballoon_oom_notify and tries to take same lock.
>
> To fix, split page allocation and enqueue and do allocations outside the lock.
>
> Here's a detailed analysis of the deadlock by Tetsuo Handa:
>
> In leak_balloon(), mutex_lock(>balloon_lock) is called in order to
> serialize against fill_balloon(). But in fill_balloon(),
> alloc_page(GFP_HIGHUSER[_MOVABLE] | __GFP_NOMEMALLOC | __GFP_NORETRY) is
> called with vb->balloon_lock mutex held. Since GFP_HIGHUSER[_MOVABLE]
> implies __GFP_DIRECT_RECLAIM | __GFP_IO | __GFP_FS, despite __GFP_NORETRY
> is specified, this allocation attempt might indirectly depend on somebody
> else's __GFP_DIRECT_RECLAIM memory allocation. And such indirect
> __GFP_DIRECT_RECLAIM memory allocation might call leak_balloon() via
> virtballoon_oom_notify() via blocking_notifier_call_chain() callback via
> out_of_memory() when it reached __alloc_pages_may_oom() and held oom_lock
> mutex. Since vb->balloon_lock mutex is already held by fill_balloon(), it
> will cause OOM lockup.
>
>   Thread1   Thread2
> fill_balloon()
>   takes a balloon_lock
>   balloon_page_enqueue()
> alloc_page(GFP_HIGHUSER_MOVABLE)
>   direct reclaim (__GFP_FS context)   takes a fs lock
> waits for that fs lock  alloc_page(GFP_NOFS)
>   __alloc_pages_may_oom()
> takes the oom_lock
> out_of_memory()
>   
> blocking_notifier_call_chain()
> leak_balloon()
>   tries to take 
> that balloon_lock and deadlocks
>
> Reported-by: Tetsuo Handa 
> Cc: Michal Hocko 
> Cc: Wei Wang 
> Signed-off-by: Michael S. Tsirkin 
>
>  drivers/virtio/virtio_balloon.c| 24 +++-
>  include/linux/balloon_compaction.h | 35 ++-
>  mm/balloon_compaction.c| 28 +---
>  3 files changed, 74 insertions(+), 13 deletions(-)


Somehow this commit seems to be killing powerpc guests.

The symptom is that the first page (64K) of the guests memory gets over
written with zeroes, which is where our interrupt handlers are, so the
system rapidly locks up due to illegal instructions in the illegal
instruction handler.

There seems to be some element of a race, because it doesn't always
crash. Sometimes I can boot to a shell, but not often. When it does
happen it's fairly late in boot, but before I get to a shell.

I had a few bisects go off into the weeds due to the intermittent
nature. But once I realised that I changed my script to boot 5 times
before declaring a kernel good, and that bisected straight here.

I can also revert this commit on v4.15-rc1 and everything's fine again -
I got through ~250 boots with that kernel.

So I'm pretty sure this commit is triggering/exposing/causing the bug.

The other data point is that the page that's overwritten is mapped read
only in the guest kernel. So either the guest kernel is writing to it in
real mode (MMU off), or the hypervisor/DMA is doing it.

I haven't isolated if it's host kernel/qemu version dependent, at the
moment I'm just using distro packaged versions of both.

Anyway I'll try and dig further into it on Monday, but I thought I'd let
you know in case this is a known bug with a fix in the pipeline, or
rings any bells or whatever.

cheers


> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f0b3a0b..7960746 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -143,16 +143,17 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
> - struct balloon_dev_info *vb_dev_info = >vb_dev_info;
>   unsigned num_allocated_pages;
> + unsigned num_pfns;
> + struct page *page;
> + LIST_HEAD(pages);
>  
>   /* We can only do one array worth at a time. */
>   num = min(num, ARRAY_SIZE(vb->pfns));
>  
> - mutex_lock(>balloon_lock);
> - for (vb->num_pfns = 0; vb->num_pfns < num;
> -  vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> - struct page *page = balloon_page_enqueue(vb_dev_info);
> + for (num_pfns = 0; num_pfns < num;
> +  num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> + struct page *page = balloon_page_alloc();
>  
>   if (!page) {
>   dev_info_ratelimited(>vdev->dev,
> @@ 

Re: [PATCH v18 06/10] virtio_ring: add a new API, virtqueue_add_one_desc

2017-12-01 Thread Wei Wang

On 12/01/2017 03:38 AM, Michael S. Tsirkin wrote:

On Wed, Nov 29, 2017 at 09:55:22PM +0800, Wei Wang wrote:

Current virtqueue_add API implementation is based on the scatterlist
struct, which uses kaddr. This is inadequate to all the use case of
vring. For example:
- Some usages don't use IOMMU, in this case the user can directly pass
   in a physical address in hand, instead of going through the sg
   implementation (e.g. the VIRTIO_BALLOON_F_SG feature)
- Sometimes, a guest physical page may not have a kaddr (e.g. high
   memory) but need to use vring (e.g. the VIRTIO_BALLOON_F_FREE_PAGE_VQ
   feature)

The new API virtqueue_add_one_desc enables the caller to assign a vring
desc with a physical address and len. Also, factor out the common code
with virtqueue_add in vring_set_avail.

Signed-off-by: Wei Wang 
Cc: Michael S. Tsirkin 

You previously managed without this patch, and it's preferable
IMHO since this patchset is already too big.

I don't really understand what is wrong with virtio_add_sgs + sg_set_page.
I don't think is assumes a kaddr.



OK, I will use the previous method to send sgs.
Please have a check if there are other things need to be improved.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v18 05/10] xbitmap: add more operations

2017-12-01 Thread Wei Wang

On 11/30/2017 06:34 PM, Tetsuo Handa wrote:

Wei Wang wrote:

+ * @start: the start of the bit range, inclusive
+ * @end: the end of the bit range, inclusive
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
+{
+   struct radix_tree_root *root = >xbrt;
+   struct radix_tree_node *node;
+   void **slot;
+   struct ida_bitmap *bitmap;
+   unsigned int nbits;
+
+   for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+   unsigned long index = start / IDA_BITMAP_BITS;
+   unsigned long bit = start % IDA_BITMAP_BITS;
+
+   bitmap = __radix_tree_lookup(root, index, , );
+   if (radix_tree_exception(bitmap)) {
+   unsigned long ebit = bit + 2;
+   unsigned long tmp = (unsigned long)bitmap;
+
+   nbits = min(end - start + 1, BITS_PER_LONG - ebit);

"nbits = min(end - start + 1," seems to expect that start == end is legal
for clearing only 1 bit. But this function is no-op if start == end.
Please clarify what "inclusive" intended.


If xb_clear_bit_range(xb,10,10), then it is effectively the same as 
xb_clear_bit(10). Why would it be illegal?


"@start inclusive" means that the @start will also be included to be 
cleared.





+static inline __always_inline void bitmap_clear(unsigned long *map,
+   unsigned int start,
+   unsigned int nbits)
+{
+   if (__builtin_constant_p(nbits) && nbits == 1)
+   __clear_bit(start, map);
+   else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) &&
+__builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))

It looks strange to apply __builtin_constant_p test to variables after "& 7".



I think this is normal - if the variables are known at compile time, the 
calculation will be done at compile time (termed constant folding).



Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization