Re: [RFC] virtio-net: help live migrate SR-IOV devices
On 30 November 2017 at 05:29, Jason Wangwrote: > > > On 2017年11月29日 03:27, Jesse Brandeburg wrote: >> >> Hi, I'd like to get some feedback on a proposal to enhance virtio-net >> to ease configuration of a VM and that would enable live migration of >> passthrough network SR-IOV devices. >> >> Today we have SR-IOV network devices (VFs) that can be passed into a VM >> in order to enable high performance networking direct within the VM. >> The problem I am trying to address is that this configuration is >> generally difficult to live-migrate. There is documentation [1] >> indicating that some OS/Hypervisor vendors will support live migration >> of a system with a direct assigned networking device. The problem I >> see with these implementations is that the network configuration >> requirements that are passed on to the owner of the VM are quite >> complicated. You have to set up bonding, you have to configure it to >> enslave two interfaces, those interfaces (one is virtio-net, the other >> is SR-IOV device/driver like ixgbevf) must support MAC address changes >> requested in the VM, and on and on... >> >> So, on to the proposal: >> Modify virtio-net driver to be a single VM network device that >> enslaves an SR-IOV network device (inside the VM) with the same MAC >> address. This would cause the virtio-net driver to appear and work like >> a simplified bonding/team driver. The live migration problem would be >> solved just like today's bonding solution, but the VM user's networking >> config would be greatly simplified. >> >> At it's simplest, it would appear something like this in the VM. >> >> == >> = vnet0 = >> = >> (virtio- = | >> net)= | >> = == >> = = ixgbef = >> == == >> >> (forgive the ASCII art) >> >> The fast path traffic would prefer the ixgbevf or other SR-IOV device >> path, and fall back to virtio's transmit/receive when migrating. >> >> Compared to today's options this proposal would >> 1) make virtio-net more sticky, allow fast path traffic at SR-IOV >> speeds >> 2) simplify end user configuration in the VM (most if not all of the >> set up to enable migration would be done in the hypervisor) >> 3) allow live migration via a simple link down and maybe a PCI >> hot-unplug of the SR-IOV device, with failover to the virtio-net >> driver core >> 4) allow vendor agnostic hardware acceleration, and live migration >> between vendors if the VM os has driver support for all the required >> SR-IOV devices. >> >> Runtime operation proposed: >> - virtio-net driver loads, SR-IOV driver loads >> - virtio-net finds other NICs that match it's MAC address by >>both examining existing interfaces, and sets up a new device notifier >> - virtio-net enslaves the first NIC with the same MAC address >> - virtio-net brings up the slave, and makes it the "preferred" path >> - virtio-net follows the behavior of an active backup bond/team >> - virtio-net acts as the interface to the VM >> - live migration initiates >> - link goes down on SR-IOV, or SR-IOV device is removed >> - failover to virtio-net as primary path >> - migration continues to new host >> - new host is started with virio-net as primary >> - if no SR-IOV, virtio-net stays primary >> - hypervisor can hot-add SR-IOV NIC, with same MAC addr as virtio >> - virtio-net notices new NIC and starts over at enslave step above >> >> Future ideas (brainstorming): >> - Optimize Fast east-west by having special rules to direct east-west >>traffic through virtio-net traffic path >> >> Thanks for reading! >> Jesse > > > Cc netdev. > > Interesting, and this method is actually used by netvsc now: > > commit 0c195567a8f6e82ea5535cd9f1d54a1626dd233e > Author: stephen hemminger > Date: Tue Aug 1 19:58:53 2017 -0700 > > netvsc: transparent VF management > > This patch implements transparent fail over from synthetic NIC to > SR-IOV virtual function NIC in Hyper-V environment. It is a better > alternative to using bonding as is done now. Instead, the receive and > transmit fail over is done internally inside the driver. > > Using bonding driver has lots of issues because it depends on the > script being run early enough in the boot process and with sufficient > information to make the association. This patch moves all that > functionality into the kernel. > > Signed-off-by: Stephen Hemminger > Signed-off-by: David S. Miller > > If my understanding is correct there's no need to for any extension of > virtio spec. If this is true, maybe you can start to prepare the patch? > > Thanks > >> >> [1] >> >> https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts >> ___ >>
[PATCH net,stable v3] vhost: fix a few skb leaks
From: Wei XuMatthew found a roughly 40% tcp throughput regression with commit c67df11f(vhost_net: try batch dequing from skb array) as discussed in the following thread: https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html This is v3. v3: - move freeing skb from vhost to tun/tap recvmsg() to not confuse the callers. v2: - add Matthew as the reporter, thanks matthew. - moving zero headcount check ahead instead of defer consuming skb due to jason and mst's comment. - add freeing skb in favor of recvmsg() fails. Wei Xu (3): vhost: fix skb leak in handle_rx() tun: free skb in early errors tap: free skb if flags error drivers/net/tap.c | 6 +- drivers/net/tun.c | 14 +++--- drivers/vhost/net.c | 20 ++-- 3 files changed, 26 insertions(+), 14 deletions(-) -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH] virtio-mmio: Use PTR_ERR_OR_ZERO()
Fix ptr_ret.cocci warnings: drivers/virtio/virtio_mmio.c:653:1-3: WARNING: PTR_ERR_OR_ZERO can be used Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR Generated by: scripts/coccinelle/api/ptr_ret.cocci Signed-off-by: Vasyl Gomonovych--- drivers/virtio/virtio_mmio.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c index 74dc7170fd35..8cffcb1706e1 100644 --- a/drivers/virtio/virtio_mmio.c +++ b/drivers/virtio/virtio_mmio.c @@ -650,10 +650,8 @@ static int vm_cmdline_set(const char *device, pdev = platform_device_register_resndata(_cmdline_parent, "virtio-mmio", vm_cmdline_id++, resources, ARRAY_SIZE(resources), NULL, 0); - if (IS_ERR(pdev)) - return PTR_ERR(pdev); - return 0; + return PTR_ERR_OR_ZERO(pdev); } static int vm_cmdline_get_device(struct device *dev, void *data) -- 1.9.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net,stable] vhost: fix skb leak in handle_rx()
On Tue, Nov 28, 2017 at 07:53:33PM +0200, Michael S. Tsirkin wrote: > On Tue, Nov 28, 2017 at 12:17:16PM -0500, w...@redhat.com wrote: > > From: Wei Xu> > > > Matthew found a roughly 40% tcp throughput regression with commit > > c67df11f(vhost_net: try batch dequing from skb array) as discussed > > in the following thread: > > https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html > > > > Eventually we figured out that it was a skb leak in handle_rx() > > when sending packets to the VM. This usually happens when a guest > > can not drain out vq as fast as vhost fills in, afterwards it sets > > off the traffic jam and leaks skb(s) which occurs as no headcount > > to send on the vq from vhost side. > > > > This can be avoided by making sure we have got enough headcount > > before actually consuming a skb from the batched rx array while > > transmitting, which is simply done by deferring it a moment later > > in this patch. > > > > Signed-off-by: Wei Xu > > Given the amount of effort Matthew has put into this, > you definitely want > > Reported-by: Matthew Rosato > > here. > Absolutely we want that, sorry for missed you here, Matthew. It is more like you have figured out this issue independently all by yourself with a wide assortment of extremely quick tests(tools, throughput, slub leak, etc), and I am happy that I have the opportunity to do the paperwork on behalf of you.:) Thanks a lot Matthew. Wei > Let's give credit where credit is due. > > Thanks a lot Matthew! > > > --- > > drivers/vhost/net.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c > > index 8d626d7..e76535e 100644 > > --- a/drivers/vhost/net.c > > +++ b/drivers/vhost/net.c > > @@ -778,8 +778,6 @@ static void handle_rx(struct vhost_net *net) > > /* On error, stop handling until the next kick. */ > > if (unlikely(headcount < 0)) > > goto out; > > - if (nvq->rx_array) > > - msg.msg_control = vhost_net_buf_consume(>rxq); > > /* On overrun, truncate and discard */ > > if (unlikely(headcount > UIO_MAXIOV)) { > > iov_iter_init(_iter, READ, vq->iov, 1, 1); > > @@ -809,6 +807,8 @@ static void handle_rx(struct vhost_net *net) > > */ > > iov_iter_advance(_iter, vhost_hlen); > > } > > + if (nvq->rx_array) > > + msg.msg_control = vhost_net_buf_consume(>rxq); > > err = sock->ops->recvmsg(sock, , > > sock_len, MSG_DONTWAIT | MSG_TRUNC); > > /* Userspace might have consumed the packet meanwhile: > > -- > > 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net,stable] vhost: fix skb leak in handle_rx()
On Tue, Nov 28, 2017 at 07:50:58PM +0200, Michael S. Tsirkin wrote: > On Tue, Nov 28, 2017 at 12:17:16PM -0500, w...@redhat.com wrote: > > From: Wei Xu> > > > Matthew found a roughly 40% tcp throughput regression with commit > > c67df11f(vhost_net: try batch dequing from skb array) as discussed > > in the following thread: > > https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html > > > > Eventually we figured out that it was a skb leak in handle_rx() > > when sending packets to the VM. This usually happens when a guest > > can not drain out vq as fast as vhost fills in, afterwards it sets > > off the traffic jam and leaks skb(s) which occurs as no headcount > > to send on the vq from vhost side. > > > > This can be avoided by making sure we have got enough headcount > > before actually consuming a skb from the batched rx array while > > transmitting, which is simply done by deferring it a moment later > > in this patch. > > > > Signed-off-by: Wei Xu > > Looks like a good way to fix it, but it will still leak if recvmsg > returns a wrong length (I'm not sure this can happen in practice, but > best to keep it simple). Right, it is better to defend this case, will include it in v2. > > Also, we need to add this before each recvmsg, including overrun, > and discard on error. OK. Wei > > > --- > > drivers/vhost/net.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c > > index 8d626d7..e76535e 100644 > > --- a/drivers/vhost/net.c > > +++ b/drivers/vhost/net.c > > @@ -778,8 +778,6 @@ static void handle_rx(struct vhost_net *net) > > /* On error, stop handling until the next kick. */ > > if (unlikely(headcount < 0)) > > goto out; > > - if (nvq->rx_array) > > - msg.msg_control = vhost_net_buf_consume(>rxq); > > /* On overrun, truncate and discard */ > > if (unlikely(headcount > UIO_MAXIOV)) { > > iov_iter_init(_iter, READ, vq->iov, 1, 1); > > @@ -809,6 +807,8 @@ static void handle_rx(struct vhost_net *net) > > */ > > iov_iter_advance(_iter, vhost_hlen); > > } > > + if (nvq->rx_array) > > + msg.msg_control = vhost_net_buf_consume(>rxq); > > err = sock->ops->recvmsg(sock, , > > sock_len, MSG_DONTWAIT | MSG_TRUNC); > > /* Userspace might have consumed the packet meanwhile: > > -- > > 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC] virtio-net: help live migrate SR-IOV devices
On 11/30/2017 6:11 AM, Michael S. Tsirkin wrote: On Thu, Nov 30, 2017 at 10:08:45AM +0200, achiad shochat wrote: Re. problem #2: Indeed the best way to address it seems to be to enslave the VF driver netdev under a persistent anchor netdev. And it's indeed desired to allow (but not enforce) PV netdev and VF netdev to work in conjunction. And it's indeed desired that this enslavement logic work out-of-the box. But in case of PV+VF some configurable policies must be in place (and they'd better be generic rather than differ per PV technology). For example - based on which characteristics should the PV+VF coupling be done? netvsc uses MAC address, but that might not always be the desire. It's a policy but not guest userspace policy. The hypervisor certainly knows. Are you concerned that someone might want to create two devices with the same MAC for an unrelated reason? If so, hypervisor could easily set a flag in the virtio device to say "this is a backup, use MAC to find another device". This is something I was going to suggest: a flag or other configuration on the virtio device to help control how this new feature is used. I can imagine this might be useful to control from either the hypervisor side or the VM side. The hypervisor might want to (1) disable it (force it off), (2) enable it for VM choice, or (3) force it on for the VM. In case (2), the VM might be able to chose whether it wants to make use of the feature, or stick with the bonding solution. Either way, the kernel is making a feature available, and the user (VM or hypervisor) is able to control it by selecting the feature based on the policy desired. sln ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC] virtio-net: help live migrate SR-IOV devices
On Thu, 30 Nov 2017 15:54:40 +0200, Michael S. Tsirkin wrote: > On Wed, Nov 29, 2017 at 07:51:38PM -0800, Jakub Kicinski wrote: > > On Thu, 30 Nov 2017 11:29:56 +0800, Jason Wang wrote: > > > On 2017年11月29日 03:27, Jesse Brandeburg wrote: > > > > Hi, I'd like to get some feedback on a proposal to enhance virtio-net > > > > to ease configuration of a VM and that would enable live migration of > > > > passthrough network SR-IOV devices. > > > > > > > > Today we have SR-IOV network devices (VFs) that can be passed into a VM > > > > in order to enable high performance networking direct within the VM. > > > > The problem I am trying to address is that this configuration is > > > > generally difficult to live-migrate. There is documentation [1] > > > > indicating that some OS/Hypervisor vendors will support live migration > > > > of a system with a direct assigned networking device. The problem I > > > > see with these implementations is that the network configuration > > > > requirements that are passed on to the owner of the VM are quite > > > > complicated. You have to set up bonding, you have to configure it to > > > > enslave two interfaces, those interfaces (one is virtio-net, the other > > > > is SR-IOV device/driver like ixgbevf) must support MAC address changes > > > > requested in the VM, and on and on... > > > > > > > > So, on to the proposal: > > > > Modify virtio-net driver to be a single VM network device that > > > > enslaves an SR-IOV network device (inside the VM) with the same MAC > > > > address. This would cause the virtio-net driver to appear and work like > > > > a simplified bonding/team driver. The live migration problem would be > > > > solved just like today's bonding solution, but the VM user's networking > > > > config would be greatly simplified. > > > > > > > > At it's simplest, it would appear something like this in the VM. > > > > > > > > == > > > > = vnet0 = > > > > = > > > > (virtio- = | > > > > net)= | > > > > = == > > > > = = ixgbef = > > > > == == > > > > > > > > (forgive the ASCII art) > > > > > > > > The fast path traffic would prefer the ixgbevf or other SR-IOV device > > > > path, and fall back to virtio's transmit/receive when migrating. > > > > > > > > Compared to today's options this proposal would > > > > 1) make virtio-net more sticky, allow fast path traffic at SR-IOV > > > > speeds > > > > 2) simplify end user configuration in the VM (most if not all of the > > > > set up to enable migration would be done in the hypervisor) > > > > 3) allow live migration via a simple link down and maybe a PCI > > > > hot-unplug of the SR-IOV device, with failover to the virtio-net > > > > driver core > > > > 4) allow vendor agnostic hardware acceleration, and live migration > > > > between vendors if the VM os has driver support for all the required > > > > SR-IOV devices. > > > > > > > > Runtime operation proposed: > > > > - virtio-net driver loads, SR-IOV driver loads > > > > - virtio-net finds other NICs that match it's MAC address by > > > >both examining existing interfaces, and sets up a new device notifier > > > > - virtio-net enslaves the first NIC with the same MAC address > > > > - virtio-net brings up the slave, and makes it the "preferred" path > > > > - virtio-net follows the behavior of an active backup bond/team > > > > - virtio-net acts as the interface to the VM > > > > - live migration initiates > > > > - link goes down on SR-IOV, or SR-IOV device is removed > > > > - failover to virtio-net as primary path > > > > - migration continues to new host > > > > - new host is started with virio-net as primary > > > > - if no SR-IOV, virtio-net stays primary > > > > - hypervisor can hot-add SR-IOV NIC, with same MAC addr as virtio > > > > - virtio-net notices new NIC and starts over at enslave step above > > > > > > > > Future ideas (brainstorming): > > > > - Optimize Fast east-west by having special rules to direct east-west > > > >traffic through virtio-net traffic path > > > > > > > > Thanks for reading! > > > > Jesse > > > > > > Cc netdev. > > > > > > Interesting, and this method is actually used by netvsc now: > > > > > > commit 0c195567a8f6e82ea5535cd9f1d54a1626dd233e > > > Author: stephen hemminger> > > Date: Tue Aug 1 19:58:53 2017 -0700 > > > > > > netvsc: transparent VF management > > > > > > This patch implements transparent fail over from synthetic NIC to > > > SR-IOV virtual function NIC in Hyper-V environment. It is a better > > > alternative to using bonding as is done now. Instead, the receive and > > > transmit fail over is done internally inside the driver. > > > > > > Using bonding driver has lots of issues because it depends on the > > > script being run early enough in the boot process and with sufficient > > > information to make the
[PATCH] drm/virtio: Don't return invalid caps on timeout
If the wait timeouts, the caps are probably invalid and we shouldn't be passing them to userspace. Signed-off-by: Tomeu Vizoso--- drivers/gpu/drm/virtio/virtgpu_ioctl.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/virtio/virtgpu_ioctl.c b/drivers/gpu/drm/virtio/virtgpu_ioctl.c index b94bd5440e57..902120ad4a6d 100644 --- a/drivers/gpu/drm/virtio/virtgpu_ioctl.c +++ b/drivers/gpu/drm/virtio/virtgpu_ioctl.c @@ -517,6 +517,8 @@ static int virtio_gpu_get_caps_ioctl(struct drm_device *dev, ret = wait_event_timeout(vgdev->resp_wq, atomic_read(_ent->is_valid), 5 * HZ); + if (!ret) + return -EBUSY; ptr = cache_ent->caps_cache; -- 2.14.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH] virtio_balloon: fix increment of vb->num_pfns in fill_balloon()
commit c7cdff0e8647 ("virtio_balloon: fix deadlock on OOM") changed code to increment vb->num_pfns before call to set_page_pfns(), which used to happen only after. This patch fixes boot hang for me on ppc64le KVM guests. Fixes: c7cdff0e8647 ("virtio_balloon: fix deadlock on OOM") Cc: Michael S. TsirkinCc: Tetsuo Handa Cc: Michal Hocko Cc: Wei Wang Signed-off-by: Jan Stancek --- drivers/virtio/virtio_balloon.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 7960746f7597..a1fb52cb3f0a 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -174,13 +174,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) while ((page = balloon_page_pop())) { balloon_page_enqueue(>vb_dev_info, page); - vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE; - set_page_pfns(vb, vb->pfns + vb->num_pfns, page); vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE; if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) adjust_managed_page_count(page, -1); + vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE; } num_allocated_pages = vb->num_pfns; -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net,stable v4 0/3] vhost: fix a few skb leaks
From: Wei XuMatthew found a roughly 40% tcp throughput regression with commit c67df11f(vhost_net: try batch dequing from skb array) as discussed in the following thread: https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html v4: - fix zero iov iterator count in tap/tap_do_read()(Jason) - don't put tun in case of EBADFD(Jason) - Replace msg->msg_control with new 'skb' when calling tun/tap_do_read() v3: - move freeing skb from vhost to tun/tap recvmsg() to not confuse the callers. v2: - add Matthew as the reporter, thanks matthew. - moving zero headcount check ahead instead of defer consuming skb due to jason and mst's comment. - add freeing skb in favor of recvmsg() fails. Wei Xu (3): vhost: fix skb leak in handle_rx() tun: free skb in early errors tap: free skb if flags error drivers/net/tap.c | 14 ++ drivers/net/tun.c | 24 ++-- drivers/vhost/net.c | 20 ++-- 3 files changed, 38 insertions(+), 20 deletions(-) -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 1/3] vhost: fix skb leak in handle_rx()
From: Wei XuMatthew found a roughly 40% tcp throughput regression with commit c67df11f(vhost_net: try batch dequing from skb array) as discussed in the following thread: https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html Eventually we figured out that it was a skb leak in handle_rx() when sending packets to the VM. This usually happens when a guest can not drain out vq as fast as vhost fills in, afterwards it sets off the traffic jam and leaks skb(s) which occurs as no headcount to send on the vq from vhost side. This can be avoided by making sure we have got enough headcount before actually consuming a skb from the batched rx array while transmitting, which is simply done by moving checking the zero headcount a bit ahead. Signed-off-by: Wei Xu Reported-by: Matthew Rosato --- drivers/vhost/net.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 8d626d7..c7bdeb6 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -778,16 +778,6 @@ static void handle_rx(struct vhost_net *net) /* On error, stop handling until the next kick. */ if (unlikely(headcount < 0)) goto out; - if (nvq->rx_array) - msg.msg_control = vhost_net_buf_consume(>rxq); - /* On overrun, truncate and discard */ - if (unlikely(headcount > UIO_MAXIOV)) { - iov_iter_init(_iter, READ, vq->iov, 1, 1); - err = sock->ops->recvmsg(sock, , -1, MSG_DONTWAIT | MSG_TRUNC); - pr_debug("Discarded rx packet: len %zd\n", sock_len); - continue; - } /* OK, now we need to know about added descriptors. */ if (!headcount) { if (unlikely(vhost_enable_notify(>dev, vq))) { @@ -800,6 +790,16 @@ static void handle_rx(struct vhost_net *net) * they refilled. */ goto out; } + if (nvq->rx_array) + msg.msg_control = vhost_net_buf_consume(>rxq); + /* On overrun, truncate and discard */ + if (unlikely(headcount > UIO_MAXIOV)) { + iov_iter_init(_iter, READ, vq->iov, 1, 1); + err = sock->ops->recvmsg(sock, , +1, MSG_DONTWAIT | MSG_TRUNC); + pr_debug("Discarded rx packet: len %zd\n", sock_len); + continue; + } /* We don't need to be notified again. */ iov_iter_init(_iter, READ, vq->iov, in, vhost_len); fixup = msg.msg_iter; -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net,stable v4 0/3] vhost: fix a few skb leaks
On 12/01/2017 09:47 AM, Michael S. Tsirkin wrote: > On Fri, Dec 01, 2017 at 05:10:35AM -0500, w...@redhat.com wrote: >> From: Wei Xu>> >> Matthew found a roughly 40% tcp throughput regression with commit >> c67df11f(vhost_net: try batch dequing from skb array) as discussed >> in the following thread: >> https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html > > Series > > Acked-by: Michael S. Tsirkin Tested this series on a z13 (s390x) on top of net-next using 4GB/4vcpu guests. Verified that both the reported TCP throughput regression and memory leak are resolved. net-next: 19.83 Gb/s net-next+: 35.02 Gb/s Thanks all! Matt > > >> v4: >> - fix zero iov iterator count in tap/tap_do_read()(Jason) >> - don't put tun in case of EBADFD(Jason) >> - Replace msg->msg_control with new 'skb' when calling tun/tap_do_read() >> >> v3: >> - move freeing skb from vhost to tun/tap recvmsg() to not >> confuse the callers. >> >> v2: >> - add Matthew as the reporter, thanks matthew. >> - moving zero headcount check ahead instead of defer consuming skb >> due to jason and mst's comment. >> - add freeing skb in favor of recvmsg() fails. >> >> Wei Xu (3): >> vhost: fix skb leak in handle_rx() >> tun: free skb in early errors >> tap: free skb if flags error >> >> drivers/net/tap.c | 14 ++ >> drivers/net/tun.c | 24 ++-- >> drivers/vhost/net.c | 20 ++-- >> 3 files changed, 38 insertions(+), 20 deletions(-) >> >> -- >> 1.8.3.1 > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 2/3] tun: free skb in early errors
From: Wei Xutun_recvmsg() supports accepting skb by msg_control after commit ac77cfd4258f ("tun: support receiving skb through msg_control"), the skb if presented should be freed no matter how far it can go along, otherwise it would be leaked. This patch fixes several missed cases. Signed-off-by: Wei Xu Reported-by: Matthew Rosato --- drivers/net/tun.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 9574900..4f4a842 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1952,8 +1952,11 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile, tun_debug(KERN_INFO, tun, "tun_do_read\n"); - if (!iov_iter_count(to)) + if (!iov_iter_count(to)) { + if (skb) + kfree_skb(skb); return 0; + } if (!skb) { /* Read frames from ring */ @@ -2069,22 +2072,24 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len, { struct tun_file *tfile = container_of(sock, struct tun_file, socket); struct tun_struct *tun = tun_get(tfile); + struct sk_buff *skb = m->msg_control; int ret; - if (!tun) - return -EBADFD; + if (!tun) { + ret = -EBADFD; + goto out_free_skb; + } if (flags & ~(MSG_DONTWAIT|MSG_TRUNC|MSG_ERRQUEUE)) { ret = -EINVAL; - goto out; + goto out_put_tun; } if (flags & MSG_ERRQUEUE) { ret = sock_recv_errqueue(sock->sk, m, total_len, SOL_PACKET, TUN_TX_TIMESTAMP); goto out; } - ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT, - m->msg_control); + ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT, skb); if (ret > (ssize_t)total_len) { m->msg_flags |= MSG_TRUNC; ret = flags & MSG_TRUNC ? ret : total_len; @@ -2092,6 +2097,13 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len, out: tun_put(tun); return ret; + +out_put_tun: + tun_put(tun); +out_free_skb: + if (skb) + kfree_skb(skb); + return ret; } static int tun_peek_len(struct socket *sock) -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 1/3] vhost: fix skb leak in handle_rx()
From: Wei XuMatthew found a roughly 40% tcp throughput regression with commit c67df11f(vhost_net: try batch dequing from skb array) as discussed in the following thread: https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html Eventually we figured out that it was a skb leak in handle_rx() when sending packets to the VM. This usually happens when a guest can not drain out vq as fast as vhost fills in, afterwards it sets off the traffic jam and leaks skb(s) which occurs as no headcount to send on the vq from vhost side. This can be avoided by making sure we have got enough headcount before actually consuming a skb from the batched rx array while transmitting, which is simply done by moving checking the zero headcount a bit ahead. Signed-off-by: Wei Xu Reported-by: Matthew Rosato --- drivers/vhost/net.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 8d626d7..c7bdeb6 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -778,16 +778,6 @@ static void handle_rx(struct vhost_net *net) /* On error, stop handling until the next kick. */ if (unlikely(headcount < 0)) goto out; - if (nvq->rx_array) - msg.msg_control = vhost_net_buf_consume(>rxq); - /* On overrun, truncate and discard */ - if (unlikely(headcount > UIO_MAXIOV)) { - iov_iter_init(_iter, READ, vq->iov, 1, 1); - err = sock->ops->recvmsg(sock, , -1, MSG_DONTWAIT | MSG_TRUNC); - pr_debug("Discarded rx packet: len %zd\n", sock_len); - continue; - } /* OK, now we need to know about added descriptors. */ if (!headcount) { if (unlikely(vhost_enable_notify(>dev, vq))) { @@ -800,6 +790,16 @@ static void handle_rx(struct vhost_net *net) * they refilled. */ goto out; } + if (nvq->rx_array) + msg.msg_control = vhost_net_buf_consume(>rxq); + /* On overrun, truncate and discard */ + if (unlikely(headcount > UIO_MAXIOV)) { + iov_iter_init(_iter, READ, vq->iov, 1, 1); + err = sock->ops->recvmsg(sock, , +1, MSG_DONTWAIT | MSG_TRUNC); + pr_debug("Discarded rx packet: len %zd\n", sock_len); + continue; + } /* We don't need to be notified again. */ iov_iter_init(_iter, READ, vq->iov, in, vhost_len); fixup = msg.msg_iter; -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 2/3] tun: free skb in early errors
From: Wei Xutun_recvmsg() supports accepting skb by msg_control after commit ac77cfd4258f ("tun: support receiving skb through msg_control"), the skb if presented should be freed within the function, otherwise it would be leaked. Signed-off-by: Wei Xu Reported-by: Matthew Rosato --- drivers/net/tun.c | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 6a7bde9..5563430 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -2067,14 +2067,17 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len, { struct tun_file *tfile = container_of(sock, struct tun_file, socket); struct tun_struct *tun = tun_get(tfile); + struct sk_buff *skb = m->msg_control; int ret; - if (!tun) - return -EBADFD; + if (!tun) { + ret = -EBADFD; + goto out_free_skb; + } if (flags & ~(MSG_DONTWAIT|MSG_TRUNC|MSG_ERRQUEUE)) { ret = -EINVAL; - goto out; + goto out_free_skb; } if (flags & MSG_ERRQUEUE) { ret = sock_recv_errqueue(sock->sk, m, total_len, @@ -2087,6 +2090,11 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len, m->msg_flags |= MSG_TRUNC; ret = flags & MSG_TRUNC ? ret : total_len; } + goto out; + +out_free_skb: + if (skb) + kfree_skb(skb); out: tun_put(tun); return ret; -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 3/3] tap: free skb if flags error
From: Wei Xutap_recvmsg() supports accepting skb by msg_control after commit 3b4ba04acca8 ("tap: support receiving skb from msg_control"), the skb if presented should be freed within the function, otherwise it would be leaked. Signed-off-by: Wei Xu Reported-by: Matthew Rosato --- drivers/net/tap.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/net/tap.c b/drivers/net/tap.c index e9489b8..1c66b75 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -1154,9 +1154,13 @@ static int tap_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len, int flags) { struct tap_queue *q = container_of(sock, struct tap_queue, sock); + struct sk_buff *skb = m->msg_control; int ret; - if (flags & ~(MSG_DONTWAIT|MSG_TRUNC)) + if (flags & ~(MSG_DONTWAIT|MSG_TRUNC)) { + if (skb) + kfree_skb(skb); return -EINVAL; + } ret = tap_do_read(q, >msg_iter, flags & MSG_DONTWAIT, m->msg_control); if (ret > total_len) { -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net,stable v2] vhost: fix skb leak in handle_rx()
On Thu, Nov 30, 2017 at 10:46:17AM +0800, Jason Wang wrote: > > > On 2017年11月29日 23:31, Michael S. Tsirkin wrote: > > On Wed, Nov 29, 2017 at 09:23:24AM -0500,w...@redhat.com wrote: > > > From: Wei Xu> > > > > > Matthew found a roughly 40% tcp throughput regression with commit > > > c67df11f(vhost_net: try batch dequing from skb array) as discussed > > > in the following thread: > > > https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html > > > > > > Eventually we figured out that it was a skb leak in handle_rx() > > > when sending packets to the VM. This usually happens when a guest > > > can not drain out vq as fast as vhost fills in, afterwards it sets > > > off the traffic jam and leaks skb(s) which occurs as no headcount > > > to send on the vq from vhost side. > > > > > > This can be avoided by making sure we have got enough headcount > > > before actually consuming a skb from the batched rx array while > > > transmitting, which is simply done by moving checking the zero > > > headcount a bit ahead. > > > > > > Also strengthen the small possibility of leak in case of recvmsg() > > > fails by freeing the skb. > > > > > > Signed-off-by: Wei Xu > > > Reported-by: Matthew Rosato > > > --- > > > drivers/vhost/net.c | 23 +-- > > > 1 file changed, 13 insertions(+), 10 deletions(-) > > > > > > v2: > > > - add Matthew as the reporter, thanks matthew. > > > - moving zero headcount check ahead instead of defer consuming skb > > >due to jason and mst's comment. > > > - add freeing skb in favor of recvmsg() fails. > > > > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c > > > index 8d626d7..e302e08 100644 > > > --- a/drivers/vhost/net.c > > > +++ b/drivers/vhost/net.c > > > @@ -778,16 +778,6 @@ static void handle_rx(struct vhost_net *net) > > > /* On error, stop handling until the next kick. */ > > > if (unlikely(headcount < 0)) > > > goto out; > > > - if (nvq->rx_array) > > > - msg.msg_control = vhost_net_buf_consume(>rxq); > > > - /* On overrun, truncate and discard */ > > > - if (unlikely(headcount > UIO_MAXIOV)) { > > > - iov_iter_init(_iter, READ, vq->iov, 1, 1); > > > - err = sock->ops->recvmsg(sock, , > > > - 1, MSG_DONTWAIT | MSG_TRUNC); > > > - pr_debug("Discarded rx packet: len %zd\n", sock_len); > > > - continue; > > > - } > > > /* OK, now we need to know about added descriptors. */ > > > if (!headcount) { > > > if (unlikely(vhost_enable_notify(>dev, > > > vq))) { > > > @@ -800,6 +790,18 @@ static void handle_rx(struct vhost_net *net) > > >* they refilled. */ > > > goto out; > > > } > > > + if (nvq->rx_array) > > > + msg.msg_control = vhost_net_buf_consume(>rxq); > > > + /* On overrun, truncate and discard */ > > > + if (unlikely(headcount > UIO_MAXIOV)) { > > > + iov_iter_init(_iter, READ, vq->iov, 1, 1); > > > + err = sock->ops->recvmsg(sock, , > > > + 1, MSG_DONTWAIT | MSG_TRUNC); > > > + if (unlikely(err != 1)) > > Why 1? How is receiving 1 byte special or even possible? > > Also, I wouldn't put an unlikely here. It's all error handling code anyway. Vhost is dropping the skb by invoking a 1 byte recvmsg() here, while it is kind of weird to free skb since it would have been freed in recvmsg() for most cases, and the return value doesn't make sense too much. > > > > > + kfree_skb((struct sk_buff *)msg.msg_control); > > You do not need a cast here. Yes, exactly, I missed it. > > Also, is it really safe to refer to msg_control here? > > I'd rather keep a copy of the skb pointer and use it than assume > > caller did not change it. But also see below. It should be safe since msg is a local variable here, the callee has no chance to modify it, except rx_array is not used by vhost and then it becomes uncertain, but I don't know what the case is. Isn't vhost using rx_array for all kinds of devices? Any clue rings the bell? > > > > > + pr_debug("Discarded rx packet: len %zd\n", sock_len); > > > + continue; > > > + } > > > /* We don't need to be notified again. */ > > > iov_iter_init(_iter, READ, vq->iov, in, > > > vhost_len); > > > fixup = msg.msg_iter; > > > @@ -818,6 +820,7 @@ static void handle_rx(struct vhost_net *net) > > > pr_debug("Discarded rx packet: " > > >" len %d, expected %zd\n", err, > > > sock_len); > > >
Re: [PATCH net,stable v2] vhost: fix skb leak in handle_rx()
On Wed, Nov 29, 2017 at 10:43:33PM +0800, Jason Wang wrote: > > > On 2017年11月29日 22:23, w...@redhat.com wrote: > > From: Wei Xu> > > > Matthew found a roughly 40% tcp throughput regression with commit > > c67df11f(vhost_net: try batch dequing from skb array) as discussed > > in the following thread: > > https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html > > > > Eventually we figured out that it was a skb leak in handle_rx() > > when sending packets to the VM. This usually happens when a guest > > can not drain out vq as fast as vhost fills in, afterwards it sets > > off the traffic jam and leaks skb(s) which occurs as no headcount > > to send on the vq from vhost side. > > > > This can be avoided by making sure we have got enough headcount > > before actually consuming a skb from the batched rx array while > > transmitting, which is simply done by moving checking the zero > > headcount a bit ahead. > > > > Also strengthen the small possibility of leak in case of recvmsg() > > fails by freeing the skb. > > > > Signed-off-by: Wei Xu > > Reported-by: Matthew Rosato > > --- > > drivers/vhost/net.c | 23 +-- > > 1 file changed, 13 insertions(+), 10 deletions(-) > > > > v2: > > - add Matthew as the reporter, thanks matthew. > > - moving zero headcount check ahead instead of defer consuming skb > >due to jason and mst's comment. > > - add freeing skb in favor of recvmsg() fails. > > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c > > index 8d626d7..e302e08 100644 > > --- a/drivers/vhost/net.c > > +++ b/drivers/vhost/net.c > > @@ -778,16 +778,6 @@ static void handle_rx(struct vhost_net *net) > > /* On error, stop handling until the next kick. */ > > if (unlikely(headcount < 0)) > > goto out; > > - if (nvq->rx_array) > > - msg.msg_control = vhost_net_buf_consume(>rxq); > > - /* On overrun, truncate and discard */ > > - if (unlikely(headcount > UIO_MAXIOV)) { > > - iov_iter_init(_iter, READ, vq->iov, 1, 1); > > - err = sock->ops->recvmsg(sock, , > > -1, MSG_DONTWAIT | MSG_TRUNC); > > - pr_debug("Discarded rx packet: len %zd\n", sock_len); > > - continue; > > - } > > /* OK, now we need to know about added descriptors. */ > > if (!headcount) { > > if (unlikely(vhost_enable_notify(>dev, vq))) { > > @@ -800,6 +790,18 @@ static void handle_rx(struct vhost_net *net) > > * they refilled. */ > > goto out; > > } > > + if (nvq->rx_array) > > + msg.msg_control = vhost_net_buf_consume(>rxq); > > + /* On overrun, truncate and discard */ > > + if (unlikely(headcount > UIO_MAXIOV)) { > > + iov_iter_init(_iter, READ, vq->iov, 1, 1); > > + err = sock->ops->recvmsg(sock, , > > +1, MSG_DONTWAIT | MSG_TRUNC); > > + if (unlikely(err != 1)) > > + kfree_skb((struct sk_buff *)msg.msg_control); > > I think we'd better fix this in tun/tap (better in another patch) otherwise > it lead to an odd API: some case skb were freed in recvmsg() but caller > still need to deal with the rest case. Right, it is better to handle it in recvmsg(). Wei ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC] virtio-net: help live migrate SR-IOV devices
On Wed, 29 Nov 2017 20:10:09 -0800, Stephen Hemminger wrote: > On Wed, 29 Nov 2017 19:51:38 -0800 Jakub Kicinski wrote: > > On Thu, 30 Nov 2017 11:29:56 +0800, Jason Wang wrote: > > > On 2017年11月29日 03:27, Jesse Brandeburg wrote: > > > commit 0c195567a8f6e82ea5535cd9f1d54a1626dd233e > > > Author: stephen hemminger> > > Date: Tue Aug 1 19:58:53 2017 -0700 > > > > > > netvsc: transparent VF management > > > > > > This patch implements transparent fail over from synthetic NIC > > > to SR-IOV virtual function NIC in Hyper-V environment. It is a > > > better alternative to using bonding as is done now. Instead, the > > > receive and transmit fail over is done internally inside the driver. > > > > > > Using bonding driver has lots of issues because it depends on > > > the script being run early enough in the boot process and with > > > sufficient information to make the association. This patch moves > > > all that functionality into the kernel. > > > > > > Signed-off-by: Stephen Hemminger > > > Signed-off-by: David S. Miller > > > > > > If my understanding is correct there's no need to for any extension > > > of virtio spec. If this is true, maybe you can start to prepare the > > > patch? > > > > IMHO this is as close to policy in the kernel as one can get. User > > land has all the information it needs to instantiate that bond/team > > automatically. In fact I'm trying to discuss this with NetworkManager > > folks and Red Hat right now: > > > > https://mail.gnome.org/archives/networkmanager-list/2017-November/msg00038.html > > > > Can we flip the argument and ask why is the kernel supposed to be > > responsible for this? It's not like we run DHCP out of the kernel > > on new interfaces... > > Although "policy should not be in the kernel" is a a great mantra, > it is not practical in the real world. > > If you think it can be solved in userspace, then you haven't had to > deal with four different network initialization > systems, multiple orchestration systems and customers on ancient > Enterprise distributions. I would accept that argument if anyone ever tried to get those Enterprise distros to handle this use case. From conversations I had it seemed like no one ever did, and SR-IOV+virtio bonding is what has been done to solve this since day 1 of SR-IOV networking. For practical reasons it's easier to push this into the kernel, because vendors rarely employ developers of the user space orchestrations systems. Is that not the real problem here, potentially? :) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net,stable] vhost: fix skb leak in handle_rx()
From: Wei XuMatthew found a roughly 40% tcp throughput regression with commit c67df11f(vhost_net: try batch dequing from skb array) as discussed in the following thread: https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html Eventually we figured out that it was a skb leak in handle_rx() when sending packets to the VM. This usually happens when a guest can not drain out vq as fast as vhost fills in, afterwards it sets off the traffic jam and leaks skb(s) which occurs as no headcount to send on the vq from vhost side. This can be avoided by making sure we have got enough headcount before actually consuming a skb from the batched rx array while transmitting, which is simply done by deferring it a moment later in this patch. Signed-off-by: Wei Xu --- drivers/vhost/net.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 8d626d7..e76535e 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -778,8 +778,6 @@ static void handle_rx(struct vhost_net *net) /* On error, stop handling until the next kick. */ if (unlikely(headcount < 0)) goto out; - if (nvq->rx_array) - msg.msg_control = vhost_net_buf_consume(>rxq); /* On overrun, truncate and discard */ if (unlikely(headcount > UIO_MAXIOV)) { iov_iter_init(_iter, READ, vq->iov, 1, 1); @@ -809,6 +807,8 @@ static void handle_rx(struct vhost_net *net) */ iov_iter_advance(_iter, vhost_hlen); } + if (nvq->rx_array) + msg.msg_control = vhost_net_buf_consume(>rxq); err = sock->ops->recvmsg(sock, , sock_len, MSG_DONTWAIT | MSG_TRUNC); /* Userspace might have consumed the packet meanwhile: -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net,stable v2] vhost: fix skb leak in handle_rx()
From: Wei XuMatthew found a roughly 40% tcp throughput regression with commit c67df11f(vhost_net: try batch dequing from skb array) as discussed in the following thread: https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html Eventually we figured out that it was a skb leak in handle_rx() when sending packets to the VM. This usually happens when a guest can not drain out vq as fast as vhost fills in, afterwards it sets off the traffic jam and leaks skb(s) which occurs as no headcount to send on the vq from vhost side. This can be avoided by making sure we have got enough headcount before actually consuming a skb from the batched rx array while transmitting, which is simply done by moving checking the zero headcount a bit ahead. Also strengthen the small possibility of leak in case of recvmsg() fails by freeing the skb. Signed-off-by: Wei Xu Reported-by: Matthew Rosato --- drivers/vhost/net.c | 23 +-- 1 file changed, 13 insertions(+), 10 deletions(-) v2: - add Matthew as the reporter, thanks matthew. - moving zero headcount check ahead instead of defer consuming skb due to jason and mst's comment. - add freeing skb in favor of recvmsg() fails. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 8d626d7..e302e08 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -778,16 +778,6 @@ static void handle_rx(struct vhost_net *net) /* On error, stop handling until the next kick. */ if (unlikely(headcount < 0)) goto out; - if (nvq->rx_array) - msg.msg_control = vhost_net_buf_consume(>rxq); - /* On overrun, truncate and discard */ - if (unlikely(headcount > UIO_MAXIOV)) { - iov_iter_init(_iter, READ, vq->iov, 1, 1); - err = sock->ops->recvmsg(sock, , -1, MSG_DONTWAIT | MSG_TRUNC); - pr_debug("Discarded rx packet: len %zd\n", sock_len); - continue; - } /* OK, now we need to know about added descriptors. */ if (!headcount) { if (unlikely(vhost_enable_notify(>dev, vq))) { @@ -800,6 +790,18 @@ static void handle_rx(struct vhost_net *net) * they refilled. */ goto out; } + if (nvq->rx_array) + msg.msg_control = vhost_net_buf_consume(>rxq); + /* On overrun, truncate and discard */ + if (unlikely(headcount > UIO_MAXIOV)) { + iov_iter_init(_iter, READ, vq->iov, 1, 1); + err = sock->ops->recvmsg(sock, , +1, MSG_DONTWAIT | MSG_TRUNC); + if (unlikely(err != 1)) + kfree_skb((struct sk_buff *)msg.msg_control); + pr_debug("Discarded rx packet: len %zd\n", sock_len); + continue; + } /* We don't need to be notified again. */ iov_iter_init(_iter, READ, vq->iov, in, vhost_len); fixup = msg.msg_iter; @@ -818,6 +820,7 @@ static void handle_rx(struct vhost_net *net) pr_debug("Discarded rx packet: " " len %d, expected %zd\n", err, sock_len); vhost_discard_vq_desc(vq, headcount); + kfree_skb((struct sk_buff *)msg.msg_control); continue; } /* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */ -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC] virtio-net: help live migrate SR-IOV devices
On Thu, 30 Nov 2017 11:29:56 +0800, Jason Wang wrote: > On 2017年11月29日 03:27, Jesse Brandeburg wrote: > > Hi, I'd like to get some feedback on a proposal to enhance virtio-net > > to ease configuration of a VM and that would enable live migration of > > passthrough network SR-IOV devices. > > > > Today we have SR-IOV network devices (VFs) that can be passed into a VM > > in order to enable high performance networking direct within the VM. > > The problem I am trying to address is that this configuration is > > generally difficult to live-migrate. There is documentation [1] > > indicating that some OS/Hypervisor vendors will support live migration > > of a system with a direct assigned networking device. The problem I > > see with these implementations is that the network configuration > > requirements that are passed on to the owner of the VM are quite > > complicated. You have to set up bonding, you have to configure it to > > enslave two interfaces, those interfaces (one is virtio-net, the other > > is SR-IOV device/driver like ixgbevf) must support MAC address changes > > requested in the VM, and on and on... > > > > So, on to the proposal: > > Modify virtio-net driver to be a single VM network device that > > enslaves an SR-IOV network device (inside the VM) with the same MAC > > address. This would cause the virtio-net driver to appear and work like > > a simplified bonding/team driver. The live migration problem would be > > solved just like today's bonding solution, but the VM user's networking > > config would be greatly simplified. > > > > At it's simplest, it would appear something like this in the VM. > > > > == > > = vnet0 = > > = > > (virtio- = | > > net)= | > > = == > > = = ixgbef = > > == == > > > > (forgive the ASCII art) > > > > The fast path traffic would prefer the ixgbevf or other SR-IOV device > > path, and fall back to virtio's transmit/receive when migrating. > > > > Compared to today's options this proposal would > > 1) make virtio-net more sticky, allow fast path traffic at SR-IOV > > speeds > > 2) simplify end user configuration in the VM (most if not all of the > > set up to enable migration would be done in the hypervisor) > > 3) allow live migration via a simple link down and maybe a PCI > > hot-unplug of the SR-IOV device, with failover to the virtio-net > > driver core > > 4) allow vendor agnostic hardware acceleration, and live migration > > between vendors if the VM os has driver support for all the required > > SR-IOV devices. > > > > Runtime operation proposed: > > - virtio-net driver loads, SR-IOV driver loads > > - virtio-net finds other NICs that match it's MAC address by > >both examining existing interfaces, and sets up a new device notifier > > - virtio-net enslaves the first NIC with the same MAC address > > - virtio-net brings up the slave, and makes it the "preferred" path > > - virtio-net follows the behavior of an active backup bond/team > > - virtio-net acts as the interface to the VM > > - live migration initiates > > - link goes down on SR-IOV, or SR-IOV device is removed > > - failover to virtio-net as primary path > > - migration continues to new host > > - new host is started with virio-net as primary > > - if no SR-IOV, virtio-net stays primary > > - hypervisor can hot-add SR-IOV NIC, with same MAC addr as virtio > > - virtio-net notices new NIC and starts over at enslave step above > > > > Future ideas (brainstorming): > > - Optimize Fast east-west by having special rules to direct east-west > >traffic through virtio-net traffic path > > > > Thanks for reading! > > Jesse > > Cc netdev. > > Interesting, and this method is actually used by netvsc now: > > commit 0c195567a8f6e82ea5535cd9f1d54a1626dd233e > Author: stephen hemminger> Date: Tue Aug 1 19:58:53 2017 -0700 > > netvsc: transparent VF management > > This patch implements transparent fail over from synthetic NIC to > SR-IOV virtual function NIC in Hyper-V environment. It is a better > alternative to using bonding as is done now. Instead, the receive and > transmit fail over is done internally inside the driver. > > Using bonding driver has lots of issues because it depends on the > script being run early enough in the boot process and with sufficient > information to make the association. This patch moves all that > functionality into the kernel. > > Signed-off-by: Stephen Hemminger > Signed-off-by: David S. Miller > > If my understanding is correct there's no need to for any extension of > virtio spec. If this is true, maybe you can start to prepare the patch? IMHO this is as close to policy in the kernel as one can get. User land has all the information it needs to instantiate that bond/team automatically. In fact I'm
[PATCH] virtio: release virtio index when fail to device_register
index can be reused by other virtio device. Signed-off-by: weiping zhang--- drivers/virtio/virtio.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c index 48230a5..bf7ff39 100644 --- a/drivers/virtio/virtio.c +++ b/drivers/virtio/virtio.c @@ -333,6 +333,8 @@ int register_virtio_device(struct virtio_device *dev) /* device_register() causes the bus infrastructure to look for a * matching driver. */ err = device_register(>dev); + if (err) + ida_simple_remove(_index_ida, dev->index); out: if (err) virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED); -- 2.9.4 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 00/13] remove_conflicting_framebuffers() cleanup
On Mon, Nov 27, 2017 at 11:30:44AM +0100, Daniel Vetter wrote: > On Fri, Nov 24, 2017 at 06:53:25PM +0100, Michał Mirosław wrote: > > This series cleans up duplicated code for replacing firmware FB > > driver with proper DRI driver and adds handover support to > > Tegra driver. > > > > The last patch is here because it uses new semantics of > > remove_conflicting_framebuffers() from this series. This > > can be considered independently, though. > > Except for that patches I've commented on: > > Acked-by: Daniel Vetter> > Since this is for tegra and Thierry has drm-misc commit rights, it's > probably simplest when Thierry pushes this all to drm-misc once driver > maintainers had a chance to look at it. Will do. Thierry signature.asc Description: PGP signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH tip/core/rcu 21/21] drivers/vhost: Remove now-redundant read_barrier_depends()
Because READ_ONCE() now implies read_barrier_depends(), the read_barrier_depends() in next_desc() is now redundant. This commit therefore removes it and the related comments. Signed-off-by: Paul E. McKenneyCc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Cc: Cc: --- drivers/vhost/vhost.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 33ac2b186b85..78b5940a415a 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -1877,12 +1877,7 @@ static unsigned next_desc(struct vhost_virtqueue *vq, struct vring_desc *desc) return -1U; /* Check they're not leading us off end of descriptors. */ - next = vhost16_to_cpu(vq, desc->next); - /* Make sure compiler knows to grab that: we don't want it changing! */ - /* We will use the result as an index in an array, so most -* architectures only need a compiler barrier here. */ - read_barrier_depends(); - + next = vhost16_to_cpu(vq, READ_ONCE(desc->next)); return next; } -- 2.5.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH] virtio_mmio: add cleanup for virtio_mmio_probe
cleanup all resource allocated by virtio_mmio_probe. Signed-off-by: weiping zhang--- drivers/virtio/virtio_mmio.c | 34 ++ 1 file changed, 26 insertions(+), 8 deletions(-) diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c index 74dc717..3fd0e66 100644 --- a/drivers/virtio/virtio_mmio.c +++ b/drivers/virtio/virtio_mmio.c @@ -513,8 +513,10 @@ static int virtio_mmio_probe(struct platform_device *pdev) return -EBUSY; vm_dev = devm_kzalloc(>dev, sizeof(*vm_dev), GFP_KERNEL); - if (!vm_dev) - return -ENOMEM; + if (!vm_dev) { + rc = -ENOMEM; + goto free_mem; + } vm_dev->vdev.dev.parent = >dev; vm_dev->vdev.dev.release = virtio_mmio_release_dev_empty; @@ -524,14 +526,17 @@ static int virtio_mmio_probe(struct platform_device *pdev) spin_lock_init(_dev->lock); vm_dev->base = devm_ioremap(>dev, mem->start, resource_size(mem)); - if (vm_dev->base == NULL) - return -EFAULT; + if (vm_dev->base == NULL) { + rc = -EFAULT; + goto free_vmdev; + } /* Check magic value */ magic = readl(vm_dev->base + VIRTIO_MMIO_MAGIC_VALUE); if (magic != ('v' | 'i' << 8 | 'r' << 16 | 't' << 24)) { dev_warn(>dev, "Wrong magic value 0x%08lx!\n", magic); - return -ENODEV; + rc = -ENODEV; + goto unmap; } /* Check device version */ @@ -539,7 +544,8 @@ static int virtio_mmio_probe(struct platform_device *pdev) if (vm_dev->version < 1 || vm_dev->version > 2) { dev_err(>dev, "Version %ld not supported!\n", vm_dev->version); - return -ENXIO; + rc = -ENXIO; + goto unmap; } vm_dev->vdev.id.device = readl(vm_dev->base + VIRTIO_MMIO_DEVICE_ID); @@ -548,7 +554,8 @@ static int virtio_mmio_probe(struct platform_device *pdev) * virtio-mmio device with an ID 0 is a (dummy) placeholder * with no function. End probing now with no error reported. */ - return -ENODEV; + rc = -ENODEV; + goto unmap; } vm_dev->vdev.id.vendor = readl(vm_dev->base + VIRTIO_MMIO_VENDOR_ID); @@ -573,7 +580,18 @@ static int virtio_mmio_probe(struct platform_device *pdev) platform_set_drvdata(pdev, vm_dev); - return register_virtio_device(_dev->vdev); + rc = register_virtio_device(_dev->vdev); + if (rc) + goto unmap; + return 0; +unmap: + iounmap(vm_dev->base); +free_mem: + devm_release_mem_region(>dev, mem->start, + resource_size(mem)); +free_vmdev: + devm_kfree(>dev, vm_dev); + return rc; } static int virtio_mmio_remove(struct platform_device *pdev) -- 2.9.4 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio: release virtio index when fail to device_register
2017-12-02 0:55 GMT+08:00 weiping zhang: > On Wed, Nov 29, 2017 at 10:50:44AM +0100, Cornelia Huck wrote: >> On Wed, 29 Nov 2017 09:23:01 +0800 >> weiping zhang wrote: >> >> > index can be reused by other virtio device. >> > >> > Signed-off-by: weiping zhang >> >> Reviewed-by: Cornelia Huck >> >> > --- >> > drivers/virtio/virtio.c | 2 ++ >> > 1 file changed, 2 insertions(+) >> > >> > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c >> > index 48230a5..bf7ff39 100644 >> > --- a/drivers/virtio/virtio.c >> > +++ b/drivers/virtio/virtio.c >> > @@ -333,6 +333,8 @@ int register_virtio_device(struct virtio_device *dev) >> > /* device_register() causes the bus infrastructure to look for a >> > * matching driver. */ >> > err = device_register(>dev); >> > + if (err) >> > + ida_simple_remove(_index_ida, dev->index); >> > out: >> > if (err) >> > virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED); >> >> I think your patch is correct (giving up the index if we failed to >> register), but this made me look at our error handling if a virtio >> device failed to register. >> >> We hold an extra reference to the struct device, even after a failed >> register, and it is the responsibility of the caller to give up that >> reference once no longer needed. As callers toregister_virtio_device() >> embed the struct virtio_device, it needs to be their responsibility. >> Looking at the existing callers, >> >> - ccw does a put_device >> - pci, mmio and remoteproc do nothing, causing a leak >> - vop does a free on the embedding structure, which is a big no-no >> >> Thoughts? > Sorry to relay late and thanks for your review. > Do you mean the "extra reference to the struct device" caused by the > following code? > > err = device_register(>dev); > device_add(dev) > get_device(dev) > If I'm understand right, I think there is no extra reference if we fail > virtio_register_device, because if device_register we don't get a > reference. I go through these callers, virtio_mmio_probe should do more cleanup. I'll sent a patch fix that. > -- > Thanks > weiping ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v18 05/10] xbitmap: add more operations
On Fri, Dec 01, 2017 at 03:09:08PM +, Wang, Wei W wrote: > On Friday, December 1, 2017 9:02 PM, Tetsuo Handa wrote: > > If start == end is legal, > > > >for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) { > > > > makes this loop do nothing because 10 < 10 is false. > > How about "start <= end "? Don't ask Tetsuo for his opinion, write some userspace code that uses it. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v18 10/10] virtio-balloon: don't report free pages when page poisoning is enabled
On Wed, Nov 29, 2017 at 09:55:26PM +0800, Wei Wang wrote: > The guest free pages should not be discarded by the live migration thread > when page poisoning is enabled with PAGE_POISONING_NO_SANITY=n, because > skipping the transfer of such poisoned free pages will trigger false > positive when new pages are allocated and checked on the destination. > This patch skips the reporting of free pages in the above case. > > Reported-by: Michael S. Tsirkin> Signed-off-by: Wei Wang > Cc: Michal Hocko > --- > drivers/virtio/virtio_balloon.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c > index 035bd3a..6ac4cff 100644 > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -652,7 +652,9 @@ static void report_free_page(struct work_struct *work) > /* Start by sending the obtained cmd id to the host with an outbuf */ > send_one_desc(vb, vb->free_page_vq, virt_to_phys(>start_cmd_id), > sizeof(uint32_t), false, true, false); > - walk_free_mem_block(vb, 0, _balloon_send_free_pages); > + if (!(page_poisoning_enabled() && > + !IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY))) > + walk_free_mem_block(vb, 0, _balloon_send_free_pages); > /* >* End by sending the stop id to the host with an outbuf. Use the >* non-batching mode here to trigger a kick after adding the stop id. PAGE_POISONING_ZERO is actually OK. But I really would prefer it that we still send pages to host, otherwise debugging becomes much harder. And it does not have to be completely useless, even though you can not discard them as they would be zero-filled then. How about a config field telling host what should be there in the free pages? This way even though host can not discard them, host can send them out without reading them, still a win. > -- > 2.7.4 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v18 07/10] virtio-balloon: VIRTIO_BALLOON_F_SG
On Wed, Nov 29, 2017 at 09:55:23PM +0800, Wei Wang wrote: > Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer of > balloon (i.e. inflated/deflated) pages using scatter-gather lists to the > host. A scatter-gather list is described by a vring desc. > > The implementation of the previous virtio-balloon is not very efficient, > because the balloon pages are transferred to the host by one array each > time. Here is the breakdown of the time in percentage spent on each step > of the balloon inflating process (inflating 7GB of an 8GB idle guest). > > 1) allocating pages (6.5%) > 2) sending PFNs to host (68.3%) > 3) address translation (6.1%) > 4) madvise (19%) > > It takes about 4126ms for the inflating process to complete. The above > profiling shows that the bottlenecks are stage 2) and stage 4). > > This patch optimizes step 2) by transferring pages to host in sgs. An sg > describes a chunk of guest physically continuous pages. With this > mechanism, step 4) can also be optimized by doing address translation and > madvise() in chunks rather than page by page. > > With this new feature, the above ballooning process takes ~440ms resulting > in an improvement of ~89%. > > TODO: optimize stage 1) by allocating/freeing a chunk of pages instead of > a single page each time. > > Signed-off-by: Wei Wang> Signed-off-by: Liang Li > Suggested-by: Michael S. Tsirkin > Cc: Tetsuo Handa > --- > drivers/virtio/virtio_balloon.c | 230 > +--- > include/uapi/linux/virtio_balloon.h | 1 + > 2 files changed, 212 insertions(+), 19 deletions(-) > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c > index 7960746..2c21c5a 100644 > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -32,6 +32,8 @@ > #include > #include > #include > +#include > +#include > > /* > * Balloon device works in 4K page units. So each page is pointed to by > @@ -79,6 +81,9 @@ struct virtio_balloon { > /* Synchronize access/update to this struct virtio_balloon elements */ > struct mutex balloon_lock; > > + /* The xbitmap used to record balloon pages */ > + struct xb page_xb; > + > /* The array of pfns we tell the Host about. */ > unsigned int num_pfns; > __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX]; > @@ -141,15 +146,121 @@ static void set_page_pfns(struct virtio_balloon *vb, > page_to_balloon_pfn(page) + i); > } > > +static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head) > +{ > + unsigned int len; > + > + virtqueue_kick(vq); > + wait_event(wq_head, virtqueue_get_buf(vq, )); > +} > + > +static void send_one_desc(struct virtio_balloon *vb, > + struct virtqueue *vq, > + uint64_t addr, > + uint32_t len, > + bool inbuf, > + bool batch) > +{ > + int err; > + unsigned int size; > + > + /* Detach all the used buffers from the vq */ > + while (virtqueue_get_buf(vq, )) > + ; > + > + err = virtqueue_add_one_desc(vq, addr, len, inbuf, vq); > + /* > + * This is expected to never fail: there is always at least 1 entry > + * available on the vq, because when the vq is full the worker thread > + * that adds the desc will be put into sleep until at least 1 entry is > + * available to use. > + */ > + BUG_ON(err); > + > + /* If batching is requested, we batch till the vq is full */ > + if (!batch || !vq->num_free) > + kick_and_wait(vq, vb->acked); > +} > + This internal kick complicates callers. I suggest that instead, you move this to callers, just return a "kick required" boolean. This way callers do not need to play with num_free at all. > +/* > + * Send balloon pages in sgs to host. The balloon pages are recorded in the > + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE. > + * The page xbitmap is searched for continuous "1" bits, which correspond > + * to continuous pages, to chunk into sgs. > + * > + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that > + * need to be searched. > + */ > +static void tell_host_sgs(struct virtio_balloon *vb, > + struct virtqueue *vq, > + unsigned long page_xb_start, > + unsigned long page_xb_end) > +{ > + unsigned long pfn_start, pfn_end; > + uint64_t addr; > + uint32_t len, max_len = round_down(UINT_MAX, PAGE_SIZE); > + > + pfn_start = page_xb_start; > + while (pfn_start < page_xb_end) { > + pfn_start = xb_find_next_set_bit(>page_xb, pfn_start, > + page_xb_end); > + if (pfn_start ==
RE: [PATCH v18 05/10] xbitmap: add more operations
On Friday, December 1, 2017 9:02 PM, Tetsuo Handa wrote: > Wei Wang wrote: > > On 11/30/2017 06:34 PM, Tetsuo Handa wrote: > > > Wei Wang wrote: > > >> + * @start: the start of the bit range, inclusive > > >> + * @end: the end of the bit range, inclusive > > >> + * > > >> + * This function is used to clear a bit in the xbitmap. If all the > > >> +bits of the > > >> + * bitmap are 0, the bitmap will be freed. > > >> + */ > > >> +void xb_clear_bit_range(struct xb *xb, unsigned long start, > > >> +unsigned long end) { > > >> +struct radix_tree_root *root = >xbrt; > > >> +struct radix_tree_node *node; > > >> +void **slot; > > >> +struct ida_bitmap *bitmap; > > >> +unsigned int nbits; > > >> + > > >> +for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + > > >> 1) { > > >> +unsigned long index = start / IDA_BITMAP_BITS; > > >> +unsigned long bit = start % IDA_BITMAP_BITS; > > >> + > > >> +bitmap = __radix_tree_lookup(root, index, , ); > > >> +if (radix_tree_exception(bitmap)) { > > >> +unsigned long ebit = bit + 2; > > >> +unsigned long tmp = (unsigned long)bitmap; > > >> + > > >> +nbits = min(end - start + 1, BITS_PER_LONG - > > >> ebit); > > > "nbits = min(end - start + 1," seems to expect that start == end is > > > legal for clearing only 1 bit. But this function is no-op if start == end. > > > Please clarify what "inclusive" intended. > > > > If xb_clear_bit_range(xb,10,10), then it is effectively the same as > > xb_clear_bit(10). Why would it be illegal? > > > > "@start inclusive" means that the @start will also be included to be > > cleared. > > If start == end is legal, > >for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) { > > makes this loop do nothing because 10 < 10 is false. How about "start <= end "? Best, Wei ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net,stable v4 0/3] vhost: fix a few skb leaks
On Fri, Dec 01, 2017 at 09:54:02AM -0500, Matthew Rosato wrote: > On 12/01/2017 09:47 AM, Michael S. Tsirkin wrote: > > On Fri, Dec 01, 2017 at 05:10:35AM -0500, w...@redhat.com wrote: > >> From: Wei Xu> >> > >> Matthew found a roughly 40% tcp throughput regression with commit > >> c67df11f(vhost_net: try batch dequing from skb array) as discussed > >> in the following thread: > >> https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html > > > > Series > > > > Acked-by: Michael S. Tsirkin > > Tested this series on a z13 (s390x) on top of net-next using 4GB/4vcpu > guests. Verified that both the reported TCP throughput regression and > memory leak are resolved. > > net-next: 19.83 Gb/s > net-next+: 35.02 Gb/s > > Thanks all! > > Matt So we can also add Tested-by: Matthew Rosato Dave, pls take this through the net tree as most changes are in tun/tap. Thanks! > > > > > >> v4: > >> - fix zero iov iterator count in tap/tap_do_read()(Jason) > >> - don't put tun in case of EBADFD(Jason) > >> - Replace msg->msg_control with new 'skb' when calling tun/tap_do_read() > >> > >> v3: > >> - move freeing skb from vhost to tun/tap recvmsg() to not > >> confuse the callers. > >> > >> v2: > >> - add Matthew as the reporter, thanks matthew. > >> - moving zero headcount check ahead instead of defer consuming skb > >> due to jason and mst's comment. > >> - add freeing skb in favor of recvmsg() fails. > >> > >> Wei Xu (3): > >> vhost: fix skb leak in handle_rx() > >> tun: free skb in early errors > >> tap: free skb if flags error > >> > >> drivers/net/tap.c | 14 ++ > >> drivers/net/tun.c | 24 ++-- > >> drivers/vhost/net.c | 20 ++-- > >> 3 files changed, 38 insertions(+), 20 deletions(-) > >> > >> -- > >> 1.8.3.1 > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] tap: free skb if flags error
On Fri, Dec 01, 2017 at 05:10:38AM -0500, w...@redhat.com wrote: > From: Wei Xu> > tap_recvmsg() supports accepting skb by msg_control after > commit 3b4ba04acca8 ("tap: support receiving skb from msg_control"), > the skb if presented should be freed within the function, otherwise > it would be leaked. > > Signed-off-by: Wei Xu > Reported-by: Matthew Rosato Acked-by: Michael S. Tsirkin > --- > drivers/net/tap.c | 14 ++ > 1 file changed, 10 insertions(+), 4 deletions(-) > > diff --git a/drivers/net/tap.c b/drivers/net/tap.c > index e9489b8..0a886fda 100644 > --- a/drivers/net/tap.c > +++ b/drivers/net/tap.c > @@ -829,8 +829,11 @@ static ssize_t tap_do_read(struct tap_queue *q, > DEFINE_WAIT(wait); > ssize_t ret = 0; > > - if (!iov_iter_count(to)) > + if (!iov_iter_count(to)) { > + if (skb) > + kfree_skb(skb); > return 0; > + } > > if (skb) > goto put; > @@ -1154,11 +1157,14 @@ static int tap_recvmsg(struct socket *sock, struct > msghdr *m, > size_t total_len, int flags) > { > struct tap_queue *q = container_of(sock, struct tap_queue, sock); > + struct sk_buff *skb = m->msg_control; > int ret; > - if (flags & ~(MSG_DONTWAIT|MSG_TRUNC)) > + if (flags & ~(MSG_DONTWAIT|MSG_TRUNC)) { > + if (skb) > + kfree_skb(skb); > return -EINVAL; > - ret = tap_do_read(q, >msg_iter, flags & MSG_DONTWAIT, > - m->msg_control); > + } > + ret = tap_do_read(q, >msg_iter, flags & MSG_DONTWAIT, skb); > if (ret > total_len) { > m->msg_flags |= MSG_TRUNC; > ret = flags & MSG_TRUNC ? ret : total_len; > -- > 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net,stable v4 0/3] vhost: fix a few skb leaks
On Fri, Dec 01, 2017 at 05:10:35AM -0500, w...@redhat.com wrote: > From: Wei Xu> > Matthew found a roughly 40% tcp throughput regression with commit > c67df11f(vhost_net: try batch dequing from skb array) as discussed > in the following thread: > https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html Series Acked-by: Michael S. Tsirkin > v4: > - fix zero iov iterator count in tap/tap_do_read()(Jason) > - don't put tun in case of EBADFD(Jason) > - Replace msg->msg_control with new 'skb' when calling tun/tap_do_read() > > v3: > - move freeing skb from vhost to tun/tap recvmsg() to not > confuse the callers. > > v2: > - add Matthew as the reporter, thanks matthew. > - moving zero headcount check ahead instead of defer consuming skb > due to jason and mst's comment. > - add freeing skb in favor of recvmsg() fails. > > Wei Xu (3): > vhost: fix skb leak in handle_rx() > tun: free skb in early errors > tap: free skb if flags error > > drivers/net/tap.c | 14 ++ > drivers/net/tun.c | 24 ++-- > drivers/vhost/net.c | 20 ++-- > 3 files changed, 38 insertions(+), 20 deletions(-) > > -- > 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/3] tun: free skb in early errors
On Fri, Dec 01, 2017 at 05:10:37AM -0500, w...@redhat.com wrote: > From: Wei Xu> > tun_recvmsg() supports accepting skb by msg_control after > commit ac77cfd4258f ("tun: support receiving skb through msg_control"), > the skb if presented should be freed no matter how far it can go > along, otherwise it would be leaked. > > This patch fixes several missed cases. > > Signed-off-by: Wei Xu > Reported-by: Matthew Rosato Acked-by: Michael S. Tsirkin > --- > drivers/net/tun.c | 24 ++-- > 1 file changed, 18 insertions(+), 6 deletions(-) > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > index 9574900..4f4a842 100644 > --- a/drivers/net/tun.c > +++ b/drivers/net/tun.c > @@ -1952,8 +1952,11 @@ static ssize_t tun_do_read(struct tun_struct *tun, > struct tun_file *tfile, > > tun_debug(KERN_INFO, tun, "tun_do_read\n"); > > - if (!iov_iter_count(to)) > + if (!iov_iter_count(to)) { > + if (skb) > + kfree_skb(skb); > return 0; > + } > > if (!skb) { > /* Read frames from ring */ > @@ -2069,22 +2072,24 @@ static int tun_recvmsg(struct socket *sock, struct > msghdr *m, size_t total_len, > { > struct tun_file *tfile = container_of(sock, struct tun_file, socket); > struct tun_struct *tun = tun_get(tfile); > + struct sk_buff *skb = m->msg_control; > int ret; > > - if (!tun) > - return -EBADFD; > + if (!tun) { > + ret = -EBADFD; > + goto out_free_skb; > + } > > if (flags & ~(MSG_DONTWAIT|MSG_TRUNC|MSG_ERRQUEUE)) { > ret = -EINVAL; > - goto out; > + goto out_put_tun; > } > if (flags & MSG_ERRQUEUE) { > ret = sock_recv_errqueue(sock->sk, m, total_len, >SOL_PACKET, TUN_TX_TIMESTAMP); > goto out; > } > - ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT, > - m->msg_control); > + ret = tun_do_read(tun, tfile, >msg_iter, flags & MSG_DONTWAIT, skb); > if (ret > (ssize_t)total_len) { > m->msg_flags |= MSG_TRUNC; > ret = flags & MSG_TRUNC ? ret : total_len; > @@ -2092,6 +2097,13 @@ static int tun_recvmsg(struct socket *sock, struct > msghdr *m, size_t total_len, > out: > tun_put(tun); > return ret; > + > +out_put_tun: > + tun_put(tun); > +out_free_skb: > + if (skb) > + kfree_skb(skb); > + return ret; > } > > static int tun_peek_len(struct socket *sock) > -- > 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/3] vhost: fix skb leak in handle_rx()
On Fri, Dec 01, 2017 at 05:10:36AM -0500, w...@redhat.com wrote: > From: Wei Xu> > Matthew found a roughly 40% tcp throughput regression with commit > c67df11f(vhost_net: try batch dequing from skb array) as discussed > in the following thread: > https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html > > Eventually we figured out that it was a skb leak in handle_rx() > when sending packets to the VM. This usually happens when a guest > can not drain out vq as fast as vhost fills in, afterwards it sets > off the traffic jam and leaks skb(s) which occurs as no headcount > to send on the vq from vhost side. > > This can be avoided by making sure we have got enough headcount > before actually consuming a skb from the batched rx array while > transmitting, which is simply done by moving checking the zero > headcount a bit ahead. > > Signed-off-by: Wei Xu > Reported-by: Matthew Rosato Acked-by: Michael S. Tsirkin > --- > drivers/vhost/net.c | 20 ++-- > 1 file changed, 10 insertions(+), 10 deletions(-) > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c > index 8d626d7..c7bdeb6 100644 > --- a/drivers/vhost/net.c > +++ b/drivers/vhost/net.c > @@ -778,16 +778,6 @@ static void handle_rx(struct vhost_net *net) > /* On error, stop handling until the next kick. */ > if (unlikely(headcount < 0)) > goto out; > - if (nvq->rx_array) > - msg.msg_control = vhost_net_buf_consume(>rxq); > - /* On overrun, truncate and discard */ > - if (unlikely(headcount > UIO_MAXIOV)) { > - iov_iter_init(_iter, READ, vq->iov, 1, 1); > - err = sock->ops->recvmsg(sock, , > - 1, MSG_DONTWAIT | MSG_TRUNC); > - pr_debug("Discarded rx packet: len %zd\n", sock_len); > - continue; > - } > /* OK, now we need to know about added descriptors. */ > if (!headcount) { > if (unlikely(vhost_enable_notify(>dev, vq))) { > @@ -800,6 +790,16 @@ static void handle_rx(struct vhost_net *net) >* they refilled. */ > goto out; > } > + if (nvq->rx_array) > + msg.msg_control = vhost_net_buf_consume(>rxq); > + /* On overrun, truncate and discard */ > + if (unlikely(headcount > UIO_MAXIOV)) { > + iov_iter_init(_iter, READ, vq->iov, 1, 1); > + err = sock->ops->recvmsg(sock, , > + 1, MSG_DONTWAIT | MSG_TRUNC); > + pr_debug("Discarded rx packet: len %zd\n", sock_len); > + continue; > + } > /* We don't need to be notified again. */ > iov_iter_init(_iter, READ, vq->iov, in, vhost_len); > fixup = msg.msg_iter; > -- > 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: Memory corruption in powerpc guests with virtio_balloon (was Re: [PATCH v3] virtio_balloon: fix deadlock on OOM)
On Fri, Dec 01, 2017 at 11:31:08PM +1100, Michael Ellerman wrote: > "Michael S. Tsirkin"writes: > > > fill_balloon doing memory allocations under balloon_lock > > can cause a deadlock when leak_balloon is called from > > virtballoon_oom_notify and tries to take same lock. > > > > To fix, split page allocation and enqueue and do allocations outside the > > lock. > > > > Here's a detailed analysis of the deadlock by Tetsuo Handa: > > > > In leak_balloon(), mutex_lock(>balloon_lock) is called in order to > > serialize against fill_balloon(). But in fill_balloon(), > > alloc_page(GFP_HIGHUSER[_MOVABLE] | __GFP_NOMEMALLOC | __GFP_NORETRY) is > > called with vb->balloon_lock mutex held. Since GFP_HIGHUSER[_MOVABLE] > > implies __GFP_DIRECT_RECLAIM | __GFP_IO | __GFP_FS, despite __GFP_NORETRY > > is specified, this allocation attempt might indirectly depend on somebody > > else's __GFP_DIRECT_RECLAIM memory allocation. And such indirect > > __GFP_DIRECT_RECLAIM memory allocation might call leak_balloon() via > > virtballoon_oom_notify() via blocking_notifier_call_chain() callback via > > out_of_memory() when it reached __alloc_pages_may_oom() and held oom_lock > > mutex. Since vb->balloon_lock mutex is already held by fill_balloon(), it > > will cause OOM lockup. > > > > Thread1 Thread2 > > fill_balloon() > > takes a balloon_lock > > balloon_page_enqueue() > > alloc_page(GFP_HIGHUSER_MOVABLE) > > direct reclaim (__GFP_FS context) takes a fs lock > > waits for that fs lock alloc_page(GFP_NOFS) > > > > __alloc_pages_may_oom() > > takes the oom_lock > > out_of_memory() > > > > blocking_notifier_call_chain() > > leak_balloon() > > tries to take > > that balloon_lock and deadlocks > > > > Reported-by: Tetsuo Handa > > Cc: Michal Hocko > > Cc: Wei Wang > > Signed-off-by: Michael S. Tsirkin > > > > drivers/virtio/virtio_balloon.c| 24 +++- > > include/linux/balloon_compaction.h | 35 ++- > > mm/balloon_compaction.c| 28 +--- > > 3 files changed, 74 insertions(+), 13 deletions(-) > > > Somehow this commit seems to be killing powerpc guests. > > The symptom is that the first page (64K) of the guests memory gets over > written with zeroes, which is where our interrupt handlers are, so the > system rapidly locks up due to illegal instructions in the illegal > instruction handler. > > There seems to be some element of a race, because it doesn't always > crash. Sometimes I can boot to a shell, but not often. When it does > happen it's fairly late in boot, but before I get to a shell. > > I had a few bisects go off into the weeds due to the intermittent > nature. But once I realised that I changed my script to boot 5 times > before declaring a kernel good, and that bisected straight here. > > I can also revert this commit on v4.15-rc1 and everything's fine again - > I got through ~250 boots with that kernel. > > So I'm pretty sure this commit is triggering/exposing/causing the bug. > > The other data point is that the page that's overwritten is mapped read > only in the guest kernel. So either the guest kernel is writing to it in > real mode (MMU off), or the hypervisor/DMA is doing it. > > I haven't isolated if it's host kernel/qemu version dependent, at the > moment I'm just using distro packaged versions of both. > > Anyway I'll try and dig further into it on Monday, but I thought I'd let > you know in case this is a known bug with a fix in the pipeline, or > rings any bells or whatever. > > cheers Thanks for the report! A fix was just posted: virtio_balloon: fix increment of vb->num_pfns in fill_balloon() Would appreciate testing. > > > diff --git a/drivers/virtio/virtio_balloon.c > > b/drivers/virtio/virtio_balloon.c > > index f0b3a0b..7960746 100644 > > --- a/drivers/virtio/virtio_balloon.c > > +++ b/drivers/virtio/virtio_balloon.c > > @@ -143,16 +143,17 @@ static void set_page_pfns(struct virtio_balloon *vb, > > > > static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) > > { > > - struct balloon_dev_info *vb_dev_info = >vb_dev_info; > > unsigned num_allocated_pages; > > + unsigned num_pfns; > > + struct page *page; > > + LIST_HEAD(pages); > > > > /* We can only do one array worth at a time. */ > > num = min(num, ARRAY_SIZE(vb->pfns)); > > > > - mutex_lock(>balloon_lock); > > - for (vb->num_pfns = 0;
Re: [PATCH 02/13] fbdev: add remove_conflicting_pci_framebuffers()
On 30 November 2017 at 23:49, Sudip Mukherjeewrote: > Hi Daniel, > > On Wed, Nov 29, 2017 at 10:56:34AM +0100, Daniel Vetter wrote: >> On Tue, Nov 28, 2017 at 12:30:30PM +, Sudip Mukherjee wrote: >> > On Tue, Nov 28, 2017 at 12:32:38PM +0100, Greg KH wrote: >> > > On Tue, Nov 28, 2017 at 11:22:17AM +0100, Daniel Vetter wrote: >> > > > On Mon, Nov 27, 2017 at 08:52:19PM +, Sudip Mukherjee wrote: >> > > > > On Mon, Nov 27, 2017 at 11:27:59AM +0100, Daniel Vetter wrote: >> > > > > > On Fri, Nov 24, 2017 at 06:53:31PM +0100, Michał Mirosław wrote: >> > > > > > > Almost all drivers using remove_conflicting_framebuffers() wrap >> > > > > > > it with >> > > > > > > the same code. Extract common part from PCI drivers into separate >> > > > > > > remove_conflicting_pci_framebuffers(). >> > > > > > > > > > >> > > > >> > > > Greg? >> > > >> > > Yes, if no one is working to get it out of staging, that means no one >> > > cares about it, and it needs to be removed from the tree. >> > >> > Agreed. I was not working on getting it out of staging as there is no >> > place for it to go. >> > But, Teddy Wang told me that they have a working drm driver for it, but >> > it is not atomic like Daniel was asking for. If it is ok, then I can send >> > in a patch to remove the existing sm750 from staging and add the new sm750 >> > drm driver to staging. And after it is ready, we can go ahead with moving >> > it out of staging to drm. >> >> Please keep the todo item that it needs to be converted to atomic. And >> tbh, it's probably faster if you just submit it to dri-devel, assuming you >> have time to work on it. For small drivers we tend to be fairly quick in >> getting them into good enough shape. > > I have received the driver from Teddy and pushed it to > https://github.com/sudipm-mukherjee/parport/tree/drm_smi for your first > look into it. It is not even building with next-20171130 and has lots of > build warnings. I will have to do a lot of work on it before I can even > submit it to dri-devel. > A crazy idea, mostly towards Tedd and Sudip: Start small and build gradually. An example split for separate patch series: - one HW, basic setup + atomic KMS - add second HW - more KMS features - fancy memory management - 2D/3D/other acceleration The driver as seen above tries to do all of the above (almost, it's not atomic) at once - 40k loc. Someone familiar with the code can quickly split it up and while doing so, feed it through checkpatch. Current code is _very_ far from kernel coding style, plus the copyright blurp is very disturbing: * All rights are reserved. Reproduction or in part is prohibited * without the written consent of the copyright owner. HTH Emil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/3] vhost: fix skb leak in handle_rx()
On Fri, Dec 01, 2017 at 03:11:05PM +0800, Jason Wang wrote: > > > On 2017年12月01日 13:54, w...@redhat.com wrote: > > From: Wei Xu> > > > Matthew found a roughly 40% tcp throughput regression with commit > > c67df11f(vhost_net: try batch dequing from skb array) as discussed > > in the following thread: > > https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html > > > > Eventually we figured out that it was a skb leak in handle_rx() > > when sending packets to the VM. This usually happens when a guest > > can not drain out vq as fast as vhost fills in, afterwards it sets > > off the traffic jam and leaks skb(s) which occurs as no headcount > > to send on the vq from vhost side. > > > > This can be avoided by making sure we have got enough headcount > > before actually consuming a skb from the batched rx array while > > transmitting, which is simply done by moving checking the zero > > headcount a bit ahead. > > > > Signed-off-by: Wei Xu > > Reported-by: Matthew Rosato > > --- > > drivers/vhost/net.c | 20 ++-- > > 1 file changed, 10 insertions(+), 10 deletions(-) > > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c > > index 8d626d7..c7bdeb6 100644 > > --- a/drivers/vhost/net.c > > +++ b/drivers/vhost/net.c > > @@ -778,16 +778,6 @@ static void handle_rx(struct vhost_net *net) > > /* On error, stop handling until the next kick. */ > > if (unlikely(headcount < 0)) > > goto out; > > - if (nvq->rx_array) > > - msg.msg_control = vhost_net_buf_consume(>rxq); > > - /* On overrun, truncate and discard */ > > - if (unlikely(headcount > UIO_MAXIOV)) { > > - iov_iter_init(_iter, READ, vq->iov, 1, 1); > > - err = sock->ops->recvmsg(sock, , > > -1, MSG_DONTWAIT | MSG_TRUNC); > > - pr_debug("Discarded rx packet: len %zd\n", sock_len); > > - continue; > > - } > > /* OK, now we need to know about added descriptors. */ > > if (!headcount) { > > if (unlikely(vhost_enable_notify(>dev, vq))) { > > @@ -800,6 +790,16 @@ static void handle_rx(struct vhost_net *net) > > * they refilled. */ > > goto out; > > } > > + if (nvq->rx_array) > > + msg.msg_control = vhost_net_buf_consume(>rxq); > > + /* On overrun, truncate and discard */ > > + if (unlikely(headcount > UIO_MAXIOV)) { > > + iov_iter_init(_iter, READ, vq->iov, 1, 1); > > + err = sock->ops->recvmsg(sock, , > > +1, MSG_DONTWAIT | MSG_TRUNC); > > + pr_debug("Discarded rx packet: len %zd\n", sock_len); > > + continue; > > + } > > /* We don't need to be notified again. */ > > iov_iter_init(_iter, READ, vq->iov, in, vhost_len); > > fixup = msg.msg_iter; > > I suggest to reorder this patch to 3/3. > > Thanks Why? This doesn't cause any new leaks, does it? -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/3] tun: free skb in early errors
On Fri, Dec 01, 2017 at 03:07:44PM +0800, Jason Wang wrote: > > > On 2017年12月01日 13:54, w...@redhat.com wrote: > > From: Wei Xu> > > > tun_recvmsg() supports accepting skb by msg_control after > > commit ac77cfd4258f ("tun: support receiving skb through msg_control"), > > the skb if presented should be freed within the function, otherwise it > > would be leaked. > > > > Signed-off-by: Wei Xu > > Reported-by: Matthew Rosato > > --- > > drivers/net/tun.c | 14 +++--- > > 1 file changed, 11 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > > index 6a7bde9..5563430 100644 > > --- a/drivers/net/tun.c > > +++ b/drivers/net/tun.c > > @@ -2067,14 +2067,17 @@ static int tun_recvmsg(struct socket *sock, struct > > msghdr *m, size_t total_len, > > { > > struct tun_file *tfile = container_of(sock, struct tun_file, socket); > > struct tun_struct *tun = tun_get(tfile); > > + struct sk_buff *skb = m->msg_control; > > int ret; > > - if (!tun) > > - return -EBADFD; > > + if (!tun) { > > + ret = -EBADFD; > > + goto out_free_skb; > > Unfortunately, you can't to there since tun is NULL. Right, this should just be kfree_skb(skb); return -EBADFD; > > > + } > > if (flags & ~(MSG_DONTWAIT|MSG_TRUNC|MSG_ERRQUEUE)) { > > ret = -EINVAL; > > - goto out; > > + goto out_free_skb; > > } > > if (flags & MSG_ERRQUEUE) { > > ret = sock_recv_errqueue(sock->sk, m, total_len, > > @@ -2087,6 +2090,11 @@ static int tun_recvmsg(struct socket *sock, struct > > msghdr *m, size_t total_len, > > m->msg_flags |= MSG_TRUNC; > > ret = flags & MSG_TRUNC ? ret : total_len; > > } > > + goto out; > > We usually don't use goto in the case of success, and you need deal with the > case skb != NULL but iov_iter_count(to) == 0 in tun_do_read(). > > Thanks I agree, the way to lay this out is: tun_put(tun); return ret; err: tun_put(tun); err_tun: if (skb) kfree_skb(skb); return ret; > > + > > +out_free_skb: > > + if (skb) > > + kfree_skb(skb); > > out: > > tun_put(tun); > > return ret; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v18 05/10] xbitmap: add more operations
On Fri, Dec 01, 2017 at 10:02:01PM +0900, Tetsuo Handa wrote: > If start == end is legal, > >for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) { > > makes this loop do nothing because 10 < 10 is false. ... and this is why we add tests to the test-suite! ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 02/13] fbdev: add remove_conflicting_pci_framebuffers()
On Fri, Dec 01, 2017 at 08:19:16AM +0100, Daniel Vetter wrote: > On Thu, Nov 30, 2017 at 11:49:53PM +, Sudip Mukherjee wrote: > > Hi Daniel, > > > > > > > > > > > > > > > > > > Greg? > > > > > > > > > > Yes, if no one is working to get it out of staging, that means no one > > > > > cares about it, and it needs to be removed from the tree. > > > > > > > > Agreed. I was not working on getting it out of staging as there is no > > > > place for it to go. > > > > But, Teddy Wang told me that they have a working drm driver for it, but > > > > it is not atomic like Daniel was asking for. If it is ok, then I can > > > > send > > > > in a patch to remove the existing sm750 from staging and add the new > > > > sm750 > > > > drm driver to staging. And after it is ready, we can go ahead with > > > > moving > > > > it out of staging to drm. > > > > > > Please keep the todo item that it needs to be converted to atomic. And > > > tbh, it's probably faster if you just submit it to dri-devel, assuming you > > > have time to work on it. For small drivers we tend to be fairly quick in > > > getting them into good enough shape. > > > > I have received the driver from Teddy and pushed it to > > https://github.com/sudipm-mukherjee/parport/tree/drm_smi for your first > > look into it. It is not even building with next-20171130 and has lots of > > build warnings. I will have to do a lot of work on it before I can even > > submit it to dri-devel. > > > > Time will be the problem, as this is not part of my day job. > > > > > > > > Staging is also a major pain for drm subsystem refactorings, I really, > > > really, really prefer we don't add more than the vbox pain we have > > > already. > > > > I am hoping that after seeing it, you will agree to have it in staging. :) > > So I know Greg is willing to take anything into staging, but I'm no fan. > We refactor and improve drm a lot, with a lot of cross-driver changes > necessary to move things forward. We can do that since we have a generally > rather active development community, and we try hard to keep most drivers > in reasonable good shape and so easy to maintain. > > If you know throw a pile of unmaintainable stuff into staging, but without > working on it, then that's just cost, no benefit to the dri-devel > community. On top, staging tree is separate from drm trees, so more pain > to synchronize trees (and we have to do that a few times per release cycle > or drivers simply stop compiling). Where's the upside of taking this > driver into staging? > > One is users, but ime for soc display drivers usually everything else is > in worse shape (e.g. even great drivers like the one for qualcom must be > tested on some vendor tree because critical core bits are missing in > upstream), so "more users" is not the good reason. And it's clearly not > "more developers", because no time to clean things up. So what exactly is > the good reason to put this into staging instead of just waiting until > someone has the time to clean it up quickly? Ok, I will not try to give any more reasons now. :) I will cleanup the basic things I can and then submit it to dri-devel. Greg - Please keep the SM750 driver in staging for now. I will send a patch later to add in TODO the git location where the drm driver is being developed. And when that drm driver is ready, I will send a patch to remove the sm750fb driver from staging. -- Regards Sudip ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Memory corruption in powerpc guests with virtio_balloon (was Re: [PATCH v3] virtio_balloon: fix deadlock on OOM)
"Michael S. Tsirkin"writes: > fill_balloon doing memory allocations under balloon_lock > can cause a deadlock when leak_balloon is called from > virtballoon_oom_notify and tries to take same lock. > > To fix, split page allocation and enqueue and do allocations outside the lock. > > Here's a detailed analysis of the deadlock by Tetsuo Handa: > > In leak_balloon(), mutex_lock(>balloon_lock) is called in order to > serialize against fill_balloon(). But in fill_balloon(), > alloc_page(GFP_HIGHUSER[_MOVABLE] | __GFP_NOMEMALLOC | __GFP_NORETRY) is > called with vb->balloon_lock mutex held. Since GFP_HIGHUSER[_MOVABLE] > implies __GFP_DIRECT_RECLAIM | __GFP_IO | __GFP_FS, despite __GFP_NORETRY > is specified, this allocation attempt might indirectly depend on somebody > else's __GFP_DIRECT_RECLAIM memory allocation. And such indirect > __GFP_DIRECT_RECLAIM memory allocation might call leak_balloon() via > virtballoon_oom_notify() via blocking_notifier_call_chain() callback via > out_of_memory() when it reached __alloc_pages_may_oom() and held oom_lock > mutex. Since vb->balloon_lock mutex is already held by fill_balloon(), it > will cause OOM lockup. > > Thread1 Thread2 > fill_balloon() > takes a balloon_lock > balloon_page_enqueue() > alloc_page(GFP_HIGHUSER_MOVABLE) > direct reclaim (__GFP_FS context) takes a fs lock > waits for that fs lock alloc_page(GFP_NOFS) > __alloc_pages_may_oom() > takes the oom_lock > out_of_memory() > > blocking_notifier_call_chain() > leak_balloon() > tries to take > that balloon_lock and deadlocks > > Reported-by: Tetsuo Handa > Cc: Michal Hocko > Cc: Wei Wang > Signed-off-by: Michael S. Tsirkin > > drivers/virtio/virtio_balloon.c| 24 +++- > include/linux/balloon_compaction.h | 35 ++- > mm/balloon_compaction.c| 28 +--- > 3 files changed, 74 insertions(+), 13 deletions(-) Somehow this commit seems to be killing powerpc guests. The symptom is that the first page (64K) of the guests memory gets over written with zeroes, which is where our interrupt handlers are, so the system rapidly locks up due to illegal instructions in the illegal instruction handler. There seems to be some element of a race, because it doesn't always crash. Sometimes I can boot to a shell, but not often. When it does happen it's fairly late in boot, but before I get to a shell. I had a few bisects go off into the weeds due to the intermittent nature. But once I realised that I changed my script to boot 5 times before declaring a kernel good, and that bisected straight here. I can also revert this commit on v4.15-rc1 and everything's fine again - I got through ~250 boots with that kernel. So I'm pretty sure this commit is triggering/exposing/causing the bug. The other data point is that the page that's overwritten is mapped read only in the guest kernel. So either the guest kernel is writing to it in real mode (MMU off), or the hypervisor/DMA is doing it. I haven't isolated if it's host kernel/qemu version dependent, at the moment I'm just using distro packaged versions of both. Anyway I'll try and dig further into it on Monday, but I thought I'd let you know in case this is a known bug with a fix in the pipeline, or rings any bells or whatever. cheers > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c > index f0b3a0b..7960746 100644 > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -143,16 +143,17 @@ static void set_page_pfns(struct virtio_balloon *vb, > > static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) > { > - struct balloon_dev_info *vb_dev_info = >vb_dev_info; > unsigned num_allocated_pages; > + unsigned num_pfns; > + struct page *page; > + LIST_HEAD(pages); > > /* We can only do one array worth at a time. */ > num = min(num, ARRAY_SIZE(vb->pfns)); > > - mutex_lock(>balloon_lock); > - for (vb->num_pfns = 0; vb->num_pfns < num; > - vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { > - struct page *page = balloon_page_enqueue(vb_dev_info); > + for (num_pfns = 0; num_pfns < num; > + num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { > + struct page *page = balloon_page_alloc(); > > if (!page) { > dev_info_ratelimited(>vdev->dev, > @@
Re: [PATCH v18 06/10] virtio_ring: add a new API, virtqueue_add_one_desc
On 12/01/2017 03:38 AM, Michael S. Tsirkin wrote: On Wed, Nov 29, 2017 at 09:55:22PM +0800, Wei Wang wrote: Current virtqueue_add API implementation is based on the scatterlist struct, which uses kaddr. This is inadequate to all the use case of vring. For example: - Some usages don't use IOMMU, in this case the user can directly pass in a physical address in hand, instead of going through the sg implementation (e.g. the VIRTIO_BALLOON_F_SG feature) - Sometimes, a guest physical page may not have a kaddr (e.g. high memory) but need to use vring (e.g. the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature) The new API virtqueue_add_one_desc enables the caller to assign a vring desc with a physical address and len. Also, factor out the common code with virtqueue_add in vring_set_avail. Signed-off-by: Wei WangCc: Michael S. Tsirkin You previously managed without this patch, and it's preferable IMHO since this patchset is already too big. I don't really understand what is wrong with virtio_add_sgs + sg_set_page. I don't think is assumes a kaddr. OK, I will use the previous method to send sgs. Please have a check if there are other things need to be improved. Best, Wei ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v18 05/10] xbitmap: add more operations
On 11/30/2017 06:34 PM, Tetsuo Handa wrote: Wei Wang wrote: + * @start: the start of the bit range, inclusive + * @end: the end of the bit range, inclusive + * + * This function is used to clear a bit in the xbitmap. If all the bits of the + * bitmap are 0, the bitmap will be freed. + */ +void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end) +{ + struct radix_tree_root *root = >xbrt; + struct radix_tree_node *node; + void **slot; + struct ida_bitmap *bitmap; + unsigned int nbits; + + for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) { + unsigned long index = start / IDA_BITMAP_BITS; + unsigned long bit = start % IDA_BITMAP_BITS; + + bitmap = __radix_tree_lookup(root, index, , ); + if (radix_tree_exception(bitmap)) { + unsigned long ebit = bit + 2; + unsigned long tmp = (unsigned long)bitmap; + + nbits = min(end - start + 1, BITS_PER_LONG - ebit); "nbits = min(end - start + 1," seems to expect that start == end is legal for clearing only 1 bit. But this function is no-op if start == end. Please clarify what "inclusive" intended. If xb_clear_bit_range(xb,10,10), then it is effectively the same as xb_clear_bit(10). Why would it be illegal? "@start inclusive" means that the @start will also be included to be cleared. +static inline __always_inline void bitmap_clear(unsigned long *map, + unsigned int start, + unsigned int nbits) +{ + if (__builtin_constant_p(nbits) && nbits == 1) + __clear_bit(start, map); + else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) && +__builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8)) It looks strange to apply __builtin_constant_p test to variables after "& 7". I think this is normal - if the variables are known at compile time, the calculation will be done at compile time (termed constant folding). Best, Wei ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization