Re: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker
On 08/04/2018 03:15 AM, Michael S. Tsirkin wrote: On Fri, Aug 03, 2018 at 04:32:26PM +0800, Wei Wang wrote: The OOM notifier is getting deprecated to use for the reasons: - As a callout from the oom context, it is too subtle and easy to generate bugs and corner cases which are hard to track; - It is called too late (after the reclaiming has been performed). Drivers with large amuont of reclaimable memory is expected to release them at an early stage of memory pressure; - The notifier callback isn't aware of oom contrains; Link: https://lkml.org/lkml/2018/7/12/314 This patch replaces the virtio-balloon oom notifier with a shrinker to release balloon pages on memory pressure. The balloon pages are given back to mm adaptively by returning the number of pages that the reclaimer is asking for (i.e. sc->nr_to_scan). Currently the max possible value of sc->nr_to_scan passed to the balloon shrinker is SHRINK_BATCH, which is 128. This is smaller than the limitation that only VIRTIO_BALLOON_ARRAY_PFNS_MAX (256) pages can be returned via one invocation of leak_balloon. But this patch still considers the case that SHRINK_BATCH or shrinker->batch could be changed to a value larger than VIRTIO_BALLOON_ARRAY_PFNS_MAX, which will need to do multiple invocations of leak_balloon. Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has been used to release balloon pages on OOM. We continue to use this feature bit for the shrinker, so the shrinker is only registered when this feature bit has been negotiated with host. Signed-off-by: Wei Wang Cc: Michael S. Tsirkin Cc: Michal Hocko Cc: Andrew Morton Could you add data at how was this tested and how did guest behaviour change. Which configurations see an improvement? Yes. Please see the differences from the "*1" and "*2" cases below. Taking this chance, I use "*2" and "*3" to show Michal etc the differences of applying and not applying the shrinker fix patch here: https://lkml.org/lkml/2018/8/3/384 *1. V3 patches 1)After inflating some amount of memory, actual=101536 Bytes free -m totalusedfree shared buff/cache available Mem: 79757289 514 10 171 447 Swap: 10236 0 10236 2) dd if=478MB_file of=/dev/null, actual=1058721792 Bytes free -m totalusedfree shared buff/cache available Mem: 79757233 102 10 639 475 Swap: 10236 0 10236 The advantage is that the inflated pages are given back to mm based on the number, i.e. ~56MB(diff "actual" above) of the reclaimer is asking for. This is more adaptive. *2. V2 paches, balloon_pages_to_shrink=100 pages (around 4GB), with the shrinker fix patches applied. 1)After inflating some amount of memory, actual=101536 Bytes free -m totalusedfree shared buff/cache available Mem: 79757288 530 10 157 455 Swap: 10236 0 10236 2)dd if=478MB_file of=/dev/null, actual=5096001536 Bytes free -m totalusedfree shared buff/cache available Mem: 797533813953 10 6404327 Swap: 10236 0 10236 In the above example, we set 4GB to shrink to make the difference obvious. Though the claimer only needs to reclaim ~56MB memory, 4GB inflated pages are given back to mm, which is unnecessary. From the user's perspective, it has no idea of how many pages to given back at the time of setting the module parameter (balloon_pages_to_shrink). So I think the above "*1" is better. *3. V2 paches, balloon_pages_to_shrink=100 pages (around 4GB), without the shrinker fix patches applied. 1) After inflating some amount of memory, actual=101536 Bytes free -m totalusedfree shared buff/cache available Mem: 79757292 524 10 158 450 Swap: 10236 0 10236 2) dd if=478MB_file of=/dev/null, actual=8589934592 Bytes free -m totalusedfree shared buff/cache available Mem: 7975 537281 10 6407656 Swap: 10236 0 10236 Compared to *2, all the balloon pages are shrunk, but users expect 4GB to shrink. The reason is that do_slab_shrink has a mistake in calculating schrinkctl->nr_scanned, which should be the actual number of pages that the shrinker has freed, but do slab_shrink still treat that value as 128 (but 4GB has actually been freed). Best, Wei ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next V2] vhost: switch to use new message format
We use to have message like: struct vhost_msg { int type; union { struct vhost_iotlb_msg iotlb; __u8 padding[64]; }; }; Unfortunately, there will be a hole of 32bit in 64bit machine because of the alignment. This leads a different formats between 32bit API and 64bit API. What's more it will break 32bit program running on 64bit machine. So fixing this by introducing a new message type with an explicit 32bit reserved field after type like: struct vhost_msg_v2 { __u32 type; __u32 reserved; union { struct vhost_iotlb_msg iotlb; __u8 padding[64]; }; }; We will have a consistent ABI after switching to use this. To enable this capability, introduce a new ioctl (VHOST_SET_BAKCEND_FEATURE) for userspace to enable this feature (VHOST_BACKEND_F_IOTLB_V2). Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API") Signed-off-by: Jason Wang --- Changes from V1: - use __u32 instead of int for type --- drivers/vhost/net.c| 30 drivers/vhost/vhost.c | 71 ++ drivers/vhost/vhost.h | 11 ++- include/uapi/linux/vhost.h | 18 4 files changed, 111 insertions(+), 19 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 367d802..4e656f8 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -78,6 +78,10 @@ enum { }; enum { + VHOST_NET_BACKEND_FEATURES = (1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2) +}; + +enum { VHOST_NET_VQ_RX = 0, VHOST_NET_VQ_TX = 1, VHOST_NET_VQ_MAX = 2, @@ -1399,6 +1403,21 @@ static long vhost_net_reset_owner(struct vhost_net *n) return err; } +static int vhost_net_set_backend_features(struct vhost_net *n, u64 features) +{ + int i; + + mutex_lock(>dev.mutex); + for (i = 0; i < VHOST_NET_VQ_MAX; ++i) { + mutex_lock(>vqs[i].vq.mutex); + n->vqs[i].vq.acked_backend_features = features; + mutex_unlock(>vqs[i].vq.mutex); + } + mutex_unlock(>dev.mutex); + + return 0; +} + static int vhost_net_set_features(struct vhost_net *n, u64 features) { size_t vhost_hlen, sock_hlen, hdr_len; @@ -1489,6 +1508,17 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl, if (features & ~VHOST_NET_FEATURES) return -EOPNOTSUPP; return vhost_net_set_features(n, features); + case VHOST_GET_BACKEND_FEATURES: + features = VHOST_NET_BACKEND_FEATURES; + if (copy_to_user(featurep, , sizeof(features))) + return -EFAULT; + return 0; + case VHOST_SET_BACKEND_FEATURES: + if (copy_from_user(, featurep, sizeof(features))) + return -EFAULT; + if (features & ~VHOST_NET_BACKEND_FEATURES) + return -EOPNOTSUPP; + return vhost_net_set_backend_features(n, features); case VHOST_RESET_OWNER: return vhost_net_reset_owner(n); case VHOST_SET_OWNER: diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index a502f1a..6f6c42d 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -315,6 +315,7 @@ static void vhost_vq_reset(struct vhost_dev *dev, vq->log_addr = -1ull; vq->private_data = NULL; vq->acked_features = 0; + vq->acked_backend_features = 0; vq->log_base = NULL; vq->error_ctx = NULL; vq->kick = NULL; @@ -1027,28 +1028,40 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev, ssize_t vhost_chr_write_iter(struct vhost_dev *dev, struct iov_iter *from) { - struct vhost_msg_node node; - unsigned size = sizeof(struct vhost_msg); - size_t ret; - int err; + struct vhost_iotlb_msg msg; + size_t offset; + int type, ret; - if (iov_iter_count(from) < size) - return 0; - ret = copy_from_iter(, size, from); - if (ret != size) + ret = copy_from_iter(, sizeof(type), from); + if (ret != sizeof(type)) goto done; - switch (node.msg.type) { + switch (type) { case VHOST_IOTLB_MSG: - err = vhost_process_iotlb_msg(dev, ); - if (err) - ret = err; + /* There maybe a hole after type for V1 message type, +* so skip it here. +*/ + offset = offsetof(struct vhost_msg, iotlb) - sizeof(int); + break; + case VHOST_IOTLB_MSG_V2: + offset = sizeof(__u32); break; default: ret = -EINVAL; - break; + goto done; + } + + iov_iter_advance(from, offset); + ret = copy_from_iter(, sizeof(msg), from); + if (ret
Re: [PATCH net-next] vhost: switch to use new message format
On 2018年08月03日 15:59, Michael S. Tsirkin wrote: diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index a502f1a..6f6c42d 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -315,6 +315,7 @@ static void vhost_vq_reset(struct vhost_dev *dev, vq->log_addr = -1ull; vq->private_data = NULL; vq->acked_features = 0; + vq->acked_backend_features = 0; vq->log_base = NULL; vq->error_ctx = NULL; vq->kick = NULL; @@ -1027,28 +1028,40 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev, ssize_t vhost_chr_write_iter(struct vhost_dev *dev, struct iov_iter *from) { - struct vhost_msg_node node; - unsigned size = sizeof(struct vhost_msg); - size_t ret; - int err; + struct vhost_iotlb_msg msg; + size_t offset; + int type, ret; - if (iov_iter_count(from) < size) - return 0; - ret = copy_from_iter(, size, from); - if (ret != size) + ret = copy_from_iter(, sizeof(type), from); + if (ret != sizeof(type)) goto done; - switch (node.msg.type) { + switch (type) { case VHOST_IOTLB_MSG: - err = vhost_process_iotlb_msg(dev, ); - if (err) - ret = err; + /* There maybe a hole after type for V1 message type, +* so skip it here. +*/ + offset = offsetof(struct vhost_msg, iotlb) - sizeof(int); + break; + case VHOST_IOTLB_MSG_V2: + offset = sizeof(__u32); break; default: ret = -EINVAL; - break; + goto done; + } + + iov_iter_advance(from, offset); + ret = copy_from_iter(, sizeof(msg), from); + if (ret != sizeof(msg)) + goto done; + if (vhost_process_iotlb_msg(dev, )) { + ret = -EFAULT; + goto done; } + ret = (type == VHOST_IOTLB_MSG) ? sizeof(struct vhost_msg) : + sizeof(struct vhost_msg_v2); done: return ret; } We can actually fix 32 bit apps too, checking the mode for v1. But that can wait for another patch. Yes, let me do it on top. Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] vhost: switch to use new message format
On 2018年08月05日 04:21, David Miller wrote: From: Jason Wang Date: Fri, 3 Aug 2018 15:04:51 +0800 So fixing this by introducing a new message type with an explicit 32bit reserved field after type like: struct vhost_msg_v2 { int type; __u32 reserved; Please use fixed sized types consistently. Use 's32' instead of 'int' here. Thanks! Ok, V2 will be posted soon. And it looks to me u32 is sufficient. Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] virtio-net: mark expected switch fall-throughs
From: "Gustavo A. R. Silva" Date: Sat, 4 Aug 2018 21:42:05 -0500 > In preparation to enabling -Wimplicit-fallthrough, mark switch cases > where we are expecting to fall through. > > Addresses-Coverity-ID: 1402059 ("Missing break in switch") > Addresses-Coverity-ID: 1402060 ("Missing break in switch") > Addresses-Coverity-ID: 1402061 ("Missing break in switch") > Signed-off-by: Gustavo A. R. Silva Applied, thank you. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 0/4] Virtio uses DMA API for all devices
On Mon, 2018-08-06 at 07:16 +1000, Benjamin Herrenschmidt wrote: > I'm trying to understand because the limitation is not a device side > limitation, it's not a qemu limitation, it's actually more of a VM > limitation. It has most of its memory pages made inaccessible for > security reasons. The platform from a qemu/KVM perspective is almost > entirely normal. In fact this is probably the best image of what's going on: It's a normal VM from a KVM/qemu perspective (and thus virtio). It boots normally, can run firmware, linux, etc... normally, it's not created with any different XML or qemu command line definition etc... It just that once it reaches the kernel with the secure stuff enabled (could be via kexec from a normal kernel), that kernel will "stash away" most of the VM's memory into some secure space that nothing else (not even the hypervisor) can access. It can keep around a pool or two of normal memory for bounce buferring IOs but that's about it. I think that's the clearest way I could find to explain what's going on, and why I'm so resistant on adding things on qemu side. That said, we *can* (and will) notify KVM and qemu of the transition, and we can/will do so after virtio has been instanciated and used by the bootloader, but before it will be used (or even probed) by the secure VM itself, so there's an opportunity to poke at things, either from the VM itself (a quirk poking at virtio config space for example) or from qemu (though I find the idea of iterating all virtio devices from qemu to change a setting rather gross). Cheers, Ben. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 0/4] Virtio uses DMA API for all devices
On Sun, 2018-08-05 at 00:29 -0700, Christoph Hellwig wrote: > On Sun, Aug 05, 2018 at 11:10:15AM +1000, Benjamin Herrenschmidt wrote: > > - One you have rejected, which is to have a way for "no-iommu" virtio > > (which still doesn't use an iommu on the qemu side and doesn't need > > to), to be forced to use some custom DMA ops on the VM side. > > > > - One, which sadly has more overhead and will require modifying more > > pieces of the puzzle, which is to make qemu uses an emulated iommu. > > Once we make qemu do that, we can then layer swiotlb on top of the > > emulated iommu on the guest side, and pass that as dma_ops to virtio. > > Or number three: have a a virtio feature bit that tells the VM > to use whatever dma ops the platform thinks are appropinquate for > the bus it pretends to be on. Then set a dma-range that is limited > to your secure memory range (if you really need it to be runtime > enabled only after a device reset that rescans) and use the normal > dma mapping code to bounce buffer. Who would set this bit ? qemu ? Under what circumstances ? What would be the effect of this bit while VIRTIO_F_IOMMU is NOT set, ie, what would qemu do and what would Linux do ? I'm not sure I fully understand your idea. I'm trying to understand because the limitation is not a device side limitation, it's not a qemu limitation, it's actually more of a VM limitation. It has most of its memory pages made inaccessible for security reasons. The platform from a qemu/KVM perspective is almost entirely normal. So I don't understand when would qemu set this bit, or should it be set by the VM at runtime ? Cheers, Ben. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next 0/6] virtio_net: Add ethtool stat items
On 05/08/2018 4:45 PM, Tariq Toukan wrote: On 05/08/2018 4:15 PM, Tariq Toukan wrote: On 25/07/2018 10:59 PM, David Miller wrote: From: "Michael S. Tsirkin" Date: Wed, 25 Jul 2018 12:40:12 +0300 On Mon, Jul 23, 2018 at 11:36:03PM +0900, Toshiaki Makita wrote: From: Toshiaki Makita Add some ethtool stat items useful for performance analysis. Signed-off-by: Toshiaki Makita Series: Acked-by: Michael S. Tsirkin Series applied. Patch 1 seems appropriate for stable, even though it's minor. Ok, I'll put patch #1 also into 'net' and queue it up for -stable. Thanks. Hi, Our verification team encountered the following degradation, introduced by this series. Please check KASAN bug report below. We tested before and after the series, but didn't bisect within it. We can do if necessary. Thanks, Tariq I see this might already be fixed, here: https://marc.info/?l=linux-netdev=153335713407532=2 Verifying... No repro when applying this fix. Thanks, Tariq ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next 0/6] virtio_net: Add ethtool stat items
On 05/08/2018 4:15 PM, Tariq Toukan wrote: On 25/07/2018 10:59 PM, David Miller wrote: From: "Michael S. Tsirkin" Date: Wed, 25 Jul 2018 12:40:12 +0300 On Mon, Jul 23, 2018 at 11:36:03PM +0900, Toshiaki Makita wrote: From: Toshiaki Makita Add some ethtool stat items useful for performance analysis. Signed-off-by: Toshiaki Makita Series: Acked-by: Michael S. Tsirkin Series applied. Patch 1 seems appropriate for stable, even though it's minor. Ok, I'll put patch #1 also into 'net' and queue it up for -stable. Thanks. Hi, Our verification team encountered the following degradation, introduced by this series. Please check KASAN bug report below. We tested before and after the series, but didn't bisect within it. We can do if necessary. Thanks, Tariq I see this might already be fixed, here: https://marc.info/?l=linux-netdev=153335713407532=2 Verifying... [ 46.166544] BUG: KASAN: use-after-free in virtnet_poll+0xd9c/0xfd1 [virtio_net] [ 46.166593] Read of size 8 at addr 8803400da608 by task ip/1013 [ 46.166603] CPU: 3 PID: 1013 Comm: ip Not tainted 4.18.0-rc6-for-upstream-dbg-2018-07-26_19-45-52-64 #1 [ 46.166606] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 46.166609] Call Trace: [ 46.166613] [ 46.166624] dump_stack+0xf0/0x17b [ 46.166647] ? show_regs_print_info+0x5/0x5 [ 46.166654] ? printk+0x9c/0xc3 [ 46.11] ? kmsg_dump_rewind_nolock+0xf5/0xf5 [ 46.166677] ? virtnet_poll+0xd9c/0xfd1 [virtio_net] [ 46.166685] print_address_description+0xea/0x380 [ 46.166701] ? virtnet_poll+0xd9c/0xfd1 [virtio_net] [ 46.166711] kasan_report+0x1dd/0x460 [ 46.166727] ? virtnet_poll+0xd9c/0xfd1 [virtio_net] [ 46.166743] virtnet_poll+0xd9c/0xfd1 [virtio_net] [ 46.166767] ? receive_buf+0x2e30/0x2e30 [virtio_net] [ 46.166796] ? sched_clock_cpu+0x18/0x2b0 [ 46.166809] ? print_irqtrace_events+0x280/0x280 [ 46.166817] ? print_irqtrace_events+0x280/0x280 [ 46.166830] ? rcu_process_callbacks+0xc5e/0x12d0 [ 46.166838] ? kvm_clock_read+0x1f/0x30 [ 46.166857] ? mark_held_locks+0xd5/0x170 [ 46.166867] ? net_rx_action+0x2aa/0x10e0 [ 46.166882] net_rx_action+0x4bc/0x10e0 [ 46.166906] ? napi_complete_done+0x480/0x480 [ 46.166925] ? print_irqtrace_events+0x280/0x280 [ 46.166935] ? sched_clock+0x5/0x10 [ 46.166952] ? __lock_is_held+0xcb/0x1a0 [ 46.166982] __do_softirq+0x2c4/0xdf4 [ 46.167010] do_softirq_own_stack+0x2a/0x40 [ 46.167031] [ 46.167038] ? virtnet_napi_enable+0x37/0xa0 [virtio_net] [ 46.167044] do_softirq.part.11+0x69/0x70 [ 46.167065] __local_bh_enable_ip+0x1d9/0x250 [ 46.167076] virtnet_open+0x13c/0x440 [virtio_net] [ 46.167099] __dev_open+0x1cf/0x390 [ 46.167108] ? dev_set_rx_mode+0x30/0x30 [ 46.167113] ? __local_bh_enable_ip+0xf7/0x250 [ 46.167124] ? trace_hardirqs_on_caller+0x38d/0x6c0 [ 46.167130] ? __dev_change_flags+0x18d/0x630 [ 46.167142] __dev_change_flags+0x469/0x630 [ 46.167152] ? dev_set_allmulti+0x10/0x10 [ 46.167157] ? __lock_acquire+0x9c1/0x4b50 [ 46.167166] ? print_irqtrace_events+0x280/0x280 [ 46.167174] ? kvm_clock_read+0x1f/0x30 [ 46.167191] ? rtnetlink_put_metrics+0x530/0x530 [ 46.167205] dev_change_flags+0x8b/0x160 [ 46.167236] do_setlink+0xa17/0x39f0 [ 46.167248] ? sched_clock_cpu+0x18/0x2b0 [ 46.167267] ? rtnl_dump_ifinfo+0xd20/0xd20 [ 46.167277] ? print_irqtrace_events+0x280/0x280 [ 46.167285] ? kvm_clock_read+0x1f/0x30 [ 46.167316] ? debug_show_all_locks+0x3b0/0x3b0 [ 46.167321] ? kvm_sched_clock_read+0x5/0x10 [ 46.167327] ? sched_clock+0x5/0x10 [ 46.167333] ? sched_clock_cpu+0x18/0x2b0 [ 46.167382] ? memset+0x1f/0x40 [ 46.167408] ? nla_parse+0x8c/0x3f0 [ 46.167419] ? rtnetlink_put_metrics+0x530/0x530 [ 46.167427] ? kvm_sched_clock_read+0x5/0x10 [ 46.167433] ? sched_clock+0x5/0x10 [ 46.167439] ? sched_clock_cpu+0x18/0x2b0 [ 46.167457] rtnl_newlink+0x9dc/0x13a0 [ 46.167484] ? rtnl_link_unregister+0x2b0/0x2b0 [ 46.167513] ? kvm_clock_read+0x1f/0x30 [ 46.167538] ? print_irqtrace_events+0x280/0x280 [ 46.167546] ? kvm_clock_read+0x1f/0x30 [ 46.167551] ? kvm_sched_clock_read+0x5/0x10 [ 46.167557] ? sched_clock+0x5/0x10 [ 46.167562] ? sched_clock_cpu+0x18/0x2b0 [ 46.167567] ? kvm_clock_read+0x1f/0x30 [ 46.167598] ? __lock_acquire+0x9c1/0x4b50 [ 46.167640] ? debug_show_all_locks+0x3b0/0x3b0 [ 46.167646] ? kvm_clock_read+0x1f/0x30 [ 46.167651] ? kvm_sched_clock_read+0x5/0x10 [ 46.167673] ? debug_show_all_locks+0x3b0/0x3b0 [ 46.167698] ? is_bpf_text_address+0x87/0x130 [ 46.167707] ? print_irqtrace_events+0x280/0x280 [ 46.167714] ? kvm_clock_read+0x1f/0x30 [ 46.167718] ? kvm_sched_clock_read+0x5/0x10 [ 46.167723] ? sched_clock+0x5/0x10 [ 46.167728] ? sched_clock_cpu+0x18/0x2b0 [ 46.167753] ? get_lock_stats+0x18/0x160 [ 46.167877] ? __lock_is_held+0xcb/0x1a0 [ 46.167908]
Re: [PATCH net-next 0/6] virtio_net: Add ethtool stat items
On 25/07/2018 10:59 PM, David Miller wrote: From: "Michael S. Tsirkin" Date: Wed, 25 Jul 2018 12:40:12 +0300 On Mon, Jul 23, 2018 at 11:36:03PM +0900, Toshiaki Makita wrote: From: Toshiaki Makita Add some ethtool stat items useful for performance analysis. Signed-off-by: Toshiaki Makita Series: Acked-by: Michael S. Tsirkin Series applied. Patch 1 seems appropriate for stable, even though it's minor. Ok, I'll put patch #1 also into 'net' and queue it up for -stable. Thanks. Hi, Our verification team encountered the following degradation, introduced by this series. Please check KASAN bug report below. We tested before and after the series, but didn't bisect within it. We can do if necessary. Thanks, Tariq [ 46.166544] BUG: KASAN: use-after-free in virtnet_poll+0xd9c/0xfd1 [virtio_net] [ 46.166593] Read of size 8 at addr 8803400da608 by task ip/1013 [ 46.166603] CPU: 3 PID: 1013 Comm: ip Not tainted 4.18.0-rc6-for-upstream-dbg-2018-07-26_19-45-52-64 #1 [ 46.166606] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 46.166609] Call Trace: [ 46.166613] [ 46.166624] dump_stack+0xf0/0x17b [ 46.166647] ? show_regs_print_info+0x5/0x5 [ 46.166654] ? printk+0x9c/0xc3 [ 46.11] ? kmsg_dump_rewind_nolock+0xf5/0xf5 [ 46.166677] ? virtnet_poll+0xd9c/0xfd1 [virtio_net] [ 46.166685] print_address_description+0xea/0x380 [ 46.166701] ? virtnet_poll+0xd9c/0xfd1 [virtio_net] [ 46.166711] kasan_report+0x1dd/0x460 [ 46.166727] ? virtnet_poll+0xd9c/0xfd1 [virtio_net] [ 46.166743] virtnet_poll+0xd9c/0xfd1 [virtio_net] [ 46.166767] ? receive_buf+0x2e30/0x2e30 [virtio_net] [ 46.166796] ? sched_clock_cpu+0x18/0x2b0 [ 46.166809] ? print_irqtrace_events+0x280/0x280 [ 46.166817] ? print_irqtrace_events+0x280/0x280 [ 46.166830] ? rcu_process_callbacks+0xc5e/0x12d0 [ 46.166838] ? kvm_clock_read+0x1f/0x30 [ 46.166857] ? mark_held_locks+0xd5/0x170 [ 46.166867] ? net_rx_action+0x2aa/0x10e0 [ 46.166882] net_rx_action+0x4bc/0x10e0 [ 46.166906] ? napi_complete_done+0x480/0x480 [ 46.166925] ? print_irqtrace_events+0x280/0x280 [ 46.166935] ? sched_clock+0x5/0x10 [ 46.166952] ? __lock_is_held+0xcb/0x1a0 [ 46.166982] __do_softirq+0x2c4/0xdf4 [ 46.167010] do_softirq_own_stack+0x2a/0x40 [ 46.167031] [ 46.167038] ? virtnet_napi_enable+0x37/0xa0 [virtio_net] [ 46.167044] do_softirq.part.11+0x69/0x70 [ 46.167065] __local_bh_enable_ip+0x1d9/0x250 [ 46.167076] virtnet_open+0x13c/0x440 [virtio_net] [ 46.167099] __dev_open+0x1cf/0x390 [ 46.167108] ? dev_set_rx_mode+0x30/0x30 [ 46.167113] ? __local_bh_enable_ip+0xf7/0x250 [ 46.167124] ? trace_hardirqs_on_caller+0x38d/0x6c0 [ 46.167130] ? __dev_change_flags+0x18d/0x630 [ 46.167142] __dev_change_flags+0x469/0x630 [ 46.167152] ? dev_set_allmulti+0x10/0x10 [ 46.167157] ? __lock_acquire+0x9c1/0x4b50 [ 46.167166] ? print_irqtrace_events+0x280/0x280 [ 46.167174] ? kvm_clock_read+0x1f/0x30 [ 46.167191] ? rtnetlink_put_metrics+0x530/0x530 [ 46.167205] dev_change_flags+0x8b/0x160 [ 46.167236] do_setlink+0xa17/0x39f0 [ 46.167248] ? sched_clock_cpu+0x18/0x2b0 [ 46.167267] ? rtnl_dump_ifinfo+0xd20/0xd20 [ 46.167277] ? print_irqtrace_events+0x280/0x280 [ 46.167285] ? kvm_clock_read+0x1f/0x30 [ 46.167316] ? debug_show_all_locks+0x3b0/0x3b0 [ 46.167321] ? kvm_sched_clock_read+0x5/0x10 [ 46.167327] ? sched_clock+0x5/0x10 [ 46.167333] ? sched_clock_cpu+0x18/0x2b0 [ 46.167382] ? memset+0x1f/0x40 [ 46.167408] ? nla_parse+0x8c/0x3f0 [ 46.167419] ? rtnetlink_put_metrics+0x530/0x530 [ 46.167427] ? kvm_sched_clock_read+0x5/0x10 [ 46.167433] ? sched_clock+0x5/0x10 [ 46.167439] ? sched_clock_cpu+0x18/0x2b0 [ 46.167457] rtnl_newlink+0x9dc/0x13a0 [ 46.167484] ? rtnl_link_unregister+0x2b0/0x2b0 [ 46.167513] ? kvm_clock_read+0x1f/0x30 [ 46.167538] ? print_irqtrace_events+0x280/0x280 [ 46.167546] ? kvm_clock_read+0x1f/0x30 [ 46.167551] ? kvm_sched_clock_read+0x5/0x10 [ 46.167557] ? sched_clock+0x5/0x10 [ 46.167562] ? sched_clock_cpu+0x18/0x2b0 [ 46.167567] ? kvm_clock_read+0x1f/0x30 [ 46.167598] ? __lock_acquire+0x9c1/0x4b50 [ 46.167640] ? debug_show_all_locks+0x3b0/0x3b0 [ 46.167646] ? kvm_clock_read+0x1f/0x30 [ 46.167651] ? kvm_sched_clock_read+0x5/0x10 [ 46.167673] ? debug_show_all_locks+0x3b0/0x3b0 [ 46.167698] ? is_bpf_text_address+0x87/0x130 [ 46.167707] ? print_irqtrace_events+0x280/0x280 [ 46.167714] ? kvm_clock_read+0x1f/0x30 [ 46.167718] ? kvm_sched_clock_read+0x5/0x10 [ 46.167723] ? sched_clock+0x5/0x10 [ 46.167728] ? sched_clock_cpu+0x18/0x2b0 [ 46.167753] ? get_lock_stats+0x18/0x160 [ 46.167877] ? __lock_is_held+0xcb/0x1a0 [ 46.167908] rtnetlink_rcv_msg+0x575/0xaa0 [ 46.167913] ? kvm_clock_read+0x1f/0x30 [ 46.167925] ? validate_linkmsg+0x900/0x900 [ 46.167945] ?
Re: [RFC 0/4] Virtio uses DMA API for all devices
On Sun, Aug 05, 2018 at 11:10:15AM +1000, Benjamin Herrenschmidt wrote: > - One you have rejected, which is to have a way for "no-iommu" virtio > (which still doesn't use an iommu on the qemu side and doesn't need > to), to be forced to use some custom DMA ops on the VM side. > > - One, which sadly has more overhead and will require modifying more > pieces of the puzzle, which is to make qemu uses an emulated iommu. > Once we make qemu do that, we can then layer swiotlb on top of the > emulated iommu on the guest side, and pass that as dma_ops to virtio. Or number three: have a a virtio feature bit that tells the VM to use whatever dma ops the platform thinks are appropinquate for the bus it pretends to be on. Then set a dma-range that is limited to your secure memory range (if you really need it to be runtime enabled only after a device reset that rescans) and use the normal dma mapping code to bounce buffer. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 0/4] Virtio uses DMA API for all devices
On Sun, Aug 05, 2018 at 03:09:55AM +0300, Michael S. Tsirkin wrote: > So in this case however I'm not sure what exactly do we want to add. It > seems that from point of view of the device, there is nothing special - > it just gets a PA and writes there. It also seems that guest does not > need to get any info from the device either. Instead guest itself needs > device to DMA into specific addresses, for its own reasons. > > It seems that the fact that within guest it's implemented using a bounce > buffer and that it's easiest to do by switching virtio to use the DMA API > isn't something virtio spec concerns itself with. And that is exactly what we added bus_dma_mask for - the case where the device itself has not limitation (or a bigger limitation), but the platform limits the accessible dma ranges. One typical case is a PCIe root port that is only connected to the CPU through an interconnect that is limited to 32 address bits for example. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization