Re: virtio_balloon regression in 5.19-rc3
On Tue, 2022-06-21 at 17:34 +0800, Jason Wang wrote: > On Tue, Jun 21, 2022 at 5:24 PM David Hildenbrand wrote: > > > > On 20.06.22 20:49, Ben Hutchings wrote: > > > I've tested a 5.19-rc3 kernel on top of QEMU/KVM with machine type > > > pc-q35-5.2. It has a virtio balloon device defined in libvirt as: > > > > > > > > >> > function="0x0"/> > > > > > > > > > but the virtio_balloon driver fails to bind to it: > > > > > > virtio_balloon virtio4: init_vqs: add stat_vq failed > > > virtio_balloon: probe of virtio4 failed with error -5 > > > > > > > Hmm, I don't see any recent changes to drivers/virtio/virtio_balloon.c > > > > virtqueue_add_outbuf() fails with -EIO if I'm not wrong. That's the > > first call of virtqueue_add_outbuf() when virtio_balloon initializes. > > > > > > Maybe something in generic virtio code changed? > > Yes, we introduced the IRQ hardening. That could be the root cause and > we've received lots of reports so we decide to disable it by default. > > Ben, could you please try this patch: (and make sure > CONFIG_VIRTIO_HARDEN_NOTIFICATION is not set) > > https://lore.kernel.org/lkml/20220620024158.2505-1-jasow...@redhat.com/T/ Yes, that patch fixes the regression for me. Ben. -- Ben Hutchings Any smoothly functioning technology is indistinguishable from a rigged demo. signature.asc Description: This is a digitally signed message part ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 3.16 093/245] virtio-balloon: fix managed page counts when migrating pages between zones
[ 186.026321] Offlined Pages 32768 [ 187.684861] Offlined Pages 32768 [ 189.227013] Offlined Pages 32768 [ 190.830303] Offlined Pages 32768 [ 190.833071] Built 1 zonelists, mobility grouping on. Total pages: -36920272750453009 In another instance (older kernel), I was no longer able to start any process: [root@vm ~]# [ 214.348068] Offlined Pages 32768 [ 215.973009] Offlined Pages 32768 cat /proc/meminfo -bash: fork: Cannot allocate memory [root@vm ~]# cat /proc/meminfo -bash: fork: Cannot allocate memory Fix it by properly adjusting the managed page count when migrating if the zone changed. The managed page count of the zones now looks after unplug of the DIMM (and after deflating the balloon) just like before inflating the balloon (and plugging+onlining the DIMM). We'll temporarily modify the totalram page count. If this ever becomes a problem, we can fine tune by providing helpers that don't touch the totalram pages (e.g., adjust_zone_managed_page_count()). Please note that fixing up the managed page count is only necessary when we adjusted the managed page count when inflating - only if we don't have VIRTIO_BALLOON_F_DEFLATE_ON_OOM. With that feature, the managed page count is not touched when inflating/deflating. Reported-by: Yumei Huang Fixes: 3dcc0571cd64 ("mm: correctly update zone->managed_pages") Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Jiang Liu Cc: Andrew Morton Cc: Igor Mammedov Cc: virtualization@lists.linux-foundation.org Signed-off-by: David Hildenbrand Signed-off-by: Michael S. Tsirkin [bwh: Backported to 3.16: Deflate-on-OOM is not supported at all so don't check that flag] Signed-off-by: Ben Hutchings --- --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -403,6 +403,16 @@ static int virtballoon_migratepage(struc get_page(newpage); /* balloon reference */ + /* + * When we migrate a page to a different zone and adjusted the + * managed page count when inflating, we have to fixup the count of + * both involved zones. + */ + if (page_zone(page) != page_zone(newpage)) { + adjust_managed_page_count(page, 1); + adjust_managed_page_count(newpage, -1); + } + /* balloon's page migration 1st step -- inflate "newpage" */ spin_lock_irqsave(_dev_info->pages_lock, flags); balloon_page_insert(newpage, mapping, _dev_info->pages); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 3.16 140/305] x86/hyper-v: Enable PIT shutdown quirk
3.16.63-rc1 review patch. If anyone has any objections, please let me know. -- From: Michael Kelley commit 1de72c706488b7be664a601cf3843bd01e327e58 upstream. Hyper-V emulation of the PIT has a quirk such that the normal PIT shutdown path doesn't work, because clearing the counter register restarts the timer. Disable the counter clearing on PIT shutdown. Signed-off-by: Michael Kelley Signed-off-by: Thomas Gleixner Cc: "gre...@linuxfoundation.org" Cc: "de...@linuxdriverproject.org" Cc: "daniel.lezc...@linaro.org" Cc: "virtualization@lists.linux-foundation.org" Cc: "jgr...@suse.com" Cc: "akata...@vmware.com" Cc: "o...@aepfle.de" Cc: "a...@canonical.com" Cc: vkuznets Cc: "jasow...@redhat.com" Cc: "marcelo.ce...@canonical.com" Cc: KY Srinivasan Link: https://lkml.kernel.org/r/1541303219-11142-3-git-send-email-mikel...@microsoft.com [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings --- arch/x86/kernel/cpu/mshyperv.c | 11 +++ 1 file changed, 11 insertions(+) --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -143,6 +144,16 @@ static void __init ms_hyperv_init_platfo no_timer_check = 1; #endif + /* +* Hyper-V VMs have a PIT emulation quirk such that zeroing the +* counter register during PIT shutdown restarts the PIT. So it +* continues to interrupt @18.2 HZ. Setting i8253_clear_counter +* to false tells pit_shutdown() not to zero the counter so that +* the PIT really is shutdown. Generation 2 VMs don't have a PIT, +* and setting this value has no effect. +*/ + i8253_clear_counter_on_shutdown = false; + } const __refconst struct hypervisor_x86 x86_hyper_ms_hyperv = { ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 3.16 139/305] clockevents/drivers/i8253: Add support for PIT shutdown quirk
3.16.63-rc1 review patch. If anyone has any objections, please let me know. -- From: Michael Kelley commit 35b69a420bfb56b7b74cb635ea903db05e357bec upstream. Add support for platforms where pit_shutdown() doesn't work because of a quirk in the PIT emulation. On these platforms setting the counter register to zero causes the PIT to start running again, negating the shutdown. Provide a global variable that controls whether the counter register is zero'ed, which platform specific code can override. Signed-off-by: Michael Kelley Signed-off-by: Thomas Gleixner Cc: "gre...@linuxfoundation.org" Cc: "de...@linuxdriverproject.org" Cc: "daniel.lezc...@linaro.org" Cc: "virtualization@lists.linux-foundation.org" Cc: "jgr...@suse.com" Cc: "akata...@vmware.com" Cc: "o...@aepfle.de" Cc: "a...@canonical.com" Cc: vkuznets Cc: "jasow...@redhat.com" Cc: "marcelo.ce...@canonical.com" Cc: KY Srinivasan Link: https://lkml.kernel.org/r/1541303219-11142-2-git-send-email-mikel...@microsoft.com [bwh: Backported to 3.16: - Don't use __ro_after_init - Adjust context, indentation] Signed-off-by: Ben Hutchings --- drivers/clocksource/i8253.c | 14 -- include/linux/i8253.h | 1 + 2 files changed, 13 insertions(+), 2 deletions(-) --- a/drivers/clocksource/i8253.c +++ b/drivers/clocksource/i8253.c @@ -19,6 +19,13 @@ DEFINE_RAW_SPINLOCK(i8253_lock); EXPORT_SYMBOL(i8253_lock); +/* + * Handle PIT quirk in pit_shutdown() where zeroing the counter register + * restarts the PIT, negating the shutdown. On platforms with the quirk, + * platform specific code can set this to false. + */ +bool i8253_clear_counter_on_shutdown = true; + #ifdef CONFIG_CLKSRC_I8253 /* * Since the PIT overflows every tick, its not very useful @@ -123,8 +130,11 @@ static void init_pit_timer(enum clock_ev if (evt->mode == CLOCK_EVT_MODE_PERIODIC || evt->mode == CLOCK_EVT_MODE_ONESHOT) { outb_p(0x30, PIT_MODE); - outb_p(0, PIT_CH0); - outb_p(0, PIT_CH0); + + if (i8253_clear_counter_on_shutdown) { + outb_p(0, PIT_CH0); + outb_p(0, PIT_CH0); + } } break; --- a/include/linux/i8253.h +++ b/include/linux/i8253.h @@ -21,6 +21,7 @@ #define PIT_LATCH ((PIT_TICK_RATE + HZ/2) / HZ) extern raw_spinlock_t i8253_lock; +extern bool i8253_clear_counter_on_shutdown; extern struct clock_event_device i8253_clockevent; extern void clockevent_i8253_init(bool oneshot); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 3.16 249/328] x86/paravirt: Fix some warning messages
3.16.62-rc1 review patch. If anyone has any objections, please let me know. -- From: Dan Carpenter commit 571d0563c8881595f4ab027aef9ed1c55e3e7b7c upstream. The first argument to WARN_ONCE() is a condition. Fixes: 5800dc5c19f3 ("x86/paravirt: Fix spectre-v2 mitigations for paravirt guests") Signed-off-by: Dan Carpenter Signed-off-by: Thomas Gleixner Reviewed-by: Juergen Gross Cc: Peter Zijlstra Cc: Alok Kataria Cc: "H. Peter Anvin" Cc: virtualization@lists.linux-foundation.org Cc: kernel-janit...@vger.kernel.org Link: https://lkml.kernel.org/r/20180919103553.GD9238@mwanda Signed-off-by: Ben Hutchings --- arch/x86/kernel/paravirt.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -99,7 +99,7 @@ unsigned paravirt_patch_call(void *insnb if (len < 5) { #ifdef CONFIG_RETPOLINE - WARN_ONCE("Failing to patch indirect CALL in %ps\n", (void *)addr); + WARN_ONCE(1, "Failing to patch indirect CALL in %ps\n", (void *)addr); #endif return len; /* call too long for patch site */ } @@ -119,7 +119,7 @@ unsigned paravirt_patch_jmp(void *insnbu if (len < 5) { #ifdef CONFIG_RETPOLINE - WARN_ONCE("Failing to patch indirect JMP in %ps\n", (void *)addr); + WARN_ONCE(1, "Failing to patch indirect JMP in %ps\n", (void *)addr); #endif return len; /* call too long for patch site */ } ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/3] ipv6: Select fragment id during UFO/GSO segmentation if not set.
On Wed, 2015-01-28 at 11:46 +0200, Michael S. Tsirkin wrote: On Wed, Jan 28, 2015 at 09:25:08AM +0100, Hannes Frederic Sowa wrote: [...] I see fragmentation id generation still as security critical: When Eric patched the frag id generator in 04ca6973f7c1a0d (ip: make IP identifiers less predictable) I could patch my kernels and use the patch regardless of the machine being virtualized or not. It was not dependent on the hypervisor. And now it's even easier - just patch the hypervisor, and all VMs automatically benefit. [...] You are advocating that the hypervisor should continue to act as a middle-box that quietly modifies packets. This may be useful to protect guests that have poor fragment ID generation, but then that should be an optional netfilter module and *not* the default. The default should be that UFO has no effect on the packet headers on the wire, and therefore that the fragment ID is chosen by the IPv6 stack in the guest. Ben. -- Ben Hutchings Teamwork is essential - it allows you to blame someone else. signature.asc Description: This is a digitally signed message part ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/3] ipv6: Select fragment id during UFO/GSO segmentation if not set.
On Mon, 2015-01-26 at 09:37 -0500, Vladislav Yasevich wrote: If the IPv6 fragment id has not been set and we perform fragmentation due to UFO, select a new fragment id. When we store the fragment id into skb_shinfo, set the bit in the skb so we can re-use the selected id. This preserves the behavior of UFO packets generated on the host and solves the issue of id generation for packet sockets and tap/macvtap devices. This patch moves ipv6_select_ident() back in to the header file. It also provides the helper function that sets skb_shinfo() frag id and sets the bit. It also makes sure that we select the fragment id when doing just gso validation, since it's possible for the packet to come from an untrusted source (VM) and be forwarded through a UFO enabled device which will expect the fragment id. CC: Eric Dumazet eduma...@google.com Signed-off-by: Vladislav Yasevich vyase...@redhat.com --- include/linux/skbuff.h | 3 ++- include/net/ipv6.h | 2 ++ net/ipv6/ip6_output.c | 4 ++-- net/ipv6/output_core.c | 9 - net/ipv6/udp_offload.c | 10 +- 5 files changed, 23 insertions(+), 5 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 85ab7d7..3ad5203 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -605,7 +605,8 @@ struct sk_buff { __u8ipvs_property:1; __u8inner_protocol_type:1; __u8remcsum_offload:1; - /* 3 or 5 bit hole */ + __u8ufo_fragid_set:1; [...] Doesn't the flag belong in struct skb_shared_info, rather than struct sk_buff? Otherwise this looks fine. Ben. -- Ben Hutchings When in doubt, use brute force. - Ken Thompson signature.asc Description: This is a digitally signed message part ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] Revert drivers/net: Disable UFO through virtio
On Mon, 2015-01-26 at 09:37 -0500, Vladislav Yasevich wrote: This reverts commit 3d0ad09412ffe00c9afa201d01effdb6023d09b4. Now that UFO functionality can correctly track the fragment id has been selected and select a fragment id if necessary, we can re-enable UFO on tap/macvap and virtio devices. Signed-off-by: Vladislav Yasevich vyase...@redhat.com [...] Acked-by: Ben Hutchings b...@decadent.org.uk I think this is sensible even before your other patches. Ben. -- Ben Hutchings When in doubt, use brute force. - Ken Thompson signature.asc Description: This is a digitally signed message part ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 00/10] Split UFO into v4 and v6 versions.
On Thu, 2014-12-18 at 00:28 -0500, Jason Wang wrote: - Original Message - UFO support in the kernel applies to both IPv4 and IPv6 protocols with the same device feature. However some devices may not be able to support one of the offloads. For this we split the UFO offload feature into 2 pieces. NETIF_F_UFO now controlls the IPv4 part and this series introduces NETIF_F_UFO6. As a result of this work, we can now re-enable NETIF_F_UFO on virtio_net devices and restore UDP over IPv4 performance for guests. We also continue to support legacy guests that assume that UFO6 support included into UFO(4). Without this work, migrating a guest to a 3.18 kernel fails. This series eliminate the ambiguous NETIF_F_UFO. But VIRTIO_NET_F_{HOST|GUEST}_UFO and VIRTIO_NET_F_HDR_GSO_UDP is still ambigious. I know it was used to keep compatibility for legacy guest. But what's the future plan? Differentiate UFOv4 and UFOv6 in virtio features and gso type in vnet header looks sufficient? [...] The IPv6 fragmentation ID needs to be added to the vnet header, to do UFOv6 properly. If it wasn't for that lack, we wouldn't have to split the feature flag. Ben. -- Ben Hutchings If more than one person is responsible for a bug, no one is at fault. signature.asc Description: This is a digitally signed message part ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 01/10] core: Split out UFO6 support
On Wed, 2014-12-17 at 13:20 -0500, Vladislav Yasevich wrote: Split IPv6 support for UFO into its own feature similiar to TSO. This will later allow us to re-enable UFO support for virtio-net devices. [...] diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 6c8b6f6..8538b67 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -372,6 +372,7 @@ enum { SKB_GSO_MPLS = 1 12, + SKB_GSO_UDP6 = 1 13 It seems like it would be cleaner to use the names SKB_GSO_UDPV{4,6}, similarly to SKB_GSO_TCPV{4,6}. }; #if BITS_PER_LONG 32 diff --git a/net/core/dev.c b/net/core/dev.c index 945bbd0..fa4d2ee 100644 --- a/net/core/dev.c +++ b/net/core/dev.c [...] @@ -5952,24 +5958,21 @@ static netdev_features_t netdev_fix_features(struct net_device *dev, [...] + /* UFO also needs checksumming */ + if ((features NETIF_F_UFO) !(features NETIF_F_GEN_CSUM) + !(features NETIF_F_IP_CSUM)) { You can use !(features NETIF_F_V4_CSUM) instead of the last two terms. + netdev_dbg(dev, +Dropping NETIF_F_UFO since no checksum offload features.\n); + features = ~NETIF_F_UFO; + } + if ((features NETIF_F_UFO6) !(features NETIF_F_GEN_CSUM) + !(features NETIF_F_IPV6_CSUM)) { [...] Similarly you can use !(features NETIF_F_V6_CSUM) instead of the last two terms. Aside from those minor points, this looks fine. Ben. -- Ben Hutchings Absolutum obsoletum. (If it works, it's out of date.) - Stafford Beer signature.asc Description: This is a digitally signed message part ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v7 28/46] vhost: make features 64 bit
On Sun, 2014-11-30 at 18:44 +0300, Sergei Shtylyov wrote: Hello. On 11/30/2014 6:11 PM, Michael S. Tsirkin wrote: We need to use bit 32 for virtio 1.0 Signed-off-by: Michael S. Tsirkin m...@redhat.com Reviewed-by: Jason Wang jasow...@redhat.com --- drivers/vhost/vhost.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 3eda654..c624b09 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -106,7 +106,7 @@ struct vhost_virtqueue { /* Protected by virtqueue mutex. */ struct vhost_memory *memory; void *private_data; - unsigned acked_features; + u64 acked_features; /* Log write descriptors */ void __user *log_base; struct vhost_log *log; @@ -174,6 +174,6 @@ enum { static inline int vhost_has_feature(struct vhost_virtqueue *vq, int bit) { - return vq-acked_features (1 bit); + return vq-acked_features (1ULL bit); Erm, wouldn't the high word be just dropped when returning *int*? I think you need !!(vq-acked_features (1ULL bit)). Or change the return type to bool. Ben. -- Ben Hutchings The first rule of tautology club is the first rule of tautology club. signature.asc Description: This is a digitally signed message part ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 net 1/2] drivers/net: Disable UFO through virtio
On Wed, 2014-11-19 at 11:14 +0200, Michael S. Tsirkin wrote: On Thu, Oct 30, 2014 at 06:27:12PM +, Ben Hutchings wrote: IPv6 does not allow fragmentation by routers, so there is no fragmentation ID in the fixed header. UFO for IPv6 requires the ID to be passed separately, but there is no provision for this in the virtio net protocol. Until recently our software implementation of UFO/IPv6 generated a new ID, but this was a bug. Now we will use ID=0 for any UFO/IPv6 packet passed through a tap, which is even worse. Unfortunately there is no distinction between UFO/IPv4 and v6 features, so disable UFO on taps and virtio_net completely until we have a proper solution. We cannot depend on VM managers respecting the tap feature flags, so keep accepting UFO packets but log a warning the first time we do this. Signed-off-by: Ben Hutchings b...@decadent.org.uk Fixes: 916e4cf46d02 (ipv6: reuse ip6_frag_id from ip6_ufo_append_data) There's something I don't understand here. I see: NETIF_F_UFO_BIT,/* ... UDPv4 fragmentation */ this comment is wrong then? Yes. The patches drastically regress performance for UDPv4 for VMs only, but isn't it likely many other devices based their code on this comment? There's only one hardware driver that implements UFO (s2io), and it does handle IPv6. How about we disable UFO for IPv6 globally, and put the flag back in? We can then gradually add NETIF_F_UFO6_BIT for devices that actually support UFO for IPv6. Since the corresponding virtio feature bit is understood to include UFO/IPv6, and existing VMs rely on that, I don't see what this solves. Ben. -- Ben Hutchings Beware of bugs in the above code; I have only proved it correct, not tried it. - Donald Knuth signature.asc Description: This is a digitally signed message part ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: TUN_F_UFO change breaks live migration
On Tue, 2014-11-11 at 10:58 +, Stefan Hajnoczi wrote: Commit 3d0ad09412ffe00c9afa201d01effdb6023d09b4 (drivers/net: Disable UFO through virtio) breaks live migration of KVM guests from older to newer host kernels: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3d0ad09412ffe00c9afa201d01effdb6023d09b4 The problem occurs when a guest running on a host kernel without commit 3d0ad0941 in tun.ko attempts to live migration to a host with commit 3d0ad0941. Live migration fails in QEMU with the following error message: virtio-net: saved image requires TUN_F_UFO support The old host tun.ko advertised support for TUN_F_UFO. The new host tun.ko does not and that's why QEMU aborts live migration. QEMU cannot change the features of a running virtio-net device. Yes, this is known and was mentioned in the DSA. tuxcrafter provided logs from two Debian hosts migrating from 3.2.60-1+deb7u3 to 3.2.63-2+deb7u1: http://paste.debian.net/131264/ I haven't investigated enough to suggest a fix, just wanted to bring it to your attention. Soon a lot of people will be hitting this problem as they upgrade their infrastructure and migrate guests - seems like a critical issue. You can work around this by making macvtap and tun still claim to support UFO. They continue to support it even if it's not advertised because the tap features don't reliably get propagated to virtio devices. Ben. -- Ben Hutchings Experience is directly proportional to the value of equipment destroyed. - Carolyn Scheppner signature.asc Description: This is a digitally signed message part ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: TUN_F_UFO change breaks live migration
On Tue, 2014-11-11 at 17:57 +0200, Michael S. Tsirkin wrote: On Tue, Nov 11, 2014 at 12:17:26PM +, Ben Hutchings wrote: On Tue, 2014-11-11 at 10:58 +, Stefan Hajnoczi wrote: Commit 3d0ad09412ffe00c9afa201d01effdb6023d09b4 (drivers/net: Disable UFO through virtio) breaks live migration of KVM guests from older to newer host kernels: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3d0ad09412ffe00c9afa201d01effdb6023d09b4 The problem occurs when a guest running on a host kernel without commit 3d0ad0941 in tun.ko attempts to live migration to a host with commit 3d0ad0941. Live migration fails in QEMU with the following error message: virtio-net: saved image requires TUN_F_UFO support The old host tun.ko advertised support for TUN_F_UFO. The new host tun.ko does not and that's why QEMU aborts live migration. QEMU cannot change the features of a running virtio-net device. Yes, this is known and was mentioned in the DSA. tuxcrafter provided logs from two Debian hosts migrating from 3.2.60-1+deb7u3 to 3.2.63-2+deb7u1: http://paste.debian.net/131264/ I haven't investigated enough to suggest a fix, just wanted to bring it to your attention. Soon a lot of people will be hitting this problem as they upgrade their infrastructure and migrate guests - seems like a critical issue. You can work around this by making macvtap and tun still claim to support UFO. If this is what we want userspace to do, let's just put the feature flag back? Let's not *just* put the feature flag back. I accept this is probably needed as a workaround, but UFO/IPv6 still won't work correctly over virtio. Basically userspace assumed that features will only ever be added, never removed, so this change is breaking it. OK. They continue to support it even if it's not advertised because the tap features don't reliably get propagated to virtio devices. Ben. Hmm I don't understand this last sentence. features are actually reliably propagated to virtio devices. They might be when using current QEMU and libvirt on the host. They weren't when I tested on Debian stable. The warnings about 'using disabled UFO feature' are reliably triggered if the host's tap driver is patched and the guest's virtio_net driver is not. Ben. -- Ben Hutchings Experience is directly proportional to the value of equipment destroyed. - Carolyn Scheppner signature.asc Description: This is a digitally signed message part ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net] Revert drivers/net: Disable UFO through virtio in macvtap and tun
This reverts commit 88e0e0e5aa722b193c8758c8b45d041de5316924 for the tap drivers, but leaves UFO disabled in virtio_net. libvirt at least assumes that tap features will never be dropped in new kernel versions, and doing so prevents migration of VMs to the never kernel version while they are running with virtio net devices. Fixes: 88e0e0e5aa7a (drivers/net: Disable UFO through virtio) Signed-off-by: Ben Hutchings b...@decadent.org.uk --- Compile-tested only. Ben. drivers/net/macvtap.c | 13 - drivers/net/tun.c | 19 --- 2 files changed, 16 insertions(+), 16 deletions(-) diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c index 6f226de..aeaeb6d 100644 --- a/drivers/net/macvtap.c +++ b/drivers/net/macvtap.c @@ -66,7 +66,7 @@ static struct cdev macvtap_cdev; static const struct proto_ops macvtap_socket_ops; #define TUN_OFFLOADS (NETIF_F_HW_CSUM | NETIF_F_TSO_ECN | NETIF_F_TSO | \ - NETIF_F_TSO6) + NETIF_F_TSO6 | NETIF_F_UFO) #define RX_OFFLOADS (NETIF_F_GRO | NETIF_F_LRO) #define TAP_FEATURES (NETIF_F_GSO | NETIF_F_SG) @@ -570,8 +570,6 @@ static int macvtap_skb_from_vnet_hdr(struct sk_buff *skb, gso_type = SKB_GSO_TCPV6; break; case VIRTIO_NET_HDR_GSO_UDP: - pr_warn_once(macvtap: %s: using disabled UFO feature; please fix this program\n, -current-comm); gso_type = SKB_GSO_UDP; if (skb-protocol == htons(ETH_P_IPV6)) ipv6_proxy_select_ident(skb); @@ -619,6 +617,8 @@ static void macvtap_skb_to_vnet_hdr(const struct sk_buff *skb, vnet_hdr-gso_type = VIRTIO_NET_HDR_GSO_TCPV4; else if (sinfo-gso_type SKB_GSO_TCPV6) vnet_hdr-gso_type = VIRTIO_NET_HDR_GSO_TCPV6; + else if (sinfo-gso_type SKB_GSO_UDP) + vnet_hdr-gso_type = VIRTIO_NET_HDR_GSO_UDP; else BUG(); if (sinfo-gso_type SKB_GSO_TCP_ECN) @@ -953,6 +953,9 @@ static int set_offload(struct macvtap_queue *q, unsigned long arg) if (arg TUN_F_TSO6) feature_mask |= NETIF_F_TSO6; } + + if (arg TUN_F_UFO) + feature_mask |= NETIF_F_UFO; } /* tun/tap driver inverts the usage for TSO offloads, where @@ -963,7 +966,7 @@ static int set_offload(struct macvtap_queue *q, unsigned long arg) * When user space turns off TSO, we turn off GSO/LRO so that * user-space will not receive TSO frames. */ - if (feature_mask (NETIF_F_TSO | NETIF_F_TSO6)) + if (feature_mask (NETIF_F_TSO | NETIF_F_TSO6 | NETIF_F_UFO)) features |= RX_OFFLOADS; else features = ~RX_OFFLOADS; @@ -1064,7 +1067,7 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd, case TUNSETOFFLOAD: /* let the user check for future flags */ if (arg ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 | - TUN_F_TSO_ECN)) + TUN_F_TSO_ECN | TUN_F_UFO)) return -EINVAL; rtnl_lock(); diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 7302398..a0987d1 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -175,7 +175,7 @@ struct tun_struct { struct net_device *dev; netdev_features_t set_features; #define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \ - NETIF_F_TSO6) + NETIF_F_TSO6|NETIF_F_UFO) int vnet_hdr_sz; int sndbuf; @@ -1152,20 +1152,10 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, skb_shinfo(skb)-gso_type = SKB_GSO_TCPV6; break; case VIRTIO_NET_HDR_GSO_UDP: - { - static bool warned; - - if (!warned) { - warned = true; - netdev_warn(tun-dev, - %s: using disabled UFO feature; please fix this program\n, - current-comm); - } skb_shinfo(skb)-gso_type = SKB_GSO_UDP; if (skb-protocol == htons(ETH_P_IPV6)) ipv6_proxy_select_ident(skb); break; - } default: tun-dev-stats.rx_frame_errors++; kfree_skb(skb); @@ -1265,6 +1255,8 @@ static ssize_t tun_put_user(struct tun_struct *tun
[PATCH v2 net 0/2] drivers/net,ipv6: Fix IPv6 fragment ID selection for virtio
The virtio net protocol supports UFO but does not provide for passing a fragment ID for fragmentation of IPv6 packets. We used to generate a fragment ID wherever such a packet was fragmented, but currently we always use ID=0! v2: Add blank lines after declarations Ben. Ben Hutchings (2): drivers/net: Disable UFO through virtio drivers/net,ipv6: Select IPv6 fragment idents for virtio UFO packets drivers/net/macvtap.c| 16 drivers/net/tun.c| 25 - drivers/net/virtio_net.c | 24 ++-- include/net/ipv6.h | 2 ++ net/ipv6/output_core.c | 34 ++ 5 files changed, 74 insertions(+), 27 deletions(-) -- Ben Hutchings The program is absolutely right; therefore, the computer must be wrong. signature.asc Description: This is a digitally signed message part ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v2 net 1/2] drivers/net: Disable UFO through virtio
IPv6 does not allow fragmentation by routers, so there is no fragmentation ID in the fixed header. UFO for IPv6 requires the ID to be passed separately, but there is no provision for this in the virtio net protocol. Until recently our software implementation of UFO/IPv6 generated a new ID, but this was a bug. Now we will use ID=0 for any UFO/IPv6 packet passed through a tap, which is even worse. Unfortunately there is no distinction between UFO/IPv4 and v6 features, so disable UFO on taps and virtio_net completely until we have a proper solution. We cannot depend on VM managers respecting the tap feature flags, so keep accepting UFO packets but log a warning the first time we do this. Signed-off-by: Ben Hutchings b...@decadent.org.uk Fixes: 916e4cf46d02 (ipv6: reuse ip6_frag_id from ip6_ufo_append_data) --- drivers/net/macvtap.c| 13 + drivers/net/tun.c| 19 +++ drivers/net/virtio_net.c | 24 ++-- 3 files changed, 30 insertions(+), 26 deletions(-) diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c index 65e2892..2aeaa61 100644 --- a/drivers/net/macvtap.c +++ b/drivers/net/macvtap.c @@ -65,7 +65,7 @@ static struct cdev macvtap_cdev; static const struct proto_ops macvtap_socket_ops; #define TUN_OFFLOADS (NETIF_F_HW_CSUM | NETIF_F_TSO_ECN | NETIF_F_TSO | \ - NETIF_F_TSO6 | NETIF_F_UFO) + NETIF_F_TSO6) #define RX_OFFLOADS (NETIF_F_GRO | NETIF_F_LRO) #define TAP_FEATURES (NETIF_F_GSO | NETIF_F_SG) @@ -569,6 +569,8 @@ static int macvtap_skb_from_vnet_hdr(struct sk_buff *skb, gso_type = SKB_GSO_TCPV6; break; case VIRTIO_NET_HDR_GSO_UDP: + pr_warn_once(macvtap: %s: using disabled UFO feature; please fix this program\n, +current-comm); gso_type = SKB_GSO_UDP; break; default: @@ -614,8 +616,6 @@ static void macvtap_skb_to_vnet_hdr(const struct sk_buff *skb, vnet_hdr-gso_type = VIRTIO_NET_HDR_GSO_TCPV4; else if (sinfo-gso_type SKB_GSO_TCPV6) vnet_hdr-gso_type = VIRTIO_NET_HDR_GSO_TCPV6; - else if (sinfo-gso_type SKB_GSO_UDP) - vnet_hdr-gso_type = VIRTIO_NET_HDR_GSO_UDP; else BUG(); if (sinfo-gso_type SKB_GSO_TCP_ECN) @@ -950,9 +950,6 @@ static int set_offload(struct macvtap_queue *q, unsigned long arg) if (arg TUN_F_TSO6) feature_mask |= NETIF_F_TSO6; } - - if (arg TUN_F_UFO) - feature_mask |= NETIF_F_UFO; } /* tun/tap driver inverts the usage for TSO offloads, where @@ -963,7 +960,7 @@ static int set_offload(struct macvtap_queue *q, unsigned long arg) * When user space turns off TSO, we turn off GSO/LRO so that * user-space will not receive TSO frames. */ - if (feature_mask (NETIF_F_TSO | NETIF_F_TSO6 | NETIF_F_UFO)) + if (feature_mask (NETIF_F_TSO | NETIF_F_TSO6)) features |= RX_OFFLOADS; else features = ~RX_OFFLOADS; @@ -1064,7 +1061,7 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd, case TUNSETOFFLOAD: /* let the user check for future flags */ if (arg ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 | - TUN_F_TSO_ECN | TUN_F_UFO)) + TUN_F_TSO_ECN)) return -EINVAL; rtnl_lock(); diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 186ce54..280d3d2 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -174,7 +174,7 @@ struct tun_struct { struct net_device *dev; netdev_features_t set_features; #define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \ - NETIF_F_TSO6|NETIF_F_UFO) + NETIF_F_TSO6) int vnet_hdr_sz; int sndbuf; @@ -1149,8 +1149,18 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, skb_shinfo(skb)-gso_type = SKB_GSO_TCPV6; break; case VIRTIO_NET_HDR_GSO_UDP: + { + static bool warned; + + if (!warned) { + warned = true; + netdev_warn(tun-dev, + %s: using disabled UFO feature; please fix this program\n, + current-comm); + } skb_shinfo(skb)-gso_type = SKB_GSO_UDP; break
[PATCH v2 net 2/2] drivers/net,ipv6: Select IPv6 fragment idents for virtio UFO packets
UFO is now disabled on all drivers that work with virtio net headers, but userland may try to send UFO/IPv6 packets anyway. Instead of sending with ID=0, we should select identifiers on their behalf (as we used to). Signed-off-by: Ben Hutchings b...@decadent.org.uk Fixes: 916e4cf46d02 (ipv6: reuse ip6_frag_id from ip6_ufo_append_data) --- drivers/net/macvtap.c | 3 +++ drivers/net/tun.c | 6 +- include/net/ipv6.h | 2 ++ net/ipv6/output_core.c | 34 ++ 4 files changed, 44 insertions(+), 1 deletion(-) diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c index 2aeaa61..6f226de 100644 --- a/drivers/net/macvtap.c +++ b/drivers/net/macvtap.c @@ -16,6 +16,7 @@ #include linux/idr.h #include linux/fs.h +#include net/ipv6.h #include net/net_namespace.h #include net/rtnetlink.h #include net/sock.h @@ -572,6 +573,8 @@ static int macvtap_skb_from_vnet_hdr(struct sk_buff *skb, pr_warn_once(macvtap: %s: using disabled UFO feature; please fix this program\n, current-comm); gso_type = SKB_GSO_UDP; + if (skb-protocol == htons(ETH_P_IPV6)) + ipv6_proxy_select_ident(skb); break; default: return -EINVAL; diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 280d3d2..7302398 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -65,6 +65,7 @@ #include linux/nsproxy.h #include linux/virtio_net.h #include linux/rcupdate.h +#include net/ipv6.h #include net/net_namespace.h #include net/netns/generic.h #include net/rtnetlink.h @@ -1139,6 +1140,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, break; } + skb_reset_network_header(skb); + if (gso.gso_type != VIRTIO_NET_HDR_GSO_NONE) { pr_debug(GSO!\n); switch (gso.gso_type ~VIRTIO_NET_HDR_GSO_ECN) { @@ -1159,6 +1162,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, current-comm); } skb_shinfo(skb)-gso_type = SKB_GSO_UDP; + if (skb-protocol == htons(ETH_P_IPV6)) + ipv6_proxy_select_ident(skb); break; } default: @@ -1189,7 +1194,6 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, skb_shinfo(skb)-tx_flags |= SKBTX_SHARED_FRAG; } - skb_reset_network_header(skb); skb_probe_transport_header(skb, 0); rxhash = skb_get_hash(skb); diff --git a/include/net/ipv6.h b/include/net/ipv6.h index 97f4720..4292929 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -671,6 +671,8 @@ static inline int ipv6_addr_diff(const struct in6_addr *a1, const struct in6_add return __ipv6_addr_diff(a1, a2, sizeof(struct in6_addr)); } +void ipv6_proxy_select_ident(struct sk_buff *skb); + int ip6_dst_hoplimit(struct dst_entry *dst); static inline int ip6_sk_dst_hoplimit(struct ipv6_pinfo *np, struct flowi6 *fl6, diff --git a/net/ipv6/output_core.c b/net/ipv6/output_core.c index fc24c39..97f41a3 100644 --- a/net/ipv6/output_core.c +++ b/net/ipv6/output_core.c @@ -3,11 +3,45 @@ * not configured or static. These functions are needed by GSO/GRO implementation. */ #include linux/export.h +#include net/ip.h #include net/ipv6.h #include net/ip6_fib.h #include net/addrconf.h #include net/secure_seq.h +/* This function exists only for tap drivers that must support broken + * clients requesting UFO without specifying an IPv6 fragment ID. + * + * This is similar to ipv6_select_ident() but we use an independent hash + * seed to limit information leakage. + * + * The network header must be set before calling this. + */ +void ipv6_proxy_select_ident(struct sk_buff *skb) +{ + static u32 ip6_proxy_idents_hashrnd __read_mostly; + struct in6_addr buf[2]; + struct in6_addr *addrs; + u32 hash, id; + + addrs = skb_header_pointer(skb, + skb_network_offset(skb) + + offsetof(struct ipv6hdr, saddr), + sizeof(buf), buf); + if (!addrs) + return; + + net_get_random_once(ip6_proxy_idents_hashrnd, + sizeof(ip6_proxy_idents_hashrnd)); + + hash = __ipv6_addr_jhash(addrs[1], ip6_proxy_idents_hashrnd); + hash = __ipv6_addr_jhash(addrs[0], hash); + + id = ip_idents_reserve(hash, 1); + skb_shinfo(skb)-ip6_frag_id = htonl(id); +} +EXPORT_SYMBOL_GPL(ipv6_proxy_select_ident); + int ip6_find_1stfragopt(struct sk_buff *skb, u8 **nexthdr) { u16 offset = sizeof(struct ipv6hdr); -- Ben Hutchings The program is absolutely right
Re: [PATCH v2 net 1/2] drivers/net: Disable UFO through virtio
On Thu, 2014-10-30 at 11:47 -0700, Eric Dumazet wrote: On Thu, 2014-10-30 at 18:27 +, Ben Hutchings wrote: + { + static bool warned; + + if (!warned) { + warned = true; + netdev_warn(tun-dev, + %s: using disabled UFO feature; please fix this program\n, + current-comm); + } It might be time to add netdev_warn_once() ;) Could do. I'm trying to make small fixes that are suitable for stable. Alternatively, you could use pr_warn_once(%s: using disabled UFO feature; please fix this program\n, tun-dev-name, current-comm); That's missing a %s: , but yes that would also work. Ben. -- Ben Hutchings The program is absolutely right; therefore, the computer must be wrong. signature.asc Description: This is a digitally signed message part ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: IPv6 UFO for VMs
On Wed, 2014-10-22 at 11:35 +0200, Hannes Frederic Sowa wrote: On Mi, 2014-10-22 at 00:44 +0100, Ben Hutchings wrote: There are several ways that VMs can take advantage of UFO and get the host to do fragmentation for them: drivers/net/macvtap.c: gso_type = SKB_GSO_UDP; drivers/net/tun.c: skb_shinfo(skb)-gso_type = SKB_GSO_UDP; drivers/net/virtio_net.c: skb_shinfo(skb)-gso_type = SKB_GSO_UDP; Our implementation of UFO for IPv6 does: fptr = (struct frag_hdr *)(skb_network_header(skb) + unfrag_ip6hlen); fptr-nexthdr = nexthdr; fptr-reserved = 0; fptr-identification = skb_shinfo(skb)-ip6_frag_id; which assumes ip6_frag_id has been set. That's only true if the local stack constructed the skb; otherwise it appears we get zero. This seems to be a regression as a result of: commit 916e4cf46d0204806c062c8c6c4d1f633852c5b6 Author: Hannes Frederic Sowa han...@stressinduktion.org Date: Fri Feb 21 02:55:35 2014 +0100 ipv6: reuse ip6_frag_id from ip6_ufo_append_data However, that change seems reasonable - we *shouldn't* be choosing IDs for any other stack. Any paravirt net driver that can use IPv6 UFO needs to have some way of passing a fragmentation ID to put in skb_shared_info::ip6_frag_id. Do we really gain a lot of performance by enabling UFO on those devices or would it make sense to just drop support? It only helps fragmenting large UDP packets, so I don't think it is worth it. It's not been important enough for anyone to bother implementing it in hardware/firmware aside from Neterion. I'll shortly post patches to disable it. Ben. Otherwise I agree with Ben, we need to pass a fragmentation id from the host over to the system segmenting the gso frame. Fragmentation ids must be generated by the end system. Hmm... -- Ben Hutchings Theory and practice are closer in theory than in practice. - John Levine, moderator of comp.compilers signature.asc Description: This is a digitally signed message part ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net 0/2] drivers/net,ipv6: Fix IPv6 fragment ID selection for virtio
The virtio net protocol supports UFO but does not provide for passing a fragment ID for fragmentation of IPv6 packets. We used to generate a fragment ID wherever such a packet was fragmented, but currently we always use ID=0! Ben. Ben Hutchings (2): drivers/net: Disable UFO through virtio drivers/net,ipv6: Select IPv6 fragment idents for virtio UFO packets drivers/net/macvtap.c| 16 drivers/net/tun.c| 24 +++- drivers/net/virtio_net.c | 23 +-- include/net/ipv6.h | 2 ++ net/ipv6/output_core.c | 34 ++ 5 files changed, 72 insertions(+), 27 deletions(-) -- Ben Hutchings Theory and practice are closer in theory than in practice. - John Levine, moderator of comp.compilers signature.asc Description: This is a digitally signed message part ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net 2/2] drivers/net,ipv6: Select IPv6 fragment idents for virtio UFO packets
UFO is now disabled on all drivers that work with virtio net headers, but userland may try to send UFO/IPv6 packets anyway. Instead of sending with ID=0, we should select identifiers on their behalf (as we used to). Signed-off-by: Ben Hutchings b...@decadent.org.uk Fixes: 916e4cf46d02 (ipv6: reuse ip6_frag_id from ip6_ufo_append_data) --- drivers/net/macvtap.c | 3 +++ drivers/net/tun.c | 6 +- include/net/ipv6.h | 2 ++ net/ipv6/output_core.c | 34 ++ 4 files changed, 44 insertions(+), 1 deletion(-) diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c index 2aeaa61..6f226de 100644 --- a/drivers/net/macvtap.c +++ b/drivers/net/macvtap.c @@ -16,6 +16,7 @@ #include linux/idr.h #include linux/fs.h +#include net/ipv6.h #include net/net_namespace.h #include net/rtnetlink.h #include net/sock.h @@ -572,6 +573,8 @@ static int macvtap_skb_from_vnet_hdr(struct sk_buff *skb, pr_warn_once(macvtap: %s: using disabled UFO feature; please fix this program\n, current-comm); gso_type = SKB_GSO_UDP; + if (skb-protocol == htons(ETH_P_IPV6)) + ipv6_proxy_select_ident(skb); break; default: return -EINVAL; diff --git a/drivers/net/tun.c b/drivers/net/tun.c index e8b949a..f8356b0 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -65,6 +65,7 @@ #include linux/nsproxy.h #include linux/virtio_net.h #include linux/rcupdate.h +#include net/ipv6.h #include net/net_namespace.h #include net/netns/generic.h #include net/rtnetlink.h @@ -1139,6 +1140,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, break; } + skb_reset_network_header(skb); + if (gso.gso_type != VIRTIO_NET_HDR_GSO_NONE) { pr_debug(GSO!\n); switch (gso.gso_type ~VIRTIO_NET_HDR_GSO_ECN) { @@ -1158,6 +1161,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, current-comm); } skb_shinfo(skb)-gso_type = SKB_GSO_UDP; + if (skb-protocol == htons(ETH_P_IPV6)) + ipv6_proxy_select_ident(skb); break; } default: @@ -1188,7 +1193,6 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, skb_shinfo(skb)-tx_flags |= SKBTX_SHARED_FRAG; } - skb_reset_network_header(skb); skb_probe_transport_header(skb, 0); rxhash = skb_get_hash(skb); diff --git a/include/net/ipv6.h b/include/net/ipv6.h index 97f4720..4292929 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -671,6 +671,8 @@ static inline int ipv6_addr_diff(const struct in6_addr *a1, const struct in6_add return __ipv6_addr_diff(a1, a2, sizeof(struct in6_addr)); } +void ipv6_proxy_select_ident(struct sk_buff *skb); + int ip6_dst_hoplimit(struct dst_entry *dst); static inline int ip6_sk_dst_hoplimit(struct ipv6_pinfo *np, struct flowi6 *fl6, diff --git a/net/ipv6/output_core.c b/net/ipv6/output_core.c index fc24c39..97f41a3 100644 --- a/net/ipv6/output_core.c +++ b/net/ipv6/output_core.c @@ -3,11 +3,45 @@ * not configured or static. These functions are needed by GSO/GRO implementation. */ #include linux/export.h +#include net/ip.h #include net/ipv6.h #include net/ip6_fib.h #include net/addrconf.h #include net/secure_seq.h +/* This function exists only for tap drivers that must support broken + * clients requesting UFO without specifying an IPv6 fragment ID. + * + * This is similar to ipv6_select_ident() but we use an independent hash + * seed to limit information leakage. + * + * The network header must be set before calling this. + */ +void ipv6_proxy_select_ident(struct sk_buff *skb) +{ + static u32 ip6_proxy_idents_hashrnd __read_mostly; + struct in6_addr buf[2]; + struct in6_addr *addrs; + u32 hash, id; + + addrs = skb_header_pointer(skb, + skb_network_offset(skb) + + offsetof(struct ipv6hdr, saddr), + sizeof(buf), buf); + if (!addrs) + return; + + net_get_random_once(ip6_proxy_idents_hashrnd, + sizeof(ip6_proxy_idents_hashrnd)); + + hash = __ipv6_addr_jhash(addrs[1], ip6_proxy_idents_hashrnd); + hash = __ipv6_addr_jhash(addrs[0], hash); + + id = ip_idents_reserve(hash, 1); + skb_shinfo(skb)-ip6_frag_id = htonl(id); +} +EXPORT_SYMBOL_GPL(ipv6_proxy_select_ident); + int ip6_find_1stfragopt(struct sk_buff *skb, u8 **nexthdr) { u16 offset = sizeof(struct ipv6hdr); -- Ben Hutchings Theory and practice are closer
IPv6 UFO for VMs
There are several ways that VMs can take advantage of UFO and get the host to do fragmentation for them: drivers/net/macvtap.c: gso_type = SKB_GSO_UDP; drivers/net/tun.c: skb_shinfo(skb)-gso_type = SKB_GSO_UDP; drivers/net/virtio_net.c: skb_shinfo(skb)-gso_type = SKB_GSO_UDP; Our implementation of UFO for IPv6 does: fptr = (struct frag_hdr *)(skb_network_header(skb) + unfrag_ip6hlen); fptr-nexthdr = nexthdr; fptr-reserved = 0; fptr-identification = skb_shinfo(skb)-ip6_frag_id; which assumes ip6_frag_id has been set. That's only true if the local stack constructed the skb; otherwise it appears we get zero. This seems to be a regression as a result of: commit 916e4cf46d0204806c062c8c6c4d1f633852c5b6 Author: Hannes Frederic Sowa han...@stressinduktion.org Date: Fri Feb 21 02:55:35 2014 +0100 ipv6: reuse ip6_frag_id from ip6_ufo_append_data However, that change seems reasonable - we *shouldn't* be choosing IDs for any other stack. Any paravirt net driver that can use IPv6 UFO needs to have some way of passing a fragmentation ID to put in skb_shared_info::ip6_frag_id. Ben. -- Ben Hutchings For every action, there is an equal and opposite criticism. - Harrison signature.asc Description: This is a digitally signed message part ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC] virtio_pci: fix virtio spec compliance on restore
On Tue, 2014-09-23 at 13:32 +0300, Michael S. Tsirkin wrote: On restore, virtio pci does the following: + set features + init vqs etc - device can be used at this point! + set ACKNOWLEDGE,DRIVER and DRIVER_OK status bits This is in violation of the virtio spec, which requires the following order: - ACKNOWLEDGE - DRIVER - init vqs - DRIVER_OK Cc: sta...@vger.kernel.org Cc: Amit Shah amit.s...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com [...] What concrete problem does this fix, such that it should be applied to stable branches? Ben. -- Ben Hutchings Everything should be made as simple as possible, but not simpler. - Albert Einstein signature.asc Description: This is a digitally signed message part ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v3 4/5] net-sysfs: add support for device-specific rx queue sysfs attributes
On Thu, 2014-01-16 at 01:38 -0800, Michael Dalton wrote: [...] --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h [...] @@ -2401,6 +2416,23 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev, #endif } +#ifdef CONFIG_SYSFS +static inline unsigned int get_netdev_rx_queue_index( + struct netdev_rx_queue *queue) +{ + struct net_device *dev = queue-dev; + int i; + + for (i = 0; i dev-num_rx_queues; i++) + if (queue == dev-_rx[i]) + break; Why write a loop when you can do: i = queue - dev-_rx; Ben. + BUG_ON(i = dev-num_rx_queues); + + return i; +} +#endif -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v3 4/5] net-sysfs: add support for device-specific rx queue sysfs attributes
On Thu, 2014-01-16 at 11:07 -0800, Michael Dalton wrote: On Jan 16, 2014 at 10:57 AM, Ben Hutchings bhutchi...@solarflare.com wrote: Why write a loop when you can do: i = queue - dev-_rx; Good point, the loop approach was done in get_netdev_queue_index -- I agree your fix is faster and simpler. I'll fix in next patchset. It's simpler but we don't know if it's faster (and I don't believe that matters for the current usage). If one of these functions starts to be used in the data path, at that point it could be worth optimising, e.g. by doing a test for queue 0 and only then doing the pointer arithmetic with its implicit division. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
On Mon, 2014-01-13 at 11:40 +0200, Michael S. Tsirkin wrote: On Sun, Jan 12, 2014 at 03:32:28PM -0800, Michael Dalton wrote: [...] The last issue is how the rx_queue_attribute 'show' function implementation for mergeable_rx_buffer_size will access the appropriate per-receive queue EWMA data. The arguments to the show function will be the netdev_rx_queue and the attribute itself. We can get to the struct net_device from the netdev_rx_queue. If we extended netdev_rx_queue to indicate the queue_index or to store a void *priv_data pointer, that would be sufficient to allow us to resolve this issue. Hmm netdev_rx_queue is not defined unless CONFIG_RPS is set. Maybe we should use a different structure. [...] I don't think RPS should own this structure. It's just that there are currently no per-RX-queue attributes other than those defined by RPS. By the way, CONFIG_RPS is equivalent to CONFIG_SMP so will likely be enabled already in most places where virtio_net is used. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V2 5/6] vhost_net: poll vhost queue after marking DMA is done
On Fri, 2013-08-30 at 12:29 +0800, Jason Wang wrote: We used to poll vhost queue before making DMA is done, this is racy if vhost thread were waked up before marking DMA is done which can result the signal to be missed. Fix this by always poll the vhost thread before DMA is done. Does this bug only exist in net-next or is it older? Should the fix go to net and stable branches? Ben. Signed-off-by: Jason Wang jasow...@redhat.com --- drivers/vhost/net.c |9 + 1 files changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index ff60c2a..d09c17c 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -308,6 +308,11 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success) struct vhost_virtqueue *vq = ubufs-vq; int cnt = atomic_read(ubufs-kref.refcount); + /* set len to mark this desc buffers done DMA */ + vq-heads[ubuf-desc].len = success ? + VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN; + vhost_net_ubuf_put(ubufs); + /* * Trigger polling thread if guest stopped submitting new buffers: * in this case, the refcount after decrement will eventually reach 1 @@ -318,10 +323,6 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success) */ if (cnt = 2 || !(cnt % 16)) vhost_poll_queue(vq-poll); - /* set len to mark this desc buffers done DMA */ - vq-heads[ubuf-desc].len = success ? - VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN; - vhost_net_ubuf_put(ubufs); } /* Expects to be always run from workqueue - which acts as -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: Merge of virtio_net: fix race in RX VQ processing for linux-3.2.48
On Fri, 2013-07-12 at 23:13 +0200, Wolfram Gloger wrote: Hi, Today I merged M. Tsirkin's patch v2 virtio_net: fix race in RX VQ processing: http://lkml.indiana.edu/hypermail/linux/kernel/1307.1/00503.html into 3.2.48 and lightly tested the result (vhost-net, virtio-disk and 9pfs using virtio). This sounds like it could be suitable for stable, but that doesn't seem to have been requested by the author. I'm cc'ing those involved so they can make a decision whether this should be included in 3.2.y or other stable branches. Merge is not completely trivial, so might save you some time, if only for a comparison that you arrive at the same result. Thanks, but these are not in quite the right patch format. The filenames should always have a leading directory name which will be stripped (patch -p1). The original patch header must be preserved and a reference to the upstream commit inserted e.g. 'commit 0123456789abcdef0123456789abcdef012345678 upstream.' (Attachments copied for reference.) Ben. -- Ben Hutchings Once a job is fouled up, anything done to improve it makes it worse. --- drivers/virtio/virtio_ring.c.orig 2013-06-30 11:35:44.0 +0200 +++ drivers/virtio/virtio_ring.c 2013-07-12 15:18:48.0 +0200 @@ -360,9 +360,22 @@ void virtqueue_disable_cb(struct virtque } EXPORT_SYMBOL_GPL(virtqueue_disable_cb); -bool virtqueue_enable_cb(struct virtqueue *_vq) +/** + * virtqueue_enable_cb_prepare - restart callbacks after disable_cb + * @vq: the struct virtqueue we're talking about. + * + * This re-enables callbacks; it returns current queue state + * in an opaque unsigned value. This value should be later tested by + * virtqueue_poll, to detect a possible race between the driver checking for + * more work, and enabling callbacks. + * + * Caller must ensure we don't call this with other virtqueue + * operations at the same time (except where noted). + */ +unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq) { struct vring_virtqueue *vq = to_vvq(_vq); + u16 last_used_idx; START_USE(vq); @@ -372,15 +385,45 @@ bool virtqueue_enable_cb(struct virtqueu * either clear the flags bit or point the event index at the next * entry. Always do both to keep code simple. */ vq-vring.avail-flags = ~VRING_AVAIL_F_NO_INTERRUPT; - vring_used_event(vq-vring) = vq-last_used_idx; + vring_used_event(vq-vring) = last_used_idx = vq-last_used_idx; + END_USE(vq); + return last_used_idx; +} +EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare); + +/** + * virtqueue_poll - query pending used buffers + * @vq: the struct virtqueue we're talking about. + * @last_used_idx: virtqueue state (from call to virtqueue_enable_cb_prepare). + * + * Returns true if there are pending used buffers in the queue. + * + * This does not need to be serialized. + */ +bool virtqueue_poll(struct virtqueue *_vq, unsigned last_used_idx) +{ + struct vring_virtqueue *vq = to_vvq(_vq); + virtio_mb(); - if (unlikely(more_used(vq))) { - END_USE(vq); - return false; - } + return (u16)last_used_idx != vq-vring.used-idx; +} +EXPORT_SYMBOL_GPL(virtqueue_poll); - END_USE(vq); - return true; +/** + * virtqueue_enable_cb - restart callbacks after disable_cb. + * @vq: the struct virtqueue we're talking about. + * + * This re-enables callbacks; it returns false if there are pending + * buffers in the queue, to detect a possible race between the driver + * checking for more work, and enabling callbacks. + * + * Caller must ensure we don't call this with other virtqueue + * operations at the same time (except where noted). + */ +bool virtqueue_enable_cb(struct virtqueue *_vq) +{ + unsigned last_used_idx = virtqueue_enable_cb_prepare(_vq); + return !virtqueue_poll(_vq, last_used_idx); } EXPORT_SYMBOL_GPL(virtqueue_enable_cb); --- include/linux/virtio.h.orig 2012-01-05 00:55:44.0 +0100 +++ include/linux/virtio.h 2013-07-12 14:42:26.0 +0200 @@ -96,6 +96,10 @@ void virtqueue_disable_cb(struct virtque bool virtqueue_enable_cb(struct virtqueue *vq); +unsigned virtqueue_enable_cb_prepare(struct virtqueue *vq); + +bool virtqueue_poll(struct virtqueue *vq, unsigned); + bool virtqueue_enable_cb_delayed(struct virtqueue *vq); void *virtqueue_detach_unused_buf(struct virtqueue *vq); --- drivers/net/virtio_net.c.orig 2012-01-05 00:55:44.0 +0100 +++ drivers/net/virtio_net.c 2013-07-12 15:39:06.0 +0200 @@ -508,7 +508,7 @@ static int virtnet_poll(struct napi_stru { struct virtnet_info *vi = container_of(napi, struct virtnet_info, napi); void *buf; - unsigned int len, received = 0; + unsigned int r, len, received = 0; again: while (received budget @@ -525,8 +525,9 @@ again: /* Out of packets? */ if (received budget) { + r = virtqueue_enable_cb_prepare(vi-rvq); napi_complete(napi); - if (unlikely(!virtqueue_enable_cb(vi-rvq)) + if (unlikely(virtqueue_poll(vi-rvq, r)) napi_schedule_prep(napi)) { virtqueue_disable_cb(vi-rvq); __napi_schedule(napi
Re: [PATCH 0/2] fix kernel crash with macvtap on top of LRO
On Mon, 2013-06-10 at 10:07 +0300, Michael S. Tsirkin wrote: On Thu, Feb 07, 2013 at 01:14:20PM -0500, David Miller wrote: From: Ben Hutchings bhutchi...@solarflare.com Date: Thu, 7 Feb 2013 16:20:46 + If the consensus is still that we must preserve packets exactly (aside from the usual modifications by IP routers) then LRO should be disabled on all devices for which forwarding is enabled. I believe this is still undoubtedly the consensus. With virtio we are getting packets from a linux host, so we could thinkably preserve packets exactly even with LRO. I am guessing other hardware could be doing this as well. I am not sure what information would need to be preserved - could someone help clarify please? Some LRO implementations may not preserve: - Packet boundaries - TSO/GSO produces packets all the same size, except possibly for the last one. GRO therefore flushes a flow after merging a packet with a different segment size. - IPv4 TTL, IPv6 hop-limit, TCP timestamp - TSO/GSO will put the same values in all packets. GRO flushes a flow if they change. - IPv4 fragment ID - TSO/GSO produces consecutive fragment IDs. GRO flushes a flow if it sees a non-consecutive fragment ID. - MAC header, IPv4 TOS, IPv6 traffic class - Should be the same for all packets in a flow. GRO actually checks and flushes a flow if they change. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio-net: Reporting traffic queue distribution statistics through ethtool
On Thu, 2013-05-16 at 13:24 -0700, Sriram Narasimhan wrote: This patch allows virtio-net driver to report traffic distribution to inbound/outbound queues through ethtool -S. The per_cpu virtnet_stats is split into receive and transmit stats and are maintained on a per receive_queue and send_queue basis. virtnet_stats() is modified to aggregate interface level statistics from per-queue statistics. Sample output below: NIC statistics: rxq0: rx_packets: 4357802 rxq0: rx_bytes: 292642052 txq0: tx_packets: 824540 txq0: tx_bytes: 55256404 rxq1: rx_packets: 0 rxq1: rx_bytes: 0 txq1: tx_packets: 1094268 txq1: tx_bytes: 73328316 rxq2: rx_packets: 0 rxq2: rx_bytes: 0 txq2: tx_packets: 1091466 txq2: tx_bytes: 73140566 rxq3: rx_packets: 0 rxq3: rx_bytes: 0 txq3: tx_packets: 1093043 txq3: tx_bytes: 73246142 Signed-off-by: Sriram Narasimhan sriram.narasim...@hp.com [...] That ordering is totally unreadable. I want to see patches for ethtool to let the user order and aggregate the queue statistics in a sensible way: $ ethtool -S eth0 # default would show aggregated statistics NIC statistics: rx_packets: 4357802 rx_bytes: 292642052 tx_packets: 4103317 tx_bytes: 274971428 $ ethtool -S eth0 full group queue # full statistics, grouped by queue name NIC statistics: rxq0: rx_packets: 4357802 rx_bytes: 292642052 rxq1: rx_packets: 0 rx_bytes: 0 [...] txq0: tx_packets: 824540 tx_bytes: 55256404 txq1: tx_packets: 1094268 tx_bytes: 73328316 [...] $ ethtool -S eth0 full group statistic # full statistics, grouped by statistic name NIC statistics: rx_packets: rxq0: 4357802 rxq1: 0 rxq2: 0 rxq3: 0 rx_bytes: rxq0: 292642052 rxq1: 0 rxq2: 0 rxq3: 0 [...] Maybe even: $ ethtool -S eth0 full table NIC statistics: rx_packets rx_bytes rxq0: 4357802 292642052 rxq1:0 0 rxq2:0 0 rxq3:0 0 tx_packets tx_bytes txq0: 824540 55256404 txq1: 1094268 73328316 txq2: 1091466 73140566 txq3: 1093043 73246142 (Difficult to do, but probably very useful!) The queue names should be rx-index and tx-index like in sysfs. We'll need to reach a consensus on the naming scheme for per-queue and otherwise disaggregated statistics (see previous discussions in http://thread.gmane.org/gmane.linux.kernel.virtualization/15572/focus=15608). I don't much care what they look like as long as there's an implementation for the ethtool side and they don't look too awful in older versions of ethtool. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net] vhost/net: fix heads usage of ubuf_info
On Thu, 2013-03-21 at 08:02 +0200, Michael S. Tsirkin wrote: On Sun, Mar 17, 2013 at 02:29:55PM -0400, David Miller wrote: From: Michael S. Tsirkin m...@redhat.com Date: Sun, 17 Mar 2013 14:46:09 +0200 ubuf info allocator uses guest controlled head as an index, so a malicious guest could put the same head entry in the ring twice, and we will get two callbacks on the same value. To fix use upend_idx which is guaranteed to be unique. Reported-by: Rusty Russell ru...@rustcorp.com.au Signed-off-by: Michael S. Tsirkin m...@redhat.com Applied and queued up for -stable, thanks. And thankfully you got the stable URL wrong, Yes I wrote sta...@kernel.org that's what an old copy says here: https://www.kernel.org/doc/Documentation/stable_kernel_rules.txt I should have known better than look at it on the 'net. The top 'Everything you ever wanted to know about Linux 2.6 -stable releases.' is a big hint that it's stale. Any idea who maintains this? Better update it or remove it or redirect to git. Rob Landley maintains it, but he's been having trouble updating it since all the upload mechanisms were changed on kernel.org. (My stable maintenance scripts still match the old address, anyway. Not sure about Greg's.) Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net] vhost/net: fix heads usage of ubuf_info
On Thu, 2013-03-21 at 18:28 +0200, Michael S. Tsirkin wrote: On Thu, Mar 21, 2013 at 04:23:48PM +, Ben Hutchings wrote: On Thu, 2013-03-21 at 08:02 +0200, Michael S. Tsirkin wrote: On Sun, Mar 17, 2013 at 02:29:55PM -0400, David Miller wrote: From: Michael S. Tsirkin m...@redhat.com Date: Sun, 17 Mar 2013 14:46:09 +0200 ubuf info allocator uses guest controlled head as an index, so a malicious guest could put the same head entry in the ring twice, and we will get two callbacks on the same value. To fix use upend_idx which is guaranteed to be unique. Reported-by: Rusty Russell ru...@rustcorp.com.au Signed-off-by: Michael S. Tsirkin m...@redhat.com Applied and queued up for -stable, thanks. And thankfully you got the stable URL wrong, Yes I wrote sta...@kernel.org that's what an old copy says here: https://www.kernel.org/doc/Documentation/stable_kernel_rules.txt I should have known better than look at it on the 'net. The top 'Everything you ever wanted to know about Linux 2.6 -stable releases.' is a big hint that it's stale. Any idea who maintains this? Better update it or remove it or redirect to git. Rob Landley maintains it, but he's been having trouble updating it since all the upload mechanisms were changed on kernel.org. (My stable maintenance scripts still match the old address, anyway. Not sure about Greg's.) Ben. I hope you mean it will match both the old and the new address? Yes, of course! Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH][v3.2.y] xen-netfront: delay gARP until backend switches to Connected
On Fri, 2013-03-15 at 13:56 -0400, Joseph Salisbury wrote: Hello, Please consider including upstream commit 08e34eb14fe4cfd934b5c169a7682a969457c4ea in the next v3.2.y release. It was included upstream as of v3.3-rc1. It has been tested and confirmed to resolve http://bugs.launchpad.net/bugs/1154608 . commit 08e34eb14fe4cfd934b5c169a7682a969457c4ea Author: Laszlo Ersek ler...@redhat.com Date: Sun Dec 11 01:48:59 2011 + xen-netfront: delay gARP until backend switches to Connected Added to the queue, thanks. Ben. -- Ben Hutchings Never attribute to conspiracy what can adequately be explained by stupidity. signature.asc Description: This is a digitally signed message part ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH] x86/xen: don't assume %ds is usable in xen_iret for 32-bit PVOPS.
On Thu, 2013-02-14 at 17:55 -0500, Konrad Rzeszutek Wilk wrote: On Thu, Feb 14, 2013 at 01:52:16PM -0800, Greg KH wrote: Jan, any reason why this patch isn't in Linus's tree already, and why it Sent out the GIT PULL this week to Linus wasn't marked for inclusion in a -stable kernel release? I forgot to stick that. Do please include it in the stable tree. [...] I've queued this up for 3.2. Ben. -- Ben Hutchings Experience is what causes a person to make new mistakes instead of old ones. signature.asc Description: This is a digitally signed message part ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V3 1/2] virtio-net: fix the set affinity bug when CPU IDs are not consecutive
On Thu, 2013-01-10 at 11:19 +1030, Rusty Russell wrote: Wanlong Gao gaowanl...@cn.fujitsu.com writes: On 01/09/2013 07:31 AM, Rusty Russell wrote: Wanlong Gao gaowanl...@cn.fujitsu.com writes: */ static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb) { - int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : - smp_processor_id(); + int txq = 0; + + if (skb_rx_queue_recorded(skb)) + txq = skb_get_rx_queue(skb); + else if ((txq = per_cpu(vq_index, smp_processor_id())) == -1) + txq = 0; You should use __get_cpu_var() instead of smp_processor_id() here, ie: else if ((txq = __get_cpu_var(vq_index)) == -1) And AFAICT, no reason to initialize txq to 0 to start with. So: int txq; if (skb_rx_queue_recorded(skb)) txq = skb_get_rx_queue(skb); else { txq = __get_cpu_var(vq_index); if (txq == -1) txq = 0; } Got it, thank you. Now, just to confirm, I assume this can happen even if we use vq_index, right, because of races with virtnet_set_channels? I still can't understand this race, could you explain more? thank you. I assume that someone can call virtnet_set_channels() while we are inside virtnet_select_queue(), so they reduce dev-real_num_tx_queues, causing virtnet_set_channels to do: while (unlikely(txq = dev-real_num_tx_queues)) txq -= dev-real_num_tx_queues; Otherwise, when is this loop called? In fact, this race can result in the TX scheduler using a queue that has been disabled, or other weirdness (consider what happens if real_num_tx_queues increases between those two uses). virtnet_set_channels() really must disable TX temporarily: netif_tx_lock(dev); netif_device_detach(dev); netif_tx_unlock(dev); ... netif_device_attach(dev); Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCHv5] virtio-spec: virtio network device RFS support
On Thu, 2012-12-06 at 10:13 +0200, Michael S. Tsirkin wrote: On Wed, Dec 05, 2012 at 08:39:26PM +, Ben Hutchings wrote: On Mon, 2012-12-03 at 12:58 +0200, Michael S. Tsirkin wrote: Add RFS support to virtio network device. Add a new feature flag VIRTIO_NET_F_RFS for this feature, a new configuration field max_virtqueue_pairs to detect supported number of virtqueues as well as a new command VIRTIO_NET_CTRL_RFS to program packet steering for unidirectional protocols. [...] +Programming of the receive flow classificator is implicit. + Transmitting a packet of a specific flow on transmitqX will cause incoming + packets for this flow to be steered to receiveqX. + For uni-directional protocols, or where no packets have been transmitted + yet, device will steer a packet to a random queue out of the specified + receiveq0..receiveqn. [...] It doesn't seem like this is usable to implement accelerated RFS in the guest, though perhaps that doesn't matter. What is the issue? Could you be more explicit please? It seems to work pretty well: if we have # of queues = # of cpus, incoming TCP_STREAM into guest scales very nicely without manual tweaks in guest. The way it works is, when guest sends a packet driver select the rx queue that we want to use for incoming packets for this slow, and transmit on the matching tx queue. This is exactly what text above suggests no? Yes, I get that. On the host side, presumably you'll want vhost_net to do the equivalent of sock_rps_record_flow() - only without a socket? But in any case, that requires an rxhash, so I don't see how this is supposed to work. Ben. Host should just do what guest tells it to. On the host side we build up the steering table as we get packets to transmit. See the code in drivers/net/tun.c in recent kernels. Again this actually works fine - what are the problems that you see? Could you give an example please? I'm not saying it doesn't work in its own way, I just don't see how you would make it work with the existing RFS! Since this doesn't seem to be intended to have *any* connection with the existing core networking feature called RFS, perhaps you could find a different name for it. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCHv5] virtio-spec: virtio network device RFS support
On Thu, 2012-12-06 at 22:29 +0200, Michael S. Tsirkin wrote: On Thu, Dec 06, 2012 at 08:03:14PM +, Ben Hutchings wrote: [...] Since this doesn't seem to be intended to have *any* connection with the existing core networking feature called RFS, perhaps you could find a different name for it. Ben. Ah I see what you mean. We started out calling this feature multiqueue Rusty suggested RFS since it gives similar functionality to RFS but in device: it has receive steering logic per flow as part of the device. The name is quite generic, but in the context of Linux it has so far been used for a specific software feature and not as a generic name for flow steering by hardware (or drivers). The existing documentation (Documentation/networking/scaling.txt) states quite clearly that 'RFS' means that specific software implementation (with optional driver integration) and configuration interface. Maybe simply adding a statement similar to the one above would be sufficient to avoid confusion? No, I don't think it's sufficient. We have documentation that says how to configure 'RFS', and you're proposing to add a very similar feature called 'RFS' that is configured differently. No matter how clearly you distinguish them in new documentation, this will make the old documentation confusing. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCHv5] virtio-spec: virtio network device RFS support
On Thu, 2012-12-06 at 23:01 +0200, Michael S. Tsirkin wrote: On Thu, Dec 06, 2012 at 08:53:59PM +, Ben Hutchings wrote: On Thu, 2012-12-06 at 22:29 +0200, Michael S. Tsirkin wrote: On Thu, Dec 06, 2012 at 08:03:14PM +, Ben Hutchings wrote: [...] Since this doesn't seem to be intended to have *any* connection with the existing core networking feature called RFS, perhaps you could find a different name for it. Ben. Ah I see what you mean. We started out calling this feature multiqueue Rusty suggested RFS since it gives similar functionality to RFS but in device: it has receive steering logic per flow as part of the device. The name is quite generic, but in the context of Linux it has so far been used for a specific software feature and not as a generic name for flow steering by hardware (or drivers). The existing documentation (Documentation/networking/scaling.txt) states quite clearly that 'RFS' means that specific software implementation (with optional driver integration) and configuration interface. Maybe simply adding a statement similar to the one above would be sufficient to avoid confusion? No, I don't think it's sufficient. We have documentation that says how to configure 'RFS', and you're proposing to add a very similar feature called 'RFS' that is configured differently. No matter how clearly you distinguish them in new documentation, this will make the old documentation confusing. Ben. I don't mind, renaming is just s/RFS/whatever/ away - how should hardware call this in your opinion? If by 'this' you mean the use of perfect filters or a large hash table to select the RX queue per flow, then 'flow steering'. But that is usually combined with the fall-back of a simple mapping from hash to queue ('RSS' or 'flow hashing') in case there is no specific queue selection yet, which I can see tun has. And you're specifying multiple transmit queues too. If you want a name for the whole set of features involved, I don't see any better name than 'multiqueue'/'MQ'. If you want a name for this specific flow steering mechanism, add some distinguishing adjective(s) like 'virtual' or 'automatic'. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCHv5] virtio-spec: virtio network device RFS support
On Mon, 2012-12-03 at 12:58 +0200, Michael S. Tsirkin wrote: Add RFS support to virtio network device. Add a new feature flag VIRTIO_NET_F_RFS for this feature, a new configuration field max_virtqueue_pairs to detect supported number of virtqueues as well as a new command VIRTIO_NET_CTRL_RFS to program packet steering for unidirectional protocols. [...] +Programming of the receive flow classificator is implicit. + Transmitting a packet of a specific flow on transmitqX will cause incoming + packets for this flow to be steered to receiveqX. + For uni-directional protocols, or where no packets have been transmitted + yet, device will steer a packet to a random queue out of the specified + receiveq0..receiveqn. [...] It doesn't seem like this is usable to implement accelerated RFS in the guest, though perhaps that doesn't matter. On the host side, presumably you'll want vhost_net to do the equivalent of sock_rps_record_flow() - only without a socket? But in any case, that requires an rxhash, so I don't see how this is supposed to work. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [net-next rfc v7 3/3] virtio-net: change the number of queues through ethtool
On Mon, 2012-12-03 at 13:25 +0200, Michael S. Tsirkin wrote: On Mon, Dec 03, 2012 at 02:09:28PM +0800, Jason Wang wrote: On Sunday, December 02, 2012 06:09:06 PM Michael S. Tsirkin wrote: On Tue, Nov 27, 2012 at 06:16:00PM +0800, Jason Wang wrote: This patch implement the {set|get}_channels method of ethool to allow user to change the number of queues dymaically when the device is running. This would let the user to configure it on demand. Signed-off-by: Jason Wang jasow...@redhat.com --- drivers/net/virtio_net.c | 41 + 1 files changed, 41 insertions(+), 0 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index bcaa6e5..f08ec2a 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1578,10 +1578,51 @@ static struct virtio_driver virtio_net_driver = { #endif }; +/* TODO: Eliminate OOO packets during switching */ +static int virtnet_set_channels(struct net_device *dev, + struct ethtool_channels *channels) +{ + struct virtnet_info *vi = netdev_priv(dev); + u16 queue_pairs = channels-combined_count; by the way shouldn't this be combined_count / 2? And below channels-max_combined = vi-max_queue_pairs * 2; ? [...] In this ethtool API, 'channel' means an IRQ and set of queues that trigger it. So each ethtool-channel will correspond to one queue-pair and not one virtio channel. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [rfc net-next v6 2/3] virtio_net: multiqueue support
On Sun, 2012-11-18 at 11:13 +0200, Michael S. Tsirkin wrote: On Sat, Nov 17, 2012 at 12:35:29AM +, Ben Hutchings wrote: On Tue, 2012-11-13 at 08:40 +0200, Michael S. Tsirkin wrote: On Mon, Nov 05, 2012 at 11:38:39AM +1030, Rusty Russell wrote: @@ -924,11 +1032,10 @@ static void virtnet_get_ringparam(struct net_device *dev, { struct virtnet_info *vi = netdev_priv(dev); - ring-rx_max_pending = virtqueue_get_vring_size(vi-rvq); - ring-tx_max_pending = virtqueue_get_vring_size(vi-svq); + ring-rx_max_pending = virtqueue_get_vring_size(vi-rq[0].vq); + ring-tx_max_pending = virtqueue_get_vring_size(vi-sq[0].vq); ring-rx_pending = ring-rx_max_pending; ring-tx_pending = ring-tx_max_pending; - } This assumes all vqs are the same size. I think this should probably check: for mq mode, use the first vq, otherewise use the 0th. For rx_pending/tx_pending I think what is required here is the actual number of outstanding buffers. Dave, Eric - right? So this should be the total over all rings and to be useful, rx_max_pending/tx_max_pending should be the total too. So far as I know, all current implementations use the number of descriptors per ring here. virtio_net should be consistent with this. Ben. Problem is, it could in theory be different between rings. I guess we could use the maximum. What's the right thing to do for rx_pending - I am guessing we want the current outstanding packets right? The 'max_pending' fields are for the maximum ring sizes supported; the 'pending' fields are for the current ring sizes. The current number of descriptors pending is not included (would be a bit pointless). Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [rfc net-next v6 2/3] virtio_net: multiqueue support
On Tue, 2012-11-13 at 08:40 +0200, Michael S. Tsirkin wrote: On Mon, Nov 05, 2012 at 11:38:39AM +1030, Rusty Russell wrote: @@ -924,11 +1032,10 @@ static void virtnet_get_ringparam(struct net_device *dev, { struct virtnet_info *vi = netdev_priv(dev); - ring-rx_max_pending = virtqueue_get_vring_size(vi-rvq); - ring-tx_max_pending = virtqueue_get_vring_size(vi-svq); + ring-rx_max_pending = virtqueue_get_vring_size(vi-rq[0].vq); + ring-tx_max_pending = virtqueue_get_vring_size(vi-sq[0].vq); ring-rx_pending = ring-rx_max_pending; ring-tx_pending = ring-tx_max_pending; - } This assumes all vqs are the same size. I think this should probably check: for mq mode, use the first vq, otherewise use the 0th. For rx_pending/tx_pending I think what is required here is the actual number of outstanding buffers. Dave, Eric - right? So this should be the total over all rings and to be useful, rx_max_pending/tx_max_pending should be the total too. So far as I know, all current implementations use the number of descriptors per ring here. virtio_net should be consistent with this. Ben. For bonus points, check this assertion at probe time. Looks like we can easily support different queues too. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [rfc net-next v6 3/3] virtio-net: change the number of queues through ethtool
On Tue, 2012-10-30 at 18:03 +0800, Jason Wang wrote: This patch implement the {set|get}_channels method of ethool to allow user to change the number of queues dymaically when the device is running. This would let the user to tune the device for specific applications. Signed-off-by: Jason Wang jasow...@redhat.com --- drivers/net/virtio_net.c | 43 +++ 1 files changed, 43 insertions(+), 0 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 8cc43e5..66fc129 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1616,10 +1616,53 @@ static struct virtio_driver virtio_net_driver = { #endif }; +/* TODO: Eliminate OOO packets during switching */ +static int virtnet_set_channels(struct net_device *dev, + struct ethtool_channels *channels) +{ + struct virtnet_info *vi = netdev_priv(dev); + u16 queue_pairs = channels-combined_count; + + /* Only two modes were support currently */ + if (queue_pairs == 0) + return -EINVAL; + if (queue_pairs != vi-total_queue_pairs - 1 queue_pairs != 1) + return -EINVAL; You should also check that all of the other counts are == 0. + vi-num_queue_pairs = queue_pairs 1 ? queue_pairs + 1 : 1; This doesn't make sense. You're setting vi-num_queue_pairs to be combined_count + 1... + BUG_ON(virtnet_set_queues(vi)); + + netif_set_real_num_tx_queues(dev, vi-num_queue_pairs); + netif_set_real_num_rx_queues(dev, vi-num_queue_pairs); + + return 0; +} + +static void virtnet_get_channels(struct net_device *dev, + struct ethtool_channels *channels) +{ + struct virtnet_info *vi = netdev_priv(dev); + + if (vi-total_queue_pairs != 1) { + channels-combined_count = vi-num_queue_pairs; but here you report combined_count as being vi-num_queue_pairs. Do you need to subtract 1 here? Ben. + channels-max_combined = vi-total_queue_pairs - 1; + } else { + channels-combined_count = 1; + channels-max_combined = 1; + } + + channels-max_other = 0; + channels-rx_count = 0; + channels-tx_count = 0; + channels-other_count = 0; +} + static const struct ethtool_ops virtnet_ethtool_ops = { .get_drvinfo = virtnet_get_drvinfo, .get_link = ethtool_op_get_link, .get_ringparam = virtnet_get_ringparam, + .set_channels = virtnet_set_channels, + .get_channels = virtnet_get_channels, }; static int __init init(void) -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCHv4] virtio-spec: virtio network device multiqueue support
On Wed, 2012-09-12 at 07:40 -0700, Tom Herbert wrote: On Wed, Sep 12, 2012 at 12:57 AM, Michael S. Tsirkin m...@redhat.com wrote: On Wed, Sep 12, 2012 at 03:19:11PM +0930, Rusty Russell wrote: [...] Perhaps Tom can explain how we avoid out-of-order receive for the accelerated RFS case? It's not clear to me, but we need to be able to do that for virtio-net if it implements accelerated RFS. AFAIK ooo RX is still possible with accelerated RFS. We have an algorithm that prevents this for RFS by deferring a migration to a new queue as long as it's possible that a flow might have outstanding packets on the old queue. I suppose this could be implemented in the device for the HW queues, but I don't think it would be easy to cover all cases where packets were already in transit to the host or other cases where host and device queues are out of sync. [...] Yes, I couldn't see any way to eliminate the possibility of OOO. The software queue check in RFS should redirect the flow only when it is new or has had an idle period, when I hope only a few packets will be received before we send some kind of response (transport or application layer ACK). So I think that OOO is not that likely in practice, but I don't have the evidence to back that up. If the filter update latency is high enough that a response can overtake the filter update, there may be more of a problem. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote: This patch let the virtio_net driver can negotiate the number of queues it wishes to use through control virtqueue and export an ethtool interface to let use tweak it. As current multiqueue virtio-net implementation has optimizations on per-cpu virtuqueues, so only two modes were support: - single queue pair mode - multiple queue paris mode, the number of queues matches the number of vcpus The single queue mode were used by default currently due to regression of multiqueue mode in some test (especially in stream test). Since virtio core does not support paritially deleting virtqueues, so during mode switching the whole virtqueue were deleted and the driver would re-create the virtqueues it would used. btw. The queue number negotiating were defered to .ndo_open(), this is because only after feature negotitaion could we send the command to control virtqueue (as it may also use event index). [...] +static int virtnet_set_channels(struct net_device *dev, + struct ethtool_channels *channels) +{ + struct virtnet_info *vi = netdev_priv(dev); + u16 queues = channels-rx_count; + unsigned status = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER; + + if (channels-rx_count != channels-tx_count) + return -EINVAL; [...] +static void virtnet_get_channels(struct net_device *dev, + struct ethtool_channels *channels) +{ + struct virtnet_info *vi = netdev_priv(dev); + + channels-max_rx = vi-total_queue_pairs; + channels-max_tx = vi-total_queue_pairs; + channels-max_other = 0; + channels-max_combined = 0; + channels-rx_count = vi-num_queue_pairs; + channels-tx_count = vi-num_queue_pairs; + channels-other_count = 0; + channels-combined_count = 0; +} [...] It looks like the queue-pairs should be treated as 'combined channels', not separate RX and TX channels. Also you don't need to clear the other members; you can assume that the ethtool core will zero-initialise structures for 'get' operations. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [patch net-next 1/4] net: introduce new priv_flag indicating iface capable of change mac when running
On Thu, 2012-06-28 at 16:10 +0200, Jiri Pirko wrote: Introduce IFF_LIFE_ADDR_CHANGE priv_flag and use it to disable netif_running() check in eth_mac_addr() Signed-off-by: Jiri Pirko jpi...@redhat.com --- include/linux/if.h |2 ++ net/ethernet/eth.c |2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/if.h b/include/linux/if.h index f995c66..fd9ee7c 100644 --- a/include/linux/if.h +++ b/include/linux/if.h @@ -81,6 +81,8 @@ #define IFF_UNICAST_FLT 0x2 /* Supports unicast filtering */ #define IFF_TEAM_PORT0x4 /* device used as team port */ #define IFF_SUPP_NOFCS 0x8 /* device supports sending custom FCS */ +#define IFF_LIFE_ADDR_CHANGE 0x10/* device supports hardware address + * change when it's running */ [...] Any device that has IFF_UNICAST_FLT can update the unicast MAC filter while it's running; doesn't that go hand-in-hand with being able to handle changes to the primary MAC address? Is the new flag really necessary at all? Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [V2 RFC net-next PATCH 2/2] virtio_net: export more statistics through ethtool
On Wed, 2012-06-06 at 15:52 +0800, Jason Wang wrote: Satistics counters is useful for debugging and performance optimization, so this patch lets virtio_net driver collect following and export them to userspace through ethtool -S: - number of packets sent/received - number of bytes sent/received - number of callbacks for tx/rx - number of kick for tx/rx - number of bytes/packets queued for tx As virtnet_stats were per-cpu, so both per-cpu and gloabl satistics were collected like: [...] I would really like to see some sort of convention for presenting per-queue statistics through ethtool. At the moment we have a complete mess of different formats: bnx2x:[${index}]: ${name} be2net: ${qtype}q${index}: ${name} ehea: PR${index} ${name} mlx4_en: ${qtype}${index}_${name} myri10ge: dummy stat names as headings niu: dummy stat names as headings s2io: ring_${index}_${name} vmxnet3: dummy stat names as headings vxge: ${name}_${index}; also dummy stat names as headings And you're introducing yet another format! (Additionally some of the drivers are playing games with spaces and tabs to make ethtool indent the stats the way they like. Ethtool statistics are inconsistent enough already without drivers pulling that sort of crap. I'm inclined to make ethtool start stripping whitespace from stat names, and *if* people can agree on a common format for per-queue statistic names then I'll indent them *consistently*. Also, I would make such stats optional, so you don't get hundreds of lines of crap by default.) Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [V2 RFC net-next PATCH 2/2] virtio_net: export more statistics through ethtool
On Thu, 2012-06-07 at 13:05 -0700, David Miller wrote: From: Ben Hutchings bhutchi...@solarflare.com Date: Thu, 7 Jun 2012 18:15:06 +0100 I would really like to see some sort of convention for presenting per-queue statistics through ethtool. At the moment we have a complete mess of different formats: Indeed. Probably ${QUEUE_TYPE}-${INDEX}-${STATISTIC} is best. With an agreed upon list of queue types such as rx, tx, rxtx etc. I think we should leave the type names open-ended, as there are other useful groupings like per-virtual-port. In that case the separator should be chosen to allow arbitrary type names without ambiguity. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [V2 RFC net-next PATCH 2/2] virtio_net: export more statistics through ethtool
On Thu, 2012-06-07 at 13:39 -0700, Rick Jones wrote: On 06/07/2012 01:24 PM, Ben Hutchings wrote: On Thu, 2012-06-07 at 13:05 -0700, David Miller wrote: From: Ben Hutchingsbhutchi...@solarflare.com Date: Thu, 7 Jun 2012 18:15:06 +0100 I would really like to see some sort of convention for presenting per-queue statistics through ethtool. At the moment we have a complete mess of different formats: Indeed. Probably ${QUEUE_TYPE}-${INDEX}-${STATISTIC} is best. With an agreed upon list of queue types such as rx, tx, rxtx etc. I think we should leave the type names open-ended, as there are other useful groupings like per-virtual-port. In that case the separator should be chosen to allow arbitrary type names without ambiguity. So you mean like something along the lines of the presence of say '.' indicating indent a level: rx_bytes: 1234 myqueue1.rx_bytes: 234 myqueue2.rx_bytes: 345 ... Most drivers seem to want this sort of ordering/grouping: group0.foo group0.bar ... group1.foo group1.bar ... but if we have a standard way of indicating groups of statistics then the user can choose whether they want to reorder by type name. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [V2 RFC net-next PATCH 2/2] virtio_net: export more statistics through ethtool
On Thu, 2012-06-07 at 21:56 +0100, Ben Hutchings wrote: On Thu, 2012-06-07 at 13:39 -0700, Rick Jones wrote: On 06/07/2012 01:24 PM, Ben Hutchings wrote: On Thu, 2012-06-07 at 13:05 -0700, David Miller wrote: From: Ben Hutchingsbhutchi...@solarflare.com Date: Thu, 7 Jun 2012 18:15:06 +0100 I would really like to see some sort of convention for presenting per-queue statistics through ethtool. At the moment we have a complete mess of different formats: Indeed. Probably ${QUEUE_TYPE}-${INDEX}-${STATISTIC} is best. With an agreed upon list of queue types such as rx, tx, rxtx etc. I think we should leave the type names open-ended, as there are other useful groupings like per-virtual-port. In that case the separator should be chosen to allow arbitrary type names without ambiguity. So you mean like something along the lines of the presence of say '.' indicating indent a level: rx_bytes: 1234 myqueue1.rx_bytes: 234 myqueue2.rx_bytes: 345 ... Most drivers seem to want this sort of ordering/grouping: group0.foo group0.bar ... group1.foo group1.bar ... but if we have a standard way of indicating groups of statistics then the user can choose whether they want to reorder by type name. I mean, whether they want to reorder/regroup by the final part of the statistic name. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio-net: fix a race on 32bit arches
On Wed, 2012-06-06 at 22:08 +0200, Eric Dumazet wrote: On Wed, 2012-06-06 at 22:58 +0300, Michael S. Tsirkin wrote: Absolutely, I am talking about virtio here. I'm not kicking u64_stats_sync idea I am just saying that simple locking would work for virtio and might be better as it gives us a way to get counters atomically. Which lock do you own in the RX path ? You'll have to add a lock in fast path. This sounds really a bad choice to me. You have the NAPI 'lock', so when gathering stats you can synchronise using napi_disable() ;-) Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio-net: fix a race on 32bit arches
On Wed, 2012-06-06 at 23:16 +0300, Michael S. Tsirkin wrote: On Wed, Jun 06, 2012 at 10:08:09PM +0200, Eric Dumazet wrote: On Wed, 2012-06-06 at 22:58 +0300, Michael S. Tsirkin wrote: Absolutely, I am talking about virtio here. I'm not kicking u64_stats_sync idea I am just saying that simple locking would work for virtio and might be better as it gives us a way to get counters atomically. Which lock do you own in the RX path ? We can just disable napi, everything is updated from napi callback. Seriously, though: don't do that; this is going to hurt performance for minimal benefit. Ben. You'll have to add a lock in fast path. This sounds really a bad choice to me. .ndo_get_stats64 is not data path though, is it? -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] vhost-net: add module alias (v2)
On Wed, 2012-01-11 at 09:16 -0800, Stephen Hemminger wrote: By adding the correct module alias, programs won't have to explicitly call modprobe. Vhost-net will always be available if built into the kernel. It does require assigning a permanent minor number for depmod to work. Choose one next to TUN since this driver is related to it. Also, use C99 style initialization. Signed-off-by: Stephen Hemminger shemmin...@vyatta.com --- v2 - document minor number and make sure to not overlap [...] --- a/include/linux/miscdevice.h 2012-01-10 10:56:59.779189436 -0800 +++ b/include/linux/miscdevice.h 2012-01-11 09:13:20.803694316 -0800 @@ -42,6 +42,7 @@ #define AUTOFS_MINOR 235 #define MAPPER_CTRL_MINOR236 #define LOOP_CTRL_MINOR 237 +#define VHOST_NET_MINOR 238 #define MISC_DYNAMIC_MINOR 255 struct device; --- a/Documentation/devices.txt 2012-01-10 10:56:53.399116518 -0800 +++ b/Documentation/devices.txt 2012-01-11 09:12:49.251197653 -0800 @@ -447,6 +447,8 @@ Your cooperation is appreciated. 234 = /dev/btrfs-controlBtrfs control device 235 = /dev/autofs Autofs control device 236 = /dev/mapper/control Device-Mapper control device + 237 = /dev/vhost-netHost kernel accelerator for virtio net [...] 238 != 237. It looks like someone forgot to add loopctrl here. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [net-next RFC PATCH 0/5] Series short description
On Fri, 2011-12-09 at 16:01 +1030, Rusty Russell wrote: On Wed, 7 Dec 2011 17:02:04 +, Ben Hutchings bhutchi...@solarflare.com wrote: Solarflare controllers (sfc driver) have 8192 perfect filters for TCP/IPv4 and UDP/IPv4 which can be used for flow steering. (The filters are organised as a hash table, but matched based on 5-tuples.) I implemented the 'accelerated RFS' interface in this driver. I believe the Intel 82599 controllers (ixgbe driver) have both hash-based and perfect filter modes and the driver can be configured to use one or the other. The driver has its own independent mechanism for steering RX and TX flows which predates RFS; I don't know whether it uses hash-based or perfect filters. Thanks for this summary (and Jason, too). I've fallen a long way behind NIC state-of-the-art. Most multi-queue controllers could support a kind of hash-based filtering for TCP/IP by adjusting the RSS indirection table. However, this table is usually quite small (64-256 entries). This means that hash collisions will be quite common and this can result in reordering. The same applies to the small table Jason has proposed for virtio-net. But this happens on real hardware today. Better that real hardware is nice, but is it overkill? What do you mean, it happens on real hardware today? So far as I know, the only cases where we have dynamic adjustment of flow steering are in ixgbe (big table of hash filters, I think) and sfc (perfect filters). I don't think that anyone's currently doing flow steering with the RSS indirection table. (At least, not on Linux. I think that Microsoft was intending to do so on Windows, but I don't know whether they ever did.) And can't you reorder even with perfect matching, since prior packets will be on the old queue and more recent ones on the new queue? Does it discard or requeue old ones? Or am I missing a trick? Yes, that is possible. RFS is careful to avoid such reordering by only changing the steering of a flow when none of its packets can be in a software receive queue. It is not generally possible to do the same for hardware receive queues. However, when the first condition is met it is likely that there won't be a whole lot of packets for that flow in the hardware receive queue either. (But if there are, then I think as a side-effect of commit 09994d1 RFS will repeatedly ask the driver to steer the flow. Which isn't ideal.) Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [net-next RFC PATCH 0/5] Series short description
On Wed, 2011-12-07 at 19:31 +0800, Jason Wang wrote: On 12/07/2011 03:30 PM, Rusty Russell wrote: On Mon, 05 Dec 2011 16:58:37 +0800, Jason Wangjasow...@redhat.com wrote: multiple queue virtio-net: flow steering through host/guest cooperation Hello all: This is a rough series adds the guest/host cooperation of flow steering support based on Krish Kumar's multiple queue virtio-net driver patch 3/3 (http://lwn.net/Articles/467283/). Is there a real (physical) device which does this kind of thing? How do they do it? Can we copy them? Cheers, Rusty. As far as I see, ixgbe and sfc have similar but much more sophisticated mechanism. The idea was originally suggested by Ben and it was just borrowed form those real physical nic cards who can dispatch packets based on their hash. All of theses cards can filter the flow based on the hash of L2/L3/L4 header and the stack would tell the card which queue should this flow goes. Solarflare controllers (sfc driver) have 8192 perfect filters for TCP/IPv4 and UDP/IPv4 which can be used for flow steering. (The filters are organised as a hash table, but matched based on 5-tuples.) I implemented the 'accelerated RFS' interface in this driver. I believe the Intel 82599 controllers (ixgbe driver) have both hash-based and perfect filter modes and the driver can be configured to use one or the other. The driver has its own independent mechanism for steering RX and TX flows which predates RFS; I don't know whether it uses hash-based or perfect filters. Most multi-queue controllers could support a kind of hash-based filtering for TCP/IP by adjusting the RSS indirection table. However, this table is usually quite small (64-256 entries). This means that hash collisions will be quite common and this can result in reordering. The same applies to the small table Jason has proposed for virtio-net. So in host, a simple hash to queue table were introduced in tap/macvtap and in guest, the guest driver would tell the desired queue of a flow through changing this table. I don't think accelerated RFS can work well without the use of perfect filtering or hash-based filtering with a very low rate of collisions. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [net-next RFC PATCH 2/5] tuntap: simple flow director support
On Tue, 2011-12-06 at 15:21 +0800, Jason Wang wrote: On 12/06/2011 04:09 AM, Ben Hutchings wrote: On Mon, 2011-12-05 at 16:58 +0800, Jason Wang wrote: This patch adds a simple flow director to tun/tap device. It is just a page that contains the hash to queue mapping which could be changed by user-space. The backend (tap/macvtap) would query this table to get the desired queue of a packets when it send packets to userspace. This is just flow hashing (RSS), not flow steering. The page address were set through a new kind of ioctl - TUNSETFD and were pinned until device exit or another new page were specified. [...] You should implement ethtool ETHTOOL_{G,S}RXFHINDIR instead. Ben. I'm not fully understanding this. The page belongs to guest, and the idea is to let guest driver can easily change any entry. Looks like if ethtool_set_rxfh_indir() is used, this kind of change is not easy as it needs one copy and can only accept the whole table as its parameters. Sorry, yes, I was misreading this. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
On Tue, 2011-12-06 at 15:25 +0800, Jason Wang wrote: On 12/06/2011 04:42 AM, Ben Hutchings wrote: [...] This is not a proper implementation of ndo_rx_flow_steer. If you steer a flow by changing the RSS table this can easily cause packet reordering in other flows. The filtering should be more precise, ideally matching exactly a single flow by e.g. VID and IP 5-tuple. I think you need to add a second hash table which records exactly which flow is supposed to be steered. Also, you must call rps_may_expire_flow() to check whether an entry in this table may be replaced; otherwise you can cause packet reordering in the flow that was previously being steered. Finally, this function must return the table index it assigned, so that rps_may_expire_flow() works. Thanks for the explanation, how about document this briefly in scaling.txt? [...] I believe scaling.txt is intended for users/administrators, not developers. The documentation for implementers of accelerated RFS is in the comment for struct net_device_ops and the commit message adding it. But I really should improve that comment. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [net-next RFC PATCH 2/5] tuntap: simple flow director support
On Mon, 2011-12-05 at 16:58 +0800, Jason Wang wrote: This patch adds a simple flow director to tun/tap device. It is just a page that contains the hash to queue mapping which could be changed by user-space. The backend (tap/macvtap) would query this table to get the desired queue of a packets when it send packets to userspace. This is just flow hashing (RSS), not flow steering. The page address were set through a new kind of ioctl - TUNSETFD and were pinned until device exit or another new page were specified. [...] You should implement ethtool ETHTOOL_{G,S}RXFHINDIR instead. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [net-next RFC PATCH 3/5] macvtap: flow director support
Similarly, macvtap chould implement the ethtool {get,set}_rxfh_indir operations. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
On Mon, 2011-12-05 at 16:59 +0800, Jason Wang wrote: In order to let the packets of a flow to be passed to the desired guest cpu, we can co-operate with devices through programming the flow director which was just a hash to queue table. This kinds of co-operation is done through the accelerate RFS support, a device specific flow sterring method virtnet_fd() is used to modify the flow director based on rfs mapping. The desired queue were calculated through reverse mapping of the irq affinity table. In order to parallelize the ingress path, irq affinity of rx queue were also provides by the driver. In addition to accelerate RFS, we can also use the guest scheduler to balance the load of TX and reduce the lock contention on egress path, so the processor_id() were used to tx queue selection. [...] +#ifdef CONFIG_RFS_ACCEL + +int virtnet_fd(struct net_device *net_dev, const struct sk_buff *skb, +u16 rxq_index, u32 flow_id) +{ + struct virtnet_info *vi = netdev_priv(net_dev); + u16 *table = NULL; + + if (skb-protocol != htons(ETH_P_IP) || !skb-rxhash) + return -EPROTONOSUPPORT; Why only IPv4? + table = kmap_atomic(vi-fd_page); + table[skb-rxhash TAP_HASH_MASK] = rxq_index; + kunmap_atomic(table); + + return 0; +} +#endif This is not a proper implementation of ndo_rx_flow_steer. If you steer a flow by changing the RSS table this can easily cause packet reordering in other flows. The filtering should be more precise, ideally matching exactly a single flow by e.g. VID and IP 5-tuple. I think you need to add a second hash table which records exactly which flow is supposed to be steered. Also, you must call rps_may_expire_flow() to check whether an entry in this table may be replaced; otherwise you can cause packet reordering in the flow that was previously being steered. Finally, this function must return the table index it assigned, so that rps_may_expire_flow() works. +static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb) +{ + int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : +smp_processor_id(); + + /* As we make use of the accelerate rfs which let the scheduler to + * balance the load, it make sense to choose the tx queue also based on + * theprocessor id? + */ + while (unlikely(txq = dev-real_num_tx_queues)) + txq -= dev-real_num_tx_queues; + return txq; +} [...] Don't do this, let XPS handle it. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes
On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote: Changes for multiqueue virtio_net driver. [...] @@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet { struct virtnet_info *vi = netdev_priv(dev); int cpu; - unsigned int start; for_each_possible_cpu(cpu) { - struct virtnet_stats __percpu *stats - = per_cpu_ptr(vi-stats, cpu); - u64 tpackets, tbytes, rpackets, rbytes; - - do { - start = u64_stats_fetch_begin(stats-syncp); - tpackets = stats-tx_packets; - tbytes = stats-tx_bytes; - rpackets = stats-rx_packets; - rbytes = stats-rx_bytes; - } while (u64_stats_fetch_retry(stats-syncp, start)); - - tot-rx_packets += rpackets; - tot-tx_packets += tpackets; - tot-rx_bytes += rbytes; - tot-tx_bytes += tbytes; + int qpair; + + for (qpair = 0; qpair vi-num_queue_pairs; qpair++) { + struct virtnet_send_stats __percpu *tx_stat; + struct virtnet_recv_stats __percpu *rx_stat; While you're at it, you can drop the per-CPU stats and make them only per-queue. There is unlikely to be any benefit in maintaining them per-CPU while receive and transmit processing is serialised per-queue. [...] +static int invoke_find_vqs(struct virtnet_info *vi) +{ + vq_callback_t **callbacks; + struct virtqueue **vqs; + int ret = -ENOMEM; + int i, total_vqs; + char **names; + + /* + * We expect 1 RX virtqueue followed by 1 TX virtqueue, followed + * by the same 'vi-num_queue_pairs-1' more times, and optionally + * one control virtqueue. + */ + total_vqs = vi-num_queue_pairs * 2 + + virtio_has_feature(vi-vdev, VIRTIO_NET_F_CTRL_VQ); + + /* Allocate space for find_vqs parameters */ + vqs = kmalloc(total_vqs * sizeof(*vqs), GFP_KERNEL); + callbacks = kmalloc(total_vqs * sizeof(*callbacks), GFP_KERNEL); + names = kmalloc(total_vqs * sizeof(*names), GFP_KERNEL); + if (!vqs || !callbacks || !names) + goto err; + + /* Allocate/initialize parameters for recv virtqueues */ + for (i = 0; i vi-num_queue_pairs * 2; i += 2) { + callbacks[i] = skb_recv_done; + names[i] = kasprintf(GFP_KERNEL, input.%d, i / 2); + if (!names[i]) + goto err; + } + + /* Allocate/initialize parameters for send virtqueues */ + for (i = 1; i vi-num_queue_pairs * 2; i += 2) { + callbacks[i] = skb_xmit_done; + names[i] = kasprintf(GFP_KERNEL, output.%d, i / 2); + if (!names[i]) + goto err; + } [...] The RX and TX interrupt names for a multiqueue device should follow the formats device-rx-index and device-tx-index. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes
On Fri, 2011-11-18 at 08:24 +0200, Sasha Levin wrote: On Fri, 2011-11-18 at 01:08 +, Ben Hutchings wrote: On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote: Changes for multiqueue virtio_net driver. [...] @@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet { struct virtnet_info *vi = netdev_priv(dev); int cpu; - unsigned int start; for_each_possible_cpu(cpu) { - struct virtnet_stats __percpu *stats - = per_cpu_ptr(vi-stats, cpu); - u64 tpackets, tbytes, rpackets, rbytes; - - do { - start = u64_stats_fetch_begin(stats-syncp); - tpackets = stats-tx_packets; - tbytes = stats-tx_bytes; - rpackets = stats-rx_packets; - rbytes = stats-rx_bytes; - } while (u64_stats_fetch_retry(stats-syncp, start)); - - tot-rx_packets += rpackets; - tot-tx_packets += tpackets; - tot-rx_bytes += rbytes; - tot-tx_bytes += tbytes; + int qpair; + + for (qpair = 0; qpair vi-num_queue_pairs; qpair++) { + struct virtnet_send_stats __percpu *tx_stat; + struct virtnet_recv_stats __percpu *rx_stat; While you're at it, you can drop the per-CPU stats and make them only per-queue. There is unlikely to be any benefit in maintaining them per-CPU while receive and transmit processing is serialised per-queue. It allows you to update stats without a lock. But you'll already be holding a lock related to the queue. Whats the benefit of having them per queue? It should save some memory (and a little time when summing stats, though that's unlikely to matter much). The important thing is that splitting up stats per-CPU *and* per-queue is a waste. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes
On Fri, 2011-11-18 at 18:18 +0200, Sasha Levin wrote: On Fri, 2011-11-18 at 15:40 +, Ben Hutchings wrote: On Fri, 2011-11-18 at 08:24 +0200, Sasha Levin wrote: On Fri, 2011-11-18 at 01:08 +, Ben Hutchings wrote: On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote: Changes for multiqueue virtio_net driver. [...] @@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet { struct virtnet_info *vi = netdev_priv(dev); int cpu; - unsigned int start; for_each_possible_cpu(cpu) { - struct virtnet_stats __percpu *stats - = per_cpu_ptr(vi-stats, cpu); - u64 tpackets, tbytes, rpackets, rbytes; - - do { - start = u64_stats_fetch_begin(stats-syncp); - tpackets = stats-tx_packets; - tbytes = stats-tx_bytes; - rpackets = stats-rx_packets; - rbytes = stats-rx_bytes; - } while (u64_stats_fetch_retry(stats-syncp, start)); - - tot-rx_packets += rpackets; - tot-tx_packets += tpackets; - tot-rx_bytes += rbytes; - tot-tx_bytes += tbytes; + int qpair; + + for (qpair = 0; qpair vi-num_queue_pairs; qpair++) { + struct virtnet_send_stats __percpu *tx_stat; + struct virtnet_recv_stats __percpu *rx_stat; While you're at it, you can drop the per-CPU stats and make them only per-queue. There is unlikely to be any benefit in maintaining them per-CPU while receive and transmit processing is serialised per-queue. It allows you to update stats without a lock. But you'll already be holding a lock related to the queue. Right, but now you're holding a queue lock just when playing with the queue, we don't hold it when we process the data - which is when we usually need to update stats. [...] The *stack* is holding the appropriate lock when calling the NAPI poll function or ndo_start_xmit function. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH net-next] enable virtio_net to return bus_info in ethtool -i consistent with emulated NICs
On Mon, 2011-11-14 at 13:52 -0800, Rick Jones wrote: From: Rick Jones rick.jon...@hp.com Add a new .bus_name to virtio_config_ops then modify virtio_net to call through to it in an ethtool .get_drvinfo routine to report bus_info in ethtool -i output which is consistent with other emulated NICs and the output of lspci. [...] diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c index 0dc30ff..3724d45 100644 --- a/drivers/lguest/lguest_device.c +++ b/drivers/lguest/lguest_device.c @@ -381,6 +381,11 @@ error: return PTR_ERR(vqs[i]); } +static const char *lg_bus_name(struct virtio_device *vdev) +{ + return Not Implemented; +} [...] +static const char *kvm_bus_name(struct virtio_device *vdev) +{ + return Not Implemented; +} [...] Please use the existing 'not implemented' value, which is the empty string. If you think ethtool should print some helpful message instead of an empty string, please submit a patch for ethtool. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH net-next] enable virtio_net to return bus_info in ethtool -i consistent with emulated NICs
On Mon, 2011-11-14 at 16:06 -0800, Rick Jones wrote: On 11/14/2011 02:30 PM, Ben Hutchings wrote: On Mon, 2011-11-14 at 13:52 -0800, Rick Jones wrote: From: Rick Jonesrick.jon...@hp.com Add a new .bus_name to virtio_config_ops then modify virtio_net to call through to it in an ethtool .get_drvinfo routine to report bus_info in ethtool -i output which is consistent with other emulated NICs and the output of lspci. [...] diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c index 0dc30ff..3724d45 100644 --- a/drivers/lguest/lguest_device.c +++ b/drivers/lguest/lguest_device.c @@ -381,6 +381,11 @@ error: return PTR_ERR(vqs[i]); } +static const char *lg_bus_name(struct virtio_device *vdev) +{ + return Not Implemented; +} [...] +static const char *kvm_bus_name(struct virtio_device *vdev) +{ + return Not Implemented; +} [...] Please use the existing 'not implemented' value, which is the empty string. If you think ethtool should print some helpful message instead of an empty string, please submit a patch for ethtool. One question - will those actually be called via an ethtool path? In my poking about through the virtio code, I got the impression those modules were for other than networking sorts of things. I don't know; I just assumed that was why you were adding them! In other contexts such as dev_printk() this string would make even less sense. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization