Re: virtio_balloon regression in 5.19-rc3

2022-06-21 Thread Ben Hutchings
On Tue, 2022-06-21 at 17:34 +0800, Jason Wang wrote:
> On Tue, Jun 21, 2022 at 5:24 PM David Hildenbrand  wrote:
> > 
> > On 20.06.22 20:49, Ben Hutchings wrote:
> > > I've tested a 5.19-rc3 kernel on top of QEMU/KVM with machine type
> > > pc-q35-5.2.  It has a virtio balloon device defined in libvirt as:
> > > 
> > > 
> > >> > function="0x0"/>
> > > 
> > > 
> > > but the virtio_balloon driver fails to bind to it:
> > > 
> > > virtio_balloon virtio4: init_vqs: add stat_vq failed
> > > virtio_balloon: probe of virtio4 failed with error -5
> > > 
> > 
> > Hmm, I don't see any recent changes to drivers/virtio/virtio_balloon.c
> > 
> > virtqueue_add_outbuf() fails with -EIO if I'm not wrong. That's the
> > first call of virtqueue_add_outbuf() when virtio_balloon initializes.
> > 
> > 
> > Maybe something in generic virtio code changed?
> 
> Yes, we introduced the IRQ hardening. That could be the root cause and
> we've received lots of reports so we decide to disable it by default.
> 
> Ben, could you please try this patch: (and make sure
> CONFIG_VIRTIO_HARDEN_NOTIFICATION is not set)
> 
> https://lore.kernel.org/lkml/20220620024158.2505-1-jasow...@redhat.com/T/

Yes, that patch fixes the regression for me.

Ben.

-- 
Ben Hutchings
Any smoothly functioning technology is indistinguishable
from a rigged demo.


signature.asc
Description: This is a digitally signed message part
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 3.16 093/245] virtio-balloon: fix managed page counts when migrating pages between zones

2020-04-23 Thread Ben Hutchings
  [  186.026321] Offlined Pages 32768
  [  187.684861] Offlined Pages 32768
  [  189.227013] Offlined Pages 32768
  [  190.830303] Offlined Pages 32768
  [  190.833071] Built 1 zonelists, mobility grouping on.  Total pages: 
-36920272750453009

In another instance (older kernel), I was no longer able to start any
process:
  [root@vm ~]# [  214.348068] Offlined Pages 32768
  [  215.973009] Offlined Pages 32768
  cat /proc/meminfo
  -bash: fork: Cannot allocate memory
  [root@vm ~]# cat /proc/meminfo
  -bash: fork: Cannot allocate memory

Fix it by properly adjusting the managed page count when migrating if
the zone changed. The managed page count of the zones now looks after
unplug of the DIMM (and after deflating the balloon) just like before
inflating the balloon (and plugging+onlining the DIMM).

We'll temporarily modify the totalram page count. If this ever becomes a
problem, we can fine tune by providing helpers that don't touch
the totalram pages (e.g., adjust_zone_managed_page_count()).

Please note that fixing up the managed page count is only necessary when
we adjusted the managed page count when inflating - only if we
don't have VIRTIO_BALLOON_F_DEFLATE_ON_OOM. With that feature, the
managed page count is not touched when inflating/deflating.

Reported-by: Yumei Huang 
Fixes: 3dcc0571cd64 ("mm: correctly update zone->managed_pages")
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Jiang Liu 
Cc: Andrew Morton 
Cc: Igor Mammedov 
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: David Hildenbrand 
Signed-off-by: Michael S. Tsirkin 
[bwh: Backported to 3.16: Deflate-on-OOM is not supported at all so don't
 check that flag]
Signed-off-by: Ben Hutchings 
---
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -403,6 +403,16 @@ static int virtballoon_migratepage(struc
 
get_page(newpage); /* balloon reference */
 
+   /*
+ * When we migrate a page to a different zone and adjusted the
+ * managed page count when inflating, we have to fixup the count of
+ * both involved zones.
+ */
+   if (page_zone(page) != page_zone(newpage)) {
+   adjust_managed_page_count(page, 1);
+   adjust_managed_page_count(newpage, -1);
+   }
+
/* balloon's page migration 1st step  -- inflate "newpage" */
spin_lock_irqsave(_dev_info->pages_lock, flags);
balloon_page_insert(newpage, mapping, _dev_info->pages);

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 3.16 140/305] x86/hyper-v: Enable PIT shutdown quirk

2019-02-04 Thread Ben Hutchings
3.16.63-rc1 review patch.  If anyone has any objections, please let me know.

--

From: Michael Kelley 

commit 1de72c706488b7be664a601cf3843bd01e327e58 upstream.

Hyper-V emulation of the PIT has a quirk such that the normal PIT shutdown
path doesn't work, because clearing the counter register restarts the
timer.

Disable the counter clearing on PIT shutdown.

Signed-off-by: Michael Kelley 
Signed-off-by: Thomas Gleixner 
Cc: "gre...@linuxfoundation.org" 
Cc: "de...@linuxdriverproject.org" 
Cc: "daniel.lezc...@linaro.org" 
Cc: "virtualization@lists.linux-foundation.org" 

Cc: "jgr...@suse.com" 
Cc: "akata...@vmware.com" 
Cc: "o...@aepfle.de" 
Cc: "a...@canonical.com" 
Cc: vkuznets 
Cc: "jasow...@redhat.com" 
Cc: "marcelo.ce...@canonical.com" 
Cc: KY Srinivasan 
Link: 
https://lkml.kernel.org/r/1541303219-11142-3-git-send-email-mikel...@microsoft.com
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings 
---
 arch/x86/kernel/cpu/mshyperv.c | 11 +++
 1 file changed, 11 insertions(+)

--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -143,6 +144,16 @@ static void __init ms_hyperv_init_platfo
no_timer_check = 1;
 #endif
 
+   /*
+* Hyper-V VMs have a PIT emulation quirk such that zeroing the
+* counter register during PIT shutdown restarts the PIT. So it
+* continues to interrupt @18.2 HZ. Setting i8253_clear_counter
+* to false tells pit_shutdown() not to zero the counter so that
+* the PIT really is shutdown. Generation 2 VMs don't have a PIT,
+* and setting this value has no effect.
+*/
+   i8253_clear_counter_on_shutdown = false;
+
 }
 
 const __refconst struct hypervisor_x86 x86_hyper_ms_hyperv = {

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 3.16 139/305] clockevents/drivers/i8253: Add support for PIT shutdown quirk

2019-02-04 Thread Ben Hutchings
3.16.63-rc1 review patch.  If anyone has any objections, please let me know.

--

From: Michael Kelley 

commit 35b69a420bfb56b7b74cb635ea903db05e357bec upstream.

Add support for platforms where pit_shutdown() doesn't work because of a
quirk in the PIT emulation. On these platforms setting the counter register
to zero causes the PIT to start running again, negating the shutdown.

Provide a global variable that controls whether the counter register is
zero'ed, which platform specific code can override.

Signed-off-by: Michael Kelley 
Signed-off-by: Thomas Gleixner 
Cc: "gre...@linuxfoundation.org" 
Cc: "de...@linuxdriverproject.org" 
Cc: "daniel.lezc...@linaro.org" 
Cc: "virtualization@lists.linux-foundation.org" 

Cc: "jgr...@suse.com" 
Cc: "akata...@vmware.com" 
Cc: "o...@aepfle.de" 
Cc: "a...@canonical.com" 
Cc: vkuznets 
Cc: "jasow...@redhat.com" 
Cc: "marcelo.ce...@canonical.com" 
Cc: KY Srinivasan 
Link: 
https://lkml.kernel.org/r/1541303219-11142-2-git-send-email-mikel...@microsoft.com
[bwh: Backported to 3.16:
 - Don't use __ro_after_init
 - Adjust context, indentation]
Signed-off-by: Ben Hutchings 
---
 drivers/clocksource/i8253.c | 14 --
 include/linux/i8253.h   |  1 +
 2 files changed, 13 insertions(+), 2 deletions(-)

--- a/drivers/clocksource/i8253.c
+++ b/drivers/clocksource/i8253.c
@@ -19,6 +19,13 @@
 DEFINE_RAW_SPINLOCK(i8253_lock);
 EXPORT_SYMBOL(i8253_lock);
 
+/*
+ * Handle PIT quirk in pit_shutdown() where zeroing the counter register
+ * restarts the PIT, negating the shutdown. On platforms with the quirk,
+ * platform specific code can set this to false.
+ */
+bool i8253_clear_counter_on_shutdown = true;
+
 #ifdef CONFIG_CLKSRC_I8253
 /*
  * Since the PIT overflows every tick, its not very useful
@@ -123,8 +130,11 @@ static void init_pit_timer(enum clock_ev
if (evt->mode == CLOCK_EVT_MODE_PERIODIC ||
evt->mode == CLOCK_EVT_MODE_ONESHOT) {
outb_p(0x30, PIT_MODE);
-   outb_p(0, PIT_CH0);
-   outb_p(0, PIT_CH0);
+
+   if (i8253_clear_counter_on_shutdown) {
+   outb_p(0, PIT_CH0);
+   outb_p(0, PIT_CH0);
+   }
}
break;
 
--- a/include/linux/i8253.h
+++ b/include/linux/i8253.h
@@ -21,6 +21,7 @@
 #define PIT_LATCH  ((PIT_TICK_RATE + HZ/2) / HZ)
 
 extern raw_spinlock_t i8253_lock;
+extern bool i8253_clear_counter_on_shutdown;
 extern struct clock_event_device i8253_clockevent;
 extern void clockevent_i8253_init(bool oneshot);
 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 3.16 249/328] x86/paravirt: Fix some warning messages

2018-12-09 Thread Ben Hutchings
3.16.62-rc1 review patch.  If anyone has any objections, please let me know.

--

From: Dan Carpenter 

commit 571d0563c8881595f4ab027aef9ed1c55e3e7b7c upstream.

The first argument to WARN_ONCE() is a condition.

Fixes: 5800dc5c19f3 ("x86/paravirt: Fix spectre-v2 mitigations for paravirt 
guests")
Signed-off-by: Dan Carpenter 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Juergen Gross 
Cc: Peter Zijlstra 
Cc: Alok Kataria 
Cc: "H. Peter Anvin" 
Cc: virtualization@lists.linux-foundation.org
Cc: kernel-janit...@vger.kernel.org
Link: https://lkml.kernel.org/r/20180919103553.GD9238@mwanda
Signed-off-by: Ben Hutchings 
---
 arch/x86/kernel/paravirt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -99,7 +99,7 @@ unsigned paravirt_patch_call(void *insnb
 
if (len < 5) {
 #ifdef CONFIG_RETPOLINE
-   WARN_ONCE("Failing to patch indirect CALL in %ps\n", (void 
*)addr);
+   WARN_ONCE(1, "Failing to patch indirect CALL in %ps\n", (void 
*)addr);
 #endif
return len; /* call too long for patch site */
}
@@ -119,7 +119,7 @@ unsigned paravirt_patch_jmp(void *insnbu
 
if (len < 5) {
 #ifdef CONFIG_RETPOLINE
-   WARN_ONCE("Failing to patch indirect JMP in %ps\n", (void 
*)addr);
+   WARN_ONCE(1, "Failing to patch indirect JMP in %ps\n", (void 
*)addr);
 #endif
return len; /* call too long for patch site */
}

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/3] ipv6: Select fragment id during UFO/GSO segmentation if not set.

2015-01-28 Thread Ben Hutchings
On Wed, 2015-01-28 at 11:46 +0200, Michael S. Tsirkin wrote:
 On Wed, Jan 28, 2015 at 09:25:08AM +0100, Hannes Frederic Sowa wrote:
[...]
  I see fragmentation id generation still as security critical:
  When Eric patched the frag id generator in 04ca6973f7c1a0d (ip: make IP
  identifiers less predictable) I could patch my kernels and use the
  patch regardless of the machine being virtualized or not. It was not
  dependent on the hypervisor.
 
 And now it's even easier - just patch the hypervisor, and all VMs
 automatically benefit.
[...]

You are advocating that the hypervisor should continue to act as a
middle-box that quietly modifies packets.  This may be useful to protect
guests that have poor fragment ID generation, but then that should be an
optional netfilter module and *not* the default.  The default should be
that UFO has no effect on the packet headers on the wire, and therefore
that the fragment ID is chosen by the IPv6 stack in the guest.

Ben.

-- 
Ben Hutchings
Teamwork is essential - it allows you to blame someone else.


signature.asc
Description: This is a digitally signed message part
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/3] ipv6: Select fragment id during UFO/GSO segmentation if not set.

2015-01-26 Thread Ben Hutchings
On Mon, 2015-01-26 at 09:37 -0500, Vladislav Yasevich wrote:
 If the IPv6 fragment id has not been set and we perform
 fragmentation due to UFO, select a new fragment id.
 When we store the fragment id into skb_shinfo, set the bit
 in the skb so we can re-use the selected id.
 This preserves the behavior of UFO packets generated on the
 host and solves the issue of id generation for packet sockets
 and tap/macvtap devices.
 
 This patch moves ipv6_select_ident() back in to the header file.  
 It also provides the helper function that sets skb_shinfo() frag
 id and sets the bit.
 
 It also makes sure that we select the fragment id when doing
 just gso validation, since it's possible for the packet to
 come from an untrusted source (VM) and be forwarded through
 a UFO enabled device which will expect the fragment id.
 
 CC: Eric Dumazet eduma...@google.com
 Signed-off-by: Vladislav Yasevich vyase...@redhat.com
 ---
  include/linux/skbuff.h |  3 ++-
  include/net/ipv6.h |  2 ++
  net/ipv6/ip6_output.c  |  4 ++--
  net/ipv6/output_core.c |  9 -
  net/ipv6/udp_offload.c | 10 +-
  5 files changed, 23 insertions(+), 5 deletions(-)
 
 diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
 index 85ab7d7..3ad5203 100644
 --- a/include/linux/skbuff.h
 +++ b/include/linux/skbuff.h
 @@ -605,7 +605,8 @@ struct sk_buff {
   __u8ipvs_property:1;
   __u8inner_protocol_type:1;
   __u8remcsum_offload:1;
 - /* 3 or 5 bit hole */
 + __u8ufo_fragid_set:1;
[...]

Doesn't the flag belong in struct skb_shared_info, rather than struct
sk_buff?  Otherwise this looks fine.

Ben.

-- 
Ben Hutchings
When in doubt, use brute force. - Ken Thompson


signature.asc
Description: This is a digitally signed message part
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] Revert drivers/net: Disable UFO through virtio

2015-01-26 Thread Ben Hutchings
On Mon, 2015-01-26 at 09:37 -0500, Vladislav Yasevich wrote:
 This reverts commit 3d0ad09412ffe00c9afa201d01effdb6023d09b4.
 
 Now that UFO functionality can correctly track the fragment
 id has been selected and select a fragment id if necessary,
 we can re-enable UFO on tap/macvap and virtio devices.
 
 Signed-off-by: Vladislav Yasevich vyase...@redhat.com
[...]

Acked-by: Ben Hutchings b...@decadent.org.uk

I think this is sensible even before your other patches.

Ben.

-- 
Ben Hutchings
When in doubt, use brute force. - Ken Thompson


signature.asc
Description: This is a digitally signed message part
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 00/10] Split UFO into v4 and v6 versions.

2014-12-24 Thread Ben Hutchings
On Thu, 2014-12-18 at 00:28 -0500, Jason Wang wrote:
 
 - Original Message -
  UFO support in the kernel applies to both IPv4 and IPv6 protocols
  with the same device feature.  However some devices may not be able
  to support one of the offloads.  For this we split the UFO offload
  feature into 2 pieces.  NETIF_F_UFO now controlls the IPv4 part and
  this series introduces NETIF_F_UFO6.
  
  As a result of this work, we can now re-enable NETIF_F_UFO on
  virtio_net devices and restore UDP over IPv4 performance for guests.
  We also continue to support legacy guests that assume that UFO6
  support included into UFO(4).
  
  Without this work, migrating a guest to a 3.18 kernel fails.
  
 
 This series eliminate the ambiguous NETIF_F_UFO.
 
 But VIRTIO_NET_F_{HOST|GUEST}_UFO and VIRTIO_NET_F_HDR_GSO_UDP is still
 ambigious. I know it was used to keep compatibility for legacy guest. But
 what's the future plan? Differentiate UFOv4 and UFOv6 in virtio features and
 gso type in vnet header looks sufficient?
[...]

The IPv6 fragmentation ID needs to be added to the vnet header, to do
UFOv6 properly.  If it wasn't for that lack, we wouldn't have to split
the feature flag.

Ben.

-- 
Ben Hutchings
If more than one person is responsible for a bug, no one is at fault.


signature.asc
Description: This is a digitally signed message part
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 01/10] core: Split out UFO6 support

2014-12-17 Thread Ben Hutchings
On Wed, 2014-12-17 at 13:20 -0500, Vladislav Yasevich wrote:
 Split IPv6 support for UFO into its own feature similiar to TSO.
 This will later allow us to re-enable UFO support for virtio-net
 devices.
[...]
 diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
 index 6c8b6f6..8538b67 100644
 --- a/include/linux/skbuff.h
 +++ b/include/linux/skbuff.h
 @@ -372,6 +372,7 @@ enum {
  
   SKB_GSO_MPLS = 1  12,
  
 + SKB_GSO_UDP6 = 1  13

It seems like it would be cleaner to use the names SKB_GSO_UDPV{4,6},
similarly to SKB_GSO_TCPV{4,6}.

  };
  
  #if BITS_PER_LONG  32
 diff --git a/net/core/dev.c b/net/core/dev.c
 index 945bbd0..fa4d2ee 100644
 --- a/net/core/dev.c
 +++ b/net/core/dev.c
[...]
 @@ -5952,24 +5958,21 @@ static netdev_features_t netdev_fix_features(struct 
 net_device *dev,
[...]
 + /* UFO also needs checksumming */
 + if ((features  NETIF_F_UFO)  !(features  NETIF_F_GEN_CSUM) 
 + !(features  NETIF_F_IP_CSUM)) {

You can use !(features  NETIF_F_V4_CSUM) instead of the last two terms.

 + netdev_dbg(dev,
 +Dropping NETIF_F_UFO since no checksum offload 
 features.\n);
 + features = ~NETIF_F_UFO;
 + }
 + if ((features  NETIF_F_UFO6)  !(features  NETIF_F_GEN_CSUM) 
 +  !(features  NETIF_F_IPV6_CSUM)) {
[...]

Similarly you can use !(features  NETIF_F_V6_CSUM) instead of the last
two terms.

Aside from those minor points, this looks fine.

Ben.

-- 
Ben Hutchings
Absolutum obsoletum. (If it works, it's out of date.) - Stafford Beer


signature.asc
Description: This is a digitally signed message part
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v7 28/46] vhost: make features 64 bit

2014-11-30 Thread Ben Hutchings
On Sun, 2014-11-30 at 18:44 +0300, Sergei Shtylyov wrote:
 Hello.
 
 On 11/30/2014 6:11 PM, Michael S. Tsirkin wrote:
 
  We need to use bit 32 for virtio 1.0
 
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  Reviewed-by: Jason Wang jasow...@redhat.com
  ---
drivers/vhost/vhost.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
 
  diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
  index 3eda654..c624b09 100644
  --- a/drivers/vhost/vhost.h
  +++ b/drivers/vhost/vhost.h
  @@ -106,7 +106,7 @@ struct vhost_virtqueue {
  /* Protected by virtqueue mutex. */
  struct vhost_memory *memory;
  void *private_data;
  -   unsigned acked_features;
  +   u64 acked_features;
  /* Log write descriptors */
  void __user *log_base;
  struct vhost_log *log;
  @@ -174,6 +174,6 @@ enum {
 
static inline int vhost_has_feature(struct vhost_virtqueue *vq, int bit)
{
  -   return vq-acked_features  (1  bit);
  +   return vq-acked_features  (1ULL  bit);
 
 Erm, wouldn't the high word be just dropped when returning *int*? I think 
 you need !!(vq-acked_features  (1ULL  bit)).

Or change the return type to bool.

Ben.

-- 
Ben Hutchings
The first rule of tautology club is the first rule of tautology club.


signature.asc
Description: This is a digitally signed message part
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 net 1/2] drivers/net: Disable UFO through virtio

2014-11-21 Thread Ben Hutchings
On Wed, 2014-11-19 at 11:14 +0200, Michael S. Tsirkin wrote:
 On Thu, Oct 30, 2014 at 06:27:12PM +, Ben Hutchings wrote:
  IPv6 does not allow fragmentation by routers, so there is no
  fragmentation ID in the fixed header.  UFO for IPv6 requires the ID to
  be passed separately, but there is no provision for this in the virtio
  net protocol.
  
  Until recently our software implementation of UFO/IPv6 generated a new
  ID, but this was a bug.  Now we will use ID=0 for any UFO/IPv6 packet
  passed through a tap, which is even worse.
  
  Unfortunately there is no distinction between UFO/IPv4 and v6
  features, so disable UFO on taps and virtio_net completely until we
  have a proper solution.
  
  We cannot depend on VM managers respecting the tap feature flags, so
  keep accepting UFO packets but log a warning the first time we do
  this.
  
  Signed-off-by: Ben Hutchings b...@decadent.org.uk
  Fixes: 916e4cf46d02 (ipv6: reuse ip6_frag_id from ip6_ufo_append_data)
 
 
 There's something I don't understand here. I see:
 
 NETIF_F_UFO_BIT,/* ... UDPv4 fragmentation */
 
 this comment is wrong then?

Yes.

 The patches drastically regress performance for UDPv4 for VMs only, but
 isn't it likely many other devices based their code on this comment?

There's only one hardware driver that implements UFO (s2io), and it does
handle IPv6.

 How about we disable UFO for IPv6 globally, and put the
 flag back in?
 We can then gradually add NETIF_F_UFO6_BIT for devices that
 actually support UFO for IPv6.

Since the corresponding virtio feature bit is understood to include
UFO/IPv6, and existing VMs rely on that, I don't see what this solves.

Ben.

-- 
Ben Hutchings
Beware of bugs in the above code;
I have only proved it correct, not tried it. - Donald Knuth


signature.asc
Description: This is a digitally signed message part
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: TUN_F_UFO change breaks live migration

2014-11-11 Thread Ben Hutchings
On Tue, 2014-11-11 at 10:58 +, Stefan Hajnoczi wrote:
 Commit 3d0ad09412ffe00c9afa201d01effdb6023d09b4 (drivers/net: Disable
 UFO through virtio) breaks live migration of KVM guests from older to
 newer host kernels:
 
 https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3d0ad09412ffe00c9afa201d01effdb6023d09b4
 
 The problem occurs when a guest running on a host kernel without commit
 3d0ad0941 in tun.ko attempts to live migration to a host with commit
 3d0ad0941.
 
 Live migration fails in QEMU with the following error message:
 
   virtio-net: saved image requires TUN_F_UFO support
 
 The old host tun.ko advertised support for TUN_F_UFO.  The new host
 tun.ko does not and that's why QEMU aborts live migration.  QEMU cannot
 change the features of a running virtio-net device.

Yes, this is known and was mentioned in the DSA.

 tuxcrafter provided logs from two Debian hosts migrating from
 3.2.60-1+deb7u3 to 3.2.63-2+deb7u1:
 
 http://paste.debian.net/131264/
 
 I haven't investigated enough to suggest a fix, just wanted to bring it
 to your attention.  Soon a lot of people will be hitting this problem as
 they upgrade their infrastructure and migrate guests - seems like a
 critical issue.

You can work around this by making macvtap and tun still claim to
support UFO.  They continue to support it even if it's not advertised
because the tap features don't reliably get propagated to virtio
devices.

Ben.

-- 
Ben Hutchings
Experience is directly proportional to the value of equipment destroyed.
 - Carolyn Scheppner


signature.asc
Description: This is a digitally signed message part
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: TUN_F_UFO change breaks live migration

2014-11-11 Thread Ben Hutchings
On Tue, 2014-11-11 at 17:57 +0200, Michael S. Tsirkin wrote:
 On Tue, Nov 11, 2014 at 12:17:26PM +, Ben Hutchings wrote:
  On Tue, 2014-11-11 at 10:58 +, Stefan Hajnoczi wrote:
   Commit 3d0ad09412ffe00c9afa201d01effdb6023d09b4 (drivers/net: Disable
   UFO through virtio) breaks live migration of KVM guests from older to
   newer host kernels:
   
   https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3d0ad09412ffe00c9afa201d01effdb6023d09b4
   
   The problem occurs when a guest running on a host kernel without commit
   3d0ad0941 in tun.ko attempts to live migration to a host with commit
   3d0ad0941.
   
   Live migration fails in QEMU with the following error message:
   
 virtio-net: saved image requires TUN_F_UFO support
   
   The old host tun.ko advertised support for TUN_F_UFO.  The new host
   tun.ko does not and that's why QEMU aborts live migration.  QEMU cannot
   change the features of a running virtio-net device.
  
  Yes, this is known and was mentioned in the DSA.
  
   tuxcrafter provided logs from two Debian hosts migrating from
   3.2.60-1+deb7u3 to 3.2.63-2+deb7u1:
   
   http://paste.debian.net/131264/
   
   I haven't investigated enough to suggest a fix, just wanted to bring it
   to your attention.  Soon a lot of people will be hitting this problem as
   they upgrade their infrastructure and migrate guests - seems like a
   critical issue.
  
  You can work around this by making macvtap and tun still claim to
  support UFO.
 
 If this is what we want userspace to do, let's just put the
 feature flag back?

Let's not *just* put the feature flag back.  I accept this is probably
needed as a workaround, but UFO/IPv6 still won't work correctly over
virtio.

 Basically userspace assumed that features will only
 ever be added, never removed, so this change is
 breaking it.

OK.

   They continue to support it even if it's not advertised
  because the tap features don't reliably get propagated to virtio
  devices.
  
  Ben.
 
 Hmm I don't understand this last sentence.
 features are actually reliably propagated to virtio devices.

They might be when using current QEMU and libvirt on the host.  They
weren't when I tested on Debian stable.  The warnings about 'using
disabled UFO feature' are reliably triggered if the host's tap driver is
patched and the guest's virtio_net driver is not.

Ben.

-- 
Ben Hutchings
Experience is directly proportional to the value of equipment destroyed.
 - Carolyn Scheppner


signature.asc
Description: This is a digitally signed message part
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net] Revert drivers/net: Disable UFO through virtio in macvtap and tun

2014-11-11 Thread Ben Hutchings
This reverts commit 88e0e0e5aa722b193c8758c8b45d041de5316924 for
the tap drivers, but leaves UFO disabled in virtio_net.

libvirt at least assumes that tap features will never be dropped
in new kernel versions, and doing so prevents migration of VMs to
the never kernel version while they are running with virtio net
devices.

Fixes: 88e0e0e5aa7a (drivers/net: Disable UFO through virtio)
Signed-off-by: Ben Hutchings b...@decadent.org.uk
---
Compile-tested only.

Ben.

 drivers/net/macvtap.c | 13 -
 drivers/net/tun.c | 19 ---
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 6f226de..aeaeb6d 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -66,7 +66,7 @@ static struct cdev macvtap_cdev;
 static const struct proto_ops macvtap_socket_ops;
 
 #define TUN_OFFLOADS (NETIF_F_HW_CSUM | NETIF_F_TSO_ECN | NETIF_F_TSO | \
- NETIF_F_TSO6)
+ NETIF_F_TSO6 | NETIF_F_UFO)
 #define RX_OFFLOADS (NETIF_F_GRO | NETIF_F_LRO)
 #define TAP_FEATURES (NETIF_F_GSO | NETIF_F_SG)
 
@@ -570,8 +570,6 @@ static int macvtap_skb_from_vnet_hdr(struct sk_buff *skb,
gso_type = SKB_GSO_TCPV6;
break;
case VIRTIO_NET_HDR_GSO_UDP:
-   pr_warn_once(macvtap: %s: using disabled UFO feature; 
please fix this program\n,
-current-comm);
gso_type = SKB_GSO_UDP;
if (skb-protocol == htons(ETH_P_IPV6))
ipv6_proxy_select_ident(skb);
@@ -619,6 +617,8 @@ static void macvtap_skb_to_vnet_hdr(const struct sk_buff 
*skb,
vnet_hdr-gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
else if (sinfo-gso_type  SKB_GSO_TCPV6)
vnet_hdr-gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+   else if (sinfo-gso_type  SKB_GSO_UDP)
+   vnet_hdr-gso_type = VIRTIO_NET_HDR_GSO_UDP;
else
BUG();
if (sinfo-gso_type  SKB_GSO_TCP_ECN)
@@ -953,6 +953,9 @@ static int set_offload(struct macvtap_queue *q, unsigned 
long arg)
if (arg  TUN_F_TSO6)
feature_mask |= NETIF_F_TSO6;
}
+
+   if (arg  TUN_F_UFO)
+   feature_mask |= NETIF_F_UFO;
}
 
/* tun/tap driver inverts the usage for TSO offloads, where
@@ -963,7 +966,7 @@ static int set_offload(struct macvtap_queue *q, unsigned 
long arg)
 * When user space turns off TSO, we turn off GSO/LRO so that
 * user-space will not receive TSO frames.
 */
-   if (feature_mask  (NETIF_F_TSO | NETIF_F_TSO6))
+   if (feature_mask  (NETIF_F_TSO | NETIF_F_TSO6 | NETIF_F_UFO))
features |= RX_OFFLOADS;
else
features = ~RX_OFFLOADS;
@@ -1064,7 +1067,7 @@ static long macvtap_ioctl(struct file *file, unsigned int 
cmd,
case TUNSETOFFLOAD:
/* let the user check for future flags */
if (arg  ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
-   TUN_F_TSO_ECN))
+   TUN_F_TSO_ECN | TUN_F_UFO))
return -EINVAL;
 
rtnl_lock();
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 7302398..a0987d1 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -175,7 +175,7 @@ struct tun_struct {
struct net_device   *dev;
netdev_features_t   set_features;
 #define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \
- NETIF_F_TSO6)
+ NETIF_F_TSO6|NETIF_F_UFO)
 
int vnet_hdr_sz;
int sndbuf;
@@ -1152,20 +1152,10 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
skb_shinfo(skb)-gso_type = SKB_GSO_TCPV6;
break;
case VIRTIO_NET_HDR_GSO_UDP:
-   {
-   static bool warned;
-
-   if (!warned) {
-   warned = true;
-   netdev_warn(tun-dev,
-   %s: using disabled UFO feature; 
please fix this program\n,
-   current-comm);
-   }
skb_shinfo(skb)-gso_type = SKB_GSO_UDP;
if (skb-protocol == htons(ETH_P_IPV6))
ipv6_proxy_select_ident(skb);
break;
-   }
default:
tun-dev-stats.rx_frame_errors++;
kfree_skb(skb);
@@ -1265,6 +1255,8 @@ static ssize_t tun_put_user(struct tun_struct *tun

[PATCH v2 net 0/2] drivers/net,ipv6: Fix IPv6 fragment ID selection for virtio

2014-10-30 Thread Ben Hutchings
The virtio net protocol supports UFO but does not provide for passing a
fragment ID for fragmentation of IPv6 packets.  We used to generate a
fragment ID wherever such a packet was fragmented, but currently we
always use ID=0!

v2: Add blank lines after declarations

Ben.

Ben Hutchings (2):
  drivers/net: Disable UFO through virtio
  drivers/net,ipv6: Select IPv6 fragment idents for virtio UFO packets

 drivers/net/macvtap.c| 16 
 drivers/net/tun.c| 25 -
 drivers/net/virtio_net.c | 24 ++--
 include/net/ipv6.h   |  2 ++
 net/ipv6/output_core.c   | 34 ++
 5 files changed, 74 insertions(+), 27 deletions(-)

-- 
Ben Hutchings
The program is absolutely right; therefore, the computer must be wrong.


signature.asc
Description: This is a digitally signed message part
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v2 net 1/2] drivers/net: Disable UFO through virtio

2014-10-30 Thread Ben Hutchings
IPv6 does not allow fragmentation by routers, so there is no
fragmentation ID in the fixed header.  UFO for IPv6 requires the ID to
be passed separately, but there is no provision for this in the virtio
net protocol.

Until recently our software implementation of UFO/IPv6 generated a new
ID, but this was a bug.  Now we will use ID=0 for any UFO/IPv6 packet
passed through a tap, which is even worse.

Unfortunately there is no distinction between UFO/IPv4 and v6
features, so disable UFO on taps and virtio_net completely until we
have a proper solution.

We cannot depend on VM managers respecting the tap feature flags, so
keep accepting UFO packets but log a warning the first time we do
this.

Signed-off-by: Ben Hutchings b...@decadent.org.uk
Fixes: 916e4cf46d02 (ipv6: reuse ip6_frag_id from ip6_ufo_append_data)
---
 drivers/net/macvtap.c| 13 +
 drivers/net/tun.c| 19 +++
 drivers/net/virtio_net.c | 24 ++--
 3 files changed, 30 insertions(+), 26 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 65e2892..2aeaa61 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -65,7 +65,7 @@ static struct cdev macvtap_cdev;
 static const struct proto_ops macvtap_socket_ops;
 
 #define TUN_OFFLOADS (NETIF_F_HW_CSUM | NETIF_F_TSO_ECN | NETIF_F_TSO | \
- NETIF_F_TSO6 | NETIF_F_UFO)
+ NETIF_F_TSO6)
 #define RX_OFFLOADS (NETIF_F_GRO | NETIF_F_LRO)
 #define TAP_FEATURES (NETIF_F_GSO | NETIF_F_SG)
 
@@ -569,6 +569,8 @@ static int macvtap_skb_from_vnet_hdr(struct sk_buff *skb,
gso_type = SKB_GSO_TCPV6;
break;
case VIRTIO_NET_HDR_GSO_UDP:
+   pr_warn_once(macvtap: %s: using disabled UFO feature; 
please fix this program\n,
+current-comm);
gso_type = SKB_GSO_UDP;
break;
default:
@@ -614,8 +616,6 @@ static void macvtap_skb_to_vnet_hdr(const struct sk_buff 
*skb,
vnet_hdr-gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
else if (sinfo-gso_type  SKB_GSO_TCPV6)
vnet_hdr-gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
-   else if (sinfo-gso_type  SKB_GSO_UDP)
-   vnet_hdr-gso_type = VIRTIO_NET_HDR_GSO_UDP;
else
BUG();
if (sinfo-gso_type  SKB_GSO_TCP_ECN)
@@ -950,9 +950,6 @@ static int set_offload(struct macvtap_queue *q, unsigned 
long arg)
if (arg  TUN_F_TSO6)
feature_mask |= NETIF_F_TSO6;
}
-
-   if (arg  TUN_F_UFO)
-   feature_mask |= NETIF_F_UFO;
}
 
/* tun/tap driver inverts the usage for TSO offloads, where
@@ -963,7 +960,7 @@ static int set_offload(struct macvtap_queue *q, unsigned 
long arg)
 * When user space turns off TSO, we turn off GSO/LRO so that
 * user-space will not receive TSO frames.
 */
-   if (feature_mask  (NETIF_F_TSO | NETIF_F_TSO6 | NETIF_F_UFO))
+   if (feature_mask  (NETIF_F_TSO | NETIF_F_TSO6))
features |= RX_OFFLOADS;
else
features = ~RX_OFFLOADS;
@@ -1064,7 +1061,7 @@ static long macvtap_ioctl(struct file *file, unsigned int 
cmd,
case TUNSETOFFLOAD:
/* let the user check for future flags */
if (arg  ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
-   TUN_F_TSO_ECN | TUN_F_UFO))
+   TUN_F_TSO_ECN))
return -EINVAL;
 
rtnl_lock();
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 186ce54..280d3d2 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -174,7 +174,7 @@ struct tun_struct {
struct net_device   *dev;
netdev_features_t   set_features;
 #define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \
- NETIF_F_TSO6|NETIF_F_UFO)
+ NETIF_F_TSO6)
 
int vnet_hdr_sz;
int sndbuf;
@@ -1149,8 +1149,18 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
skb_shinfo(skb)-gso_type = SKB_GSO_TCPV6;
break;
case VIRTIO_NET_HDR_GSO_UDP:
+   {
+   static bool warned;
+
+   if (!warned) {
+   warned = true;
+   netdev_warn(tun-dev,
+   %s: using disabled UFO feature; 
please fix this program\n,
+   current-comm);
+   }
skb_shinfo(skb)-gso_type = SKB_GSO_UDP;
break

[PATCH v2 net 2/2] drivers/net,ipv6: Select IPv6 fragment idents for virtio UFO packets

2014-10-30 Thread Ben Hutchings
UFO is now disabled on all drivers that work with virtio net headers,
but userland may try to send UFO/IPv6 packets anyway.  Instead of
sending with ID=0, we should select identifiers on their behalf (as we
used to).

Signed-off-by: Ben Hutchings b...@decadent.org.uk
Fixes: 916e4cf46d02 (ipv6: reuse ip6_frag_id from ip6_ufo_append_data)
---
 drivers/net/macvtap.c  |  3 +++
 drivers/net/tun.c  |  6 +-
 include/net/ipv6.h |  2 ++
 net/ipv6/output_core.c | 34 ++
 4 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 2aeaa61..6f226de 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -16,6 +16,7 @@
 #include linux/idr.h
 #include linux/fs.h
 
+#include net/ipv6.h
 #include net/net_namespace.h
 #include net/rtnetlink.h
 #include net/sock.h
@@ -572,6 +573,8 @@ static int macvtap_skb_from_vnet_hdr(struct sk_buff *skb,
pr_warn_once(macvtap: %s: using disabled UFO feature; 
please fix this program\n,
 current-comm);
gso_type = SKB_GSO_UDP;
+   if (skb-protocol == htons(ETH_P_IPV6))
+   ipv6_proxy_select_ident(skb);
break;
default:
return -EINVAL;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 280d3d2..7302398 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -65,6 +65,7 @@
 #include linux/nsproxy.h
 #include linux/virtio_net.h
 #include linux/rcupdate.h
+#include net/ipv6.h
 #include net/net_namespace.h
 #include net/netns/generic.h
 #include net/rtnetlink.h
@@ -1139,6 +1140,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
break;
}
 
+   skb_reset_network_header(skb);
+
if (gso.gso_type != VIRTIO_NET_HDR_GSO_NONE) {
pr_debug(GSO!\n);
switch (gso.gso_type  ~VIRTIO_NET_HDR_GSO_ECN) {
@@ -1159,6 +1162,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
current-comm);
}
skb_shinfo(skb)-gso_type = SKB_GSO_UDP;
+   if (skb-protocol == htons(ETH_P_IPV6))
+   ipv6_proxy_select_ident(skb);
break;
}
default:
@@ -1189,7 +1194,6 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
skb_shinfo(skb)-tx_flags |= SKBTX_SHARED_FRAG;
}
 
-   skb_reset_network_header(skb);
skb_probe_transport_header(skb, 0);
 
rxhash = skb_get_hash(skb);
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 97f4720..4292929 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -671,6 +671,8 @@ static inline int ipv6_addr_diff(const struct in6_addr *a1, 
const struct in6_add
return __ipv6_addr_diff(a1, a2, sizeof(struct in6_addr));
 }
 
+void ipv6_proxy_select_ident(struct sk_buff *skb);
+
 int ip6_dst_hoplimit(struct dst_entry *dst);
 
 static inline int ip6_sk_dst_hoplimit(struct ipv6_pinfo *np, struct flowi6 
*fl6,
diff --git a/net/ipv6/output_core.c b/net/ipv6/output_core.c
index fc24c39..97f41a3 100644
--- a/net/ipv6/output_core.c
+++ b/net/ipv6/output_core.c
@@ -3,11 +3,45 @@
  * not configured or static.  These functions are needed by GSO/GRO 
implementation.
  */
 #include linux/export.h
+#include net/ip.h
 #include net/ipv6.h
 #include net/ip6_fib.h
 #include net/addrconf.h
 #include net/secure_seq.h
 
+/* This function exists only for tap drivers that must support broken
+ * clients requesting UFO without specifying an IPv6 fragment ID.
+ *
+ * This is similar to ipv6_select_ident() but we use an independent hash
+ * seed to limit information leakage.
+ *
+ * The network header must be set before calling this.
+ */
+void ipv6_proxy_select_ident(struct sk_buff *skb)
+{
+   static u32 ip6_proxy_idents_hashrnd __read_mostly;
+   struct in6_addr buf[2];
+   struct in6_addr *addrs;
+   u32 hash, id;
+
+   addrs = skb_header_pointer(skb,
+  skb_network_offset(skb) +
+  offsetof(struct ipv6hdr, saddr),
+  sizeof(buf), buf);
+   if (!addrs)
+   return;
+
+   net_get_random_once(ip6_proxy_idents_hashrnd,
+   sizeof(ip6_proxy_idents_hashrnd));
+
+   hash = __ipv6_addr_jhash(addrs[1], ip6_proxy_idents_hashrnd);
+   hash = __ipv6_addr_jhash(addrs[0], hash);
+
+   id = ip_idents_reserve(hash, 1);
+   skb_shinfo(skb)-ip6_frag_id = htonl(id);
+}
+EXPORT_SYMBOL_GPL(ipv6_proxy_select_ident);
+
 int ip6_find_1stfragopt(struct sk_buff *skb, u8 **nexthdr)
 {
u16 offset = sizeof(struct ipv6hdr);

-- 
Ben Hutchings
The program is absolutely right

Re: [PATCH v2 net 1/2] drivers/net: Disable UFO through virtio

2014-10-30 Thread Ben Hutchings
On Thu, 2014-10-30 at 11:47 -0700, Eric Dumazet wrote:
 On Thu, 2014-10-30 at 18:27 +, Ben Hutchings wrote:
 
 
  +   {
  +   static bool warned;
  +
  +   if (!warned) {
  +   warned = true;
  +   netdev_warn(tun-dev,
  +   %s: using disabled UFO feature; 
  please fix this program\n,
  +   current-comm);
  +   }
 
 
 It might be time to add netdev_warn_once() ;)

Could do.  I'm trying to make small fixes that are suitable for stable.

 Alternatively, you could use 
   pr_warn_once(%s: using disabled UFO feature; please fix this 
 program\n,
tun-dev-name, current-comm);

That's missing a %s: , but yes that would also work.

Ben.

-- 
Ben Hutchings
The program is absolutely right; therefore, the computer must be wrong.


signature.asc
Description: This is a digitally signed message part
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: IPv6 UFO for VMs

2014-10-26 Thread Ben Hutchings
On Wed, 2014-10-22 at 11:35 +0200, Hannes Frederic Sowa wrote:
 On Mi, 2014-10-22 at 00:44 +0100, Ben Hutchings wrote:
  There are several ways that VMs can take advantage of UFO and get the
  host to do fragmentation for them:
  
  drivers/net/macvtap.c:  gso_type = SKB_GSO_UDP;
  drivers/net/tun.c:  skb_shinfo(skb)-gso_type = 
  SKB_GSO_UDP;
  drivers/net/virtio_net.c:   skb_shinfo(skb)-gso_type = 
  SKB_GSO_UDP;
  
  Our implementation of UFO for IPv6 does:
  
  fptr = (struct frag_hdr *)(skb_network_header(skb) + 
  unfrag_ip6hlen);
  fptr-nexthdr = nexthdr;
  fptr-reserved = 0;
  fptr-identification = skb_shinfo(skb)-ip6_frag_id;
  
  which assumes ip6_frag_id has been set.  That's only true if the local
  stack constructed the skb; otherwise it appears we get zero.
  
  This seems to be a regression as a result of:
  
  commit 916e4cf46d0204806c062c8c6c4d1f633852c5b6
  Author: Hannes Frederic Sowa han...@stressinduktion.org
  Date:   Fri Feb 21 02:55:35 2014 +0100
  
  ipv6: reuse ip6_frag_id from ip6_ufo_append_data
  
  However, that change seems reasonable - we *shouldn't* be choosing IDs
  for any other stack.  Any paravirt net driver that can use IPv6 UFO
  needs to have some way of passing a fragmentation ID to put in
  skb_shared_info::ip6_frag_id.
 
 Do we really gain a lot of performance by enabling UFO on those devices
 or would it make sense to just drop support? It only helps fragmenting
 large UDP packets, so I don't think it is worth it.

It's not been important enough for anyone to bother implementing it in
hardware/firmware aside from Neterion.

I'll shortly post patches to disable it.

Ben.

 Otherwise I agree with Ben, we need to pass a fragmentation id from the
 host over to the system segmenting the gso frame. Fragmentation ids must
 be generated by the end system.
 
 Hmm...

-- 
Ben Hutchings
Theory and practice are closer in theory than in practice.
- John Levine, moderator of comp.compilers


signature.asc
Description: This is a digitally signed message part
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net 0/2] drivers/net,ipv6: Fix IPv6 fragment ID selection for virtio

2014-10-26 Thread Ben Hutchings
The virtio net protocol supports UFO but does not provide for passing a
fragment ID for fragmentation of IPv6 packets.  We used to generate a
fragment ID wherever such a packet was fragmented, but currently we
always use ID=0!

Ben.

Ben Hutchings (2):
  drivers/net: Disable UFO through virtio
  drivers/net,ipv6: Select IPv6 fragment idents for virtio UFO packets

 drivers/net/macvtap.c| 16 
 drivers/net/tun.c| 24 +++-
 drivers/net/virtio_net.c | 23 +--
 include/net/ipv6.h   |  2 ++
 net/ipv6/output_core.c   | 34 ++
 5 files changed, 72 insertions(+), 27 deletions(-)


-- 
Ben Hutchings
Theory and practice are closer in theory than in practice.
- John Levine, moderator of comp.compilers


signature.asc
Description: This is a digitally signed message part
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net 2/2] drivers/net,ipv6: Select IPv6 fragment idents for virtio UFO packets

2014-10-26 Thread Ben Hutchings
UFO is now disabled on all drivers that work with virtio net headers,
but userland may try to send UFO/IPv6 packets anyway.  Instead of
sending with ID=0, we should select identifiers on their behalf (as we
used to).

Signed-off-by: Ben Hutchings b...@decadent.org.uk
Fixes: 916e4cf46d02 (ipv6: reuse ip6_frag_id from ip6_ufo_append_data)
---
 drivers/net/macvtap.c  |  3 +++
 drivers/net/tun.c  |  6 +-
 include/net/ipv6.h |  2 ++
 net/ipv6/output_core.c | 34 ++
 4 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 2aeaa61..6f226de 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -16,6 +16,7 @@
 #include linux/idr.h
 #include linux/fs.h
 
+#include net/ipv6.h
 #include net/net_namespace.h
 #include net/rtnetlink.h
 #include net/sock.h
@@ -572,6 +573,8 @@ static int macvtap_skb_from_vnet_hdr(struct sk_buff *skb,
pr_warn_once(macvtap: %s: using disabled UFO feature; 
please fix this program\n,
 current-comm);
gso_type = SKB_GSO_UDP;
+   if (skb-protocol == htons(ETH_P_IPV6))
+   ipv6_proxy_select_ident(skb);
break;
default:
return -EINVAL;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e8b949a..f8356b0 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -65,6 +65,7 @@
 #include linux/nsproxy.h
 #include linux/virtio_net.h
 #include linux/rcupdate.h
+#include net/ipv6.h
 #include net/net_namespace.h
 #include net/netns/generic.h
 #include net/rtnetlink.h
@@ -1139,6 +1140,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
break;
}
 
+   skb_reset_network_header(skb);
+
if (gso.gso_type != VIRTIO_NET_HDR_GSO_NONE) {
pr_debug(GSO!\n);
switch (gso.gso_type  ~VIRTIO_NET_HDR_GSO_ECN) {
@@ -1158,6 +1161,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
current-comm);
}
skb_shinfo(skb)-gso_type = SKB_GSO_UDP;
+   if (skb-protocol == htons(ETH_P_IPV6))
+   ipv6_proxy_select_ident(skb);
break;
}
default:
@@ -1188,7 +1193,6 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
skb_shinfo(skb)-tx_flags |= SKBTX_SHARED_FRAG;
}
 
-   skb_reset_network_header(skb);
skb_probe_transport_header(skb, 0);
 
rxhash = skb_get_hash(skb);
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 97f4720..4292929 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -671,6 +671,8 @@ static inline int ipv6_addr_diff(const struct in6_addr *a1, 
const struct in6_add
return __ipv6_addr_diff(a1, a2, sizeof(struct in6_addr));
 }
 
+void ipv6_proxy_select_ident(struct sk_buff *skb);
+
 int ip6_dst_hoplimit(struct dst_entry *dst);
 
 static inline int ip6_sk_dst_hoplimit(struct ipv6_pinfo *np, struct flowi6 
*fl6,
diff --git a/net/ipv6/output_core.c b/net/ipv6/output_core.c
index fc24c39..97f41a3 100644
--- a/net/ipv6/output_core.c
+++ b/net/ipv6/output_core.c
@@ -3,11 +3,45 @@
  * not configured or static.  These functions are needed by GSO/GRO 
implementation.
  */
 #include linux/export.h
+#include net/ip.h
 #include net/ipv6.h
 #include net/ip6_fib.h
 #include net/addrconf.h
 #include net/secure_seq.h
 
+/* This function exists only for tap drivers that must support broken
+ * clients requesting UFO without specifying an IPv6 fragment ID.
+ *
+ * This is similar to ipv6_select_ident() but we use an independent hash
+ * seed to limit information leakage.
+ *
+ * The network header must be set before calling this.
+ */
+void ipv6_proxy_select_ident(struct sk_buff *skb)
+{
+   static u32 ip6_proxy_idents_hashrnd __read_mostly;
+   struct in6_addr buf[2];
+   struct in6_addr *addrs;
+   u32 hash, id;
+
+   addrs = skb_header_pointer(skb,
+  skb_network_offset(skb) +
+  offsetof(struct ipv6hdr, saddr),
+  sizeof(buf), buf);
+   if (!addrs)
+   return;
+
+   net_get_random_once(ip6_proxy_idents_hashrnd,
+   sizeof(ip6_proxy_idents_hashrnd));
+
+   hash = __ipv6_addr_jhash(addrs[1], ip6_proxy_idents_hashrnd);
+   hash = __ipv6_addr_jhash(addrs[0], hash);
+
+   id = ip_idents_reserve(hash, 1);
+   skb_shinfo(skb)-ip6_frag_id = htonl(id);
+}
+EXPORT_SYMBOL_GPL(ipv6_proxy_select_ident);
+
 int ip6_find_1stfragopt(struct sk_buff *skb, u8 **nexthdr)
 {
u16 offset = sizeof(struct ipv6hdr);

-- 
Ben Hutchings
Theory and practice are closer

IPv6 UFO for VMs

2014-10-21 Thread Ben Hutchings
There are several ways that VMs can take advantage of UFO and get the
host to do fragmentation for them:

drivers/net/macvtap.c:  gso_type = SKB_GSO_UDP;
drivers/net/tun.c:  skb_shinfo(skb)-gso_type = SKB_GSO_UDP;
drivers/net/virtio_net.c:   skb_shinfo(skb)-gso_type = 
SKB_GSO_UDP;

Our implementation of UFO for IPv6 does:

fptr = (struct frag_hdr *)(skb_network_header(skb) + 
unfrag_ip6hlen);
fptr-nexthdr = nexthdr;
fptr-reserved = 0;
fptr-identification = skb_shinfo(skb)-ip6_frag_id;

which assumes ip6_frag_id has been set.  That's only true if the local
stack constructed the skb; otherwise it appears we get zero.

This seems to be a regression as a result of:

commit 916e4cf46d0204806c062c8c6c4d1f633852c5b6
Author: Hannes Frederic Sowa han...@stressinduktion.org
Date:   Fri Feb 21 02:55:35 2014 +0100

ipv6: reuse ip6_frag_id from ip6_ufo_append_data

However, that change seems reasonable - we *shouldn't* be choosing IDs
for any other stack.  Any paravirt net driver that can use IPv6 UFO
needs to have some way of passing a fragmentation ID to put in
skb_shared_info::ip6_frag_id.

Ben.

-- 
Ben Hutchings
For every action, there is an equal and opposite criticism. - Harrison


signature.asc
Description: This is a digitally signed message part
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC] virtio_pci: fix virtio spec compliance on restore

2014-09-23 Thread Ben Hutchings
On Tue, 2014-09-23 at 13:32 +0300, Michael S. Tsirkin wrote:
 On restore, virtio pci does the following:
 + set features
 + init vqs etc - device can be used at this point!
 + set ACKNOWLEDGE,DRIVER and DRIVER_OK status bits
 
 This is in violation of the virtio spec, which
 requires the following order:
 - ACKNOWLEDGE
 - DRIVER
 - init vqs
 - DRIVER_OK

 Cc: sta...@vger.kernel.org
 Cc: Amit Shah amit.s...@redhat.com
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
[...]

What concrete problem does this fix, such that it should be applied to
stable branches?

Ben.

-- 
Ben Hutchings
Everything should be made as simple as possible, but not simpler.
   - Albert Einstein


signature.asc
Description: This is a digitally signed message part
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v3 4/5] net-sysfs: add support for device-specific rx queue sysfs attributes

2014-01-16 Thread Ben Hutchings
On Thu, 2014-01-16 at 01:38 -0800, Michael Dalton wrote:
[...]
 --- a/include/linux/netdevice.h
 +++ b/include/linux/netdevice.h
[...]
 @@ -2401,6 +2416,23 @@ static inline int netif_copy_real_num_queues(struct 
 net_device *to_dev,
  #endif
  }
  
 +#ifdef CONFIG_SYSFS
 +static inline unsigned int get_netdev_rx_queue_index(
 + struct netdev_rx_queue *queue)
 +{
 + struct net_device *dev = queue-dev;
 + int i;
 +
 + for (i = 0; i  dev-num_rx_queues; i++)
 + if (queue == dev-_rx[i])
 + break;

Why write a loop when you can do:
i = queue - dev-_rx;

Ben.

 + BUG_ON(i = dev-num_rx_queues);
 +
 + return i;
 +}
 +#endif

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v3 4/5] net-sysfs: add support for device-specific rx queue sysfs attributes

2014-01-16 Thread Ben Hutchings
On Thu, 2014-01-16 at 11:07 -0800, Michael Dalton wrote:
 On Jan 16, 2014 at 10:57 AM, Ben Hutchings bhutchi...@solarflare.com wrote:
  Why write a loop when you can do:
  i = queue - dev-_rx;
 Good point, the loop approach was done in get_netdev_queue_index --
 I agree your fix is faster and simpler. I'll fix in next patchset.

It's simpler but we don't know if it's faster (and I don't believe that
matters for the current usage).

If one of these functions starts to be used in the data path, at that
point it could be worth optimising, e.g. by doing a test for queue 0 and
only then doing the pointer arithmetic with its implicit division.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size

2014-01-13 Thread Ben Hutchings
On Mon, 2014-01-13 at 11:40 +0200, Michael S. Tsirkin wrote:
 On Sun, Jan 12, 2014 at 03:32:28PM -0800, Michael Dalton wrote:
[...]
  The last issue is how the rx_queue_attribute 'show' function
  implementation for mergeable_rx_buffer_size will access the appropriate
  per-receive queue EWMA data. The arguments to the show function will be
  the netdev_rx_queue and the attribute itself. We can get to the
  struct net_device from the netdev_rx_queue.  If we extended
  netdev_rx_queue to indicate the queue_index or to store a void *priv_data
  pointer, that would be sufficient to allow us to resolve this issue.
 
 Hmm netdev_rx_queue is not defined unless CONFIG_RPS is set.
 Maybe we should use a different structure.
[...]

I don't think RPS should own this structure.  It's just that there are
currently no per-RX-queue attributes other than those defined by RPS.

By the way, CONFIG_RPS is equivalent to CONFIG_SMP so will likely be
enabled already in most places where virtio_net is used.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V2 5/6] vhost_net: poll vhost queue after marking DMA is done

2013-08-30 Thread Ben Hutchings
On Fri, 2013-08-30 at 12:29 +0800, Jason Wang wrote:
 We used to poll vhost queue before making DMA is done, this is racy if vhost
 thread were waked up before marking DMA is done which can result the signal to
 be missed. Fix this by always poll the vhost thread before DMA is done.

Does this bug only exist in net-next or is it older?  Should the fix go
to net and stable branches?

Ben.

 Signed-off-by: Jason Wang jasow...@redhat.com
 ---
  drivers/vhost/net.c |9 +
  1 files changed, 5 insertions(+), 4 deletions(-)
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index ff60c2a..d09c17c 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -308,6 +308,11 @@ static void vhost_zerocopy_callback(struct ubuf_info 
 *ubuf, bool success)
   struct vhost_virtqueue *vq = ubufs-vq;
   int cnt = atomic_read(ubufs-kref.refcount);
  
 + /* set len to mark this desc buffers done DMA */
 + vq-heads[ubuf-desc].len = success ?
 + VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
 + vhost_net_ubuf_put(ubufs);
 +
   /*
* Trigger polling thread if guest stopped submitting new buffers:
* in this case, the refcount after decrement will eventually reach 1
 @@ -318,10 +323,6 @@ static void vhost_zerocopy_callback(struct ubuf_info 
 *ubuf, bool success)
*/
   if (cnt = 2 || !(cnt % 16))
   vhost_poll_queue(vq-poll);
 - /* set len to mark this desc buffers done DMA */
 - vq-heads[ubuf-desc].len = success ?
 - VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
 - vhost_net_ubuf_put(ubufs);
  }
  
  /* Expects to be always run from workqueue - which acts as

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Merge of virtio_net: fix race in RX VQ processing for linux-3.2.48

2013-07-27 Thread Ben Hutchings
On Fri, 2013-07-12 at 23:13 +0200, Wolfram Gloger wrote:
 Hi,
 
 Today I merged M. Tsirkin's patch v2 virtio_net: fix race in RX VQ
 processing:
 
 http://lkml.indiana.edu/hypermail/linux/kernel/1307.1/00503.html
 
 into 3.2.48 and lightly tested the result (vhost-net, virtio-disk
 and 9pfs using virtio).

This sounds like it could be suitable for stable, but that doesn't seem
to have been requested by the author.  I'm cc'ing those involved so they
can make a decision whether this should be included in 3.2.y or other
stable branches.

 Merge is not completely trivial, so might save you some time, if only
 for a comparison that you arrive at the same result.

Thanks, but these are not in quite the right patch format.  The
filenames should always have a leading directory name which will be
stripped (patch -p1).  The original patch header must be preserved and a
reference to the upstream commit inserted e.g.
'commit 0123456789abcdef0123456789abcdef012345678 upstream.'

(Attachments copied for reference.)

Ben.

-- 
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.

--- drivers/virtio/virtio_ring.c.orig	2013-06-30 11:35:44.0 +0200
+++ drivers/virtio/virtio_ring.c	2013-07-12 15:18:48.0 +0200
@@ -360,9 +360,22 @@ void virtqueue_disable_cb(struct virtque
 }
 EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
 
-bool virtqueue_enable_cb(struct virtqueue *_vq)
+/**
+ * virtqueue_enable_cb_prepare - restart callbacks after disable_cb
+ * @vq: the struct virtqueue we're talking about.
+ *
+ * This re-enables callbacks; it returns current queue state
+ * in an opaque unsigned value. This value should be later tested by
+ * virtqueue_poll, to detect a possible race between the driver checking for
+ * more work, and enabling callbacks.
+ *
+ * Caller must ensure we don't call this with other virtqueue
+ * operations at the same time (except where noted).
+ */
+unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 last_used_idx;
 
 	START_USE(vq);
 
@@ -372,15 +385,45 @@ bool virtqueue_enable_cb(struct virtqueu
 	 * either clear the flags bit or point the event index at the next
 	 * entry. Always do both to keep code simple. */
 	vq-vring.avail-flags = ~VRING_AVAIL_F_NO_INTERRUPT;
-	vring_used_event(vq-vring) = vq-last_used_idx;
+	vring_used_event(vq-vring) = last_used_idx = vq-last_used_idx;
+	END_USE(vq);
+	return last_used_idx;
+}
+EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
+
+/**
+ * virtqueue_poll - query pending used buffers
+ * @vq: the struct virtqueue we're talking about.
+ * @last_used_idx: virtqueue state (from call to virtqueue_enable_cb_prepare).
+ *
+ * Returns true if there are pending used buffers in the queue.
+ *
+ * This does not need to be serialized.
+ */
+bool virtqueue_poll(struct virtqueue *_vq, unsigned last_used_idx)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
 	virtio_mb();
-	if (unlikely(more_used(vq))) {
-		END_USE(vq);
-		return false;
-	}
+	return (u16)last_used_idx != vq-vring.used-idx;
+}
+EXPORT_SYMBOL_GPL(virtqueue_poll);
 
-	END_USE(vq);
-	return true;
+/**
+ * virtqueue_enable_cb - restart callbacks after disable_cb.
+ * @vq: the struct virtqueue we're talking about.
+ *
+ * This re-enables callbacks; it returns false if there are pending
+ * buffers in the queue, to detect a possible race between the driver
+ * checking for more work, and enabling callbacks.
+ *
+ * Caller must ensure we don't call this with other virtqueue
+ * operations at the same time (except where noted).
+ */
+bool virtqueue_enable_cb(struct virtqueue *_vq)
+{
+	unsigned last_used_idx = virtqueue_enable_cb_prepare(_vq);
+	return !virtqueue_poll(_vq, last_used_idx);
 }
 EXPORT_SYMBOL_GPL(virtqueue_enable_cb);
 
--- include/linux/virtio.h.orig	2012-01-05 00:55:44.0 +0100
+++ include/linux/virtio.h	2013-07-12 14:42:26.0 +0200
@@ -96,6 +96,10 @@ void virtqueue_disable_cb(struct virtque
 
 bool virtqueue_enable_cb(struct virtqueue *vq);
 
+unsigned virtqueue_enable_cb_prepare(struct virtqueue *vq);
+
+bool virtqueue_poll(struct virtqueue *vq, unsigned);
+
 bool virtqueue_enable_cb_delayed(struct virtqueue *vq);
 
 void *virtqueue_detach_unused_buf(struct virtqueue *vq);
--- drivers/net/virtio_net.c.orig	2012-01-05 00:55:44.0 +0100
+++ drivers/net/virtio_net.c	2013-07-12 15:39:06.0 +0200
@@ -508,7 +508,7 @@ static int virtnet_poll(struct napi_stru
 {
 	struct virtnet_info *vi = container_of(napi, struct virtnet_info, napi);
 	void *buf;
-	unsigned int len, received = 0;
+	unsigned int r, len, received = 0;
 
 again:
 	while (received  budget 
@@ -525,8 +525,9 @@ again:
 
 	/* Out of packets? */
 	if (received  budget) {
+		r = virtqueue_enable_cb_prepare(vi-rvq);
 		napi_complete(napi);
-		if (unlikely(!virtqueue_enable_cb(vi-rvq)) 
+		if (unlikely(virtqueue_poll(vi-rvq, r)) 
 		napi_schedule_prep(napi)) {
 			virtqueue_disable_cb(vi-rvq);
 			__napi_schedule(napi

Re: [PATCH 0/2] fix kernel crash with macvtap on top of LRO

2013-06-12 Thread Ben Hutchings
On Mon, 2013-06-10 at 10:07 +0300, Michael S. Tsirkin wrote:
 On Thu, Feb 07, 2013 at 01:14:20PM -0500, David Miller wrote:
  From: Ben Hutchings bhutchi...@solarflare.com
  Date: Thu, 7 Feb 2013 16:20:46 +
  
   If the consensus is still that we must preserve packets exactly (aside
   from the usual modifications by IP routers) then LRO should be disabled
   on all devices for which forwarding is enabled.
  
  I believe this is still undoubtedly the consensus.
 
 With virtio we are getting packets from a linux host,
 so we could thinkably preserve packets exactly
 even with LRO. I am guessing other hardware could be
 doing this as well.
 
 I am not sure what information would need to be preserved -
 could someone help clarify please?

Some LRO implementations may not preserve:

- Packet boundaries
  - TSO/GSO produces packets all the same size, except possibly for the
last one.  GRO therefore flushes a flow after merging a packet with
a different segment size.
- IPv4 TTL, IPv6 hop-limit, TCP timestamp
  - TSO/GSO will put the same values in all packets.  GRO flushes a flow
if they change.
- IPv4 fragment ID
  - TSO/GSO produces consecutive fragment IDs.  GRO flushes a flow
if it sees a non-consecutive fragment ID.
- MAC header, IPv4 TOS, IPv6 traffic class
  - Should be the same for all packets in a flow.  GRO actually checks
and flushes a flow if they change.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio-net: Reporting traffic queue distribution statistics through ethtool

2013-05-16 Thread Ben Hutchings
On Thu, 2013-05-16 at 13:24 -0700, Sriram Narasimhan wrote:
 This patch allows virtio-net driver to report traffic distribution
 to inbound/outbound queues through ethtool -S.  The per_cpu
 virtnet_stats is split into receive and transmit stats and are
 maintained on a per receive_queue and send_queue basis.
 virtnet_stats() is modified to aggregate interface level statistics
 from per-queue statistics.  Sample output below:
 
 NIC statistics:
  rxq0: rx_packets: 4357802
  rxq0: rx_bytes: 292642052
  txq0: tx_packets: 824540
  txq0: tx_bytes: 55256404
  rxq1: rx_packets: 0
  rxq1: rx_bytes: 0
  txq1: tx_packets: 1094268
  txq1: tx_bytes: 73328316
  rxq2: rx_packets: 0
  rxq2: rx_bytes: 0
  txq2: tx_packets: 1091466
  txq2: tx_bytes: 73140566
  rxq3: rx_packets: 0
  rxq3: rx_bytes: 0
  txq3: tx_packets: 1093043
  txq3: tx_bytes: 73246142
 
 Signed-off-by: Sriram Narasimhan sriram.narasim...@hp.com
[...]

That ordering is totally unreadable.  I want to see patches for ethtool
to let the user order and aggregate the queue statistics in a sensible
way:

$ ethtool -S eth0   # default would show aggregated statistics
NIC statistics:
rx_packets: 4357802
rx_bytes: 292642052
tx_packets: 4103317
tx_bytes: 274971428

$ ethtool -S eth0 full group queue  # full statistics, grouped by queue name
NIC statistics:
rxq0:
rx_packets: 4357802
rx_bytes: 292642052
rxq1:
rx_packets: 0
rx_bytes: 0
[...]
txq0:
tx_packets: 824540
tx_bytes: 55256404
txq1:
tx_packets: 1094268
tx_bytes: 73328316
[...]

$ ethtool -S eth0 full group statistic  # full statistics, grouped by statistic 
name
NIC statistics:
rx_packets:
rxq0: 4357802
rxq1: 0
rxq2: 0
rxq3: 0
rx_bytes:
rxq0: 292642052
rxq1: 0
rxq2: 0
rxq3: 0
[...]

Maybe even:

$ ethtool -S eth0 full table
NIC statistics:
rx_packets   rx_bytes
rxq0:  4357802  292642052
rxq1:0  0
rxq2:0  0
rxq3:0  0
tx_packets   tx_bytes
txq0:   824540   55256404
txq1:  1094268   73328316
txq2:  1091466   73140566
txq3:  1093043   73246142

(Difficult to do, but probably very useful!)

The queue names should be rx-index and tx-index like in sysfs.

We'll need to reach a consensus on the naming scheme for per-queue and
otherwise disaggregated statistics (see previous discussions in
http://thread.gmane.org/gmane.linux.kernel.virtualization/15572/focus=15608). 
 I don't much care what they look like as long as there's an implementation for 
the ethtool side and they don't look too awful in older versions of ethtool.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net] vhost/net: fix heads usage of ubuf_info

2013-03-21 Thread Ben Hutchings
On Thu, 2013-03-21 at 08:02 +0200, Michael S. Tsirkin wrote:
 On Sun, Mar 17, 2013 at 02:29:55PM -0400, David Miller wrote:
  From: Michael S. Tsirkin m...@redhat.com
  Date: Sun, 17 Mar 2013 14:46:09 +0200
  
   ubuf info allocator uses guest controlled head as an index,
   so a malicious guest could put the same head entry in the ring twice,
   and we will get two callbacks on the same value.
   To fix use upend_idx which is guaranteed to be unique.
   
   Reported-by: Rusty Russell ru...@rustcorp.com.au
   Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  Applied and queued up for -stable, thanks.
  
  And thankfully you got the stable URL wrong,
 
 Yes I wrote sta...@kernel.org that's what an old copy
 says here:
 https://www.kernel.org/doc/Documentation/stable_kernel_rules.txt
 
 I should have known better than look at it on the 'net.  The top
 'Everything you ever wanted to know about Linux 2.6 -stable releases.'
 is a big hint that it's stale.
 Any idea who maintains this? Better update it or remove it or redirect to git.

Rob Landley maintains it, but he's been having trouble updating it since
all the upload mechanisms were changed on kernel.org.

(My stable maintenance scripts still match the old address, anyway.  Not
sure about Greg's.)

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net] vhost/net: fix heads usage of ubuf_info

2013-03-21 Thread Ben Hutchings
On Thu, 2013-03-21 at 18:28 +0200, Michael S. Tsirkin wrote:
 On Thu, Mar 21, 2013 at 04:23:48PM +, Ben Hutchings wrote:
  On Thu, 2013-03-21 at 08:02 +0200, Michael S. Tsirkin wrote:
   On Sun, Mar 17, 2013 at 02:29:55PM -0400, David Miller wrote:
From: Michael S. Tsirkin m...@redhat.com
Date: Sun, 17 Mar 2013 14:46:09 +0200

 ubuf info allocator uses guest controlled head as an index,
 so a malicious guest could put the same head entry in the ring twice,
 and we will get two callbacks on the same value.
 To fix use upend_idx which is guaranteed to be unique.
 
 Reported-by: Rusty Russell ru...@rustcorp.com.au
 Signed-off-by: Michael S. Tsirkin m...@redhat.com

Applied and queued up for -stable, thanks.

And thankfully you got the stable URL wrong,
   
   Yes I wrote sta...@kernel.org that's what an old copy
   says here:
   https://www.kernel.org/doc/Documentation/stable_kernel_rules.txt
   
   I should have known better than look at it on the 'net.  The top
   'Everything you ever wanted to know about Linux 2.6 -stable releases.'
   is a big hint that it's stale.
   Any idea who maintains this? Better update it or remove it or redirect to 
   git.
  
  Rob Landley maintains it, but he's been having trouble updating it since
  all the upload mechanisms were changed on kernel.org.
  
  (My stable maintenance scripts still match the old address, anyway.  Not
  sure about Greg's.)
  
  Ben.
 
 I hope you mean it will match both the old and the new address?

Yes, of course!

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH][v3.2.y] xen-netfront: delay gARP until backend switches to Connected

2013-03-17 Thread Ben Hutchings
On Fri, 2013-03-15 at 13:56 -0400, Joseph Salisbury wrote:
 Hello,
 
 Please consider including upstream commit 
 08e34eb14fe4cfd934b5c169a7682a969457c4ea in the next v3.2.y release.
 It was included upstream as of v3.3-rc1.  It has been tested and 
 confirmed to resolve http://bugs.launchpad.net/bugs/1154608 .
 
 commit 08e34eb14fe4cfd934b5c169a7682a969457c4ea
 Author: Laszlo Ersek ler...@redhat.com
 Date:   Sun Dec 11 01:48:59 2011 +
 
  xen-netfront: delay gARP until backend switches to Connected

Added to the queue, thanks.

Ben.

-- 
Ben Hutchings
Never attribute to conspiracy what can adequately be explained by stupidity.


signature.asc
Description: This is a digitally signed message part
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] [PATCH] x86/xen: don't assume %ds is usable in xen_iret for 32-bit PVOPS.

2013-02-18 Thread Ben Hutchings
On Thu, 2013-02-14 at 17:55 -0500, Konrad Rzeszutek Wilk wrote:
 On Thu, Feb 14, 2013 at 01:52:16PM -0800, Greg KH wrote:
  Jan, any reason why this patch isn't in Linus's tree already, and why it
 
 Sent out the GIT PULL this week to Linus
  wasn't marked for inclusion in a -stable kernel release?
 
 I forgot to stick that. Do please include it in the stable tree.
[...]

I've queued this up for 3.2.

Ben.

-- 
Ben Hutchings
Experience is what causes a person to make new mistakes instead of old ones.


signature.asc
Description: This is a digitally signed message part
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V3 1/2] virtio-net: fix the set affinity bug when CPU IDs are not consecutive

2013-01-10 Thread Ben Hutchings
On Thu, 2013-01-10 at 11:19 +1030, Rusty Russell wrote:
 Wanlong Gao gaowanl...@cn.fujitsu.com writes:
  On 01/09/2013 07:31 AM, Rusty Russell wrote:
  Wanlong Gao gaowanl...@cn.fujitsu.com writes:
*/
   static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff 
  *skb)
   {
  - int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
  -   smp_processor_id();
  + int txq = 0;
  +
  + if (skb_rx_queue_recorded(skb))
  + txq = skb_get_rx_queue(skb);
  + else if ((txq = per_cpu(vq_index, smp_processor_id())) == -1)
  + txq = 0;
  
  You should use __get_cpu_var() instead of smp_processor_id() here, ie:
  
  else if ((txq = __get_cpu_var(vq_index)) == -1)
  
  And AFAICT, no reason to initialize txq to 0 to start with.
  
  So:
  
  int txq;
  
  if (skb_rx_queue_recorded(skb))
 txq = skb_get_rx_queue(skb);
  else {
  txq = __get_cpu_var(vq_index);
  if (txq == -1)
  txq = 0;
  }
 
  Got it, thank you.
 
  
  Now, just to confirm, I assume this can happen even if we use vq_index,
  right, because of races with virtnet_set_channels?
 
  I still can't understand this race, could you explain more? thank you.
 
 I assume that someone can call virtnet_set_channels() while we are
 inside virtnet_select_queue(), so they reduce dev-real_num_tx_queues,
 causing virtnet_set_channels to do:
 
   while (unlikely(txq = dev-real_num_tx_queues))
   txq -= dev-real_num_tx_queues;
 
 Otherwise, when is this loop called?

In fact, this race can result in the TX scheduler using a queue that has
been disabled, or other weirdness (consider what happens if
real_num_tx_queues increases between those two uses).

virtnet_set_channels() really must disable TX temporarily:

netif_tx_lock(dev);
netif_device_detach(dev);
netif_tx_unlock(dev);
...
netif_device_attach(dev);

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCHv5] virtio-spec: virtio network device RFS support

2012-12-06 Thread Ben Hutchings
On Thu, 2012-12-06 at 10:13 +0200, Michael S. Tsirkin wrote:
 On Wed, Dec 05, 2012 at 08:39:26PM +, Ben Hutchings wrote:
  On Mon, 2012-12-03 at 12:58 +0200, Michael S. Tsirkin wrote:
   Add RFS support to virtio network device.
   Add a new feature flag VIRTIO_NET_F_RFS for this feature, a new
   configuration field max_virtqueue_pairs to detect supported number of
   virtqueues as well as a new command VIRTIO_NET_CTRL_RFS to program
   packet steering for unidirectional protocols.
  [...]
   +Programming of the receive flow classificator is implicit.
   + Transmitting a packet of a specific flow on transmitqX will cause 
   incoming
   + packets for this flow to be steered to receiveqX.
   + For uni-directional protocols, or where no packets have been transmitted
   + yet, device will steer a packet to a random queue out of the specified
   + receiveq0..receiveqn.
  [...]
  
  It doesn't seem like this is usable to implement accelerated RFS in the
  guest, though perhaps that doesn't matter.
 
 What is the issue? Could you be more explicit please?
 
 It seems to work pretty well: if we have
 # of queues = # of cpus, incoming TCP_STREAM into
 guest scales very nicely without manual tweaks in guest.
 
 The way it works is, when guest sends a packet driver
 select the rx queue that we want to use for incoming
 packets for this slow, and transmit on the matching tx queue.
 This is exactly what text above suggests no?

Yes, I get that.

   On the host side, presumably
  you'll want vhost_net to do the equivalent of sock_rps_record_flow() -
  only without a socket?  But in any case, that requires an rxhash, so I
  don't see how this is supposed to work.
  
  Ben.
 
 Host should just do what guest tells it to.
 On the host side we build up the steering table as we get packets
 to transmit. See the code in drivers/net/tun.c in recent
 kernels.
 
 Again this actually works fine - what are the problems that you see?
 Could you give an example please?

I'm not saying it doesn't work in its own way, I just don't see how you
would make it work with the existing RFS!

Since this doesn't seem to be intended to have *any* connection with the
existing core networking feature called RFS, perhaps you could find a
different name for it.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCHv5] virtio-spec: virtio network device RFS support

2012-12-06 Thread Ben Hutchings
On Thu, 2012-12-06 at 22:29 +0200, Michael S. Tsirkin wrote:
 On Thu, Dec 06, 2012 at 08:03:14PM +, Ben Hutchings wrote:
[...]
  Since this doesn't seem to be intended to have *any* connection with the
  existing core networking feature called RFS, perhaps you could find a
  different name for it.
  
  Ben.
 
 
 Ah I see what you mean. We started out calling this feature multiqueue
 Rusty suggested RFS since it gives similar functionality to RFS but in
 device: it has receive steering logic per flow as part of the device.

The name is quite generic, but in the context of Linux it has so far
been used for a specific software feature and not as a generic name for
flow steering by hardware (or drivers).  The existing documentation
(Documentation/networking/scaling.txt) states quite clearly that 'RFS'
means that specific software implementation (with optional driver
integration) and configuration interface.

 Maybe simply adding a statement similar to the one above would be
 sufficient to avoid confusion?

No, I don't think it's sufficient.  We have documentation that says how
to configure 'RFS', and you're proposing to add a very similar feature
called 'RFS' that is configured differently.  No matter how clearly you
distinguish them in new documentation, this will make the old
documentation confusing.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCHv5] virtio-spec: virtio network device RFS support

2012-12-06 Thread Ben Hutchings
On Thu, 2012-12-06 at 23:01 +0200, Michael S. Tsirkin wrote:
 On Thu, Dec 06, 2012 at 08:53:59PM +, Ben Hutchings wrote:
  On Thu, 2012-12-06 at 22:29 +0200, Michael S. Tsirkin wrote:
   On Thu, Dec 06, 2012 at 08:03:14PM +, Ben Hutchings wrote:
  [...]
Since this doesn't seem to be intended to have *any* connection with the
existing core networking feature called RFS, perhaps you could find a
different name for it.

Ben.
   
   
   Ah I see what you mean. We started out calling this feature multiqueue
   Rusty suggested RFS since it gives similar functionality to RFS but in
   device: it has receive steering logic per flow as part of the device.
  
  The name is quite generic, but in the context of Linux it has so far
  been used for a specific software feature and not as a generic name for
  flow steering by hardware (or drivers).  The existing documentation
  (Documentation/networking/scaling.txt) states quite clearly that 'RFS'
  means that specific software implementation (with optional driver
  integration) and configuration interface.
 
   Maybe simply adding a statement similar to the one above would be
   sufficient to avoid confusion?
  
  No, I don't think it's sufficient.  We have documentation that says how
  to configure 'RFS', and you're proposing to add a very similar feature
  called 'RFS' that is configured differently.  No matter how clearly you
  distinguish them in new documentation, this will make the old
  documentation confusing.
  
  Ben.
 
 I don't mind, renaming is just s/RFS/whatever/ away -
 how should hardware call this in your opinion?

If by 'this' you mean the use of perfect filters or a large hash table
to select the RX queue per flow, then 'flow steering'.

But that is usually combined with the fall-back of a simple mapping from
hash to queue ('RSS' or 'flow hashing') in case there is no specific
queue selection yet, which I can see tun has.  And you're specifying
multiple transmit queues too.  If you want a name for the whole set of
features involved, I don't see any better name than 'multiqueue'/'MQ'.

If you want a name for this specific flow steering mechanism, add some
distinguishing adjective(s) like 'virtual' or 'automatic'.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCHv5] virtio-spec: virtio network device RFS support

2012-12-05 Thread Ben Hutchings
On Mon, 2012-12-03 at 12:58 +0200, Michael S. Tsirkin wrote:
 Add RFS support to virtio network device.
 Add a new feature flag VIRTIO_NET_F_RFS for this feature, a new
 configuration field max_virtqueue_pairs to detect supported number of
 virtqueues as well as a new command VIRTIO_NET_CTRL_RFS to program
 packet steering for unidirectional protocols.
[...]
 +Programming of the receive flow classificator is implicit.
 + Transmitting a packet of a specific flow on transmitqX will cause incoming
 + packets for this flow to be steered to receiveqX.
 + For uni-directional protocols, or where no packets have been transmitted
 + yet, device will steer a packet to a random queue out of the specified
 + receiveq0..receiveqn.
[...]

It doesn't seem like this is usable to implement accelerated RFS in the
guest, though perhaps that doesn't matter.  On the host side, presumably
you'll want vhost_net to do the equivalent of sock_rps_record_flow() -
only without a socket?  But in any case, that requires an rxhash, so I
don't see how this is supposed to work.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next rfc v7 3/3] virtio-net: change the number of queues through ethtool

2012-12-03 Thread Ben Hutchings
On Mon, 2012-12-03 at 13:25 +0200, Michael S. Tsirkin wrote:
 On Mon, Dec 03, 2012 at 02:09:28PM +0800, Jason Wang wrote:
  On Sunday, December 02, 2012 06:09:06 PM Michael S. Tsirkin wrote:
   On Tue, Nov 27, 2012 at 06:16:00PM +0800, Jason Wang wrote:
This patch implement the {set|get}_channels method of ethool to allow 
user
to change the number of queues dymaically when the device is running.
This would let the user to configure it on demand.

Signed-off-by: Jason Wang jasow...@redhat.com
---

 drivers/net/virtio_net.c |   41 
+
 1 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index bcaa6e5..f08ec2a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1578,10 +1578,51 @@ static struct virtio_driver virtio_net_driver = 
{

 #endif
 };

+/* TODO: Eliminate OOO packets during switching */
+static int virtnet_set_channels(struct net_device *dev,
+   struct ethtool_channels *channels)
+{
+   struct virtnet_info *vi = netdev_priv(dev);
+   u16 queue_pairs = channels-combined_count;
 
 by the way shouldn't this be combined_count / 2?
 
 And below channels-max_combined = vi-max_queue_pairs * 2; ?
[...]

In this ethtool API, 'channel' means an IRQ and set of queues that
trigger it.  So each ethtool-channel will correspond to one queue-pair
and not one virtio channel.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [rfc net-next v6 2/3] virtio_net: multiqueue support

2012-11-19 Thread Ben Hutchings
On Sun, 2012-11-18 at 11:13 +0200, Michael S. Tsirkin wrote:
 On Sat, Nov 17, 2012 at 12:35:29AM +, Ben Hutchings wrote:
  On Tue, 2012-11-13 at 08:40 +0200, Michael S. Tsirkin wrote:
   On Mon, Nov 05, 2012 at 11:38:39AM +1030, Rusty Russell wrote:
 @@ -924,11 +1032,10 @@ static void virtnet_get_ringparam(struct 
 net_device *dev,
  {
   struct virtnet_info *vi = netdev_priv(dev);
  
 - ring-rx_max_pending = virtqueue_get_vring_size(vi-rvq);
 - ring-tx_max_pending = virtqueue_get_vring_size(vi-svq);
 + ring-rx_max_pending = virtqueue_get_vring_size(vi-rq[0].vq);
 + ring-tx_max_pending = virtqueue_get_vring_size(vi-sq[0].vq);
   ring-rx_pending = ring-rx_max_pending;
   ring-tx_pending = ring-tx_max_pending;
 -
  }

This assumes all vqs are the same size.  I think this should probably
check: for mq mode, use the first vq, otherewise use the 0th.
   
   For rx_pending/tx_pending I think what is required here is the
   actual number of outstanding buffers.
   Dave, Eric - right?
   
   So this should be the total over all rings and to be useful,
   rx_max_pending/tx_max_pending should be the total too.
  
  So far as I know, all current implementations use the number of
  descriptors per ring here. virtio_net should be consistent with this.
  
  Ben.
 
 Problem is, it could in theory be different between rings. I guess we
 could use the maximum.
 
 What's the right thing to do for rx_pending - I am guessing
 we want the current outstanding packets right?

The 'max_pending' fields are for the maximum ring sizes supported; the
'pending' fields are for the current ring sizes.  The current number of
descriptors pending is not included (would be a bit pointless).

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [rfc net-next v6 2/3] virtio_net: multiqueue support

2012-11-16 Thread Ben Hutchings
On Tue, 2012-11-13 at 08:40 +0200, Michael S. Tsirkin wrote:
 On Mon, Nov 05, 2012 at 11:38:39AM +1030, Rusty Russell wrote:
   @@ -924,11 +1032,10 @@ static void virtnet_get_ringparam(struct 
   net_device *dev,
{
 struct virtnet_info *vi = netdev_priv(dev);

   - ring-rx_max_pending = virtqueue_get_vring_size(vi-rvq);
   - ring-tx_max_pending = virtqueue_get_vring_size(vi-svq);
   + ring-rx_max_pending = virtqueue_get_vring_size(vi-rq[0].vq);
   + ring-tx_max_pending = virtqueue_get_vring_size(vi-sq[0].vq);
 ring-rx_pending = ring-rx_max_pending;
 ring-tx_pending = ring-tx_max_pending;
   -
}
  
  This assumes all vqs are the same size.  I think this should probably
  check: for mq mode, use the first vq, otherewise use the 0th.
 
 For rx_pending/tx_pending I think what is required here is the
 actual number of outstanding buffers.
 Dave, Eric - right?
 
 So this should be the total over all rings and to be useful,
 rx_max_pending/tx_max_pending should be the total too.

So far as I know, all current implementations use the number of
descriptors per ring here. virtio_net should be consistent with this.

Ben.

  For bonus points, check this assertion at probe time.
 
 Looks like we can easily support different queues too.
 

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [rfc net-next v6 3/3] virtio-net: change the number of queues through ethtool

2012-11-08 Thread Ben Hutchings
On Tue, 2012-10-30 at 18:03 +0800, Jason Wang wrote:
 This patch implement the {set|get}_channels method of ethool to allow user to
 change the number of queues dymaically when the device is running. This would
 let the user to tune the device for specific applications.
 
 Signed-off-by: Jason Wang jasow...@redhat.com
 ---
  drivers/net/virtio_net.c |   43 +++
  1 files changed, 43 insertions(+), 0 deletions(-)
 
 diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
 index 8cc43e5..66fc129 100644
 --- a/drivers/net/virtio_net.c
 +++ b/drivers/net/virtio_net.c
 @@ -1616,10 +1616,53 @@ static struct virtio_driver virtio_net_driver = {
  #endif
  };
  
 +/* TODO: Eliminate OOO packets during switching */
 +static int virtnet_set_channels(struct net_device *dev,
 + struct ethtool_channels *channels)
 +{
 + struct virtnet_info *vi = netdev_priv(dev);
 + u16 queue_pairs = channels-combined_count;
 +
 + /* Only two modes were support currently */
 + if (queue_pairs == 0)
 + return -EINVAL;
 + if (queue_pairs != vi-total_queue_pairs - 1  queue_pairs != 1)
 + return -EINVAL;

You should also check that all of the other counts are == 0.

 + vi-num_queue_pairs = queue_pairs  1 ? queue_pairs + 1 : 1;

This doesn't make sense.  You're setting vi-num_queue_pairs to be
combined_count + 1...

 + BUG_ON(virtnet_set_queues(vi));
 +
 + netif_set_real_num_tx_queues(dev, vi-num_queue_pairs);
 + netif_set_real_num_rx_queues(dev, vi-num_queue_pairs);
 +
 + return 0;
 +}
 +
 +static void virtnet_get_channels(struct net_device *dev,
 +  struct ethtool_channels *channels)
 +{
 + struct virtnet_info *vi = netdev_priv(dev);
 +
 + if (vi-total_queue_pairs != 1) {
 + channels-combined_count = vi-num_queue_pairs;

but here you report combined_count as being vi-num_queue_pairs.  Do you
need to subtract 1 here?

Ben.

 + channels-max_combined = vi-total_queue_pairs - 1;
 + } else {
 + channels-combined_count = 1;
 + channels-max_combined = 1;
 + }
 +
 + channels-max_other = 0;
 + channels-rx_count = 0;
 + channels-tx_count = 0;
 + channels-other_count = 0;
 +}
 +
  static const struct ethtool_ops virtnet_ethtool_ops = {
   .get_drvinfo = virtnet_get_drvinfo,
   .get_link = ethtool_op_get_link,
   .get_ringparam = virtnet_get_ringparam,
 + .set_channels = virtnet_set_channels,
 + .get_channels = virtnet_get_channels,
  };
  
  static int __init init(void)

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCHv4] virtio-spec: virtio network device multiqueue support

2012-09-12 Thread Ben Hutchings
On Wed, 2012-09-12 at 07:40 -0700, Tom Herbert wrote:
 On Wed, Sep 12, 2012 at 12:57 AM, Michael S. Tsirkin m...@redhat.com wrote:
  On Wed, Sep 12, 2012 at 03:19:11PM +0930, Rusty Russell wrote:
[...]
  Perhaps Tom can explain how we avoid out-of-order receive for the
  accelerated RFS case?  It's not clear to me, but we need to be able to
  do that for virtio-net if it implements accelerated RFS.
 
 AFAIK ooo RX is still possible with accelerated RFS.  We have an
 algorithm that prevents this for RFS by deferring a migration to a new
 queue as long as it's possible that a flow might have outstanding
 packets on the old queue.  I suppose this could be implemented in the
 device for the HW queues, but I don't think it would be easy to cover
 all cases where packets were already in transit to the host or other
 cases where host and device queues are out of sync.
[...]

Yes, I couldn't see any way to eliminate the possibility of OOO.  The
software queue check in RFS should redirect the flow only when it is new
or has had an idle period, when I hope only a few packets will be
received before we send some kind of response (transport or application
layer ACK).  So I think that OOO is not that likely in practice, but I
don't have the evidence to back that up.

If the filter update latency is high enough that a response can overtake
the filter update, there may be more of a problem.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq

2012-07-09 Thread Ben Hutchings
On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
 This patch let the virtio_net driver can negotiate the number of queues it
 wishes to use through control virtqueue and export an ethtool interface to let
 use tweak it.
 
 As current multiqueue virtio-net implementation has optimizations on per-cpu
 virtuqueues, so only two modes were support:
 
 - single queue pair mode
 - multiple queue paris mode, the number of queues matches the number of vcpus
 
 The single queue mode were used by default currently due to regression of
 multiqueue mode in some test (especially in stream test).
 
 Since virtio core does not support paritially deleting virtqueues, so during
 mode switching the whole virtqueue were deleted and the driver would re-create
 the virtqueues it would used.
 
 btw. The queue number negotiating were defered to .ndo_open(), this is because
 only after feature negotitaion could we send the command to control virtqueue
 (as it may also use event index).
[...]
 +static int virtnet_set_channels(struct net_device *dev,
 + struct ethtool_channels *channels)
 +{
 + struct virtnet_info *vi = netdev_priv(dev);
 + u16 queues = channels-rx_count;
 + unsigned status = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER;
 +
 + if (channels-rx_count != channels-tx_count)
 + return -EINVAL;
[...]
 +static void virtnet_get_channels(struct net_device *dev,
 +  struct ethtool_channels *channels)
 +{
 + struct virtnet_info *vi = netdev_priv(dev);
 +
 + channels-max_rx = vi-total_queue_pairs;
 + channels-max_tx = vi-total_queue_pairs;
 + channels-max_other = 0;
 + channels-max_combined = 0;
 + channels-rx_count = vi-num_queue_pairs;
 + channels-tx_count = vi-num_queue_pairs;
 + channels-other_count = 0;
 + channels-combined_count = 0;
 +}
[...]

It looks like the queue-pairs should be treated as 'combined channels',
not separate RX and TX channels.  Also you don't need to clear the other
members; you can assume that the ethtool core will zero-initialise
structures for 'get' operations.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch net-next 1/4] net: introduce new priv_flag indicating iface capable of change mac when running

2012-06-28 Thread Ben Hutchings
On Thu, 2012-06-28 at 16:10 +0200, Jiri Pirko wrote:
 Introduce IFF_LIFE_ADDR_CHANGE priv_flag and use it to disable
 netif_running() check in eth_mac_addr()

 Signed-off-by: Jiri Pirko jpi...@redhat.com
 ---
  include/linux/if.h |2 ++
  net/ethernet/eth.c |2 +-
  2 files changed, 3 insertions(+), 1 deletion(-)
 
 diff --git a/include/linux/if.h b/include/linux/if.h
 index f995c66..fd9ee7c 100644
 --- a/include/linux/if.h
 +++ b/include/linux/if.h
 @@ -81,6 +81,8 @@
  #define IFF_UNICAST_FLT  0x2 /* Supports unicast filtering   
 */
  #define IFF_TEAM_PORT0x4 /* device used as team port */
  #define IFF_SUPP_NOFCS   0x8 /* device supports sending 
 custom FCS */
 +#define IFF_LIFE_ADDR_CHANGE 0x10/* device supports hardware 
 address
 +  * change when it's running */
[...]

Any device that has IFF_UNICAST_FLT can update the unicast MAC filter
while it's running; doesn't that go hand-in-hand with being able to
handle changes to the primary MAC address?  Is the new flag really
necessary at all?

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [V2 RFC net-next PATCH 2/2] virtio_net: export more statistics through ethtool

2012-06-07 Thread Ben Hutchings
On Wed, 2012-06-06 at 15:52 +0800, Jason Wang wrote:
 Satistics counters is useful for debugging and performance optimization, so 
 this
 patch lets virtio_net driver collect following and export them to userspace
 through ethtool -S:
 
 - number of packets sent/received
 - number of bytes sent/received
 - number of callbacks for tx/rx
 - number of kick for tx/rx
 - number of bytes/packets queued for tx
 
 As virtnet_stats were per-cpu, so both per-cpu and gloabl satistics were
 collected like:
[...]

I would really like to see some sort of convention for presenting
per-queue statistics through ethtool.  At the moment we have a complete
mess of different formats:

bnx2x:[${index}]: ${name}
be2net:   ${qtype}q${index}: ${name}
ehea: PR${index} ${name}
mlx4_en:  ${qtype}${index}_${name}
myri10ge: dummy stat names as headings
niu:  dummy stat names as headings
s2io: ring_${index}_${name}
vmxnet3:  dummy stat names as headings
vxge: ${name}_${index}; also dummy stat names as headings

And you're introducing yet another format!

(Additionally some of the drivers are playing games with spaces and tabs
to make ethtool indent the stats the way they like.  Ethtool statistics
are inconsistent enough already without drivers pulling that sort of
crap.

I'm inclined to make ethtool start stripping whitespace from stat names,
and *if* people can agree on a common format for per-queue statistic
names then I'll indent them *consistently*.  Also, I would make such
stats optional, so you don't get hundreds of lines of crap by default.)

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [V2 RFC net-next PATCH 2/2] virtio_net: export more statistics through ethtool

2012-06-07 Thread Ben Hutchings
On Thu, 2012-06-07 at 13:05 -0700, David Miller wrote:
 From: Ben Hutchings bhutchi...@solarflare.com
 Date: Thu, 7 Jun 2012 18:15:06 +0100
 
  I would really like to see some sort of convention for presenting
  per-queue statistics through ethtool.  At the moment we have a complete
  mess of different formats:
 
 Indeed.  Probably ${QUEUE_TYPE}-${INDEX}-${STATISTIC} is best.
 With an agreed upon list of queue types such as rx, tx, rxtx
 etc.

I think we should leave the type names open-ended, as there are other
useful groupings like per-virtual-port.  In that case the separator
should be chosen to allow arbitrary type names without ambiguity.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [V2 RFC net-next PATCH 2/2] virtio_net: export more statistics through ethtool

2012-06-07 Thread Ben Hutchings
On Thu, 2012-06-07 at 13:39 -0700, Rick Jones wrote:
 On 06/07/2012 01:24 PM, Ben Hutchings wrote:
  On Thu, 2012-06-07 at 13:05 -0700, David Miller wrote:
  From: Ben Hutchingsbhutchi...@solarflare.com
  Date: Thu, 7 Jun 2012 18:15:06 +0100
 
  I would really like to see some sort of convention for presenting
  per-queue statistics through ethtool.  At the moment we have a complete
  mess of different formats:
 
  Indeed.  Probably ${QUEUE_TYPE}-${INDEX}-${STATISTIC} is best.
  With an agreed upon list of queue types such as rx, tx, rxtx
  etc.
 
  I think we should leave the type names open-ended, as there are other
  useful groupings like per-virtual-port.  In that case the separator
  should be chosen to allow arbitrary type names without ambiguity.
 
 So you mean like something along the lines of the presence of say '.' 
 indicating indent a level:
 
 rx_bytes:  1234
  myqueue1.rx_bytes: 234
  myqueue2.rx_bytes: 345
  ...

Most drivers seem to want this sort of ordering/grouping:

group0.foo
group0.bar
...
group1.foo
group1.bar
...

but if we have a standard way of indicating groups of statistics then
the user can choose whether they want to reorder by type name.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [V2 RFC net-next PATCH 2/2] virtio_net: export more statistics through ethtool

2012-06-07 Thread Ben Hutchings
On Thu, 2012-06-07 at 21:56 +0100, Ben Hutchings wrote:
 On Thu, 2012-06-07 at 13:39 -0700, Rick Jones wrote:
  On 06/07/2012 01:24 PM, Ben Hutchings wrote:
   On Thu, 2012-06-07 at 13:05 -0700, David Miller wrote:
   From: Ben Hutchingsbhutchi...@solarflare.com
   Date: Thu, 7 Jun 2012 18:15:06 +0100
  
   I would really like to see some sort of convention for presenting
   per-queue statistics through ethtool.  At the moment we have a complete
   mess of different formats:
  
   Indeed.  Probably ${QUEUE_TYPE}-${INDEX}-${STATISTIC} is best.
   With an agreed upon list of queue types such as rx, tx, rxtx
   etc.
  
   I think we should leave the type names open-ended, as there are other
   useful groupings like per-virtual-port.  In that case the separator
   should be chosen to allow arbitrary type names without ambiguity.
  
  So you mean like something along the lines of the presence of say '.' 
  indicating indent a level:
  
  rx_bytes:  1234
   myqueue1.rx_bytes: 234
   myqueue2.rx_bytes: 345
   ...
 
 Most drivers seem to want this sort of ordering/grouping:
 
 group0.foo
 group0.bar
 ...
 group1.foo
 group1.bar
 ...
 
 but if we have a standard way of indicating groups of statistics then
 the user can choose whether they want to reorder by type name.

I mean, whether they want to reorder/regroup by the final part of the
statistic name.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio-net: fix a race on 32bit arches

2012-06-06 Thread Ben Hutchings
On Wed, 2012-06-06 at 22:08 +0200, Eric Dumazet wrote:
 On Wed, 2012-06-06 at 22:58 +0300, Michael S. Tsirkin wrote:
 
  Absolutely, I am talking about virtio here.  I'm not kicking
  u64_stats_sync idea I am just saying that simple locking
  would work for virtio and might be better as it
  gives us a way to get counters atomically.
 
 Which lock do you own in the RX path ?
 
 You'll have to add a lock in fast path. This sounds really a bad choice
 to me.

You have the NAPI 'lock', so when gathering stats you can synchronise
using napi_disable() ;-)

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio-net: fix a race on 32bit arches

2012-06-06 Thread Ben Hutchings
On Wed, 2012-06-06 at 23:16 +0300, Michael S. Tsirkin wrote:
 On Wed, Jun 06, 2012 at 10:08:09PM +0200, Eric Dumazet wrote:
  On Wed, 2012-06-06 at 22:58 +0300, Michael S. Tsirkin wrote:
  
   Absolutely, I am talking about virtio here.  I'm not kicking
   u64_stats_sync idea I am just saying that simple locking
   would work for virtio and might be better as it
   gives us a way to get counters atomically.
  
  Which lock do you own in the RX path ?
 
 We can just disable napi, everything is updated from napi callback.

Seriously, though: don't do that; this is going to hurt performance for
minimal benefit.

Ben.

  You'll have to add a lock in fast path. This sounds really a bad choice
  to me.
 
 .ndo_get_stats64 is not data path though, is it?
 

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] vhost-net: add module alias (v2)

2012-01-11 Thread Ben Hutchings
On Wed, 2012-01-11 at 09:16 -0800, Stephen Hemminger wrote:
 By adding the correct module alias, programs won't have to explicitly
 call modprobe. Vhost-net will always be available if built into the kernel.
 It does require assigning a permanent minor number for depmod to work.
 Choose one next to TUN since this driver is related to it.
 
 Also, use C99 style initialization.
 
 Signed-off-by: Stephen Hemminger shemmin...@vyatta.com
 
 ---
 v2 - document minor number and make sure to not overlap
[...]
 --- a/include/linux/miscdevice.h  2012-01-10 10:56:59.779189436 -0800
 +++ b/include/linux/miscdevice.h  2012-01-11 09:13:20.803694316 -0800
 @@ -42,6 +42,7 @@
  #define AUTOFS_MINOR 235
  #define MAPPER_CTRL_MINOR236
  #define LOOP_CTRL_MINOR  237
 +#define VHOST_NET_MINOR  238
  #define MISC_DYNAMIC_MINOR   255
  
  struct device;
 --- a/Documentation/devices.txt   2012-01-10 10:56:53.399116518 -0800
 +++ b/Documentation/devices.txt   2012-01-11 09:12:49.251197653 -0800
 @@ -447,6 +447,8 @@ Your cooperation is appreciated.
   234 = /dev/btrfs-controlBtrfs control device
   235 = /dev/autofs   Autofs control device
   236 = /dev/mapper/control   Device-Mapper control device
 + 237 = /dev/vhost-netHost kernel accelerator for virtio net
[...]

238 != 237.  It looks like someone forgot to add loopctrl here.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next RFC PATCH 0/5] Series short description

2011-12-14 Thread Ben Hutchings
On Fri, 2011-12-09 at 16:01 +1030, Rusty Russell wrote:
 On Wed, 7 Dec 2011 17:02:04 +, Ben Hutchings bhutchi...@solarflare.com 
 wrote:
  Solarflare controllers (sfc driver) have 8192 perfect filters for
  TCP/IPv4 and UDP/IPv4 which can be used for flow steering.  (The filters
  are organised as a hash table, but matched based on 5-tuples.)  I
  implemented the 'accelerated RFS' interface in this driver.
  
  I believe the Intel 82599 controllers (ixgbe driver) have both
  hash-based and perfect filter modes and the driver can be configured to
  use one or the other.  The driver has its own independent mechanism for
  steering RX and TX flows which predates RFS; I don't know whether it
  uses hash-based or perfect filters.
 
 Thanks for this summary (and Jason, too).  I've fallen a long way behind
 NIC state-of-the-art.
  
  Most multi-queue controllers could support a kind of hash-based
  filtering for TCP/IP by adjusting the RSS indirection table.  However,
  this table is usually quite small (64-256 entries).  This means that
  hash collisions will be quite common and this can result in reordering.
  The same applies to the small table Jason has proposed for virtio-net.
 
 But this happens on real hardware today.  Better that real hardware is
 nice, but is it overkill?

What do you mean, it happens on real hardware today?  So far as I know,
the only cases where we have dynamic adjustment of flow steering are in
ixgbe (big table of hash filters, I think) and sfc (perfect filters).
I don't think that anyone's currently doing flow steering with the RSS
indirection table.  (At least, not on Linux.  I think that Microsoft was
intending to do so on Windows, but I don't know whether they ever did.)

 And can't you reorder even with perfect matching, since prior packets
 will be on the old queue and more recent ones on the new queue?  Does it
 discard or requeue old ones?  Or am I missing a trick?

Yes, that is possible.  RFS is careful to avoid such reordering by only
changing the steering of a flow when none of its packets can be in a
software receive queue.  It is not generally possible to do the same for
hardware receive queues.  However, when the first condition is met it is
likely that there won't be a whole lot of packets for that flow in the
hardware receive queue either.  (But if there are, then I think as a
side-effect of commit 09994d1 RFS will repeatedly ask the driver to
steer the flow.  Which isn't ideal.)

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next RFC PATCH 0/5] Series short description

2011-12-07 Thread Ben Hutchings
On Wed, 2011-12-07 at 19:31 +0800, Jason Wang wrote:
 On 12/07/2011 03:30 PM, Rusty Russell wrote:
  On Mon, 05 Dec 2011 16:58:37 +0800, Jason Wangjasow...@redhat.com  wrote:
  multiple queue virtio-net: flow steering through host/guest cooperation
 
  Hello all:
 
  This is a rough series adds the guest/host cooperation of flow
  steering support based on Krish Kumar's multiple queue virtio-net
  driver patch 3/3 (http://lwn.net/Articles/467283/).
  Is there a real (physical) device which does this kind of thing?  How do
  they do it?  Can we copy them?
 
  Cheers,
  Rusty.
 As far as I see, ixgbe and sfc have similar but much more sophisticated 
 mechanism.
 
 The idea was originally suggested by Ben and it was just borrowed form 
 those real physical nic cards who can dispatch packets based on their 
 hash. All of theses cards can filter the flow based on the hash of 
 L2/L3/L4 header and the stack would tell the card which queue should 
 this flow goes.

Solarflare controllers (sfc driver) have 8192 perfect filters for
TCP/IPv4 and UDP/IPv4 which can be used for flow steering.  (The filters
are organised as a hash table, but matched based on 5-tuples.)  I
implemented the 'accelerated RFS' interface in this driver.

I believe the Intel 82599 controllers (ixgbe driver) have both
hash-based and perfect filter modes and the driver can be configured to
use one or the other.  The driver has its own independent mechanism for
steering RX and TX flows which predates RFS; I don't know whether it
uses hash-based or perfect filters.

Most multi-queue controllers could support a kind of hash-based
filtering for TCP/IP by adjusting the RSS indirection table.  However,
this table is usually quite small (64-256 entries).  This means that
hash collisions will be quite common and this can result in reordering.
The same applies to the small table Jason has proposed for virtio-net.

 So in host, a simple hash to queue table were introduced in tap/macvtap 
 and in guest, the guest driver would tell the desired queue of a flow 
 through changing this table.

I don't think accelerated RFS can work well without the use of perfect
filtering or hash-based filtering with a very low rate of collisions.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next RFC PATCH 2/5] tuntap: simple flow director support

2011-12-06 Thread Ben Hutchings
On Tue, 2011-12-06 at 15:21 +0800, Jason Wang wrote:
 On 12/06/2011 04:09 AM, Ben Hutchings wrote:
  On Mon, 2011-12-05 at 16:58 +0800, Jason Wang wrote:
  This patch adds a simple flow director to tun/tap device. It is just a
  page that contains the hash to queue mapping which could be changed by
  user-space. The backend (tap/macvtap) would query this table to get
  the desired queue of a packets when it send packets to userspace.
  This is just flow hashing (RSS), not flow steering.
 
  The page address were set through a new kind of ioctl - TUNSETFD and
  were pinned until device exit or another new page were specified.
  [...]
 
  You should implement ethtool ETHTOOL_{G,S}RXFHINDIR instead.
 
  Ben.
 
 
 I'm not fully understanding this. The page belongs to guest, and the 
 idea is to let guest driver can easily change any entry. Looks like if 
 ethtool_set_rxfh_indir() is used, this kind of change is not easy as it 
 needs one copy and can only accept the whole table as its parameters.

Sorry, yes, I was misreading this.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next RFC PATCH 5/5] virtio-net: flow director support

2011-12-06 Thread Ben Hutchings
On Tue, 2011-12-06 at 15:25 +0800, Jason Wang wrote:
 On 12/06/2011 04:42 AM, Ben Hutchings wrote:
[...]
  This is not a proper implementation of ndo_rx_flow_steer.  If you steer
  a flow by changing the RSS table this can easily cause packet reordering
  in other flows.  The filtering should be more precise, ideally matching
  exactly a single flow by e.g. VID and IP 5-tuple.
 
  I think you need to add a second hash table which records exactly which
  flow is supposed to be steered.  Also, you must call
  rps_may_expire_flow() to check whether an entry in this table may be
  replaced; otherwise you can cause packet reordering in the flow that was
  previously being steered.
 
  Finally, this function must return the table index it assigned, so that
  rps_may_expire_flow() works.
 
 Thanks for the explanation, how about document this briefly in scaling.txt?
[...]

I believe scaling.txt is intended for users/administrators, not
developers.

The documentation for implementers of accelerated RFS is in the comment
for struct net_device_ops and the commit message adding it.  But I
really should improve that comment.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next RFC PATCH 2/5] tuntap: simple flow director support

2011-12-05 Thread Ben Hutchings
On Mon, 2011-12-05 at 16:58 +0800, Jason Wang wrote:
 This patch adds a simple flow director to tun/tap device. It is just a
 page that contains the hash to queue mapping which could be changed by
 user-space. The backend (tap/macvtap) would query this table to get
 the desired queue of a packets when it send packets to userspace.

This is just flow hashing (RSS), not flow steering.

 The page address were set through a new kind of ioctl - TUNSETFD and
 were pinned until device exit or another new page were specified.
[...]

You should implement ethtool ETHTOOL_{G,S}RXFHINDIR instead.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next RFC PATCH 3/5] macvtap: flow director support

2011-12-05 Thread Ben Hutchings
Similarly, macvtap chould implement the ethtool {get,set}_rxfh_indir
operations.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next RFC PATCH 5/5] virtio-net: flow director support

2011-12-05 Thread Ben Hutchings
On Mon, 2011-12-05 at 16:59 +0800, Jason Wang wrote:
 In order to let the packets of a flow to be passed to the desired
 guest cpu, we can co-operate with devices through programming the flow
 director which was just a hash to queue table.
 
 This kinds of co-operation is done through the accelerate RFS support,
 a device specific flow sterring method virtnet_fd() is used to modify
 the flow director based on rfs mapping. The desired queue were
 calculated through reverse mapping of the irq affinity table. In order
 to parallelize the ingress path, irq affinity of rx queue were also
 provides by the driver.
 
 In addition to accelerate RFS, we can also use the guest scheduler to
 balance the load of TX and reduce the lock contention on egress path,
 so the processor_id() were used to tx queue selection.
[...]
 +#ifdef CONFIG_RFS_ACCEL
 +
 +int virtnet_fd(struct net_device *net_dev, const struct sk_buff *skb,
 +u16 rxq_index, u32 flow_id)
 +{
 + struct virtnet_info *vi = netdev_priv(net_dev);
 + u16 *table = NULL;
 +
 + if (skb-protocol != htons(ETH_P_IP) || !skb-rxhash)
 + return -EPROTONOSUPPORT;

Why only IPv4?

 + table = kmap_atomic(vi-fd_page);
 + table[skb-rxhash  TAP_HASH_MASK] = rxq_index;
 + kunmap_atomic(table);
 +
 + return 0;
 +}
 +#endif

This is not a proper implementation of ndo_rx_flow_steer.  If you steer
a flow by changing the RSS table this can easily cause packet reordering
in other flows.  The filtering should be more precise, ideally matching
exactly a single flow by e.g. VID and IP 5-tuple.

I think you need to add a second hash table which records exactly which
flow is supposed to be steered.  Also, you must call
rps_may_expire_flow() to check whether an entry in this table may be
replaced; otherwise you can cause packet reordering in the flow that was
previously being steered.

Finally, this function must return the table index it assigned, so that
rps_may_expire_flow() works.

 +static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
 +{
 + int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
 +smp_processor_id();
 +
 + /* As we make use of the accelerate rfs which let the scheduler to
 +  * balance the load, it make sense to choose the tx queue also based on
 +  * theprocessor id?
 +  */
 + while (unlikely(txq = dev-real_num_tx_queues))
 + txq -= dev-real_num_tx_queues;
 + return txq;
 +}
[...]

Don't do this, let XPS handle it.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes

2011-11-21 Thread Ben Hutchings
On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote:
 Changes for multiqueue virtio_net driver.
[...]
 @@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet
  {
   struct virtnet_info *vi = netdev_priv(dev);
   int cpu;
 - unsigned int start;
  
   for_each_possible_cpu(cpu) {
 - struct virtnet_stats __percpu *stats
 - = per_cpu_ptr(vi-stats, cpu);
 - u64 tpackets, tbytes, rpackets, rbytes;
 -
 - do {
 - start = u64_stats_fetch_begin(stats-syncp);
 - tpackets = stats-tx_packets;
 - tbytes   = stats-tx_bytes;
 - rpackets = stats-rx_packets;
 - rbytes   = stats-rx_bytes;
 - } while (u64_stats_fetch_retry(stats-syncp, start));
 -
 - tot-rx_packets += rpackets;
 - tot-tx_packets += tpackets;
 - tot-rx_bytes   += rbytes;
 - tot-tx_bytes   += tbytes;
 + int qpair;
 +
 + for (qpair = 0; qpair  vi-num_queue_pairs; qpair++) {
 + struct virtnet_send_stats __percpu *tx_stat;
 + struct virtnet_recv_stats __percpu *rx_stat;

While you're at it, you can drop the per-CPU stats and make them only
per-queue.  There is unlikely to be any benefit in maintaining them
per-CPU while receive and transmit processing is serialised per-queue.

[...]
 +static int invoke_find_vqs(struct virtnet_info *vi)
 +{
 + vq_callback_t **callbacks;
 + struct virtqueue **vqs;
 + int ret = -ENOMEM;
 + int i, total_vqs;
 + char **names;
 +
 + /*
 +  * We expect 1 RX virtqueue followed by 1 TX virtqueue, followed
 +  * by the same 'vi-num_queue_pairs-1' more times, and optionally
 +  * one control virtqueue.
 +  */
 + total_vqs = vi-num_queue_pairs * 2 +
 + virtio_has_feature(vi-vdev, VIRTIO_NET_F_CTRL_VQ);
 +
 + /* Allocate space for find_vqs parameters */
 + vqs = kmalloc(total_vqs * sizeof(*vqs), GFP_KERNEL);
 + callbacks = kmalloc(total_vqs * sizeof(*callbacks), GFP_KERNEL);
 + names = kmalloc(total_vqs * sizeof(*names), GFP_KERNEL);
 + if (!vqs || !callbacks || !names)
 + goto err;
 +
 + /* Allocate/initialize parameters for recv virtqueues */
 + for (i = 0; i  vi-num_queue_pairs * 2; i += 2) {
 + callbacks[i] = skb_recv_done;
 + names[i] = kasprintf(GFP_KERNEL, input.%d, i / 2);
 + if (!names[i])
 + goto err;
 + }
 +
 + /* Allocate/initialize parameters for send virtqueues */
 + for (i = 1; i  vi-num_queue_pairs * 2; i += 2) {
 + callbacks[i] = skb_xmit_done;
 + names[i] = kasprintf(GFP_KERNEL, output.%d, i / 2);
 + if (!names[i])
 + goto err;
 + }
[...]

The RX and TX interrupt names for a multiqueue device should follow the
formats device-rx-index and device-tx-index.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes

2011-11-21 Thread Ben Hutchings
On Fri, 2011-11-18 at 08:24 +0200, Sasha Levin wrote:
 On Fri, 2011-11-18 at 01:08 +, Ben Hutchings wrote:
  On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote:
   Changes for multiqueue virtio_net driver.
  [...]
   @@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet
{
 struct virtnet_info *vi = netdev_priv(dev);
 int cpu;
   - unsigned int start;

 for_each_possible_cpu(cpu) {
   - struct virtnet_stats __percpu *stats
   - = per_cpu_ptr(vi-stats, cpu);
   - u64 tpackets, tbytes, rpackets, rbytes;
   -
   - do {
   - start = u64_stats_fetch_begin(stats-syncp);
   - tpackets = stats-tx_packets;
   - tbytes   = stats-tx_bytes;
   - rpackets = stats-rx_packets;
   - rbytes   = stats-rx_bytes;
   - } while (u64_stats_fetch_retry(stats-syncp, start));
   -
   - tot-rx_packets += rpackets;
   - tot-tx_packets += tpackets;
   - tot-rx_bytes   += rbytes;
   - tot-tx_bytes   += tbytes;
   + int qpair;
   +
   + for (qpair = 0; qpair  vi-num_queue_pairs; qpair++) {
   + struct virtnet_send_stats __percpu *tx_stat;
   + struct virtnet_recv_stats __percpu *rx_stat;
  
  While you're at it, you can drop the per-CPU stats and make them only
  per-queue.  There is unlikely to be any benefit in maintaining them
  per-CPU while receive and transmit processing is serialised per-queue.
 
 It allows you to update stats without a lock.

But you'll already be holding a lock related to the queue.

 Whats the benefit of having them per queue?

It should save some memory (and a little time when summing stats, though
that's unlikely to matter much).

The important thing is that splitting up stats per-CPU *and* per-queue
is a waste.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes

2011-11-21 Thread Ben Hutchings
On Fri, 2011-11-18 at 18:18 +0200, Sasha Levin wrote:
 On Fri, 2011-11-18 at 15:40 +, Ben Hutchings wrote:
  On Fri, 2011-11-18 at 08:24 +0200, Sasha Levin wrote:
   On Fri, 2011-11-18 at 01:08 +, Ben Hutchings wrote:
On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote:
 Changes for multiqueue virtio_net driver.
[...]
 @@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet
  {
   struct virtnet_info *vi = netdev_priv(dev);
   int cpu;
 - unsigned int start;
  
   for_each_possible_cpu(cpu) {
 - struct virtnet_stats __percpu *stats
 - = per_cpu_ptr(vi-stats, cpu);
 - u64 tpackets, tbytes, rpackets, rbytes;
 -
 - do {
 - start = u64_stats_fetch_begin(stats-syncp);
 - tpackets = stats-tx_packets;
 - tbytes   = stats-tx_bytes;
 - rpackets = stats-rx_packets;
 - rbytes   = stats-rx_bytes;
 - } while (u64_stats_fetch_retry(stats-syncp, start));
 -
 - tot-rx_packets += rpackets;
 - tot-tx_packets += tpackets;
 - tot-rx_bytes   += rbytes;
 - tot-tx_bytes   += tbytes;
 + int qpair;
 +
 + for (qpair = 0; qpair  vi-num_queue_pairs; qpair++) {
 + struct virtnet_send_stats __percpu *tx_stat;
 + struct virtnet_recv_stats __percpu *rx_stat;

While you're at it, you can drop the per-CPU stats and make them only
per-queue.  There is unlikely to be any benefit in maintaining them
per-CPU while receive and transmit processing is serialised per-queue.
   
   It allows you to update stats without a lock.
  
  But you'll already be holding a lock related to the queue.
 
 Right, but now you're holding a queue lock just when playing with the
 queue, we don't hold it when we process the data - which is when we
 usually need to update stats.
[...]

The *stack* is holding the appropriate lock when calling the NAPI poll
function or ndo_start_xmit function.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next] enable virtio_net to return bus_info in ethtool -i consistent with emulated NICs

2011-11-14 Thread Ben Hutchings
On Mon, 2011-11-14 at 13:52 -0800, Rick Jones wrote:
 From: Rick Jones rick.jon...@hp.com
 
 Add a new .bus_name to virtio_config_ops then modify virtio_net to 
 call through to it in an ethtool .get_drvinfo routine to report
 bus_info in ethtool -i output which is consistent with other
 emulated NICs and the output of lspci.  
[...]
 diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
 index 0dc30ff..3724d45 100644
 --- a/drivers/lguest/lguest_device.c
 +++ b/drivers/lguest/lguest_device.c
 @@ -381,6 +381,11 @@ error:
   return PTR_ERR(vqs[i]);
  }
  
 +static const char *lg_bus_name(struct virtio_device *vdev)
 +{
 + return Not Implemented;
 +}
[...]
 +static const char *kvm_bus_name(struct virtio_device *vdev)
 +{
 + return Not Implemented;
 +}
[...]

Please use the existing 'not implemented' value, which is the empty
string.   If you think ethtool should print some helpful message instead
of an empty string, please submit a patch for ethtool.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next] enable virtio_net to return bus_info in ethtool -i consistent with emulated NICs

2011-11-14 Thread Ben Hutchings
On Mon, 2011-11-14 at 16:06 -0800, Rick Jones wrote:
 On 11/14/2011 02:30 PM, Ben Hutchings wrote:
  On Mon, 2011-11-14 at 13:52 -0800, Rick Jones wrote:
  From: Rick Jonesrick.jon...@hp.com
 
  Add a new .bus_name to virtio_config_ops then modify virtio_net to
  call through to it in an ethtool .get_drvinfo routine to report
  bus_info in ethtool -i output which is consistent with other
  emulated NICs and the output of lspci.
  [...]
  diff --git a/drivers/lguest/lguest_device.c 
  b/drivers/lguest/lguest_device.c
  index 0dc30ff..3724d45 100644
  --- a/drivers/lguest/lguest_device.c
  +++ b/drivers/lguest/lguest_device.c
  @@ -381,6 +381,11 @@ error:
 return PTR_ERR(vqs[i]);
}
 
  +static const char *lg_bus_name(struct virtio_device *vdev)
  +{
  +  return Not Implemented;
  +}
  [...]
  +static const char *kvm_bus_name(struct virtio_device *vdev)
  +{
  +  return Not Implemented;
  +}
  [...]
 
  Please use the existing 'not implemented' value, which is the empty
  string.   If you think ethtool should print some helpful message instead
  of an empty string, please submit a patch for ethtool.
 
 
 One question - will those actually be called via an ethtool path?  In my 
 poking about through the virtio code, I got the impression those modules 
 were for other than networking sorts of things.

I don't know; I just assumed that was why you were adding them!  In
other contexts such as dev_printk() this string would make even less
sense.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization