date:20170316

[PATCH net-next] ipv6: Implement optimized IPv6 masked address comparison for ARM64

2017-03-16 Thread Subash Abhinov Kasiviswanathan

Android devices use multiple ip[6]tables for statistics, UID matching
and other functionality. Perf output indicated that ip6_do_table
was taking a considerable amount of CPU and more that ip_do_table
for an equivalent rate. ipv6_masked_addr_cmp was chosen for
optimization as there are more instructions required than the
equivalent operation in ip_packet_match.

Using 128 bit operations helps to reduce the number of instructions
for the match on an ARM64 system. This helps to improve UDPv6 DL
performance by 40Mbps (860Mbps -> 900Mbps) on a CPU limited system.

Tested on x86_64 UML to check if generic version is used and ARM64
to verify that ARM64 version is used.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 arch/alpha/include/asm/Kbuild  |  1 +
 arch/arc/include/asm/Kbuild|  1 +
 arch/arm/include/asm/Kbuild|  1 +
 arch/arm64/include/asm/ipv6.h  | 29 +
 arch/avr32/include/asm/Kbuild  |  1 +
 arch/blackfin/include/asm/Kbuild   |  1 +
 arch/c6x/include/asm/Kbuild|  1 +
 arch/cris/include/asm/Kbuild   |  1 +
 arch/frv/include/asm/Kbuild|  1 +
 arch/h8300/include/asm/Kbuild  |  1 +
 arch/hexagon/include/asm/Kbuild|  1 +
 arch/ia64/include/asm/Kbuild   |  1 +
 arch/m32r/include/asm/Kbuild   |  1 +
 arch/m68k/include/asm/Kbuild   |  1 +
 arch/metag/include/asm/Kbuild  |  1 +
 arch/microblaze/include/asm/Kbuild |  1 +
 arch/mips/include/asm/Kbuild   |  1 +
 arch/mn10300/include/asm/Kbuild|  1 +
 arch/nios2/include/asm/Kbuild  |  1 +
 arch/openrisc/include/asm/Kbuild   |  1 +
 arch/parisc/include/asm/Kbuild |  1 +
 arch/powerpc/include/asm/Kbuild|  1 +
 arch/s390/include/asm/Kbuild   |  1 +
 arch/score/include/asm/Kbuild  |  1 +
 arch/sh/include/asm/Kbuild |  1 +
 arch/sparc/include/asm/Kbuild  |  1 +
 arch/tile/include/asm/Kbuild   |  1 +
 arch/um/include/asm/Kbuild |  1 +
 arch/unicore32/include/asm/Kbuild  |  1 +
 arch/x86/include/asm/Kbuild|  1 +
 arch/xtensa/include/asm/Kbuild |  1 +
 include/asm-generic/ipv6.h | 32 
 include/net/ipv6.h | 20 +---
 33 files changed, 92 insertions(+), 19 deletions(-)
 create mode 100644 arch/arm64/include/asm/ipv6.h
 create mode 100644 include/asm-generic/ipv6.h

diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild
index d103db5..5b7e92b 100644
--- a/arch/alpha/include/asm/Kbuild
+++ b/arch/alpha/include/asm/Kbuild
@@ -3,6 +3,7 @@
 generic-y += clkdev.h
 generic-y += exec.h
 generic-y += export.h
+generic-y += ipv6.h
 generic-y += irq_work.h
 generic-y += mcs_spinlock.h
 generic-y += mm-arch-hooks.h
diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild
index 63a0401..99f1456 100644
--- a/arch/arc/include/asm/Kbuild
+++ b/arch/arc/include/asm/Kbuild
@@ -14,6 +14,7 @@ generic-y += hw_irq.h
 generic-y += ioctl.h
 generic-y += ioctls.h
 generic-y += ipcbuf.h
+generic-y += ipv6.h
 generic-y += irq_regs.h
 generic-y += irq_work.h
 generic-y += kmap_types.h
diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
index b14e8c7..a0ba9ac 100644
--- a/arch/arm/include/asm/Kbuild
+++ b/arch/arm/include/asm/Kbuild
@@ -9,6 +9,7 @@ generic-y += errno.h
 generic-y += exec.h
 generic-y += ioctl.h
 generic-y += ipcbuf.h
+generic-y += ipv6.h
 generic-y += irq_regs.h
 generic-y += kdebug.h
 generic-y += local.h
diff --git a/arch/arm64/include/asm/ipv6.h b/arch/arm64/include/asm/ipv6.h
new file mode 100644
index 000..d49dec6
--- /dev/null
+++ b/arch/arm64/include/asm/ipv6.h
@@ -0,0 +1,29 @@
+/* Copyright (c) 2017, The Linux Foundation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 and
+ * only version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __ASM_IPV6_H
+#define __ASM_IPV6_H
+
+#include 
+
+static inline bool
+ipv6_masked_addr_cmp(const struct in6_addr *a1, const struct in6_addr *m,
+const struct in6_addr *a2)
+{
+   const __uint128_t *ul1 = (const __uint128_t *)a1;
+   const __uint128_t *ulm = (const __uint128_t *)m;
+   const __uint128_t *ul2 = (const __uint128_t *)a1;
+
+   return !!((*ul1 ^ *ul2) & *ulm);
+}
+#define ipv6_masked_addr_cmp ipv6_masked_addr_cmp
+#endif /* __ASM_IPV6_H */
diff --git a/arch/avr32/include/asm/Kbuild b/arch/avr32/include/asm/Kbuild
index 3d7ef2c..fd6a964 100644
--- a/arch/avr32/include/asm/Kbuild
+++ b/arch/avr32/include/asm/Kbuild
@@ -6,6 +6,7 @@ generic-y += div64.h
 generic-y += emergency-restart.h
 generic-y += exec.h

[PATCH net-next v3 2/2]L2TP:Adjust intf MTU, add underlay L3, L2 hdrs

2017-03-16 Thread R. Parameswaran


In existing kernel code, when setting up the L2TP interface, all of the
tunnel encapsulation headers are not taken into account when setting
up the MTU on the  L2TP logical interface device. Due to this, the
packets created by the applications on top of the L2TP layer are larger
than they ought to be, relative to the underlay MTU, which leads to
needless fragmentation once the L2TP packet is encapsulated in an outer IP
packet.

Specifically, the MTU calculation  does not take into account the (outer)
IP header imposed on the encapsulated L2TP packet, and the Layer 2 header
imposed on the inner L2TP packet prior to encapsulation. The patch posted
here takes care of these.

Existing code also seems to assume an Ethernet (non-jumbo) underlay. The
patch uses the PMTU mechanism and the dst entry in the L2TP tunnel socket
to directly pull up the underlay MTU (as the baseline number on top of
which the encapsulation headers are factored in).  Ethernet MTU is
assumed as a fallback only if this fails.

Picked up review comments from James Chapman, added a function
to compute ip header + ip option overhead on a socket, and factored it
into L2TP change-set.

Signed-off-by: R. Parameswaran 
---
 net/l2tp/l2tp_eth.c | 51 +++
 1 file changed, 47 insertions(+), 4 deletions(-)

diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
index 8bf18a5..f512d97 100644
--- a/net/l2tp/l2tp_eth.c
+++ b/net/l2tp/l2tp_eth.c
@@ -30,6 +30,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #include "l2tp_core.h"
 
@@ -204,6 +207,49 @@ static void l2tp_eth_show(struct seq_file *m, void *arg)
 }
 #endif
 
+static void l2tp_eth_adjust_mtu(struct l2tp_tunnel *tunnel,
+   struct l2tp_session *session,
+   struct net_device *dev)
+{
+   unsigned int overhead = 0;
+   struct dst_entry *dst;
+   u32 l3_overhead = 0;
+
+   if (session->mtu != 0) {
+   dev->mtu = session->mtu;
+   dev->needed_headroom += session->hdr_len;
+   if (tunnel->encap == L2TP_ENCAPTYPE_UDP)
+   dev->needed_headroom += sizeof(struct udphdr);
+   return;
+   }
+   overhead = session->hdr_len;
+   l3_overhead = kernel_sock_ip_overhead(tunnel->sock);
+   if (!tunnel->sock || (l3_overhead == 0)) {
+   /* L3 Overhead couldn't be identified, dev mtu stays at 1500 */
+   return;
+   }
+   /* Adjust MTU, factor overhead - underlay L3, overlay L2 hdr */
+   overhead += ETH_HLEN + l3_overhead;
+   /* Additionally, if the encap is UDP, account for UDP header size */
+   if (tunnel->encap == L2TP_ENCAPTYPE_UDP) {
+   overhead += sizeof(struct udphdr);
+   dev->needed_headroom += sizeof(struct udphdr);
+   }
+   /* If PMTU discovery was enabled, use discovered MTU on L2TP device */
+   dst = sk_dst_get(tunnel->sock);
+   if (dst) {
+   /* dst_mtu will use PMTU if found, else fallback to intf MTU */
+   u32 pmtu = dst_mtu(dst);
+
+   if (pmtu != 0)
+   dev->mtu = pmtu;
+   dst_release(dst);
+   }
+   session->mtu = dev->mtu - overhead;
+   dev->mtu = session->mtu;
+   dev->needed_headroom += session->hdr_len;
+}
+
 static int l2tp_eth_create(struct net *net, u32 tunnel_id, u32 session_id, u32 
peer_session_id, struct l2tp_session_cfg *cfg)
 {
struct net_device *dev;
@@ -253,12 +299,9 @@ static int l2tp_eth_create(struct net *net, u32 tunnel_id, 
u32 session_id, u32 p
}
 
dev_net_set(dev, net);
-   if (session->mtu == 0)
-   session->mtu = dev->mtu - session->hdr_len;
-   dev->mtu = session->mtu;
-   dev->needed_headroom += session->hdr_len;
dev->min_mtu = 0;
dev->max_mtu = ETH_MAX_MTU;
+   l2tp_eth_adjust_mtu(tunnel, session, dev);
 
priv = netdev_priv(dev);
priv->dev = dev;
-- 
2.1.4

Re: [PATCH net-next 0/3] netvsc: small changesfor net-next

2017-03-16 Thread David Miller

From: Stephen Hemminger 
Date: Thu, 16 Mar 2017 16:12:36 -0700

> One bugfix, and two non-code patches

Series applied.

Re: [PATCH net-next] liquidio: fix wrong information about link modes reported to ethtool

2017-03-16 Thread David Miller

From: Felix Manlunas 
Date: Thu, 16 Mar 2017 16:16:17 -0700

> From: Manish Awasthi 
> 
> Information reported to ethtool about link modes is wrong for 25G NIC.  Fix
> it by checking for presence of 25G NIC, checking the link speed reported by
> NIC firmware, and then assigning proper values to the
> ethtool_link_ksettings struct.
> 
> Signed-off-by: Manish Awasthi 
> Signed-off-by: Felix Manlunas 

Applied.

Re: [PATCH net] tcp: tcp_get_info() should read tcp_time_stamp later

2017-03-16 Thread David Miller

From: Eric Dumazet 
Date: Thu, 16 Mar 2017 15:43:19 -0700

> From: Eric Dumazet 
> 
> Commit b369e7fd41f7 ("tcp: make TCP_INFO more consistent") moved
> lock_sock_fast() earlier in tcp_get_info()
> 
> This has the minor effect that jiffies value being sampled at the
> beginning of tcp_get_info() is more likely to be off by one, and we
> report big tcpi_last_data_sent values (like 0x).
> 
> Since we lock the socket, fetching tcp_time_stamp right before
> doing the jiffies_to_msecs() calls is enough to remove these
> wrong values.
> 
> Signed-off-by: Eric Dumazet 

Applied, thanks Eric.

Re: [PATCH net] netvsc: fix race during initialization

2017-03-16 Thread David Miller

From: Stephen Hemminger 
Date: Thu, 16 Mar 2017 12:21:32 -0700

> When device is being setup on boot, there is a small race where
> network device callback is registered, but the netvsc_device pointer
> is not set yet.  This can cause a NULL ptr dereference if packet
> arrives during this window.
> 
> Fixes: 46b4f7f5d1f7 ("netvsc: eliminate per-device outstanding send counter")
> Signed-off-by: Stephen Hemminger 
> ---
> Applies to 4.11 only, not in earlier versions.
> net-next needs different patch

Applied, thanks Stephen.

Re: [PATCH net] rxrpc: Ignore BUSY packets on old calls

2017-03-16 Thread David Miller

From: David Howells 
Date: Thu, 16 Mar 2017 16:27:10 +

> If we receive a BUSY packet for a call we think we've just completed, the
> packet is handed off to the connection processor to deal with - but the
> connection processor doesn't expect a BUSY packet and so flags a protocol
> error.
> 
> Fix this by simply ignoring the BUSY packet for the moment.
> 
> The symptom of this may appear as a system call failing with EPROTO.  This
> may be triggered by pressing ctrl-C under some circumstances.
> 
> This comes about we abort calls due to interruption by a signal (which we
> shouldn't do, but that's going to be a large fix and mostly in fs/afs/).
> What happens is that we abort the call and may also abort follow up calls
> too (this needs offloading somehoe).  So we see a transmission of something
> like the following sequence of packets:
> 
>   DATA for call N
>   ABORT call N
>   DATA for call N+1
>   ABORT call N+1
> 
> in very quick succession on the same channel.  However, the peer may have
> deferred the processing of the ABORT from the call N to a background thread
> and thus sees the DATA message from the call N+1 coming in before it has
> cleared the channel.  Thus it sends a BUSY packet[*].
> 
> [*] Note that some implementations (OpenAFS, for example) mark the BUSY
> packet with one plus the callNumber of the call prior to call N.
> Ordinarily, this would be call N, but there's no requirement for the
> calls on a channel to be numbered strictly sequentially (the number is
> required to increase).
> 
> This is wrong and means that the callNumber in the BUSY packet should
> be ignored (it really ought to be N+1 since that's what it's in
> response to).
> 
> Signed-off-by: David Howells 

Applied, thanks David.

[PATCH net-next v3 1/2]L2TP:Adjust intf MTU, add underlay L3, L2 hdrs

2017-03-16 Thread R. Parameswaran



In existing kernel code, when setting up the L2TP interface, all of the
tunnel encapsulation headers are not taken into account when setting
up the MTU on the  L2TP logical interface device. Due to this, the
packets created by the applications on top of the L2TP layer are larger
than they ought to be, relative to the underlay MTU, which leads to
needless fragmentation once the L2TP packet is encapsulated in an outer IP
packet.

Specifically, the MTU calculation  does not take into account the (outer)
IP header imposed on the encapsulated L2TP packet, and the Layer 2 header
imposed on the inner L2TP packet prior to encapsulation. The patch posted
here takes care of these.

Existing code also seems to assume an Ethernet (non-jumbo) underlay. The
patch uses the PMTU mechanism and the dst entry in the L2TP tunnel socket
to directly pull up the underlay MTU (as the baseline number on top of
which the encapsulation headers are factored in).  Ethernet MTU is
assumed as a fallback only if this fails.

Picked up review comments from James Chapman, added a function
to compute ip header + ip option overhead on a socket, and factored it
into L2TP change-set.

Signed-off-by: R. Parameswaran 
---
 include/linux/net.h |  3 +++
 net/socket.c| 41 +
 2 files changed, 44 insertions(+)

diff --git a/include/linux/net.h b/include/linux/net.h
index 0620f5e..a42fab2 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -298,6 +298,9 @@ int kernel_sendpage(struct socket *sock, struct page *page, 
int offset,
 int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg);
 int kernel_sock_shutdown(struct socket *sock, enum sock_shutdown_cmd how);
 
+/* Following routine returns the IP overhead imposed by a socket.  */
+u32 kernel_sock_ip_overhead(struct sock *sk);
+
 #define MODULE_ALIAS_NETPROTO(proto) \
MODULE_ALIAS("net-pf-" __stringify(proto))
 
diff --git a/net/socket.c b/net/socket.c
index e034fe4..af54b12 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -3345,3 +3345,44 @@ int kernel_sock_shutdown(struct socket *sock, enum 
sock_shutdown_cmd how)
return sock->ops->shutdown(sock, how);
 }
 EXPORT_SYMBOL(kernel_sock_shutdown);
+
+/* This routine returns the IP overhead imposed by a socket i.e.
+ * the length of the underlying IP header, depending on whether
+ * this is an IPv4 or IPv6 socket and the length from IP options turned
+ * on at the socket.
+ */
+u32 kernel_sock_ip_overhead(struct sock *sk)
+{
+   struct inet_sock *inet;
+   struct ipv6_pinfo *np;
+   struct ip_options_rcu *opt = NULL;
+   struct ipv6_txoptions *optv6 = NULL;
+   u32 overhead = 0;
+   bool owned_by_user = sock_owned_by_user(sk);
+
+   if (!sk)
+   return overhead;
+   switch (sk->sk_family) {
+   case AF_INET:
+   inet = inet_sk(sk);
+   overhead += sizeof(struct iphdr);
+   if (inet)
+   opt = rcu_dereference_protected(inet->inet_opt,
+   owned_by_user);
+   if (opt)
+   overhead += opt->opt.optlen;
+   return overhead;
+   case AF_INET6:
+   np = inet6_sk(sk);
+   overhead += sizeof(struct ipv6hdr);
+   if (np)
+   optv6 = rcu_dereference_protected(np->opt,
+ owned_by_user);
+   if (optv6)
+   overhead += (optv6->opt_flen + optv6->opt_nflen);
+   return overhead;
+   default: /* Returns 0 overhead if the socket is not ipv4 or ipv6 */
+   return overhead;
+   }
+}
+EXPORT_SYMBOL(kernel_sock_ip_overhead);
-- 
2.1.4

Re: [PATCH 05/10] netfilter: nf_tables: fix mismatch in big-endian system

2017-03-16 Thread Liping Zhang

Hi David,
2017-03-16 18:58 GMT+08:00 David Laight :
[...]
>> For the similar reason, when loading an u16 value from the u32 data
>> register, we should use "*(u16 *) sreg;" instead of "(u16)*sreg;",
>> the 2nd method will get the wrong value in the big-endian system.
> ...
>
> That seems to be papering over some of the obvious cracks.
>
> The fact that the entire 32bits are zeroed makes me suspect that
> in some places values are written as 8 bits and read later as 32.
> The only way that can work on BE is if the values are always written
> as 32 bit ones (assignment style 2).

In nftables, user can use concatenations to put two or more selectors together.
For example, we can use "nft add rule x y tcp dport . ip saddr 22 . 1.1.1.1"
to match the tcp dest port and ip source addr at the same time.

In such case, tcp dport will be stored to REG0 and ip saddr will be stored to
REG1, so the layout will be like this(from the lowest address):
  PORT(2 bytes) - (2 bytes) - IPADDR(4 bytes)

Later we will use the statement "memcmp(, , 8)" to do compare,
so the  part must be filled with zero to avoid garbage value.

>
> OTOH using memcmp(,,2) relies on the data being in the lower addressed
> bytes.
>
> If the data does need to be in the lower addressed bytes I'd suggest
> using a union rather than all the casts.
>
> David
>

Re: [Patch net] bridge: resolve a false alarm of lockdep

2017-03-16 Thread David Miller

From: Nikolay Aleksandrov 
Date: Thu, 16 Mar 2017 19:36:07 +0200

> On 16/03/17 19:32, Cong Wang wrote:
>> Andrei reported a false alarm of lockdep at net/bridge/br_fdb.c:109,
>> this is because in Andrei's case, a spin_bug() was already triggered
>> before this, therefore the debug_locks is turned off, lockdep_is_held()
>> is no longer accurate after that. We should use lockdep_assert_held_once()
>> instead of lockdep_is_held() to respect debug_locks.
>> 
>> Reported-by: Andrei Vagin 
>> Signed-off-by: Cong Wang 
>> ---
>>  net/bridge/br_fdb.c | 2 +-
>>  net/bridge/br_private.h | 9 -
>>  2 files changed, 1 insertion(+), 10 deletions(-)
>> 
> 
> Interesting, thanks Cong!
> 
> Fixes: 410b3d48f5111 ("bridge: fdb: add proper lock checks in searching 
> functions")
> 
> Acked-by: Nikolay Aleksandrov 

Applied, thanks.

Re: [PATCH net] ibmvnic: Free tx/rx scrq pointer array when releasing sub-crqs

2017-03-16 Thread David Miller

From: Nathan Fontenot 
Date: Wed, 15 Mar 2017 23:38:07 -0400

> The pointer array for the tx/rx sub crqs should be free'ed when
> releasing the tx/rx sub crqs.
> 
> Signed-off-by: Nathan Fontenot 

Applied, thanks.

Re: Performance issue with igb with lots of different src-ip addrs.

2017-03-16 Thread Ben Greear


I think we can, might take us a day or two to get time to do it.

Thanks,
Ben

On 03/16/2017 08:05 PM, Alexander Duyck wrote:

I'm not really interested in installing a custom version of pktgen.
Any chance you can recreate the issue with standard pktgen?

You might try running perf to get a snapshot of what is using CPU time
on the system.  It will probably give you a pretty good idea where the
code is that is eating up all your CPU time.

- Alex

On Thu, Mar 16, 2017 at 7:46 PM, Ben Greear  wrote:

I'm actually using a hacked up version of pktgen nicely driven by our
GUI tool, but the crux is that you need to set min and max src IP to some
large
range.

We are driving pktgen from a separate machine.  Stock pktgen isn't good at
reporting
received pkts last I checked, so it may be more difficult to easily view the
problem.

I'll be happy to set up my tool on your Fedora 24 or similar VM or machine
if you
want.

Thanks,
Ben


On 03/16/2017 07:35 PM, Alexander Duyck wrote:


Can you include the pktgen script you are running?

Also when you say you are driving traffic through the bridge are you
sending from something external on the system or are you actually
directing the traffic from pktgen into the bridge directly?

- Alex

On Thu, Mar 16, 2017 at 3:49 PM, Ben Greear 
wrote:


Hello,

We notice that when using two igb ports as a bridge, if we use pktgen to
drive traffic through the bridge and randomize (or use a very large
range)
for the source IP addr in pktgen, then performance of igb is very poor
(like
150Mbps
throughput instead of 1Gbps).  It runs right at line speed if we use same
src/dest
IP addr in pktgen.  So, seems it is related to lots of src/dest IP
addresses.

We see same problem when using pktgen to send to itself, and we see this
in
several different kernels.  We specifically tested bridge mode in this
stock
Fedora kernel:

 Linux lfo350-59cc 4.9.13-101.fc24.x86_64 #1 SMP Tue Mar 7 23:48:32 UTC
2017
x86_64 x86_64 x86_64 GNU/Linux

e1000e does not show this problem in our testing.

Any ideas what the issue might be and how to fix it?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com





--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com




--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com

Re: [PATCH] net: ipv6: set route type for anycast routes

2017-03-16 Thread David Miller

From: David Ahern 
Date: Thu, 16 Mar 2017 21:45:12 -0600

> On 3/16/17 9:41 PM, David Miller wrote:
>> From: David Ahern 
>> Date: Wed, 15 Mar 2017 18:14:33 -0700
>> 
>>> Anycast routes have the RTF_ANYCAST flag set, but when dumping routes
>>> for userspace the route type is not set to RTN_ANYCAST. Make it so.
>>>
>>> Fixes: 58c4fb86eabcb ("[IPV6]: Flag RTF_ANYCAST for anycast routes")
>>> CC: Hideaki YOSHIFUJI 
>>> Signed-off-by: David Ahern 
>> 
>> Good catch, applied, thanks.
> 
> Can you put this into stable for back to say 4.4?

I am only submitting patches for v4.10 and v4.9 -stable.

Re: [PATCH] net: ipv6: set route type for anycast routes

2017-03-16 Thread David Ahern

On 3/16/17 9:41 PM, David Miller wrote:
> From: David Ahern 
> Date: Wed, 15 Mar 2017 18:14:33 -0700
> 
>> Anycast routes have the RTF_ANYCAST flag set, but when dumping routes
>> for userspace the route type is not set to RTN_ANYCAST. Make it so.
>>
>> Fixes: 58c4fb86eabcb ("[IPV6]: Flag RTF_ANYCAST for anycast routes")
>> CC: Hideaki YOSHIFUJI 
>> Signed-off-by: David Ahern 
> 
> Good catch, applied, thanks.

Can you put this into stable for back to say 4.4? that way tools can
directly discriminate anycast routes from host routes for at least
new'ish kernels. Thanks,

Re: [PATCH net-next 0/6] bpf: inline bpf_map_lookup_elem()

2017-03-16 Thread David Miller

From: Alexei Starovoitov 
Date: Wed, 15 Mar 2017 18:26:38 -0700

> bpf_map_lookup_elem() is one of the most frequently used helper functions.
> Improve JITed program performance by inlining this helper.
> 
> bpf_map_type  before  after
> hash  58M 74M
> array 174M280M
> 
> The values are number of lookups per second in ideal conditions
> measured by micro-benchmark in patch 6.
> 
> The 'perf report' for HASH map type:
> before:
> 54.23%  map_perf_test  [kernel.kallsyms]  [k] __htab_map_lookup_elem
> 14.24%  map_perf_test  [kernel.kallsyms]  [k] lookup_elem_raw
>  8.84%  map_perf_test  [kernel.kallsyms]  [k] htab_map_lookup_elem
>  5.93%  map_perf_test  [kernel.kallsyms]  [k] bpf_map_lookup_elem
>  2.30%  map_perf_test  [kernel.kallsyms]  [k] bpf_prog_da4fc6a3f41761a2
>  1.49%  map_perf_test  [kernel.kallsyms]  [k] kprobe_ftrace_handler
> 
> after:
> 60.03%  map_perf_test  [kernel.kallsyms]  [k] __htab_map_lookup_elem
> 18.07%  map_perf_test  [kernel.kallsyms]  [k] lookup_elem_raw
>  2.91%  map_perf_test  [kernel.kallsyms]  [k] bpf_prog_da4fc6a3f41761a2
>  1.94%  map_perf_test  [kernel.kallsyms]  [k] _einittext
>  1.90%  map_perf_test  [kernel.kallsyms]  [k] __audit_syscall_exit
>  1.72%  map_perf_test  [kernel.kallsyms]  [k] kprobe_ftrace_handler
> 
> so the cost of htab_map_lookup_elem() and bpf_map_lookup_elem()
> is gone after inlining.
> 
> 'per-cpu' and 'lru' map types can be optimized similarly in the future.
> 
> Note the sparse will complain that bpf is addictive ;)
> kernel/bpf/hashtab.c:438:19: sparse: subtraction of functions? Share your 
> drugs
> kernel/bpf/verifier.c:3342:38: sparse: subtraction of functions? Share your 
> drugs
> it's not a new warning, just in new places.

Series applied, thanks.

Re: [PATCH net-next v3] net: ipv4: add support for ECMP hash policy choice

2017-03-16 Thread David Miller

From: Tom Herbert 
Date: Tue, 14 Mar 2017 19:30:47 -0700

> Agreed, but then why do we even need any complexity here by that
> argument? RSS is specifically defined to do 5-tuple hashing for TCP
> (and UDP), and 3-tuple. No one has ever complained that doing per flow
> RSS for TCP is bad thing AFAIK. We followed that same model for RPS,
> RFS, and XPS via state in the connection context. The skb_hash is
> often given to us for free, whereas in order to do a 3-tuple we have
> to actually do more work and do at least an extra jhash. I suppose the
> argument is probably that switches allow this configuration and
> somehow we want to have feature parity, but it would be very
> interesting to know if anyone is not doing per flow ECMP in real life
> and why...

Anyone doing any kind of hardware offload work requires the
possibility of getting feature parity on the software side.  It's the
only way to properly test and debug things.

Re: [PATCH] net: ipv6: set route type for anycast routes

2017-03-16 Thread David Miller

From: David Ahern 
Date: Wed, 15 Mar 2017 18:14:33 -0700

> Anycast routes have the RTF_ANYCAST flag set, but when dumping routes
> for userspace the route type is not set to RTN_ANYCAST. Make it so.
> 
> Fixes: 58c4fb86eabcb ("[IPV6]: Flag RTF_ANYCAST for anycast routes")
> CC: Hideaki YOSHIFUJI 
> Signed-off-by: David Ahern 

Good catch, applied, thanks.

Re: [PATCH v2] net: sun: sungem: rix a possible null dereference

2017-03-16 Thread David Miller

From: Philippe Reynes 
Date: Wed, 15 Mar 2017 22:51:13 +0100

> The function gem_begin_auto_negotiation dereference
> the pointer ep before testing if it's null. This
> patch add a check on ep before dereferencing it.
> 
> Fixes: 92552fdd557 ("net: sun: sungem: use new api

[davem@localhost net-next]$ git describe 92552fdd557
fatal: Not a valid object name 92552fdd557
[davem@localhost net-next]$

I don't know where you got that SHA1-ID, but it's not the one that
was actually used.

It is advised that you use Linus's or one of my GIT trees to find the
proper commit IDs.

RE: [PATCH net-next] r8152: simply the arguments

2017-03-16 Thread Hayes Wang

David Laight [mailto:david.lai...@aculab.com]
> Sent: Thursday, March 16, 2017 10:28 PM
[...]
> If you are really lucky the compiler will optimise it away.
> Otherwise it will generate another local variable and possibly
> a register spill to stack.

However, I could reduce the time for calculating the address,
because I only calculate it once and save the result to a variable.
Right?

Best Regards,
Hayes

[PATCH net-next] r8152: check hw version first

2017-03-16 Thread Hayes Wang

Check hw version first in probe(). Do nothing if the driver doesn't
support the chip.

Signed-off-by: Hayes Wang 
---
 drivers/net/usb/r8152.c | 102 ++--
 1 file changed, 63 insertions(+), 39 deletions(-)

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 4b85e95..3262a32 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -4236,44 +4236,6 @@ static const struct net_device_ops rtl8152_netdev_ops = {
.ndo_features_check = rtl8152_features_check,
 };
 
-static void r8152b_get_version(struct r8152 *tp)
-{
-   u32 ocp_data;
-   u16 version;
-
-   ocp_data = ocp_read_word(tp, MCU_TYPE_PLA, PLA_TCR1);
-   version = (u16)(ocp_data & VERSION_MASK);
-
-   switch (version) {
-   case 0x4c00:
-   tp->version = RTL_VER_01;
-   break;
-   case 0x4c10:
-   tp->version = RTL_VER_02;
-   break;
-   case 0x5c00:
-   tp->version = RTL_VER_03;
-   tp->mii.supports_gmii = 1;
-   break;
-   case 0x5c10:
-   tp->version = RTL_VER_04;
-   tp->mii.supports_gmii = 1;
-   break;
-   case 0x5c20:
-   tp->version = RTL_VER_05;
-   tp->mii.supports_gmii = 1;
-   break;
-   case 0x5c30:
-   tp->version = RTL_VER_06;
-   tp->mii.supports_gmii = 1;
-   break;
-   default:
-   netif_info(tp, probe, tp->netdev,
-  "Unknown version 0x%04x\n", version);
-   break;
-   }
-}
-
 static void rtl8152_unload(struct r8152 *tp)
 {
if (test_bit(RTL8152_UNPLUG, >flags))
@@ -4338,14 +4300,66 @@ static int rtl_ops_init(struct r8152 *tp)
return ret;
 }
 
+static u8 rtl_get_version(struct usb_interface *intf)
+{
+   struct usb_device *udev = interface_to_usbdev(intf);
+   u32 ocp_data = 0;
+   __le32 *tmp;
+   u8 version;
+   int ret;
+
+   tmp = kmalloc(sizeof(*tmp), GFP_KERNEL);
+   if (!tmp)
+   return 0;
+
+   ret = usb_control_msg(udev, usb_rcvctrlpipe(udev, 0),
+ RTL8152_REQ_GET_REGS, RTL8152_REQT_READ,
+ PLA_TCR0, MCU_TYPE_PLA, tmp, sizeof(*tmp), 500);
+   if (ret > 0)
+   ocp_data = (__le32_to_cpu(*tmp) >> 16) & VERSION_MASK;
+
+   kfree(tmp);
+
+   switch (ocp_data) {
+   case 0x4c00:
+   version = RTL_VER_01;
+   break;
+   case 0x4c10:
+   version = RTL_VER_02;
+   break;
+   case 0x5c00:
+   version = RTL_VER_03;
+   break;
+   case 0x5c10:
+   version = RTL_VER_04;
+   break;
+   case 0x5c20:
+   version = RTL_VER_05;
+   break;
+   case 0x5c30:
+   version = RTL_VER_06;
+   break;
+   default:
+   version = RTL_VER_UNKNOWN;
+   dev_info(>dev, "Unknown version 0x%04x\n", ocp_data);
+   break;
+   }
+
+   return version;
+}
+
 static int rtl8152_probe(struct usb_interface *intf,
 const struct usb_device_id *id)
 {
struct usb_device *udev = interface_to_usbdev(intf);
+   u8 version = rtl_get_version(intf);
struct r8152 *tp;
struct net_device *netdev;
int ret;
 
+   if (version == RTL_VER_UNKNOWN)
+   return -ENODEV;
+
if (udev->actconfig->desc.bConfigurationValue != 1) {
usb_driver_set_configuration(udev, 1);
return -ENODEV;
@@ -4365,8 +4379,18 @@ static int rtl8152_probe(struct usb_interface *intf,
tp->udev = udev;
tp->netdev = netdev;
tp->intf = intf;
+   tp->version = version;
+
+   switch (version) {
+   case RTL_VER_01:
+   case RTL_VER_02:
+   tp->mii.supports_gmii = 0;
+   break;
+   default:
+   tp->mii.supports_gmii = 1;
+   break;
+   }
 
-   r8152b_get_version(tp);
ret = rtl_ops_init(tp);
if (ret)
goto out;
-- 
2.7.4

Re: [PATCH v2 net-next 0/5] sunvnet: better connection management

2017-03-16 Thread David Miller

From: Shannon Nelson 
Date: Tue, 14 Mar 2017 10:24:38 -0700

> These patches remove some problems in handling of carrier state
> with the ldmvsw vswitch, remove  an xoff misuse in sunvnet, and
> add stats for debug and tracking of point-to-point connections
> between the ldom VMs.
> 
> v2:
>  - added ldmvsw ndo_open to reset the LDC channel
>  - updated copyrights

Series applied, thanks Shannon.

Re: [PATCH] net: ethernet: aquantia: set net_device mtu when mtu is changed

2017-03-16 Thread David Miller

From: David Arcari 
Date: Mon, 13 Mar 2017 19:07:16 -0400

> When the aquantia device mtu is changed the net_device structure is not
> updated.  As a result the ip command does not properly reflect the mtu change.
> 
> Commit 5513e16421cb incorrectly assumed that __dev_set_mtu() was making the
> assignment ndev->mtu = new_mtu;  This is not true in the case where the driver
> has a ndo_change_mtu routine.
> 
> Fixes: 5513e16421cb ("net: ethernet: aquantia: Fixes for aq_ndev_change_mtu")
> 
> Cc: Pavel Belous 
> Signed-off-by: David Arcari 

Applied.

Re: [PATCH net-next] liquidio: use meaningful names for IRQs

2017-03-16 Thread David Miller

From: Felix Manlunas 
Date: Mon, 13 Mar 2017 12:58:04 -0700

> From: Rick Farrington 
> 
> All IRQs owned by the PF and VF drivers share the same nondescript name
> "octeon"; this makes it difficult to setup interrupt affinity.
> 
> Change the IRQ names to reflect their specific purpose:
> 
> LiquidIO---
> 
> Examples:
> LiquidIO0-pf0-rxtx-3
> LiquidIO1-vf1-rxtx-0
> LiquidIO0-pf0-aux
> 
> We cannot use netdev->name for naming the IRQs because:
> 
> 1.  Early during init, the PF and VF drivers require interrupts to
> send/receive control data from the NIC firmware; so the PF and VF
> must request IRQs long before the netdev struct is registered.
> 
> 2.  The IRQ name can only be specified at the time it is requested.
> It cannot be changed after that.
> 
> Signed-off-by: Rick Farrington 
> Signed-off-by: Felix Manlunas 
> Signed-off-by: Satanand Burla 

Applied, thank you.

Re: [PATCH net-next] bonding: add 802.3ad support for 25G speeds

2017-03-16 Thread David Miller

From: Jarod Wilson 
Date: Tue, 14 Mar 2017 11:48:32 -0400

> Cut-n-paste enablement of 802.3ad bonding on 25G NICs, which currently
> report 0 as their bandwidth.
> 
> CC: Jay Vosburgh 
> CC: Veaceslav Falico 
> CC: Andy Gospodarek 
> CC: netdev@vger.kernel.org
> Signed-off-by: Jarod Wilson 

Applied.

Re: [PATCH] tcp_westwood: fix tcp_westwood_info() style mistakes

2017-03-16 Thread David Miller

From: Chun Long 
Date: Tue, 14 Mar 2017 15:26:24 +0800

> From: chun Long 
> 
> replace comma to semi colons in tcp_westwood_info().

Applied.

Re: [PATCH] net: mpls: Fix nexthop alive tracking on down events

2017-03-16 Thread David Miller

From: David Ahern 
Date: Mon, 13 Mar 2017 16:49:10 -0700

> Alive tracking of nexthops can account for a link twice if the carrier
> goes down followed by an admin down of the same link rendering multipath
> routes useless. This is similar to 79099aab38c8 for UNREGISTER events and
> DOWN events.
> 
> Fix by tracking number of alive nexthops in mpls_ifdown similar to the
> logic in mpls_ifup. Checking the flags per nexthop once after all events
> have been processed is simpler than trying to maintian a running count
> through all event combinations.
> 
> Also, WRITE_ONCE is used instead of ACCESS_ONCE to set rt_nhn_alive
> per a comment from checkpatch:
> WARNING: Prefer WRITE_ONCE(, ) over ACCESS_ONCE() = 
> 
> Fixes: c89359a42e2a4 ("mpls: support for dead routes")
> Signed-off-by: David Ahern 

Applied.

Re: Performance issue with igb with lots of different src-ip addrs.

2017-03-16 Thread Ben Greear


I'm actually using a hacked up version of pktgen nicely driven by our
GUI tool, but the crux is that you need to set min and max src IP to some large
range.

We are driving pktgen from a separate machine.  Stock pktgen isn't good at 
reporting
received pkts last I checked, so it may be more difficult to easily view the 
problem.

I'll be happy to set up my tool on your Fedora 24 or similar VM or machine if 
you
want.

Thanks,
Ben

On 03/16/2017 07:35 PM, Alexander Duyck wrote:

Can you include the pktgen script you are running?

Also when you say you are driving traffic through the bridge are you
sending from something external on the system or are you actually
directing the traffic from pktgen into the bridge directly?

- Alex

On Thu, Mar 16, 2017 at 3:49 PM, Ben Greear  wrote:

Hello,

We notice that when using two igb ports as a bridge, if we use pktgen to
drive traffic through the bridge and randomize (or use a very large range)
for the source IP addr in pktgen, then performance of igb is very poor (like
150Mbps
throughput instead of 1Gbps).  It runs right at line speed if we use same
src/dest
IP addr in pktgen.  So, seems it is related to lots of src/dest IP
addresses.

We see same problem when using pktgen to send to itself, and we see this in
several different kernels.  We specifically tested bridge mode in this stock
Fedora kernel:

 Linux lfo350-59cc 4.9.13-101.fc24.x86_64 #1 SMP Tue Mar 7 23:48:32 UTC 2017
x86_64 x86_64 x86_64 GNU/Linux

e1000e does not show this problem in our testing.

Any ideas what the issue might be and how to fix it?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com





--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com

Re: [PATCH net-next] liquidio: remove/replace invalid code

2017-03-16 Thread David Miller

From: Felix Manlunas 
Date: Mon, 13 Mar 2017 12:07:32 -0700

> From: Rick Farrington 
> 
> Remove invalid call to dma_sync_single_for_cpu() because previous DMA
> allocation was coherent--not streaming.  Remove code that references fields
> in struct list_head; replace it with calls to list_empty() and
> list_first_entry().  Also, add comment to clarify complicated if statement.
> 
> Signed-off-by: Rick Farrington 
> Signed-off-by: Felix Manlunas 
> Signed-off-by: Derek Chickles 

Applied.

Re: [PATCH net] net/mlx4_core: Avoid delays during VF driver device shutdown

2017-03-16 Thread David Miller

From: Tariq Toukan 
Date: Mon, 13 Mar 2017 19:29:08 +0200

> From: Jack Morgenstein 
> 
> Some Hypervisors detach VFs from VMs by instantly causing an FLR event
> to be generated for a VF.
> 
> In the mlx4 case, this will cause that VF's comm channel to be disabled
> before the VM has an opportunity to invoke the VF device's "shutdown"
> method.
> 
> For such Hypervisors, there is a race condition between the VF's
> shutdown method and its internal-error detection/reset thread.
> 
> The internal-error detection/reset thread (which runs every 5 seconds) also
> detects a disabled comm channel. If the internal-error detection/reset
> flow wins the race, we still get delays (while that flow tries repeatedly
> to detect comm-channel recovery).
> 
> The cited commit fixed the command timeout problem when the
> internal-error detection/reset flow loses the race.
> 
> This commit avoids the unneeded delays when the internal-error
> detection/reset flow wins.
> 
> Fixes: d585df1c5ccf ("net/mlx4_core: Avoid command timeouts during VF driver 
> device shutdown")
> Signed-off-by: Jack Morgenstein 
> Reported-by: Simon Xiao 
> Signed-off-by: Tariq Toukan 

Applied.

Re: [PATCH net-next] netem: apply correct delay when rate throttling

2017-03-16 Thread David Miller

From: Stephen Hemminger 
Date: Mon, 13 Mar 2017 10:16:58 -0700

> From: Nik Unger 
> 
> I recently reported on the netem list that iperf network benchmarks
> show unexpected results when a bandwidth throttling rate has been
> configured for netem. Specifically:
> 
> 1) The measured link bandwidth *increases* when a higher delay is added
> 2) The measured link bandwidth appears higher than the specified limit
> 3) The measured link bandwidth for the same very slow settings varies 
> significantly across
>   machines
> 
> The issue can be reproduced by using tc to configure netem with a
> 512kbit rate and various (none, 1us, 50ms, 100ms, 200ms) delays on a
> veth pair between network namespaces, and then using iperf (or any
> other network benchmarking tool) to test throughput. Complete detailed
> instructions are in the original email chain here:
> https://lists.linuxfoundation.org/pipermail/netem/2017-February/001672.html
> 
> There appear to be two underlying bugs causing these effects:
> 
> - The first issue causes long delays when the rate is slow and no
>   delay is configured (e.g., "rate 512kbit"). This is because SKBs are
>   not orphaned when no delay is configured, so orphaning does not
>   occur until *after* the rate-induced delay has been applied. For
>   this reason, adding a tiny delay (e.g., "rate 512kbit delay 1us")
>   dramatically increases the measured bandwidth.
> 
> - The second issue is that rate-induced delays are not correctly
>   applied, allowing SKB delays to occur in parallel. The indended
>   approach is to compute the delay for an SKB and to add this delay to
>   the end of the current queue. However, the code does not detect
>   existing SKBs in the queue due to improperly testing sch->q.qlen,
>   which is nonzero even when packets exist only in the
>   rbtree. Consequently, new SKBs do not wait for the current queue to
>   empty. When packet delays vary significantly (e.g., if packet sizes
>   are different), then this also causes unintended reordering.
> 
> I modified the code to expect a delay (and orphan the SKB) when a rate
> is configured. I also added some defensive tests that correctly find
> the latest scheduled delivery time, even if it is (unexpectedly) for a
> packet in sch->q. I have tested these changes on the latest kernel
> (4.11.0-rc1+) and the iperf / ping test results are as expected.
> 
> Signed-off-by: Nik Unger 
> Signed-off-by: Stephen Hemminger 

Applied.

Re: [RESEND PATCH -net] cpsw/netcp: work around reverse cpts dependency

2017-03-16 Thread David Miller

From: Arnd Bergmann 
Date: Mon, 13 Mar 2017 17:59:04 +0100

> The dependency is reversed: cpsw and netcp call into cpts,
> but cpts depends on the other two in Kconfig. This can lead
> to cpts being a loadable module and its callers built-in:
> 
> drivers/net/ethernet/ti/cpsw.o: In function `cpsw_remove':
> cpsw.c:(.text.cpsw_remove+0xd0): undefined reference to `cpts_release'
> drivers/net/ethernet/ti/cpsw.o: In function `cpsw_rx_handler':
> cpsw.c:(.text.cpsw_rx_handler+0x2dc): undefined reference to 
> `cpts_rx_timestamp'
> drivers/net/ethernet/ti/cpsw.o: In function `cpsw_tx_handler':
> cpsw.c:(.text.cpsw_tx_handler+0x7c): undefined reference to 
> `cpts_tx_timestamp'
> drivers/net/ethernet/ti/cpsw.o: In function `cpsw_ndo_stop':
> 
> As a workaround, I'm introducing another Kconfig symbol to
> control the compilation of cpts, while making the actual
> module controlled by a silent symbol that is =y when necessary.
> 
> Fixes: 6246168b4a38 ("net: ethernet: ti: netcp: add support of cpts")
> Signed-off-by: Arnd Bergmann 
> Reviewed-by: Grygorii Strashko 
> ---
> Originally submitted on Dec 16, still needed for v4.10 and v4.11-rc2

I'm fine with this change, but please keep the user visible Kconfig
symbol name the same, use the new name for the internal one.

I know that this means you have to make more changes elsewhere in
order to accomplish this.

Re: [net-next PATCH 1/5] net: Do not record sender_cpu as napi_id in socket receive paths

2017-03-16 Thread Eric Dumazet

On Thu, 2017-03-16 at 19:40 -0700, Alexander Duyck wrote:

> What I probably can do is go through and replace all the spots where
> we where checking for sk_napi_id being 0, and instead replace it with
> a check against NR_CPUS.

This seems a good idea.

Re: Performance issue with igb with lots of different src-ip addrs.

2017-03-16 Thread Alexander Duyck

I'm not really interested in installing a custom version of pktgen.
Any chance you can recreate the issue with standard pktgen?

You might try running perf to get a snapshot of what is using CPU time
on the system.  It will probably give you a pretty good idea where the
code is that is eating up all your CPU time.

- Alex

On Thu, Mar 16, 2017 at 7:46 PM, Ben Greear  wrote:
> I'm actually using a hacked up version of pktgen nicely driven by our
> GUI tool, but the crux is that you need to set min and max src IP to some
> large
> range.
>
> We are driving pktgen from a separate machine.  Stock pktgen isn't good at
> reporting
> received pkts last I checked, so it may be more difficult to easily view the
> problem.
>
> I'll be happy to set up my tool on your Fedora 24 or similar VM or machine
> if you
> want.
>
> Thanks,
> Ben
>
>
> On 03/16/2017 07:35 PM, Alexander Duyck wrote:
>>
>> Can you include the pktgen script you are running?
>>
>> Also when you say you are driving traffic through the bridge are you
>> sending from something external on the system or are you actually
>> directing the traffic from pktgen into the bridge directly?
>>
>> - Alex
>>
>> On Thu, Mar 16, 2017 at 3:49 PM, Ben Greear 
>> wrote:
>>>
>>> Hello,
>>>
>>> We notice that when using two igb ports as a bridge, if we use pktgen to
>>> drive traffic through the bridge and randomize (or use a very large
>>> range)
>>> for the source IP addr in pktgen, then performance of igb is very poor
>>> (like
>>> 150Mbps
>>> throughput instead of 1Gbps).  It runs right at line speed if we use same
>>> src/dest
>>> IP addr in pktgen.  So, seems it is related to lots of src/dest IP
>>> addresses.
>>>
>>> We see same problem when using pktgen to send to itself, and we see this
>>> in
>>> several different kernels.  We specifically tested bridge mode in this
>>> stock
>>> Fedora kernel:
>>>
>>>  Linux lfo350-59cc 4.9.13-101.fc24.x86_64 #1 SMP Tue Mar 7 23:48:32 UTC
>>> 2017
>>> x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> e1000e does not show this problem in our testing.
>>>
>>> Any ideas what the issue might be and how to fix it?
>>>
>>> Thanks,
>>> Ben
>>>
>>> --
>>> Ben Greear 
>>> Candela Technologies Inc  http://www.candelatech.com
>>>
>>
>
> --
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com

Re: [net-next PATCH 1/5] net: Do not record sender_cpu as napi_id in socket receive paths

2017-03-16 Thread Alexander Duyck

On Thu, Mar 16, 2017 at 7:57 PM, Eric Dumazet  wrote:
> On Thu, 2017-03-16 at 19:40 -0700, Alexander Duyck wrote:
>
>> What I probably can do is go through and replace all the spots where
>> we where checking for sk_napi_id being 0, and instead replace it with
>> a check against NR_CPUS.
>
> This seems a good idea.

Right.  This is the path I am planning to go with.  It will require
about the same amount of code change but replaces those checks.  I
just have to make sure I catch all the spots where we were checking it
for 0 but that shouldn't be too difficult of an issue.

Re: [net-next PATCH 1/5] net: Do not record sender_cpu as napi_id in socket receive paths

2017-03-16 Thread Eric Dumazet

On Thu, 2017-03-16 at 19:40 -0700, Alexander Duyck wrote:

> I don't know.  My concern here is about the cost of going through all
> that code just for something that we know shouldn't be valid.  If
> nothing else I might update sk_can_busy_loop so that it doesn't try
> busy looping on a sk_napi_id that is NR_CPU or less.

But why would that be a win ?

if napi_by_id() returns NULL, we immediately give up, (goto out;)

So why should we add a code that will add something that will not be
useful for the vast majority of the cases where the ID is valid and we
need to do the hash look up ?

Is libc trying to avoid doing syscalls like close(-1) ?

Re: [net-next PATCH 1/5] net: Do not record sender_cpu as napi_id in socket receive paths

2017-03-16 Thread Alexander Duyck

On Thu, Mar 16, 2017 at 3:50 PM, Eric Dumazet  wrote:
> On Thu, 2017-03-16 at 15:33 -0700, Alexander Duyck wrote:
>> On Thu, Mar 16, 2017 at 3:05 PM, Eric Dumazet  wrote:
>
>> > It is not clear why this patch is needed .
>> >
>> > What you describe here is the case we might receive packets for a socket
>> > coming from different interfaces ?
>> >
>> > If skb->napi_id is a sender_cpu, why should we prevent overwriting the
>> > sk_napi_id with it, knowing that busy polling will simply ignore the
>> > invalid value ?
>> >
>> > Do not get me wrong :
>> >
>> > I simply try to understand why the test about napi_id validity is now
>> > done twice :
>> >
>> > 1) At the time we are writing into sk->sk_napi_id
>>
>> I would argue that this is the one piece we were missing.
>>
>> > 2) At busy polling time when we read sk->sk_napi_id
>>
>> Unless there was something recently added I don't think this was ever
>> checked.  Instead we start digging into the hash looking for the ID
>> that won't ever be there.  Maybe we should add something to napi_by_id
>> that just returns NULL in these cases.
>
> But this is exactly what should happen.
>
> For invalid ID, we return NULL from napi_by_id()
>
> No need to add code for that, since the function is meant to deal with
> valid cases.

I don't know.  My concern here is about the cost of going through all
that code just for something that we know shouldn't be valid.  If
nothing else I might update sk_can_busy_loop so that it doesn't try
busy looping on a sk_napi_id that is NR_CPU or less.

>> On top of that I think there end up being several spots where once we
>> lock in a non-NAPI ID it is stuck such as the sk_mark_napi_id_once
>> call.  I figure we are better off locking in an actual NAPI ID rather
>> than getting a sender_cpu stuck in there.
>
> Are you referring to sk_mark_napi_id_once() ?
>
> Since this is only used by UDP, I would be OK to avoid the 'locking' for
> 'sender_cpu'  ids.

What I probably can do is go through and replace all the spots where
we where checking for sk_napi_id being 0, and instead replace it with
a check against NR_CPUS.

Re: Performance issue with igb with lots of different src-ip addrs.

2017-03-16 Thread Alexander Duyck

Can you include the pktgen script you are running?

Also when you say you are driving traffic through the bridge are you
sending from something external on the system or are you actually
directing the traffic from pktgen into the bridge directly?

- Alex

On Thu, Mar 16, 2017 at 3:49 PM, Ben Greear  wrote:
> Hello,
>
> We notice that when using two igb ports as a bridge, if we use pktgen to
> drive traffic through the bridge and randomize (or use a very large range)
> for the source IP addr in pktgen, then performance of igb is very poor (like
> 150Mbps
> throughput instead of 1Gbps).  It runs right at line speed if we use same
> src/dest
> IP addr in pktgen.  So, seems it is related to lots of src/dest IP
> addresses.
>
> We see same problem when using pktgen to send to itself, and we see this in
> several different kernels.  We specifically tested bridge mode in this stock
> Fedora kernel:
>
>  Linux lfo350-59cc 4.9.13-101.fc24.x86_64 #1 SMP Tue Mar 7 23:48:32 UTC 2017
> x86_64 x86_64 x86_64 GNU/Linux
>
> e1000e does not show this problem in our testing.
>
> Any ideas what the issue might be and how to fix it?
>
> Thanks,
> Ben
>
> --
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com
>

[PATCH net-stable] net: solve a NAPI race

2017-03-16 Thread Eric Dumazet

commit 39e6c8208d7b6fb9d2047850fb3327db567b564b upstream.

While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...

Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.

This would happen with a very low probability, but hurting RPC
workloads.

A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.

Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.

This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.

thread 1 thread 2 (could be on same cpu, or not)

// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()

device polling:
read 2 packets from ring buffer
  Additional 3rd packet is
available.
  device hard irq

  // does nothing because
NAPI_STATE_SCHED bit is owned by thread 1
  napi_schedule();

napi_complete_done(napi, 2);
rearm_irq();

Note that rearm_irq() will not force the device to send an additional
IRQ for the packet it already signaled (3rd packet in my example)

This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED

Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.

Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.

In v2, I changed napi_watchdog() to use a relaxed variant of
napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.

In v3, I added more details in the changelog and clears
NAPI_STATE_MISSED in busy_poll_stop()

In v4, I added the ideas given by Alexander Duyck in v3 review

Signed-off-by: Eric Dumazet 
Cc: Alexander Duyck 
Signed-off-by: David S. Miller 
---
David, here is the backport for 4.10.

 include/linux/netdevice.h | 29 ++---
 net/core/dev.c| 81 +++
 2 files changed, 83 insertions(+), 27 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 27914672602d..bdef8b7d4305 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -330,6 +330,7 @@ struct napi_struct {
 
 enum {
NAPI_STATE_SCHED,   /* Poll is scheduled */
+   NAPI_STATE_MISSED,  /* reschedule a napi */
NAPI_STATE_DISABLE, /* Disable pending */
NAPI_STATE_NPSVC,   /* Netpoll - don't dequeue from poll_list */
NAPI_STATE_HASHED,  /* In NAPI hash (busy polling possible) */
@@ -338,12 +339,13 @@ enum {
 };
 
 enum {
-   NAPIF_STATE_SCHED= (1UL << NAPI_STATE_SCHED),
-   NAPIF_STATE_DISABLE  = (1UL << NAPI_STATE_DISABLE),
-   NAPIF_STATE_NPSVC= (1UL << NAPI_STATE_NPSVC),
-   NAPIF_STATE_HASHED   = (1UL << NAPI_STATE_HASHED),
-   NAPIF_STATE_NO_BUSY_POLL = (1UL << NAPI_STATE_NO_BUSY_POLL),
-   NAPIF_STATE_IN_BUSY_POLL = (1UL << NAPI_STATE_IN_BUSY_POLL),
+   NAPIF_STATE_SCHED= BIT(NAPI_STATE_SCHED),
+   NAPIF_STATE_MISSED   = BIT(NAPI_STATE_MISSED),
+   NAPIF_STATE_DISABLE  = BIT(NAPI_STATE_DISABLE),
+   NAPIF_STATE_NPSVC= BIT(NAPI_STATE_NPSVC),
+   NAPIF_STATE_HASHED   = BIT(NAPI_STATE_HASHED),
+   NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
+   NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
 };
 
 enum gro_result {
@@ -413,20 +415,7 @@ static inline bool napi_disable_pending(struct napi_struct 
*n)
return test_bit(NAPI_STATE_DISABLE, >state);
 }
 
-/**
- * napi_schedule_prep - check if NAPI can be scheduled
- * @n: NAPI context
- *
- * Test if NAPI routine is already running, and if not mark
- * it as running.  This is used as a condition variable to
- * insure only one NAPI poll instance runs.  We also make
- * sure there is no pending NAPI disable.
- */
-static inline bool napi_schedule_prep(struct napi_struct *n)
-{
-   return !napi_disable_pending(n) &&
-   !test_and_set_bit(NAPI_STATE_SCHED, >state);
-}
+bool napi_schedule_prep(struct napi_struct *n);
 
 /**
  * napi_schedule - schedule NAPI poll
diff --git a/net/core/dev.c b/net/core/dev.c
index 29101c98399f..465b902fd136 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4886,6 +4886,39 @@ void __napi_schedule(struct napi_struct *n)

Re: [RFC] [net]openvswitch: Clear the ct flow key for the recirculated packet

2017-03-16 Thread Joe Stringer

On 16 March 2017 at 05:25, Numan Siddique  wrote:
> It is possible that the ct flow key information would have
> gone stale for the packets received from the userspace due to
> clone or ct_clear actions.
>
> In the case of OVN, it adds ping responder flows, which modifies
> the original icmp4 request packet to a reply packet. It uses the
> OVS actions - clone and ct_clear. When the reply packet hits the
> "ovs_ct_execute" function, and since the ct flow key info is not
> cleared, the connection tracker doesn't set the state to
> ESTABLISHED state.
>
> Note: This patch is marked as RFC, as I am not sure if this is the correct
> place to address this issue or it should be addressed in ovs-vswitchd
> to set the OVS_KEY_ATTR_CT_STATE and other related attributes
> properly for ct_clear action.

Hmm. I see a couple of options.

At the moment the ct_clear at the higher layer is not backed by any
action in the datapath layer. Basically we just clear the fields in
the flow metadata then from the OpenFlow point of view there is no
such state, so we can match on it as though it hasn't been through the
connection tracker, and for instance, try to send it through the
connection tracker again. If we were to introduce an openvswitch
kernel API to actually clear the 'skb->nfct' and perhaps the related
ct flow key fields in the kernel, then subsequent ct lookups would not
assume that the skb is already cached, and it should run through the
connection tracker and establish the state.

Alternatively, without changing the OVS API, we could make the actions
a bit smarter. If you change fields related to CT lookup (ie, the
5tuple) via set actions, then the conntrack entry associated with the
skb is no longer valid for the current packet. Clear the skb->nfct
when manipulating those fields, then when the CT action gets executed,
it can treat this packet as completely new and run it through the
connection tracker to establish the state.

I tend to like the latter because it doesn't involve extending the
API, but I can't help but wonder if the fact we've now got a ct_clear
action at the layer above means that eventually we're going to need
the equivalent functionality in the datapath API. Ie, we could solve
this problem but maybe something more detailed and obscure comes up
later where we're not really satisfying the expectations of the
OpenFlow 'ct_clear' action.

mlx5e backports for v4.9 -stable

2017-03-16 Thread David Miller


Commits:


>From b0d4660b4cc52e6477ca3a43435351d565dfcedc Mon Sep 17 00:00:00 2001
From: Tariq Toukan 
Date: Wed, 22 Feb 2017 17:20:14 +0200
Subject: [PATCH] net/mlx5e: Fix broken CQE compression initialization


and


>From 6dc4b54e77282caf17f0ff72aa32dd296037fbc0 Mon Sep 17 00:00:00 2001
From: Saeed Mahameed 
Date: Wed, 22 Feb 2017 17:20:15 +0200
Subject: [PATCH] net/mlx5e: Update MPWQE stride size when modifying CQE
 compress state


do not apply even closely to v4.9 while I was working on -stable backports.

Please provide proper backports of these two patches if you want them to
show up in v4.9 -stable.

Thanks.

[PATCH net-next 1/3] netvsc: avoid race with callback

2017-03-16 Thread Stephen Hemminger

Change the argument to channel callback from the channel pointer
to the internal data structure containing per-channel info.
This avoids any possible races when callback happens during
initialization and makes IRQ code simpler.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/netvsc.c   | 18 +-
 drivers/net/hyperv/rndis_filter.c | 15 ++-
 2 files changed, 15 insertions(+), 18 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 0e71164849dd..0a2e9bd98d2c 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -1250,21 +1250,12 @@ int netvsc_poll(struct napi_struct *napi, int budget)
 
 void netvsc_channel_cb(void *context)
 {
-   struct vmbus_channel *channel = context;
-   struct hv_device *device = netvsc_channel_to_device(channel);
-   u16 q_idx = channel->offermsg.offer.sub_channel_index;
-   struct netvsc_device *net_device;
-   struct net_device *ndev;
-
-   ndev = hv_get_drvdata(device);
-   if (unlikely(!ndev))
-   return;
+   struct netvsc_channel *nvchan = context;
 
/* disable interupts from host */
-   hv_begin_read(>inbound);
+   hv_begin_read(>channel->inbound);
 
-   net_device = net_device_to_netvsc_device(ndev);
-   napi_schedule(_device->chan_table[q_idx].napi);
+   napi_schedule(>napi);
 }
 
 /*
@@ -1294,7 +1285,8 @@ int netvsc_device_add(struct hv_device *device,
/* Open the channel */
ret = vmbus_open(device->channel, ring_size * PAGE_SIZE,
 ring_size * PAGE_SIZE, NULL, 0,
-netvsc_channel_cb, device->channel);
+netvsc_channel_cb,
+net_device->chan_table);
 
if (ret != 0) {
netdev_err(ndev, "unable to open channel: %d\n", ret);
diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index d7b6311e6c19..382b9a62e3c4 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -996,23 +996,28 @@ static void netvsc_sc_open(struct vmbus_channel *new_sc)
hv_get_drvdata(new_sc->primary_channel->device_obj);
struct netvsc_device *nvscdev = net_device_to_netvsc_device(ndev);
u16 chn_index = new_sc->offermsg.offer.sub_channel_index;
-   int ret;
+   struct netvsc_channel *nvchan;
unsigned long flags;
+   int ret;
 
if (chn_index >= nvscdev->num_chn)
return;
 
-   nvscdev->chan_table[chn_index].mrc.buf
+   nvchan = nvscdev->chan_table + chn_index;
+   nvchan->mrc.buf
= vzalloc(NETVSC_RECVSLOT_MAX * sizeof(struct recv_comp_data));
 
+   if (!nvchan->mrc.buf)
+   return;
+
ret = vmbus_open(new_sc, nvscdev->ring_size * PAGE_SIZE,
 nvscdev->ring_size * PAGE_SIZE, NULL, 0,
-netvsc_channel_cb, new_sc);
+netvsc_channel_cb, nvchan);
 
if (ret == 0)
-   nvscdev->chan_table[chn_index].channel = new_sc;
+   nvchan->channel = new_sc;
 
-   napi_enable(>chan_table[chn_index].napi);
+   napi_enable(>napi);
 
spin_lock_irqsave(>sc_lock, flags);
nvscdev->num_sc_offered--;
-- 
2.11.0

Re: netdev level filtering? perhaps pushing socket filters down?

2017-03-16 Thread Johannes Berg


> Attach at just above the driver, before it ever gets to
> stations/vdevs, and ignore radiotap headers and/or add special
> processing for metadata like rx-info?

That'd be a different feature, perhaps more like XDP.

There anyway is no radiotap header at that point, but not changing
whether or not it makes it to other virtual interfaces is actually not
what we're looking for here at all.

johannes

Re: NAPI complete race backport?

2017-03-16 Thread Eric Dumazet

On Thu, 2017-03-16 at 16:25 -0700, David Miller wrote:
> Eric, could you please give me a backport of commit:
> 
> 
> From 39e6c8208d7b6fb9d2047850fb3327db567b564b Mon Sep 17 00:00:00 2001
> From: Eric Dumazet 
> Date: Tue, 28 Feb 2017 10:34:50 -0800
> Subject: [PATCH] net: solve a NAPI race
> 
> 
> for 4.10 and earlier?
> 
> This is before __napi_complete() is removed, so the empty poll list
> handling in napi_complete_done() needs to be a bit different.
> 
> Thanks!

Sure, I will send a patch by tomorrow I think.

Thanks.

Re: [PATCH] [PATCH net] net: Do not hold the reference for the same sk_rx_dst.

2017-03-16 Thread Eric Dumazet

On Thu, 2017-03-16 at 16:13 -0700, Cong Wang wrote:
> On Thu, Mar 16, 2017 at 2:33 PM, Eric Dumazet  wrote:
> > Have you backported the redirect fix ?
> >
> > commit 45caeaa5ac0b4b11784ac6f932c0ad4c6b67cda0
> >
> > Or other fixes that went very recently (pick David Miller net tree)
> 
> Why the commit above is relevant here? It fixes a double-release,
> while Kevin's case is a double-hold... Not to mention it sk_rx_dst
> instead of sk_dst_cache, according to Kevin.

Yes you are right.

I was lazy, the other fix I wanted to mention was 
02b2faaf0af1d85585f

In any case I wanted to make sure the reported issue is not for some old
kernels.

NAPI complete race backport?

2017-03-16 Thread David Miller

Eric, could you please give me a backport of commit:

>From 39e6c8208d7b6fb9d2047850fb3327db567b564b Mon Sep 17 00:00:00 2001
From: Eric Dumazet 
Date: Tue, 28 Feb 2017 10:34:50 -0800
Subject: [PATCH] net: solve a NAPI race

for 4.10 and earlier?

This is before __napi_complete() is removed, so the empty poll list
handling in napi_complete_done() needs to be a bit different.

Thanks!

[PATCH net-next 3/3] netvsc: remove unused #define

2017-03-16 Thread Stephen Hemminger

Not used anywhere.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/hyperv_net.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index b09c4fca1805..6b5f75217694 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -1427,9 +1427,6 @@ struct rndis_message {
((void *) rndis_msg)
 
 
-#define __struct_bcount(x)
-
-
 
 #define RNDIS_HEADER_SIZE  (sizeof(struct rndis_message) - \
 sizeof(union rndis_message_container))
-- 
2.11.0

[PATCH net-next] liquidio: fix wrong information about link modes reported to ethtool

2017-03-16 Thread Felix Manlunas

From: Manish Awasthi 

Information reported to ethtool about link modes is wrong for 25G NIC.  Fix
it by checking for presence of 25G NIC, checking the link speed reported by
NIC firmware, and then assigning proper values to the
ethtool_link_ksettings struct.

Signed-off-by: Manish Awasthi 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_ethtool.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c 
b/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
index 50384ce..6eef3b9 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
@@ -213,17 +213,23 @@ static int lio_get_link_ksettings(struct net_device 
*netdev,
struct lio *lio = GET_LIO(netdev);
struct octeon_device *oct = lio->oct_dev;
struct oct_link_info *linfo;
-   u32 supported, advertising;
+   u32 supported = 0, advertising = 0;
 
linfo = >linfo;
 
if (linfo->link.s.if_mode == INTERFACE_MODE_XAUI ||
linfo->link.s.if_mode == INTERFACE_MODE_RXAUI ||
+   linfo->link.s.if_mode == INTERFACE_MODE_XLAUI ||
linfo->link.s.if_mode == INTERFACE_MODE_XFI) {
ecmd->base.port = PORT_FIBRE;
-   supported = (SUPPORTED_1baseT_Full | SUPPORTED_FIBRE |
-SUPPORTED_Pause);
-   advertising = (ADVERTISED_1baseT_Full | ADVERTISED_Pause);
+
+   if (linfo->link.s.speed == SPEED_1) {
+   supported = SUPPORTED_1baseT_Full;
+   advertising = ADVERTISED_1baseT_Full;
+   }
+
+   supported |= SUPPORTED_FIBRE | SUPPORTED_Pause;
+   advertising |= ADVERTISED_Pause;
ethtool_convert_legacy_u32_to_link_mode(
ecmd->link_modes.supported, supported);
ethtool_convert_legacy_u32_to_link_mode(

Re: [PATCH] [PATCH net] net: Do not hold the reference for the same sk_rx_dst.

2017-03-16 Thread Cong Wang

On Thu, Mar 16, 2017 at 2:33 PM, Eric Dumazet  wrote:
> Have you backported the redirect fix ?
>
> commit 45caeaa5ac0b4b11784ac6f932c0ad4c6b67cda0
>
> Or other fixes that went very recently (pick David Miller net tree)

Why the commit above is relevant here? It fixes a double-release,
while Kevin's case is a double-hold... Not to mention it sk_rx_dst
instead of sk_dst_cache, according to Kevin.

[PATCH net-next 0/3] netvsc: small changesfor net-next

2017-03-16 Thread Stephen Hemminger

One bugfix, and two non-code patches

Stephen Hemminger (3):
  netvsc: avoid race with callback
  netvsc: add comments about callback's and NAPI
  netvsc: remove unused #define

 drivers/net/hyperv/hyperv_net.h   |  3 ---
 drivers/net/hyperv/netvsc.c   | 31 +--
 drivers/net/hyperv/rndis_filter.c | 15 ++-
 3 files changed, 27 insertions(+), 22 deletions(-)

-- 
2.11.0

[PATCH net-next 2/3] netvsc: add comments about callback's and NAPI

2017-03-16 Thread Stephen Hemminger

Add some short description of how callback's and NAPI interoperate.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/netvsc.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 0a2e9bd98d2c..989b7cd99380 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -1209,6 +1209,10 @@ static struct hv_device *netvsc_channel_to_device(struct 
vmbus_channel *channel)
return primary ? primary->device_obj : channel->device_obj;
 }
 
+/* Network processing softirq
+ * Process data in incoming ring buffer from host
+ * Stops when ring is empty or budget is met or exceeded.
+ */
 int netvsc_poll(struct napi_struct *napi, int budget)
 {
struct netvsc_channel *nvchan
@@ -1238,7 +1242,11 @@ int netvsc_poll(struct napi_struct *napi, int budget)
}
hv_pkt_iter_close(channel);
 
-   /* If ring is empty and NAPI is not doing polling */
+   /* If budget was not exhausted and
+* not doing busy poll
+* then re-enable host interrupts
+*  and reschedule if ring is not empty.
+*/
if (work_done < budget &&
napi_complete_done(napi, work_done) &&
hv_end_read(>inbound) != 0)
@@ -1248,6 +1256,9 @@ int netvsc_poll(struct napi_struct *napi, int budget)
return work_done;
 }
 
+/* Call back when data is available in host ring buffer.
+ * Processing is deferred until network softirq (NAPI)
+ */
 void netvsc_channel_cb(void *context)
 {
struct netvsc_channel *nvchan = context;
-- 
2.11.0

Re: [net-next PATCH 3/5] net: Introduce SO_INCOMING_NAPI_ID

2017-03-16 Thread Eric Dumazet

On Thu, 2017-03-16 at 11:32 -0700, Alexander Duyck wrote:

>  
> + case SO_INCOMING_NAPI_ID:
> + v.val = sk->sk_napi_id;
> + break;

I guess that here you should filter invalid values.

(So that you no longer need the first patch in this series)

Also, it looks like eBPF will need to get access to skb->napi_id for
efficient SO_REUSEPORT support ?

Thanks.

[net-next sample action optimization v3 2/4] openvswitch: Refactor recirc key allocation.

2017-03-16 Thread Andy Zhou

The logic of allocating and copy key for each 'exec_actions_level'
was specific to execute_recirc(). However, future patches will reuse
as well.  Refactor the logic into its own function clone_key().

Signed-off-by: Andy Zhou 
Acked-by: Pravin B Shelar 
---
 net/openvswitch/actions.c | 66 ---
 1 file changed, 40 insertions(+), 26 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 75182e9..8c9c60c 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2007-2014 Nicira, Inc.
+ * Copyright (c) 2007-2017 Nicira, Inc.
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of version 2 of the GNU General Public
@@ -83,14 +83,31 @@ struct action_fifo {
struct deferred_action fifo[DEFERRED_ACTION_FIFO_SIZE];
 };
 
-struct recirc_keys {
+struct action_flow_keys {
struct sw_flow_key key[OVS_DEFERRED_ACTION_THRESHOLD];
 };
 
 static struct action_fifo __percpu *action_fifos;
-static struct recirc_keys __percpu *recirc_keys;
+static struct action_flow_keys __percpu *flow_keys;
 static DEFINE_PER_CPU(int, exec_actions_level);
 
+/* Make a clone of the 'key', using the pre-allocated percpu 'flow_keys'
+ * space. Return NULL if out of key spaces.
+ */
+static struct sw_flow_key *clone_key(const struct sw_flow_key *key_)
+{
+   struct action_flow_keys *keys = this_cpu_ptr(flow_keys);
+   int level = this_cpu_read(exec_actions_level);
+   struct sw_flow_key *key = NULL;
+
+   if (level <= OVS_DEFERRED_ACTION_THRESHOLD) {
+   key = >key[level - 1];
+   *key = *key_;
+   }
+
+   return key;
+}
+
 static void action_fifo_init(struct action_fifo *fifo)
 {
fifo->head = 0;
@@ -1090,8 +1107,8 @@ static int execute_recirc(struct datapath *dp, struct 
sk_buff *skb,
  struct sw_flow_key *key,
  const struct nlattr *a, int rem)
 {
+   struct sw_flow_key *recirc_key;
struct deferred_action *da;
-   int level;
 
if (!is_flow_key_valid(key)) {
int err;
@@ -1115,29 +1132,26 @@ static int execute_recirc(struct datapath *dp, struct 
sk_buff *skb,
return 0;
}
 
-   level = this_cpu_read(exec_actions_level);
-   if (level <= OVS_DEFERRED_ACTION_THRESHOLD) {
-   struct recirc_keys *rks = this_cpu_ptr(recirc_keys);
-   struct sw_flow_key *recirc_key = >key[level - 1];
-
-   *recirc_key = *key;
+   /* If within the limit of 'OVS_DEFERRED_ACTION_THRESHOLD',
+* recirc immediately, otherwise, defer it for later execution.
+*/
+   recirc_key = clone_key(key);
+   if (recirc_key) {
recirc_key->recirc_id = nla_get_u32(a);
ovs_dp_process_packet(skb, recirc_key);
-
-   return 0;
-   }
-
-   da = add_deferred_actions(skb, key, NULL, 0);
-   if (da) {
-   da->pkt_key.recirc_id = nla_get_u32(a);
} else {
-   kfree_skb(skb);
-
-   if (net_ratelimit())
-   pr_warn("%s: deferred action limit reached, drop recirc 
action\n",
-   ovs_dp_name(dp));
+   da = add_deferred_actions(skb, key, NULL, 0);
+   if (da) {
+   recirc_key = >pkt_key;
+   recirc_key->recirc_id = nla_get_u32(a);
+   } else {
+   /* Log an error in case action fifo is full.  */
+   kfree_skb(skb);
+   if (net_ratelimit())
+   pr_warn("%s: deferred action limit reached, 
drop recirc action\n",
+   ovs_dp_name(dp));
+   }
}
-
return 0;
 }
 
@@ -1327,8 +1341,8 @@ int action_fifos_init(void)
if (!action_fifos)
return -ENOMEM;
 
-   recirc_keys = alloc_percpu(struct recirc_keys);
-   if (!recirc_keys) {
+   flow_keys = alloc_percpu(struct action_flow_keys);
+   if (!flow_keys) {
free_percpu(action_fifos);
return -ENOMEM;
}
@@ -1339,5 +1353,5 @@ int action_fifos_init(void)
 void action_fifos_exit(void)
 {
free_percpu(action_fifos);
-   free_percpu(recirc_keys);
+   free_percpu(flow_keys);
 }
-- 
1.8.3.1

[net-next sample action optimization v3 1/4] openvswitch: Deferred fifo API change.

2017-03-16 Thread Andy Zhou

add_deferred_actions() API currently requires actions to be passed in
as a fully encoded netlink message. So far both 'sample' and 'recirc'
actions happens to carry actions as fully encoded netlink messages.
However, this requirement is more restrictive than necessary, future
patch will need to pass in action lists that are not fully encoded
by themselves.

Signed-off-by: Andy Zhou 
Acked-by: Joe Stringer 
---
 net/openvswitch/actions.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index c82301c..75182e9 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -51,6 +51,7 @@ static int do_execute_actions(struct datapath *dp, struct 
sk_buff *skb,
 struct deferred_action {
struct sk_buff *skb;
const struct nlattr *actions;
+   int actions_len;
 
/* Store pkt_key clone when creating deferred action. */
struct sw_flow_key pkt_key;
@@ -119,8 +120,9 @@ static struct deferred_action *action_fifo_put(struct 
action_fifo *fifo)
 
 /* Return true if fifo is not full */
 static struct deferred_action *add_deferred_actions(struct sk_buff *skb,
-   const struct sw_flow_key 
*key,
-   const struct nlattr *attr)
+   const struct sw_flow_key *key,
+   const struct nlattr *actions,
+   const int actions_len)
 {
struct action_fifo *fifo;
struct deferred_action *da;
@@ -129,7 +131,8 @@ static struct deferred_action *add_deferred_actions(struct 
sk_buff *skb,
da = action_fifo_put(fifo);
if (da) {
da->skb = skb;
-   da->actions = attr;
+   da->actions = actions;
+   da->actions_len = actions_len;
da->pkt_key = *key;
}
 
@@ -966,7 +969,8 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
/* Skip the sample action when out of memory. */
return 0;
 
-   if (!add_deferred_actions(skb, key, a)) {
+   if (!add_deferred_actions(skb, key, nla_data(acts_list),
+ nla_len(acts_list))) {
if (net_ratelimit())
pr_warn("%s: deferred actions limit reached, dropping 
sample action\n",
ovs_dp_name(dp));
@@ -1123,7 +1127,7 @@ static int execute_recirc(struct datapath *dp, struct 
sk_buff *skb,
return 0;
}
 
-   da = add_deferred_actions(skb, key, NULL);
+   da = add_deferred_actions(skb, key, NULL, 0);
if (da) {
da->pkt_key.recirc_id = nla_get_u32(a);
} else {
@@ -1278,10 +1282,10 @@ static void process_deferred_actions(struct datapath 
*dp)
struct sk_buff *skb = da->skb;
struct sw_flow_key *key = >pkt_key;
const struct nlattr *actions = da->actions;
+   int actions_len = da->actions_len;
 
if (actions)
-   do_execute_actions(dp, skb, key, actions,
-  nla_len(actions));
+   do_execute_actions(dp, skb, key, actions, actions_len);
else
ovs_dp_process_packet(skb, key);
} while (!action_fifo_is_empty(fifo));
-- 
1.8.3.1

Re: [net-next PATCH 1/5] net: Do not record sender_cpu as napi_id in socket receive paths

2017-03-16 Thread Eric Dumazet

On Thu, 2017-03-16 at 15:33 -0700, Alexander Duyck wrote:
> On Thu, Mar 16, 2017 at 3:05 PM, Eric Dumazet  wrote:

> > It is not clear why this patch is needed .
> >
> > What you describe here is the case we might receive packets for a socket
> > coming from different interfaces ?
> >
> > If skb->napi_id is a sender_cpu, why should we prevent overwriting the
> > sk_napi_id with it, knowing that busy polling will simply ignore the
> > invalid value ?
> >
> > Do not get me wrong :
> >
> > I simply try to understand why the test about napi_id validity is now
> > done twice :
> >
> > 1) At the time we are writing into sk->sk_napi_id
> 
> I would argue that this is the one piece we were missing.
> 
> > 2) At busy polling time when we read sk->sk_napi_id
> 
> Unless there was something recently added I don't think this was ever
> checked.  Instead we start digging into the hash looking for the ID
> that won't ever be there.  Maybe we should add something to napi_by_id
> that just returns NULL in these cases.

But this is exactly what should happen.

For invalid ID, we return NULL from napi_by_id()

No need to add code for that, since the function is meant to deal with
valid cases.



> On top of that I think there end up being several spots where once we
> lock in a non-NAPI ID it is stuck such as the sk_mark_napi_id_once
> call.  I figure we are better off locking in an actual NAPI ID rather
> than getting a sender_cpu stuck in there.

Are you referring to sk_mark_napi_id_once() ?

Since this is only used by UDP, I would be OK to avoid the 'locking' for
'sender_cpu'  ids.

[net-next sample action optimization v3 0/4]

2017-03-16 Thread Andy Zhou

The sample action can be used for translating Openflow 'clone' action.
However its implementation has not been sufficiently optimized for this
use case. This series attempts to close the gap.

Patch 3 commit message has more details on the specific optimizations
implemented.

---
v1->v2: Address pravin's comment, Refactor recirc and sample
to share more common code

v2->v3: Enhace patch 4, add more loigc to the common code

Andy Zhou (4):
  openvswitch: Deferred fifo API change.
  openvswitch: Refactor recirc key allocation.
  openvswitch: Optimize sample action for the clone use cases
  Openvswitch: Refactor sample and recirc actions implementation

 include/uapi/linux/openvswitch.h |  15 +++
 net/openvswitch/actions.c| 270 ++-
 net/openvswitch/datapath.h   |   2 -
 net/openvswitch/flow_netlink.c   | 141 +---
 4 files changed, 262 insertions(+), 166 deletions(-)

-- 
1.8.3.1

[net-next sample action optimization v3 3/4] openvswitch: Optimize sample action for the clone use cases

2017-03-16 Thread Andy Zhou

With the introduction of open flow 'clone' action, the OVS user space
can now translate the 'clone' action into kernel datapath 'sample'
action, with 100% probability, to ensure that the clone semantics,
which is that the packet seen by the clone action is the same as the
packet seen by the action after clone, is faithfully carried out
in the datapath.

While the sample action in the datpath has the matching semantics,
its implementation is only optimized for its original use.
Specifically, there are two limitation: First, there is a 3 level of
nesting restriction, enforced at the flow downloading time. This
limit turns out to be too restrictive for the 'clone' use case.
Second, the implementation avoid recursive call only if the sample
action list has a single userspace action.

The main optimization implemented in this series removes the static
nesting limit check, instead, implement the run time recursion limit
check, and recursion avoidance similar to that of the 'recirc' action.
This optimization solve both #1 and #2 issues above.

One related optimization attempts to avoid copying flow key as
long as the actions enclosed does not change the flow key. The
detection is performed only once at the flow downloading time.

Another related optimization is to rewrite the action list
at flow downloading time in order to save the fast path from parsing
the sample action list in its original form repeatedly.

Signed-off-by: Andy Zhou 
---
 include/uapi/linux/openvswitch.h |  15 +
 net/openvswitch/actions.c| 107 ++---
 net/openvswitch/datapath.h   |   2 -
 net/openvswitch/flow_netlink.c   | 141 +++
 4 files changed, 167 insertions(+), 98 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 7f41f7d..66d1c3c 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -578,10 +578,25 @@ enum ovs_sample_attr {
OVS_SAMPLE_ATTR_PROBABILITY, /* u32 number */
OVS_SAMPLE_ATTR_ACTIONS, /* Nested OVS_ACTION_ATTR_* attributes. */
__OVS_SAMPLE_ATTR_MAX,
+
+#ifdef __KERNEL__
+   OVS_SAMPLE_ATTR_ARG  /* struct sample_arg  */
+#endif
 };
 
 #define OVS_SAMPLE_ATTR_MAX (__OVS_SAMPLE_ATTR_MAX - 1)
 
+#ifdef __KERNEL__
+struct sample_arg {
+   bool exec;   /* When true, actions in sample will not
+ * change flow keys. False otherwise.
+ */
+   u32  probability;/* Same value as
+ * 'OVS_SAMPLE_ATTR_PROBABILITY'.
+ */
+};
+#endif
+
 /**
  * enum ovs_userspace_attr - Attributes for %OVS_ACTION_ATTR_USERSPACE action.
  * @OVS_USERSPACE_ATTR_PID: u32 Netlink PID to which the %OVS_PACKET_CMD_ACTION
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 8c9c60c..3529f7b 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -928,73 +928,70 @@ static int output_userspace(struct datapath *dp, struct 
sk_buff *skb,
return ovs_dp_upcall(dp, skb, key, , cutlen);
 }
 
+/* When 'last' is true, sample() should always consume the 'skb'.
+ * Otherwise, sample() should keep 'skb' intact regardless what
+ * actions are executed within sample().
+ */
 static int sample(struct datapath *dp, struct sk_buff *skb,
  struct sw_flow_key *key, const struct nlattr *attr,
- const struct nlattr *actions, int actions_len)
+ bool last)
 {
-   const struct nlattr *acts_list = NULL;
-   const struct nlattr *a;
-   int rem;
-   u32 cutlen = 0;
+   struct nlattr *actions;
+   struct nlattr *sample_arg;
+   struct sw_flow_key *orig_key = key;
+   int rem = nla_len(attr);
+   int err = 0;
+   const struct sample_arg *arg;
 
-   for (a = nla_data(attr), rem = nla_len(attr); rem > 0;
-a = nla_next(a, )) {
-   u32 probability;
+   /* The first action is always 'OVS_SAMPLE_ATTR_ARG'. */
+   sample_arg = nla_data(attr);
+   arg = nla_data(sample_arg);
+   actions = nla_next(sample_arg, );
 
-   switch (nla_type(a)) {
-   case OVS_SAMPLE_ATTR_PROBABILITY:
-   probability = nla_get_u32(a);
-   if (!probability || prandom_u32() > probability)
-   return 0;
-   break;
-
-   case OVS_SAMPLE_ATTR_ACTIONS:
-   acts_list = a;
-   break;
-   }
+   if ((arg->probability != U32_MAX) &&
+   (!arg->probability || prandom_u32() > arg->probability)) {
+   if (last)
+   consume_skb(skb);
+   return 0;
}
 
-   rem = nla_len(acts_list);
-   a = nla_data(acts_list);
-
-   /* Actions

[net-next sample action optimization v3 4/4] Openvswitch: Refactor sample and recirc actions implementation

2017-03-16 Thread Andy Zhou

Added clone_execute() that both the sample and the recirc
action implementation can use.

Signed-off-by: Andy Zhou 
---
 net/openvswitch/actions.c | 175 --
 1 file changed, 92 insertions(+), 83 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 3529f7b..e38fa7f 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -44,10 +44,6 @@
 #include "conntrack.h"
 #include "vport.h"
 
-static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
- struct sw_flow_key *key,
- const struct nlattr *attr, int len);
-
 struct deferred_action {
struct sk_buff *skb;
const struct nlattr *actions;
@@ -166,6 +162,12 @@ static bool is_flow_key_valid(const struct sw_flow_key 
*key)
return !(key->mac_proto & SW_FLOW_KEY_INVALID);
 }
 
+static int clone_execute(struct datapath *dp, struct sk_buff *skb,
+struct sw_flow_key *key,
+u32 recirc_id,
+const struct nlattr *actions, int len,
+bool last, bool clone_flow_key);
+
 static void update_ethertype(struct sk_buff *skb, struct ethhdr *hdr,
 __be16 ethertype)
 {
@@ -938,10 +940,9 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
 {
struct nlattr *actions;
struct nlattr *sample_arg;
-   struct sw_flow_key *orig_key = key;
int rem = nla_len(attr);
-   int err = 0;
const struct sample_arg *arg;
+   bool clone_flow_key;
 
/* The first action is always 'OVS_SAMPLE_ATTR_ARG'. */
sample_arg = nla_data(attr);
@@ -955,43 +956,9 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
return 0;
}
 
-   /* Unless the last action, sample works on the clone of SKB.  */
-   skb = last ? skb : skb_clone(skb, GFP_ATOMIC);
-   if (!skb) {
-   /* Out of memory, skip this sample action.
-*/
-   return 0;
-   }
-
-   /* In case the sample actions won't change 'key',
-* it can be used directly to execute sample actions.
-* Otherwise, allocate a new key from the
-* next recursion level of 'flow_keys'. If
-* successful, execute the sample actions without
-* deferring.
-*
-* Defer the sample actions if the recursion
-* limit has been reached.
-*/
-   if (!arg->exec) {
-   __this_cpu_inc(exec_actions_level);
-   key = clone_key(key);
-   }
-
-   if (key) {
-   err = do_execute_actions(dp, skb, key, actions, rem);
-   } else if (!add_deferred_actions(skb, orig_key, actions, rem)) {
-
-   if (net_ratelimit())
-   pr_warn("%s: deferred action limit reached, drop sample 
action\n",
-   ovs_dp_name(dp));
-   kfree_skb(skb);
-   }
-
-   if (!arg->exec)
-   __this_cpu_dec(exec_actions_level);
-
-   return err;
+   clone_flow_key = !arg->exec;
+   return clone_execute(dp, skb, key, 0, actions, rem, last,
+clone_flow_key);
 }
 
 static void execute_hash(struct sk_buff *skb, struct sw_flow_key *key,
@@ -1102,10 +1069,9 @@ static int execute_masked_set_action(struct sk_buff *skb,
 
 static int execute_recirc(struct datapath *dp, struct sk_buff *skb,
  struct sw_flow_key *key,
- const struct nlattr *a, int rem)
+ const struct nlattr *a, bool last)
 {
-   struct sw_flow_key *recirc_key;
-   struct deferred_action *da;
+   u32 recirc_id;
 
if (!is_flow_key_valid(key)) {
int err;
@@ -1116,40 +1082,8 @@ static int execute_recirc(struct datapath *dp, struct 
sk_buff *skb,
}
BUG_ON(!is_flow_key_valid(key));
 
-   if (!nla_is_last(a, rem)) {
-   /* Recirc action is the not the last action
-* of the action list, need to clone the skb.
-*/
-   skb = skb_clone(skb, GFP_ATOMIC);
-
-   /* Skip the recirc action when out of memory, but
-* continue on with the rest of the action list.
-*/
-   if (!skb)
-   return 0;
-   }
-
-   /* If within the limit of 'OVS_DEFERRED_ACTION_THRESHOLD',
-* recirc immediately, otherwise, defer it for later execution.
-*/
-   recirc_key = clone_key(key);
-   if (recirc_key) {
-   recirc_key->recirc_id = nla_get_u32(a);
-   ovs_dp_process_packet(skb, recirc_key);
-   } else {
-   da = add_deferred_actions(skb, key, NULL, 0);
-   if (da) {
-   recirc_key = >pkt_key;
-   recirc_key->recirc_id =

[PATCH net] tcp: tcp_get_info() should read tcp_time_stamp later

2017-03-16 Thread Eric Dumazet

From: Eric Dumazet 

Commit b369e7fd41f7 ("tcp: make TCP_INFO more consistent") moved
lock_sock_fast() earlier in tcp_get_info()

This has the minor effect that jiffies value being sampled at the
beginning of tcp_get_info() is more likely to be off by one, and we
report big tcpi_last_data_sent values (like 0x).

Since we lock the socket, fetching tcp_time_stamp right before
doing the jiffies_to_msecs() calls is enough to remove these
wrong values.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 
cf481282c608f920254078264e36e18584c6..1e319a525d51b0b603a5ccc5143381c752b9f2c7
 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2770,7 +2770,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
 {
const struct tcp_sock *tp = tcp_sk(sk); /* iff sk_type == SOCK_STREAM */
const struct inet_connection_sock *icsk = inet_csk(sk);
-   u32 now = tcp_time_stamp, intv;
+   u32 now, intv;
u64 rate64;
bool slow;
u32 rate;
@@ -2839,6 +2839,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
info->tcpi_retrans = tp->retrans_out;
info->tcpi_fackets = tp->fackets_out;
 
+   now = tcp_time_stamp;
info->tcpi_last_data_sent = jiffies_to_msecs(now - tp->lsndtime);
info->tcpi_last_data_recv = jiffies_to_msecs(now - 
icsk->icsk_ack.lrcvtime);
info->tcpi_last_ack_recv = jiffies_to_msecs(now - tp->rcv_tstamp);

Re: [net-next PATCH 1/5] net: Do not record sender_cpu as napi_id in socket receive paths

2017-03-16 Thread Alexander Duyck

On Thu, Mar 16, 2017 at 3:05 PM, Eric Dumazet  wrote:
> On Thu, 2017-03-16 at 11:32 -0700, Alexander Duyck wrote:
>> From: Sridhar Samudrala 
>>
>> Fix sk_mark_napi_id() and sk_mark_napi_id_once() to set sk_napi_id only if
>> skb->napi_id is a valid value.
>>
>> This happens in loopback paths where skb->napi_id is not updated in
>> rx path and holds sender_cpu that is set in xmit path.
>>
>> Signed-off-by: Sridhar Samudrala 
>> Signed-off-by: Alexander Duyck 
>> ---
>>  include/net/busy_poll.h |5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
>> index c0452de83086..67991635953e 100644
>> --- a/include/net/busy_poll.h
>> +++ b/include/net/busy_poll.h
>> @@ -116,7 +116,8 @@ static inline bool sk_busy_loop(struct sock *sk, int 
>> nonblock)
>>  static inline void sk_mark_napi_id(struct sock *sk, const struct sk_buff 
>> *skb)
>>  {
>>  #ifdef CONFIG_NET_RX_BUSY_POLL
>> - sk->sk_napi_id = skb->napi_id;
>> + if (skb->napi_id > (u32)NR_CPUS)
>> + sk->sk_napi_id = skb->napi_id;
>>  #endif
>>  }
>>
>> @@ -125,7 +126,7 @@ static inline void sk_mark_napi_id_once(struct sock *sk,
>>   const struct sk_buff *skb)
>>  {
>>  #ifdef CONFIG_NET_RX_BUSY_POLL
>> - if (!sk->sk_napi_id)
>> + if (!sk->sk_napi_id && (skb->napi_id > (u32)NR_CPUS))
>>   sk->sk_napi_id = skb->napi_id;
>>  #endif
>>  }
>>
>
> It is not clear why this patch is needed .
>
> What you describe here is the case we might receive packets for a socket
> coming from different interfaces ?
>
> If skb->napi_id is a sender_cpu, why should we prevent overwriting the
> sk_napi_id with it, knowing that busy polling will simply ignore the
> invalid value ?
>
> Do not get me wrong :
>
> I simply try to understand why the test about napi_id validity is now
> done twice :
>
> 1) At the time we are writing into sk->sk_napi_id

I would argue that this is the one piece we were missing.

> 2) At busy polling time when we read sk->sk_napi_id

Unless there was something recently added I don't think this was ever
checked.  Instead we start digging into the hash looking for the ID
that won't ever be there.  Maybe we should add something to napi_by_id
that just returns NULL in these cases.

On top of that I think there end up being several spots where once we
lock in a non-NAPI ID it is stuck such as the sk_mark_napi_id_once
call.  I figure we are better off locking in an actual NAPI ID rather
than getting a sender_cpu stuck in there.

Re: netdev level filtering? perhaps pushing socket filters down?

2017-03-16 Thread Ben Greear


On 03/16/2017 03:33 PM, Johannes Berg wrote:

Hi all,

Occasionally - we just had another case - people want to hook into
packets received and processed by the mac80211 stack, but because they
don't need all of them (e.g. not data packets), even adding a monitor
interface and bringing it up has too high a cost because SKBs need to
be prepared to send them to the monitor interface, even if no socket is
consuming them.

Ideally, we'd be able to detect that there are filter programs attached
to the socket(s) that are looking at the frames coming in on the
monitor interface, and we could somehow magically run those before we
create a new SKB.
One problem here is that we wouldn't really want to prepare all the
radiotap header just to throw it away, so we'd have to be able to
analyse the filter program to make sure it doesn't access anything but
the radiotap header length, and that only in order to jump over it.
That seems ... difficult, but we don't even know the header length -
although we could fudge that and make a very long constant-size header,
which might make it possible to do such analysis, or handle it by
trapping on such access. But it seems rather difficult to implement
this.

The next best thing would be to install a filter program on the virtual
monitor *interface* (netdev), but say that it doesn't get frames with
radiotap, but pure 802.11 frames. We already have those in SKB format
at this point, so it'd be simple to run such a program and only pass
the SKB to the monitor netdev's RX when the program asked to do that.

This now seems a bit like XDP, but for XDP this header difference
doesn't seem appropriate either.

Anyone have any other thoughts?


Attach at just above the driver, before it ever gets to stations/vdevs,
and ignore radiotap headers and/or add special processing for metadata like
rx-info?

Thanks,
Ben



Thanks,
johannes




--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com

Re: [net-next PATCH 1/5] net: Do not record sender_cpu as napi_id in socket receive paths

2017-03-16 Thread Samudrala, Sridhar



On 3/16/2017 3:05 PM, Eric Dumazet wrote:

On Thu, 2017-03-16 at 11:32 -0700, Alexander Duyck wrote:

From: Sridhar Samudrala 

Fix sk_mark_napi_id() and sk_mark_napi_id_once() to set sk_napi_id only if
skb->napi_id is a valid value.

This happens in loopback paths where skb->napi_id is not updated in
rx path and holds sender_cpu that is set in xmit path.

Signed-off-by: Sridhar Samudrala 
Signed-off-by: Alexander Duyck 
---
  include/net/busy_poll.h |5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index c0452de83086..67991635953e 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -116,7 +116,8 @@ static inline bool sk_busy_loop(struct sock *sk, int 
nonblock)
  static inline void sk_mark_napi_id(struct sock *sk, const struct sk_buff *skb)
  {
  #ifdef CONFIG_NET_RX_BUSY_POLL
-   sk->sk_napi_id = skb->napi_id;
+   if (skb->napi_id > (u32)NR_CPUS)
+   sk->sk_napi_id = skb->napi_id;
  #endif
  }
  
@@ -125,7 +126,7 @@ static inline void sk_mark_napi_id_once(struct sock *sk,

const struct sk_buff *skb)
  {
  #ifdef CONFIG_NET_RX_BUSY_POLL
-   if (!sk->sk_napi_id)
+   if (!sk->sk_napi_id && (skb->napi_id > (u32)NR_CPUS))
sk->sk_napi_id = skb->napi_id;
  #endif
  }


It is not clear why this patch is needed .

What you describe here is the case we might receive packets for a socket
coming from different interfaces ?


This is seen with AF_UNIX or AF_INET sockets over loopback.


If skb->napi_id is a sender_cpu, why should we prevent overwriting the
sk_napi_id with it, knowing that busy polling will simply ignore the
invalid value ?


We are not checking for invalid VALUE(< NR_CPUs) in busy_poll,
Non-zero sk->napi_id is considered valid.

If we don't want to add this check while setting sk->sk_napi_Id, we 
could change the
check in ep_set_busy_poll_napi_id() to check for invalid value rather 
than non-zero value.




Do not get me wrong :

I simply try to understand why the test about napi_id validity is now
done twice :

1) At the time we are writing into sk->sk_napi_id

2) At busy polling time when we read sk->sk_napi_id

Performance issue with igb with lots of different src-ip addrs.

2017-03-16 Thread Ben Greear


Hello,

We notice that when using two igb ports as a bridge, if we use pktgen to
drive traffic through the bridge and randomize (or use a very large range)
for the source IP addr in pktgen, then performance of igb is very poor (like 
150Mbps
throughput instead of 1Gbps).  It runs right at line speed if we use same 
src/dest
IP addr in pktgen.  So, seems it is related to lots of src/dest IP addresses.

We see same problem when using pktgen to send to itself, and we see this in
several different kernels.  We specifically tested bridge mode in this stock
Fedora kernel:

 Linux lfo350-59cc 4.9.13-101.fc24.x86_64 #1 SMP Tue Mar 7 23:48:32 UTC 2017 
x86_64 x86_64 x86_64 GNU/Linux

e1000e does not show this problem in our testing.

Any ideas what the issue might be and how to fix it?

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com

Re: [net-next PATCH 5/5] epoll: Add busy poll support to epoll with socket fds.

2017-03-16 Thread Alexander Duyck

On Thu, Mar 16, 2017 at 3:11 PM, Eric Dumazet  wrote:
> On Thu, 2017-03-16 at 11:33 -0700, Alexander Duyck wrote:
>> From: Sridhar Samudrala 
>
>> +/*
>> + * If busy polling is on and the file is a socket, return a pointer to
>> + * struct sock
>> + */
>> +static inline struct sock *ep_sk_from_file(struct file *file)
>> +{
>> + struct inode *inode = file_inode(file);
>> +
>> + if (!S_ISSOCK(inode->i_mode))
>> + return NULL;
>> +
>> + return ((struct socket *)file->private_data)->sk;
>> +}
>
> I believe a more standard way is to use sock_from_file()
>
> (This does not have to fetch a cache line to read i_mode ))

Thanks.  Will make sure to update to use that for v2.

- Alex

Re: [net-next PATCH 2/5] net: Call sk_mark_napi_id() in the ACK receive path

2017-03-16 Thread Alexander Duyck

On Thu, Mar 16, 2017 at 3:04 PM, Eric Dumazet  wrote:
> On Thu, 2017-03-16 at 11:32 -0700, Alexander Duyck wrote:
>> From: Sridhar Samudrala 
>>
>> Call sk_mark_napi_id() in the ACK receive path of a TCP_NEW_SYN_RECV
>> socket, so that sk->napi_id is set even if the socket hasn't yet received
>> any data.  With this change we should be able to start busy polling
>> slightly earlier.
>>
>> Signed-off-by: Sridhar Samudrala 
>> Signed-off-by: Alexander Duyck 
>> ---
>>  net/ipv4/tcp_ipv4.c |1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> index 08d870e45658..b86002a296f1 100644
>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -1687,6 +1687,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
>>   tcp_v4_send_reset(nsk, skb);
>>   goto discard_and_relse;
>>   } else {
>> + sk_mark_napi_id(nsk, skb);
>>   sock_put(sk);
>>   return 0;
>>   }
>>
>
> Seems good, but what about IPv6 ?

Sorry, I spaced out and overlooked that this would also be an issue for IPv6.

> Frankly this calls for the sk_mark_napi_id() being done in
> tcp_child_process() instead of its four callers.

We can look into that for the next version.

netdev level filtering? perhaps pushing socket filters down?

2017-03-16 Thread Johannes Berg

Hi all,

Occasionally - we just had another case - people want to hook into
packets received and processed by the mac80211 stack, but because they
don't need all of them (e.g. not data packets), even adding a monitor
interface and bringing it up has too high a cost because SKBs need to
be prepared to send them to the monitor interface, even if no socket is
consuming them.

Ideally, we'd be able to detect that there are filter programs attached
to the socket(s) that are looking at the frames coming in on the
monitor interface, and we could somehow magically run those before we
create a new SKB.
One problem here is that we wouldn't really want to prepare all the
radiotap header just to throw it away, so we'd have to be able to
analyse the filter program to make sure it doesn't access anything but
the radiotap header length, and that only in order to jump over it.
That seems ... difficult, but we don't even know the header length -
although we could fudge that and make a very long constant-size header,
which might make it possible to do such analysis, or handle it by
trapping on such access. But it seems rather difficult to implement
this.

The next best thing would be to install a filter program on the virtual
monitor *interface* (netdev), but say that it doesn't get frames with
radiotap, but pure 802.11 frames. We already have those in SKB format
at this point, so it'd be simple to run such a program and only pass
the SKB to the monitor netdev's RX when the program asked to do that.

This now seems a bit like XDP, but for XDP this header difference
doesn't seem appropriate either.

Anyone have any other thoughts?

Thanks,
johannes

Re: [PATCH v2 2/2] can: spi: hi311x: Add Holt HI-311x CAN driver

2017-03-16 Thread Akshay Bhat

Hi Wolfgang,

On 03/16/2017 04:02 PM, Wolfgang Grandegger wrote:
> 
> Looks much better now! There are message for state changes to error
> warning and passive. Just the following message is not correct:
> 
>  (000.200824)  can0  2004   [8]  00 40 00 00 00 00 5F 19   ERRORFRAME
> controller-problem{}
> error-counter-tx-rx{{95}{25}}
> 
> Sorry, forgot to mention... the function can_change_state() [1]
> should be used for that purpose, if possible. It fixes the issue
> above as well.
> 

The updated driver (the one used to create the above log) is using
can_change_state() function. data[1] 40 corresponds to
CAN_ERR_CRTL_ACTIVE, so looks correct? Could it be that the can-utils I
am using is old and not reporting state change?

Tentative v4 driver for reference:
http://pastebin.com/4xFVL1Sj

>> berr-reporting off case:
>> http://pastebin.com/fUn3j7qU
> 
> Ditto.
> 
> I just had another look to the manual and there is this undocumented
> STATFE register at offset 0x1E. It's mentioned in some other parts of
> the doc as interrupt enable register for STATF events. I would assume
> the same bit layout than STATF. If you set bit 2 (BUSOFF), 3 (ERRP)
> and 4 (ERRW), you may get interrupts. It's worth a try, I think. If
> it works, it's the much better solution.
> 

The HI-311x has a INT pin and a STAT pin. The hardware I have has only
the INT pin connected to the processor. If the STAT pin was also
connected, then like you mentioned it could be a much better solution to
use STAT for state changes.

Enabling BUSOFF/ERRP/ERRW bits in STATFE did not generate any interrupts
on the INT pin. Should we make it a requirement that both INT and STAT
pins need to be connected in hardware for the driver to do the error
reporting?

Thanks,
Akshay

> Wolfgang.
> 
> [1] http://lxr.free-electrons.com/ident?i=can_change_state
> 
> Wolfgang.
>

[PATCH 00/11] net: usbnet: move to new api ethtool_{get|set}_link_ksettings

2017-03-16 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated. On usbnet, it
was often implemented with usbnet_{get|set}_settings.

In this serie, I add usbnet_{get|set}_link_ksettings
in the first patch, then I update all the driver to
use this new api, and in the last patch I remove the
old api usbnet_{get|set}_settings.


Philippe Reynes (11):
  net: usb: usbnet: add new api ethtool_{get|set}_link_ksettings
  net: usb: smsc95xx: use new api ethtool_{get|set}_link_ksettings
  net: usb: sr9800: use new api ethtool_{get|set}_link_ksettings
  net: usb: cdc_ncm: use new api ethtool_{get|set}_link_ksettings
  net: usb: dm9601: use new api ethtool_{get|set}_link_ksettings
  net: usb: mcs7830: use new api ethtool_{get|set}_link_ksettings
  net: usb: sierra_net: use new api ethtool_{get|set}_link_ksettings
  net: usb: smsc75xx: use new api ethtool_{get|set}_link_ksettings
  net: usb: sr9700: use new api ethtool_{get|set}_link_ksettings
  net: usb: asix: use new api ethtool_{get|set}_link_ksettings
  net: usb: usb: remove old api ethtool_{get|set}_settings

 drivers/net/usb/asix_devices.c |   12 ++--
 drivers/net/usb/cdc_ncm.c  |4 ++--
 drivers/net/usb/dm9601.c   |4 ++--
 drivers/net/usb/mcs7830.c  |4 ++--
 drivers/net/usb/sierra_net.c   |4 ++--
 drivers/net/usb/smsc75xx.c |4 ++--
 drivers/net/usb/smsc95xx.c |   24 
 drivers/net/usb/sr9700.c   |4 ++--
 drivers/net/usb/sr9800.c   |4 ++--
 drivers/net/usb/usbnet.c   |   19 ++-
 include/linux/usb/usbnet.h |8 
 11 files changed, 46 insertions(+), 45 deletions(-)

-- 
1.7.4.4

[PATCH 04/11] net: usb: cdc_ncm: use new api ethtool_{get|set}_link_ksettings

2017-03-16 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/usb/cdc_ncm.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/cdc_ncm.c b/drivers/net/usb/cdc_ncm.c
index f317984..b6c1d3a 100644
--- a/drivers/net/usb/cdc_ncm.c
+++ b/drivers/net/usb/cdc_ncm.c
@@ -131,8 +131,6 @@ static void cdc_ncm_get_strings(struct net_device 
__always_unused *netdev, u32 s
 static void cdc_ncm_update_rxtx_max(struct usbnet *dev, u32 new_rx, u32 
new_tx);
 
 static const struct ethtool_ops cdc_ncm_ethtool_ops = {
-   .get_settings  = usbnet_get_settings,
-   .set_settings  = usbnet_set_settings,
.get_link  = usbnet_get_link,
.nway_reset= usbnet_nway_reset,
.get_drvinfo   = usbnet_get_drvinfo,
@@ -142,6 +140,8 @@ static void cdc_ncm_get_strings(struct net_device 
__always_unused *netdev, u32 s
.get_sset_count= cdc_ncm_get_sset_count,
.get_strings   = cdc_ncm_get_strings,
.get_ethtool_stats = cdc_ncm_get_ethtool_stats,
+   .get_link_ksettings  = usbnet_get_link_ksettings,
+   .set_link_ksettings  = usbnet_set_link_ksettings,
 };
 
 static u32 cdc_ncm_check_rx_max(struct usbnet *dev, u32 new_rx)
-- 
1.7.4.4

[PATCH 01/11] net: usb: usbnet: add new api ethtool_{get|set}_link_ksettings

2017-03-16 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We add the new api {get|set}_link_ksettings to this driver.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/usb/usbnet.c   |   36 
 include/linux/usb/usbnet.h |4 
 2 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index 3de65ea..1b40b18 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -980,6 +980,40 @@ int usbnet_set_settings (struct net_device *net, struct 
ethtool_cmd *cmd)
 }
 EXPORT_SYMBOL_GPL(usbnet_set_settings);
 
+int usbnet_get_link_ksettings(struct net_device *net,
+ struct ethtool_link_ksettings *cmd)
+{
+   struct usbnet *dev = netdev_priv(net);
+
+   if (!dev->mii.mdio_read)
+   return -EOPNOTSUPP;
+
+   return mii_ethtool_get_link_ksettings(>mii, cmd);
+}
+EXPORT_SYMBOL_GPL(usbnet_get_link_ksettings);
+
+int usbnet_set_link_ksettings(struct net_device *net,
+ const struct ethtool_link_ksettings *cmd)
+{
+   struct usbnet *dev = netdev_priv(net);
+   int retval;
+
+   if (!dev->mii.mdio_write)
+   return -EOPNOTSUPP;
+
+   retval = mii_ethtool_set_link_ksettings(>mii, cmd);
+
+   /* link speed/duplex might have changed */
+   if (dev->driver_info->link_reset)
+   dev->driver_info->link_reset(dev);
+
+   /* hard_mtu or rx_urb_size may change in link_reset() */
+   usbnet_update_max_qlen(dev);
+
+   return retval;
+}
+EXPORT_SYMBOL_GPL(usbnet_set_link_ksettings);
+
 u32 usbnet_get_link (struct net_device *net)
 {
struct usbnet *dev = netdev_priv(net);
@@ -1046,6 +1080,8 @@ void usbnet_set_msglevel (struct net_device *net, u32 
level)
.get_msglevel   = usbnet_get_msglevel,
.set_msglevel   = usbnet_set_msglevel,
.get_ts_info= ethtool_op_get_ts_info,
+   .get_link_ksettings = usbnet_get_link_ksettings,
+   .set_link_ksettings = usbnet_set_link_ksettings,
 };
 
 /*-*/
diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h
index 6e0ce8c..5bd8007 100644
--- a/include/linux/usb/usbnet.h
+++ b/include/linux/usb/usbnet.h
@@ -265,6 +265,10 @@ extern int usbnet_get_settings(struct net_device *net,
   struct ethtool_cmd *cmd);
 extern int usbnet_set_settings(struct net_device *net,
   struct ethtool_cmd *cmd);
+extern int usbnet_get_link_ksettings(struct net_device *net,
+struct ethtool_link_ksettings *cmd);
+extern int usbnet_set_link_ksettings(struct net_device *net,
+const struct ethtool_link_ksettings *cmd);
 extern u32 usbnet_get_link(struct net_device *net);
 extern u32 usbnet_get_msglevel(struct net_device *);
 extern void usbnet_set_msglevel(struct net_device *, u32);
-- 
1.7.4.4

[PATCH 11/11] net: usb: usb: remove old api ethtool_{get|set}_settings

2017-03-16 Thread Philippe Reynes

The function usbnet_{get|set}_settings is no longer used,
so we remove it.

Signed-off-by: Philippe Reynes 
---
 drivers/net/usb/usbnet.c   |   35 ---
 include/linux/usb/usbnet.h |4 
 2 files changed, 0 insertions(+), 39 deletions(-)

diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index 1b40b18..13d4ec5 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -947,39 +947,6 @@ int usbnet_open (struct net_device *net)
  * they'll probably want to use this base set.
  */
 
-int usbnet_get_settings (struct net_device *net, struct ethtool_cmd *cmd)
-{
-   struct usbnet *dev = netdev_priv(net);
-
-   if (!dev->mii.mdio_read)
-   return -EOPNOTSUPP;
-
-   return mii_ethtool_gset(>mii, cmd);
-}
-EXPORT_SYMBOL_GPL(usbnet_get_settings);
-
-int usbnet_set_settings (struct net_device *net, struct ethtool_cmd *cmd)
-{
-   struct usbnet *dev = netdev_priv(net);
-   int retval;
-
-   if (!dev->mii.mdio_write)
-   return -EOPNOTSUPP;
-
-   retval = mii_ethtool_sset(>mii, cmd);
-
-   /* link speed/duplex might have changed */
-   if (dev->driver_info->link_reset)
-   dev->driver_info->link_reset(dev);
-
-   /* hard_mtu or rx_urb_size may change in link_reset() */
-   usbnet_update_max_qlen(dev);
-
-   return retval;
-
-}
-EXPORT_SYMBOL_GPL(usbnet_set_settings);
-
 int usbnet_get_link_ksettings(struct net_device *net,
  struct ethtool_link_ksettings *cmd)
 {
@@ -1072,8 +1039,6 @@ void usbnet_set_msglevel (struct net_device *net, u32 
level)
 
 /* drivers may override default ethtool_ops in their bind() routine */
 static const struct ethtool_ops usbnet_ethtool_ops = {
-   .get_settings   = usbnet_get_settings,
-   .set_settings   = usbnet_set_settings,
.get_link   = usbnet_get_link,
.nway_reset = usbnet_nway_reset,
.get_drvinfo= usbnet_get_drvinfo,
diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h
index 5bd8007..e2b5691 100644
--- a/include/linux/usb/usbnet.h
+++ b/include/linux/usb/usbnet.h
@@ -261,10 +261,6 @@ extern netdev_tx_t usbnet_start_xmit(struct sk_buff *skb,
 extern void usbnet_resume_rx(struct usbnet *);
 extern void usbnet_purge_paused_rxq(struct usbnet *);
 
-extern int usbnet_get_settings(struct net_device *net,
-  struct ethtool_cmd *cmd);
-extern int usbnet_set_settings(struct net_device *net,
-  struct ethtool_cmd *cmd);
 extern int usbnet_get_link_ksettings(struct net_device *net,
 struct ethtool_link_ksettings *cmd);
 extern int usbnet_set_link_ksettings(struct net_device *net,
-- 
1.7.4.4

[PATCH 10/11] net: usb: asix: use new api ethtool_{get|set}_link_ksettings

2017-03-16 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/usb/asix_devices.c |   12 ++--
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/usb/asix_devices.c b/drivers/net/usb/asix_devices.c
index 0dd5106..38456d0 100644
--- a/drivers/net/usb/asix_devices.c
+++ b/drivers/net/usb/asix_devices.c
@@ -136,9 +136,9 @@ static int asix_ioctl (struct net_device *net, struct ifreq 
*rq, int cmd)
.get_eeprom_len = asix_get_eeprom_len,
.get_eeprom = asix_get_eeprom,
.set_eeprom = asix_set_eeprom,
-   .get_settings   = usbnet_get_settings,
-   .set_settings   = usbnet_set_settings,
.nway_reset = usbnet_nway_reset,
+   .get_link_ksettings = usbnet_get_link_ksettings,
+   .set_link_ksettings = usbnet_set_link_ksettings,
 };
 
 static void ax88172_set_multicast(struct net_device *net)
@@ -301,9 +301,9 @@ static int ax88172_bind(struct usbnet *dev, struct 
usb_interface *intf)
.get_eeprom_len = asix_get_eeprom_len,
.get_eeprom = asix_get_eeprom,
.set_eeprom = asix_set_eeprom,
-   .get_settings   = usbnet_get_settings,
-   .set_settings   = usbnet_set_settings,
.nway_reset = usbnet_nway_reset,
+   .get_link_ksettings = usbnet_get_link_ksettings,
+   .set_link_ksettings = usbnet_set_link_ksettings,
 };
 
 static int ax88772_link_reset(struct usbnet *dev)
@@ -775,9 +775,9 @@ static void ax88772_unbind(struct usbnet *dev, struct 
usb_interface *intf)
.get_eeprom_len = asix_get_eeprom_len,
.get_eeprom = asix_get_eeprom,
.set_eeprom = asix_set_eeprom,
-   .get_settings   = usbnet_get_settings,
-   .set_settings   = usbnet_set_settings,
.nway_reset = usbnet_nway_reset,
+   .get_link_ksettings = usbnet_get_link_ksettings,
+   .set_link_ksettings = usbnet_set_link_ksettings,
 };
 
 static int marvell_phy_init(struct usbnet *dev)
-- 
1.7.4.4

[PATCH 09/11] net: usb: sr9700: use new api ethtool_{get|set}_link_ksettings

2017-03-16 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/usb/sr9700.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/sr9700.c b/drivers/net/usb/sr9700.c
index 4a1e9c4..950a3a9 100644
--- a/drivers/net/usb/sr9700.c
+++ b/drivers/net/usb/sr9700.c
@@ -249,9 +249,9 @@ static int sr9700_ioctl(struct net_device *netdev, struct 
ifreq *rq, int cmd)
.set_msglevel   = usbnet_set_msglevel,
.get_eeprom_len = sr9700_get_eeprom_len,
.get_eeprom = sr9700_get_eeprom,
-   .get_settings   = usbnet_get_settings,
-   .set_settings   = usbnet_set_settings,
.nway_reset = usbnet_nway_reset,
+   .get_link_ksettings = usbnet_get_link_ksettings,
+   .set_link_ksettings = usbnet_set_link_ksettings,
 };
 
 static void sr9700_set_multicast(struct net_device *netdev)
-- 
1.7.4.4

[PATCH 07/11] net: usb: sierra_net: use new api ethtool_{get|set}_link_ksettings

2017-03-16 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/usb/sierra_net.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/sierra_net.c b/drivers/net/usb/sierra_net.c
index ac69f28..c8f60b8 100644
--- a/drivers/net/usb/sierra_net.c
+++ b/drivers/net/usb/sierra_net.c
@@ -648,9 +648,9 @@ static u32 sierra_net_get_link(struct net_device *net)
.get_link = sierra_net_get_link,
.get_msglevel = usbnet_get_msglevel,
.set_msglevel = usbnet_set_msglevel,
-   .get_settings = usbnet_get_settings,
-   .set_settings = usbnet_set_settings,
.nway_reset = usbnet_nway_reset,
+   .get_link_ksettings = usbnet_get_link_ksettings,
+   .set_link_ksettings = usbnet_set_link_ksettings,
 };
 
 static int sierra_net_get_fw_attr(struct usbnet *dev, u16 *datap)
-- 
1.7.4.4

[PATCH 08/11] net: usb: smsc75xx: use new api ethtool_{get|set}_link_ksettings

2017-03-16 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/usb/smsc75xx.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/smsc75xx.c b/drivers/net/usb/smsc75xx.c
index 0b17b40..1ab0ff4 100644
--- a/drivers/net/usb/smsc75xx.c
+++ b/drivers/net/usb/smsc75xx.c
@@ -743,13 +743,13 @@ static int smsc75xx_ethtool_set_wol(struct net_device 
*net,
.get_drvinfo= usbnet_get_drvinfo,
.get_msglevel   = usbnet_get_msglevel,
.set_msglevel   = usbnet_set_msglevel,
-   .get_settings   = usbnet_get_settings,
-   .set_settings   = usbnet_set_settings,
.get_eeprom_len = smsc75xx_ethtool_get_eeprom_len,
.get_eeprom = smsc75xx_ethtool_get_eeprom,
.set_eeprom = smsc75xx_ethtool_set_eeprom,
.get_wol= smsc75xx_ethtool_get_wol,
.set_wol= smsc75xx_ethtool_set_wol,
+   .get_link_ksettings = usbnet_get_link_ksettings,
+   .set_link_ksettings = usbnet_set_link_ksettings,
 };
 
 static int smsc75xx_ioctl(struct net_device *netdev, struct ifreq *rq, int cmd)
-- 
1.7.4.4

[PATCH 06/11] net: usb: mcs7830: use new api ethtool_{get|set}_link_ksettings

2017-03-16 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/usb/mcs7830.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/mcs7830.c b/drivers/net/usb/mcs7830.c
index 4f345bd..5771ff2 100644
--- a/drivers/net/usb/mcs7830.c
+++ b/drivers/net/usb/mcs7830.c
@@ -464,9 +464,9 @@ static void mcs7830_get_regs(struct net_device *net, struct 
ethtool_regs *regs,
.get_link   = usbnet_get_link,
.get_msglevel   = usbnet_get_msglevel,
.set_msglevel   = usbnet_set_msglevel,
-   .get_settings   = usbnet_get_settings,
-   .set_settings   = usbnet_set_settings,
.nway_reset = usbnet_nway_reset,
+   .get_link_ksettings = usbnet_get_link_ksettings,
+   .set_link_ksettings = usbnet_set_link_ksettings,
 };
 
 static const struct net_device_ops mcs7830_netdev_ops = {
-- 
1.7.4.4

[PATCH 03/11] net: usb: sr9800: use new api ethtool_{get|set}_link_ksettings

2017-03-16 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/usb/sr9800.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/sr9800.c b/drivers/net/usb/sr9800.c
index a50df0d..a696b62 100644
--- a/drivers/net/usb/sr9800.c
+++ b/drivers/net/usb/sr9800.c
@@ -524,9 +524,9 @@ static int sr_set_mac_address(struct net_device *net, void 
*p)
.set_wol= sr_set_wol,
.get_eeprom_len = sr_get_eeprom_len,
.get_eeprom = sr_get_eeprom,
-   .get_settings   = usbnet_get_settings,
-   .set_settings   = usbnet_set_settings,
.nway_reset = usbnet_nway_reset,
+   .get_link_ksettings = usbnet_get_link_ksettings,
+   .set_link_ksettings = usbnet_set_link_ksettings,
 };
 
 static int sr9800_link_reset(struct usbnet *dev)
-- 
1.7.4.4

[PATCH 02/11] net: usb: smsc95xx: use new api ethtool_{get|set}_link_ksettings

2017-03-16 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/usb/smsc95xx.c |   24 
 1 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/net/usb/smsc95xx.c b/drivers/net/usb/smsc95xx.c
index 831aa33..4a8bf96 100644
--- a/drivers/net/usb/smsc95xx.c
+++ b/drivers/net/usb/smsc95xx.c
@@ -853,32 +853,32 @@ static void set_mdix_status(struct net_device *net, __u8 
mdix_ctrl)
pdata->mdix_ctrl = mdix_ctrl;
 }
 
-static int smsc95xx_get_settings(struct net_device *net,
-struct ethtool_cmd *cmd)
+static int smsc95xx_get_link_ksettings(struct net_device *net,
+  struct ethtool_link_ksettings *cmd)
 {
struct usbnet *dev = netdev_priv(net);
struct smsc95xx_priv *pdata = (struct smsc95xx_priv *)(dev->data[0]);
int retval;
 
-   retval = usbnet_get_settings(net, cmd);
+   retval = usbnet_get_link_ksettings(net, cmd);
 
-   cmd->eth_tp_mdix = pdata->mdix_ctrl;
-   cmd->eth_tp_mdix_ctrl = pdata->mdix_ctrl;
+   cmd->base.eth_tp_mdix = pdata->mdix_ctrl;
+   cmd->base.eth_tp_mdix_ctrl = pdata->mdix_ctrl;
 
return retval;
 }
 
-static int smsc95xx_set_settings(struct net_device *net,
-struct ethtool_cmd *cmd)
+static int smsc95xx_set_link_ksettings(struct net_device *net,
+  const struct ethtool_link_ksettings *cmd)
 {
struct usbnet *dev = netdev_priv(net);
struct smsc95xx_priv *pdata = (struct smsc95xx_priv *)(dev->data[0]);
int retval;
 
-   if (pdata->mdix_ctrl != cmd->eth_tp_mdix_ctrl)
-   set_mdix_status(net, cmd->eth_tp_mdix_ctrl);
+   if (pdata->mdix_ctrl != cmd->base.eth_tp_mdix_ctrl)
+   set_mdix_status(net, cmd->base.eth_tp_mdix_ctrl);
 
-   retval = usbnet_set_settings(net, cmd);
+   retval = usbnet_set_link_ksettings(net, cmd);
 
return retval;
 }
@@ -889,8 +889,6 @@ static int smsc95xx_set_settings(struct net_device *net,
.get_drvinfo= usbnet_get_drvinfo,
.get_msglevel   = usbnet_get_msglevel,
.set_msglevel   = usbnet_set_msglevel,
-   .get_settings   = smsc95xx_get_settings,
-   .set_settings   = smsc95xx_set_settings,
.get_eeprom_len = smsc95xx_ethtool_get_eeprom_len,
.get_eeprom = smsc95xx_ethtool_get_eeprom,
.set_eeprom = smsc95xx_ethtool_set_eeprom,
@@ -898,6 +896,8 @@ static int smsc95xx_set_settings(struct net_device *net,
.get_regs   = smsc95xx_ethtool_getregs,
.get_wol= smsc95xx_ethtool_get_wol,
.set_wol= smsc95xx_ethtool_set_wol,
+   .get_link_ksettings = smsc95xx_get_link_ksettings,
+   .set_link_ksettings = smsc95xx_set_link_ksettings,
 };
 
 static int smsc95xx_ioctl(struct net_device *netdev, struct ifreq *rq, int cmd)
-- 
1.7.4.4

[PATCH 05/11] net: usb: dm9601: use new api ethtool_{get|set}_link_ksettings

2017-03-16 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/usb/dm9601.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/dm9601.c b/drivers/net/usb/dm9601.c
index 0b4bdd3..fea1b64 100644
--- a/drivers/net/usb/dm9601.c
+++ b/drivers/net/usb/dm9601.c
@@ -281,9 +281,9 @@ static int dm9601_ioctl(struct net_device *net, struct 
ifreq *rq, int cmd)
.set_msglevel   = usbnet_set_msglevel,
.get_eeprom_len = dm9601_get_eeprom_len,
.get_eeprom = dm9601_get_eeprom,
-   .get_settings   = usbnet_get_settings,
-   .set_settings   = usbnet_set_settings,
.nway_reset = usbnet_nway_reset,
+   .get_link_ksettings = usbnet_get_link_ksettings,
+   .set_link_ksettings = usbnet_set_link_ksettings,
 };
 
 static void dm9601_set_multicast(struct net_device *net)
-- 
1.7.4.4

Re: [net-next PATCH 5/5] epoll: Add busy poll support to epoll with socket fds.

2017-03-16 Thread Eric Dumazet

On Thu, 2017-03-16 at 11:33 -0700, Alexander Duyck wrote:
> From: Sridhar Samudrala 

> +/*
> + * If busy polling is on and the file is a socket, return a pointer to
> + * struct sock
> + */
> +static inline struct sock *ep_sk_from_file(struct file *file)
> +{
> + struct inode *inode = file_inode(file);
> +
> + if (!S_ISSOCK(inode->i_mode))
> + return NULL;
> +
> + return ((struct socket *)file->private_data)->sk;
> +}

I believe a more standard way is to use sock_from_file()

(This does not have to fetch a cache line to read i_mode ))

Re: [net-next PATCH 1/5] net: Do not record sender_cpu as napi_id in socket receive paths

2017-03-16 Thread Eric Dumazet

On Thu, 2017-03-16 at 11:32 -0700, Alexander Duyck wrote:
> From: Sridhar Samudrala 
> 
> Fix sk_mark_napi_id() and sk_mark_napi_id_once() to set sk_napi_id only if
> skb->napi_id is a valid value.
> 
> This happens in loopback paths where skb->napi_id is not updated in
> rx path and holds sender_cpu that is set in xmit path.
> 
> Signed-off-by: Sridhar Samudrala 
> Signed-off-by: Alexander Duyck 
> ---
>  include/net/busy_poll.h |5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
> index c0452de83086..67991635953e 100644
> --- a/include/net/busy_poll.h
> +++ b/include/net/busy_poll.h
> @@ -116,7 +116,8 @@ static inline bool sk_busy_loop(struct sock *sk, int 
> nonblock)
>  static inline void sk_mark_napi_id(struct sock *sk, const struct sk_buff 
> *skb)
>  {
>  #ifdef CONFIG_NET_RX_BUSY_POLL
> - sk->sk_napi_id = skb->napi_id;
> + if (skb->napi_id > (u32)NR_CPUS)
> + sk->sk_napi_id = skb->napi_id;
>  #endif
>  }
>  
> @@ -125,7 +126,7 @@ static inline void sk_mark_napi_id_once(struct sock *sk,
>   const struct sk_buff *skb)
>  {
>  #ifdef CONFIG_NET_RX_BUSY_POLL
> - if (!sk->sk_napi_id)
> + if (!sk->sk_napi_id && (skb->napi_id > (u32)NR_CPUS))
>   sk->sk_napi_id = skb->napi_id;
>  #endif
>  }
> 

It is not clear why this patch is needed .

What you describe here is the case we might receive packets for a socket
coming from different interfaces ?

If skb->napi_id is a sender_cpu, why should we prevent overwriting the
sk_napi_id with it, knowing that busy polling will simply ignore the
invalid value ?

Do not get me wrong :

I simply try to understand why the test about napi_id validity is now
done twice :

1) At the time we are writing into sk->sk_napi_id

2) At busy polling time when we read sk->sk_napi_id

Re: [net-next PATCH 2/5] net: Call sk_mark_napi_id() in the ACK receive path

2017-03-16 Thread Eric Dumazet

On Thu, 2017-03-16 at 11:32 -0700, Alexander Duyck wrote:
> From: Sridhar Samudrala 
> 
> Call sk_mark_napi_id() in the ACK receive path of a TCP_NEW_SYN_RECV
> socket, so that sk->napi_id is set even if the socket hasn't yet received
> any data.  With this change we should be able to start busy polling
> slightly earlier.
> 
> Signed-off-by: Sridhar Samudrala 
> Signed-off-by: Alexander Duyck 
> ---
>  net/ipv4/tcp_ipv4.c |1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 08d870e45658..b86002a296f1 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1687,6 +1687,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
>   tcp_v4_send_reset(nsk, skb);
>   goto discard_and_relse;
>   } else {
> + sk_mark_napi_id(nsk, skb);
>   sock_put(sk);
>   return 0;
>   }
> 

Seems good, but what about IPv6 ?

Frankly this calls for the sk_mark_napi_id() being done in
tcp_child_process() instead of its four callers.

Re: [PATCH] [PATCH net] net: Do not hold the reference for the same sk_rx_dst.

2017-03-16 Thread Eric Dumazet

On Thu, 2017-03-16 at 01:08 -0700, Kevin Xu wrote:
> In some rare cases, inet_sk_rx_dst_set() may be called multiple times
> on the same dst, causing reference count leakage. Eventually, it
> prevents net_device to be destroyed. The bug then manifested as
> 
> unregister_netdevice: waiting for lo to become free. Usage count = 1
> 
> in the kernel log, preventing new network namespace creation.
> 
> The patch works around the issue by checking whether the socket already
> has the same dst set.
> 
> Signed-off-by: Kevin Xu 
> ---
>  net/ipv4/tcp_ipv4.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 575e19d..f425c14 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1807,9 +1807,14 @@ void inet_sk_rx_dst_set(struct sock *sk, const struct 
> sk_buff *skb)
>  {
>   struct dst_entry *dst = skb_dst(skb);
>  
> - if (dst && dst_hold_safe(dst)) {
> - sk->sk_rx_dst = dst;
> - inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
> + if (dst) {
> + if (unlikely(dst == sk->sk_rx_dst))
> + return;
> +
> + if (dst_hold_safe(dst)) {
> + sk->sk_rx_dst = dst;
> + inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
> + }


That should not be needed.

Have you backported the redirect fix ?

commit 45caeaa5ac0b4b11784ac6f932c0ad4c6b67cda0

Or other fixes that went very recently (pick David Miller net tree)

Re: [RFC] [net]openvswitch: Clear the ct flow key for the recirculated packet

2017-03-16 Thread Lance Richardson

> From: "Numan Siddique" 
> To: netdev@vger.kernel.org, "ovs dev" 
> Cc: "Joe Stringer" , "Andy Zhou" , ja...@ovn.org
> Sent: Thursday, March 16, 2017 8:25:06 AM
> Subject: [RFC] [net]openvswitch: Clear the ct flow key for the recirculated 
> packet
> 
> It is possible that the ct flow key information would have
> gone stale for the packets received from the userspace due to
> clone or ct_clear actions.
> 
> In the case of OVN, it adds ping responder flows, which modifies
> the original icmp4 request packet to a reply packet. It uses the
> OVS actions - clone and ct_clear. When the reply packet hits the
> "ovs_ct_execute" function, and since the ct flow key info is not
> cleared, the connection tracker doesn't set the state to
> ESTABLISHED state.
> 
> Note: This patch is marked as RFC, as I am not sure if this is the correct
> place to address this issue or it should be addressed in ovs-vswitchd
> to set the OVS_KEY_ATTR_CT_STATE and other related attributes
> properly for ct_clear action.
> 
> Signed-off-by: Numan Siddique 
> ---

Hi Numan,

With this patch applied I'm consistently seeing failures for two of the
kernel datapath unit tests (via "make check-kernel"):

 16: conntrack - force commitFAILED 
(system-traffic.at:692)
 54: conntrack - SNAT with ct_mark change on reply   FAILED 
(system-traffic.at:2446)

Re: [PATCH net-next V2 0/2] small set of sched cleanups

2017-03-16 Thread Cong Wang

On Thu, Mar 16, 2017 at 12:02 PM, David Miller  wrote:
> From: Or Gerlitz 
> Date: Thu, 16 Mar 2017 12:53:40 +0200
>
>> Just two cleanups -- but for the 2nd one I think we need ack from
>> Cong Wang to make sure this isn't actually a bug report..
>  ...
>> changes from V1:
>>   - addressed comment from Sergei to use 12 hex digits etc
>
> Series applied since Eric Dumazet reviewed and ACK's patch #2,
> although I still want Cong to take a look at it.

I already did and gave an Ack. Is it lost?

Re: [PATCH v2 2/2] can: spi: hi311x: Add Holt HI-311x CAN driver

2017-03-16 Thread Wolfgang Grandegger

Hello Akshay,

Am 16.03.2017 um 18:06 schrieb Akshay Bhat:
> Hi Wolfgang,
> 
> On 03/15/2017 05:42 AM, Wolfgang Grandegger wrote:
>> Hello Akshay,
>>
> ..snip..
>>>
>>> So here is my plan:
>>> - Have the bus error interrupt always enabled
>>> - If berr-reporting off, then have the isr checks/reports state changes
>>
>> Error state change messages should always be there. These are the
>> important one.
>>
>>> - if berr-reporting on, then have the isr checks/reports bus errors
>>> and state changes (Does it make sense packing the error message, if
>>> the ISR finds both bus and state changes?)
>>
>> If berr-reporting is off, simply do not create an error message for bus
>> errors, and only if the state changed. If it's "on" create an additional
>> bus error message.
>>
>> http://lxr.free-electrons.com/source/drivers/net/can/flexcan.c#L334
>>
>>
> 
> I have fixed the driver to handle the error reporting. Also thanks for
> your tip for generating bus-off by setting the host device at a
> different CAN bit rate! Below are logs with the updated driver. Let me
> know if you have any concerns, if not I will submit the v4 patch.
> 
> berr-reporting on case:
> http://pastebin.com/qDRLERmW

Looks much better now! There are message for state changes to error
warning and passive. Just the following message is not correct:

 (000.200824)  can0  2004   [8]  00 40 00 00 00 00 5F 19   ERRORFRAME
controller-problem{}
error-counter-tx-rx{{95}{25}}

Sorry, forgot to mention... the function can_change_state() [1]
should be used for that purpose, if possible. It fixes the issue
above as well.

> berr-reporting off case:
> http://pastebin.com/fUn3j7qU

Ditto.

I just had another look to the manual and there is this undocumented
STATFE register at offset 0x1E. It's mentioned in some other parts of
the doc as interrupt enable register for STATF events. I would assume
the same bit layout than STATF. If you set bit 2 (BUSOFF), 3 (ERRP)
and 4 (ERRW), you may get interrupts. It's worth a try, I think. If
it works, it's the much better solution.

Wolfgang.

[1] http://lxr.free-electrons.com/ident?i=can_change_state

Wolfgang.

Re: port mirror on dsa switches?

2017-03-16 Thread Florian Fainelli

On 03/16/2017 12:43 PM, Maxim Uvarov wrote:
> 2017-03-16 19:47 GMT+03:00 Florian Fainelli :
>> Hi,
>>
>> On 03/16/2017 03:32 AM, Maxim Uvarov wrote:
>>> Hello,
>>>
>>> Some dsa switches can support port mirror in hardware. Does somebody
>>> have any idea how to
>>> work with it from linux side in generic way?
>>
>> It has been implemented with commit
>> f50f212749e8a28803af3628acbeb85ee0458ed5 ("net: dsa: Add plumbing for
>> port mirroring") and you can see an implementation example with the b53
>> driver.
>>
> 
> thanks, see that in newer kernel. It will be good to update doc with
> commands to use.
> I see from cover latter you mirrored eth1 to eth2. I  assume it's dsa
> ports names for b53?

Yes, DSA will refuse mirroring to network devices that are not ports of
the switch (arguable we could allow mirroring to master network device,
but that's TODO).

> Do I also need updated tc?

Yes you do need a reasonably recent iproute2, like 4.11 or something close.

> 
>> For a DSA driver you should be implementing port_mirror_add and
>> port_mirror_del operations which provide you with the necessary
>> information. Hopefully the switch you are working with (mv88e6xxx?) is
>> also supportable using that API, if not, please submit changes to extend it.
>>
>> Thanks!
>> --
>> Florian
> 
> Yes, I think it will match mv88e6xxx. Will try to play with it. Thanks a lot!
> 


-- 
Florian

[PATCH net] netvsc: fix race during initialization

2017-03-16 Thread Stephen Hemminger

When device is being setup on boot, there is a small race where
network device callback is registered, but the netvsc_device pointer
is not set yet.  This can cause a NULL ptr dereference if packet
arrives during this window.

Fixes: 46b4f7f5d1f7 ("netvsc: eliminate per-device outstanding send counter")
Signed-off-by: Stephen Hemminger 
---
Applies to 4.11 only, not in earlier versions.
net-next needs different patch

 drivers/net/hyperv/netvsc.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 4c1d8cca247b..8dd0b8770328 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -1231,8 +1231,11 @@ void netvsc_channel_cb(void *context)
return;
 
net_device = net_device_to_netvsc_device(ndev);
-   if (unlikely(net_device->destroy) &&
-   netvsc_channel_idle(net_device, q_idx))
+   if (unlikely(!net_device))
+   return;
+
+   if (unlikely(net_device->destroy &&
+netvsc_channel_idle(net_device, q_idx)))
return;
 
/* commit_rd_index() -> hv_signal_on_read() needs this. */
-- 
2.11.0

[PATCH] tun: fix inability to set offloads after disabling them via ethtool

2017-03-16 Thread Yaroslav Isakov

Added missing logic in tun driver, which prevents apps to set
offloads using tun ioctl, if offloads were previously disabled via ethtool

Signed-of-by: Yaroslav Isakov (yaroslav.isa...@gmail.com)
---
 drivers/net/tun.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 34cc3c5..cc88cd7 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1931,6 +1931,8 @@ static int set_offload(struct tun_struct *tun, unsigned 
long arg)
return -EINVAL;
 
tun->set_features = features;
+   tun->dev->wanted_features &= ~TUN_USER_FEATURES;
+   tun->dev->wanted_features |= features;
netdev_update_features(tun->dev);
 
return 0;
-- 
2.10.2

[PATCH] tun: fix inability to set offloads after disabling them via ethtool

2017-03-16 Thread Yaroslav Isakov

Added missing logic in tun driver, which prevents apps to set
offloads using tun ioctl, if offloads were previously disabled via ethtool

Signed-off-by: Yaroslav Isakov 
---
 drivers/net/tun.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 34cc3c5..cc88cd7 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1931,6 +1931,8 @@ static int set_offload(struct tun_struct *tun, unsigned 
long arg)
return -EINVAL;
 
tun->set_features = features;
+   tun->dev->wanted_features &= ~TUN_USER_FEATURES;
+   tun->dev->wanted_features |= features;
netdev_update_features(tun->dev);
 
return 0;
-- 
2.10.2

Re: port mirror on dsa switches?

2017-03-16 Thread Maxim Uvarov

2017-03-16 19:47 GMT+03:00 Florian Fainelli :
> Hi,
>
> On 03/16/2017 03:32 AM, Maxim Uvarov wrote:
>> Hello,
>>
>> Some dsa switches can support port mirror in hardware. Does somebody
>> have any idea how to
>> work with it from linux side in generic way?
>
> It has been implemented with commit
> f50f212749e8a28803af3628acbeb85ee0458ed5 ("net: dsa: Add plumbing for
> port mirroring") and you can see an implementation example with the b53
> driver.
>

thanks, see that in newer kernel. It will be good to update doc with
commands to use.
I see from cover latter you mirrored eth1 to eth2. I  assume it's dsa
ports names for b53?
Do I also need updated tc?

> For a DSA driver you should be implementing port_mirror_add and
> port_mirror_del operations which provide you with the necessary
> information. Hopefully the switch you are working with (mv88e6xxx?) is
> also supportable using that API, if not, please submit changes to extend it.
>
> Thanks!
> --
> Florian

Yes, I think it will match mv88e6xxx. Will try to play with it. Thanks a lot!

-- 
Best regards,
Maxim Uvarov

Re: [PATCH] tun: fix inability to set offloads after disabling them via ethtool

2017-03-16 Thread Yaroslav Isakov

Sorry for spamming mailing list, forgot to check via checkpatch.pl,
will resend it one more time

2017-03-16 21:08 GMT+03:00 Yaroslav Isakov :
> Added missing logic in tun driver, which prevents apps to set
> offloads using tun ioctl, if offloads were previously disabled via ethtool
>
> Signed-of-by: Yaroslav Isakov (yaroslav.isa...@gmail.com)
> ---
>  drivers/net/tun.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 34cc3c5..cc88cd7 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1931,6 +1931,8 @@ static int set_offload(struct tun_struct *tun, unsigned 
> long arg)
> return -EINVAL;
>
> tun->set_features = features;
> +   tun->dev->wanted_features &= ~TUN_USER_FEATURES;
> +   tun->dev->wanted_features |= features;
> netdev_update_features(tun->dev);
>
> return 0;
> --
> 2.10.2
>

Re: [net-next sample action optimization v2 4/4] Openvswitch: Refactor sample and recirc actions implementation

2017-03-16 Thread Pravin Shelar

On Tue, Mar 14, 2017 at 4:08 PM, Andy Zhou  wrote:
> Added execute_or_defer_actions() that both sample and recirc
> action's implementation can use.
>
> Signed-off-by: Andy Zhou 
> ---
>  net/openvswitch/actions.c | 96 
> +--
>  1 file changed, 59 insertions(+), 37 deletions(-)
>
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index 1638370..fd7d903 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
...

> @@ -1286,6 +1262,52 @@ static int do_execute_actions(struct datapath *dp, 
> struct sk_buff *skb,
> return 0;
>  }
>
> +static int execute_or_defer_actions(struct datapath *dp, struct sk_buff *skb,
> +   struct sw_flow_key *clone,
> +   struct sw_flow_key *key,
> +   u32 *recirc_id,
> +   const struct nlattr *actions, int len)
> +{
> +   struct deferred_action *da;
> +
> +   /* If within the limit of 'OVS_DEFERRED_ACTION_THRESHOLD',
> +* recirc immediately, otherwise, defer it for later execution.
> +*/
> +   if (clone) {
> +   if (recirc_id) {
> +   clone->recirc_id = *recirc_id;
> +   ovs_dp_process_packet(skb, clone);
> +   return 0;
> +   } else {
> +   return do_execute_actions(dp, skb, clone,
> + actions, len);

the exec_actions_level inc and dec should be moved here, since it is
only needed around this function.

Re: [net-next sample action optimization v2 1/4] openvswitch: Deferred fifo API change.

2017-03-16 Thread Andy Zhou

On Thu, Mar 16, 2017 at 10:28 AM, Pravin Shelar  wrote:
> On Tue, Mar 14, 2017 at 4:08 PM, Andy Zhou  wrote:
>> add_deferred_actions() API currently requires actions to be passed in
>> as a fully encoded netlink message. So far both 'sample' and 'recirc'
>> actions happens to carry actions as fully encoded netlink messages.
>> However, this requirement is more restrictive than necessary, future
>> patch will need to pass in action lists that are not fully encoded
>> by themselves.
>
> It is not obvious why this change is required?
> can you explain it.

The original 'attr' requires a nested netlink message for the callee
to get the size of
the actions list.  In the sample case, since we rewrite sample into an
internal format that
does not have actions encoded as a nested netlink, we will need to
pass both pointer
to the first action, and the size of the actions list.

Re: tun offloads bug

2017-03-16 Thread Yaroslav Isakov

Sorry for this mess! I'll do it properly this time, I hope :)

2017-03-16 22:12 GMT+03:00 David Miller :
>
> This is not how you resubmit a patch.
>
> You must make a fresh patch posting to the mailing list, provide the
> commit message and patch as inline text, and not do it as a reply to
> another posting.
>
> Thank you.

Re: tun offloads bug

2017-03-16 Thread David Miller


This is not how you resubmit a patch.

You must make a fresh patch posting to the mailing list, provide the
commit message and patch as inline text, and not do it as a reply to
another posting.

Thank you.

Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-16 Thread David Miller

From: Kees Cook 
Date: Thu, 16 Mar 2017 11:38:25 -0600

> I am, of course, biased, but I think the evidence of actual
> refcounting attacks outweighs the theoretical performance cost of
> these changes.

This is not theoretical at all.

We count the nanoseconds that every packet takes to get processed and
you are adding quite a bit.

I understand your point of view, but this is knowingly going to add
performance regressions to the networking code.

Re: [PATCH v2] openvswitch: Add missing case OVS_TUNNEL_KEY_ATTR_PAD

2017-03-16 Thread David Miller

From: Kris Murphy 
Date: Thu, 16 Mar 2017 10:51:28 -0500

> Added a case for OVS_TUNNEL_KEY_ATTR_PAD to the switch statement
> in ip_tun_from_nlattr in order to prevent the default case
> returning an error.
> 
> Fixes: b46f6ded906e ("libnl: nla_put_be64(): align on a 64-bit area")
> Signed-off-by: Kris Murphy 

Applied and queued up for -stable, thanks.

Re: [PATCH 0/2] pull request for net: batman-adv 2017-03-16

2017-03-16 Thread David Miller

From: Simon Wunderlich 
Date: Thu, 16 Mar 2017 13:55:10 +0100

> here are two bugfixes for batman-adv which we would like to see
> integrated into net.
> 
> Please pull or let me know of any problem!

Pulled, thanks Simon.

Re: [PATCH net-next] bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers

2017-03-16 Thread Daniel Borkmann

On 03/16/2017 07:23 PM, David Miller wrote:

From: Daniel Borkmann 
Date: Thu, 16 Mar 2017 11:32:31 +0100

Do you have a chance to queue this one and it's follow-up fixes
to -stable? I checked 4.10 and they seem to have it already, but
not 4.9 kernels.

Commits:

   6760bf2 bpf: fix mark_reg_unknown_value for spilled regs on map value
   marking
   a08dd0d bpf: fix regression on verifier pruning wrt map lookups
   d2a4dd3 bpf: fix state equivalence
   57a09bf bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers

Ok, queued up.

Great, thanks!

Re: [PATCH net-next V2 0/2] small set of sched cleanups

2017-03-16 Thread David Miller

From: Or Gerlitz 
Date: Thu, 16 Mar 2017 12:53:40 +0200

> Just two cleanups -- but for the 2nd one I think we need ack from 
> Cong Wang to make sure this isn't actually a bug report..
 ...
> changes from V1:
>   - addressed comment from Sergei to use 12 hex digits etc

Series applied since Eric Dumazet reviewed and ACK's patch #2,
although I still want Cong to take a look at it.

Thanks.

1 2 3 >

1 - 100 of 247 matches

Mail list logo