Re: slab corruption with current -git

2016-10-09 Thread David Miller
From: Linus Torvalds 
Date: Sun, 9 Oct 2016 20:41:17 -0700

> Note that the "correct way" of doing list operations also almost
> inevitably is the shortest way by far, since it gets rid of all the
> special cases. So the patch looks nice. It gets rid of the magic
> "nf_set_hooks_head()" thing too, because once you do list following
> right, the head is no different from any other pointer in the list.

Perhaps we should have some "slist" primitives added to
include/linux/list.h but since the comparison differs for each user I
guess it's hard to abstract in a way that's generic and inlines
properly.

I'll start taking a look at your patch and this stuff as well, thanks
Linus.


Re: slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice))

2016-10-09 Thread Linus Torvalds
On Sun, Oct 9, 2016 at 7:49 PM, Linus Torvalds
 wrote:
>
> There is one *correct* way to remove an entry from a singly linked
> list, and it looks like this:
>
> struct entry **pp, *p;
>
> pp = 
> while ((p = *pp) != NULL) {
> if (right_entry(p)) {
> *pp = p->next;
> break;
> }
> pp = >next;
> }
>
> and that's it. Nothing else.

This COMPLETELY UNTESTED patch tries to fix the nf_hook_entry code to do this.

I repeat: it's ENTIRELY UNTESTED. I just converted the insertion and
deletion to the proper pattern, but I could easily have gotten the
insertion priority test the wrong way around entirely, for example. Or
it could simply have some other completely broken bug in it. It
compiles for me, but that's all I actually checked.

Note that the "correct way" of doing list operations also almost
inevitably is the shortest way by far, since it gets rid of all the
special cases. So the patch looks nice. It gets rid of the magic
"nf_set_hooks_head()" thing too, because once you do list following
right, the head is no different from any other pointer in the list.

So the patch stats look good:

 net/netfilter/core.c | 108 ---
 1 file changed, 33 insertions(+), 75 deletions(-)

but again, it's entirely *entirely* untested. Please consider this
just a "this is generally how list insert/delete operations should be
done, avoiding special cases for the first entry".

ALSO NOTE! The code assumes that the "nf_hook_mutex" locking only
protects the actual *lists*, and that the address to the list can be
looked up without holding the lock. That's generally how things are
done, and it simplifies error handling (because you can do the "there
is no such list at all" test before you do anything else. But again, I
don't actually know the code, and if there is something that actually
expands the number of lists etc that depends on that mutex, then the
list head lookup may need to be inside the lock too.

   Linus
 net/netfilter/core.c | 108 ---
 1 file changed, 33 insertions(+), 75 deletions(-)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index c9d90eb64046..814258641fcc 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -65,49 +65,24 @@ static DEFINE_MUTEX(nf_hook_mutex);
 #define nf_entry_dereference(e) \
rcu_dereference_protected(e, lockdep_is_held(_hook_mutex))
 
-static struct nf_hook_entry *nf_hook_entry_head(struct net *net,
-   const struct nf_hook_ops *reg)
+static struct nf_hook_entry __rcu **nf_hook_entry_head(struct net *net, const 
struct nf_hook_ops *reg)
 {
-   struct nf_hook_entry *hook_head = NULL;
-
if (reg->pf != NFPROTO_NETDEV)
-   hook_head = nf_entry_dereference(net->nf.hooks[reg->pf]
-[reg->hooknum]);
-   else if (reg->hooknum == NF_NETDEV_INGRESS) {
+   return net->nf.hooks[reg->pf]+reg->hooknum;
+
 #ifdef CONFIG_NETFILTER_INGRESS
+   if (reg->hooknum == NF_NETDEV_INGRESS) {
if (reg->dev && dev_net(reg->dev) == net)
-   hook_head =
-   nf_entry_dereference(
-   reg->dev->nf_hooks_ingress);
-#endif
+   return >dev->nf_hooks_ingress;
}
-   return hook_head;
-}
-
-/* must hold nf_hook_mutex */
-static void nf_set_hooks_head(struct net *net, const struct nf_hook_ops *reg,
- struct nf_hook_entry *entry)
-{
-   switch (reg->pf) {
-   case NFPROTO_NETDEV:
-#ifdef CONFIG_NETFILTER_INGRESS
-   /* We already checked in nf_register_net_hook() that this is
-* used from ingress.
-*/
-   rcu_assign_pointer(reg->dev->nf_hooks_ingress, entry);
 #endif
-   break;
-   default:
-   rcu_assign_pointer(net->nf.hooks[reg->pf][reg->hooknum],
-  entry);
-   break;
-   }
+   return NULL;
 }
 
 int nf_register_net_hook(struct net *net, const struct nf_hook_ops *reg)
 {
-   struct nf_hook_entry *hooks_entry;
-   struct nf_hook_entry *entry;
+   struct nf_hook_entry __rcu **pp;
+   struct nf_hook_entry *entry, *p;
 
if (reg->pf == NFPROTO_NETDEV) {
 #ifndef CONFIG_NETFILTER_INGRESS
@@ -119,6 +94,10 @@ int nf_register_net_hook(struct net *net, const struct 
nf_hook_ops *reg)
return -EINVAL;
}
 
+   pp = nf_hook_entry_head(net, reg);
+   if (!pp)
+   return -EINVAL;
+
entry = kmalloc(sizeof(*entry), GFP_KERNEL);
if (!entry)
return -ENOMEM;
@@ -128,26 +107,15 @@ int nf_register_net_hook(struct net *net, const struct 
nf_hook_ops *reg)
entry->next = NULL;
 
 

[PATCH net] net_sched: do not broadcast RTM_GETTFILTER result

2016-10-09 Thread Eric Dumazet
From: Eric Dumazet 

There are two ways to get tc filters from kernel to user space.

1) Full dump (tc_dump_tfilter())
2) RTM_GETTFILTER to get one precise filter, reducing overhead.

The second operation is unfortunately broadcasting its result,
polluting "tc monitor" users.

This patch makes sure only the requester gets the result, using
netlink_unicast() instead of rtnetlink_send()

Jamal cooked an iproute2 patch to implement "tc filter get" operation,
but other user space libraries already use RTM_GETTFILTER when a single
filter is queried, instead of dumping all filters.

Signed-off-by: Eric Dumazet 
Cc: Jamal Hadi Salim 
---
 net/sched/cls_api.c |   18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 11da7da0b7c4..2ee29a3375f6 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -101,7 +101,7 @@ EXPORT_SYMBOL(unregister_tcf_proto_ops);
 
 static int tfilter_notify(struct net *net, struct sk_buff *oskb,
  struct nlmsghdr *n, struct tcf_proto *tp,
- unsigned long fh, int event);
+ unsigned long fh, int event, bool unicast);
 
 static void tfilter_notify_chain(struct net *net, struct sk_buff *oskb,
 struct nlmsghdr *n,
@@ -112,7 +112,7 @@ static void tfilter_notify_chain(struct net *net, struct 
sk_buff *oskb,
 
for (it_chain = chain; (tp = rtnl_dereference(*it_chain)) != NULL;
 it_chain = >next)
-   tfilter_notify(net, oskb, n, tp, 0, event);
+   tfilter_notify(net, oskb, n, tp, 0, event, false);
 }
 
 /* Select new prio value from the range, managed by kernel. */
@@ -319,7 +319,8 @@ replay:
 
RCU_INIT_POINTER(*back, next);
 
-   tfilter_notify(net, skb, n, tp, fh, RTM_DELTFILTER);
+   tfilter_notify(net, skb, n, tp, fh,
+  RTM_DELTFILTER, false);
tcf_destroy(tp, true);
err = 0;
goto errout;
@@ -345,14 +346,14 @@ replay:
struct tcf_proto *next = 
rtnl_dereference(tp->next);
 
tfilter_notify(net, skb, n, tp, fh,
-  RTM_DELTFILTER);
+  RTM_DELTFILTER, false);
if (tcf_destroy(tp, false))
RCU_INIT_POINTER(*back, next);
}
goto errout;
case RTM_GETTFILTER:
err = tfilter_notify(net, skb, n, tp, fh,
-RTM_NEWTFILTER);
+RTM_NEWTFILTER, true);
goto errout;
default:
err = -EINVAL;
@@ -367,7 +368,7 @@ replay:
RCU_INIT_POINTER(tp->next, rtnl_dereference(*back));
rcu_assign_pointer(*back, tp);
}
-   tfilter_notify(net, skb, n, tp, fh, RTM_NEWTFILTER);
+   tfilter_notify(net, skb, n, tp, fh, RTM_NEWTFILTER, false);
} else {
if (tp_created)
tcf_destroy(tp, true);
@@ -419,7 +420,7 @@ nla_put_failure:
 
 static int tfilter_notify(struct net *net, struct sk_buff *oskb,
  struct nlmsghdr *n, struct tcf_proto *tp,
- unsigned long fh, int event)
+ unsigned long fh, int event, bool unicast)
 {
struct sk_buff *skb;
u32 portid = oskb ? NETLINK_CB(oskb).portid : 0;
@@ -433,6 +434,9 @@ static int tfilter_notify(struct net *net, struct sk_buff 
*oskb,
return -EINVAL;
}
 
+   if (unicast)
+   return netlink_unicast(net->rtnl, skb, portid, MSG_DONTWAIT);
+
return rtnetlink_send(skb, net, portid, RTNLGRP_TC,
  n->nlmsg_flags & NLM_F_ECHO);
 }




Re: net: BUG still has locks held in unix_stream_splice_read

2016-10-09 Thread Al Viro
On Mon, Oct 10, 2016 at 03:46:07AM +0100, Al Viro wrote:
> On Sun, Oct 09, 2016 at 12:06:14PM +0200, Dmitry Vyukov wrote:
> > I suspect this is:
> > 
> > commit 25869262ef7af24ccde988867ac3eb1c3d4b88d4
> > Author: Al Viro 
> > Date:   Sat Sep 17 21:02:10 2016 -0400
> > skb_splice_bits(): get rid of callback
> > since pipe_lock is the outermost now, we don't need to drop/regain
> > socket locks around the call of splice_to_pipe() from skb_splice_bits(),
> > which kills the need to have a socket-specific callback; we can just
> > call splice_to_pipe() and be done with that.
> 
> Unlikely, since that particular commit removes unlocking/relocking ->iolock
> around the call of splice_to_pipe().  Original would've retaken the same
> lock on the way out; it's not as if we could leave the syscall there.
> 
> It might be splice-related, but I don't believe that you've got the right
> commit here.

It's not that commit, all right - it's "can't call unix_stream_read_generic()
with any locks held" stepped onto a couple of commits prior by
"splice: lift pipe_lock out of splice_to_pipe()".  Could somebody explain
what is that about?

E.g what will happen if some code does a read on AF_UNIX socket with
some local mutex held?  AFAICS, there are exactly two callers of
freezable_schedule_timeout() - this one and one in XFS; the latter is
in a kernel thread where we do have good warranties about the locking
environment, but here it's in the bleeding ->recvmsg/->splice_read and
for those assumption that caller doesn't hold any locks is pretty
strong, especially since it's not documented anywhere.

What's going on there?


Re: slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice))

2016-10-09 Thread Linus Torvalds
On Sun, Oct 9, 2016 at 6:35 PM, Aaron Conole  wrote:
>
> I was just about to build and test something similar:

So I haven't actually tested that one, but looking at the code, it
really looks very bogus. In fact, that code just looks like crap. It
does *not* do a proper "remove singly linked list entry". It's exactly
the kind of code that I rail against, and that people should never
write.

Any code that can't even traverse a linked list is not worth looking at.

There is one *correct* way to remove an entry from a singly linked
list, and it looks like this:

struct entry **pp, *p;

pp = 
while ((p = *pp) != NULL) {
if (right_entry(p)) {
*pp = p->next;
break;
}
pp = >next;
}

and that's it. Nothing else. The above code exits the loop with "p"
containing the entry that was removed, or NULL if nothing was. It
can't get any simpler than that, but more importantly, anything more
complicated than that is WRONG.

Seriously, nothing else is acceptable. In particular, any linked list
traversal that makes a special case of the first entry or the last
entry should not be allowed to exist. Note how there is not a single
special case in the above correct code. It JustWorks(tm).

That nf_unregister_net_hook() code has all the signs of exactly that
kind of broken list-handling code: special-casing the head of the
loop, and having the loop condition test both current and that odd
"next to next" pointer etc. It's all very very wrong.

So I really see two options:

 - do that singly-linked list traversal right (and I'm serious:
nothing but the code above can ever be right)

 - don't make up your own list handling code at all, and use the
standard linux list code.

So either e3b37f11e6e4 needs to be reverted, or it needs to be taught
to use real list handling.  If the code doesn't want to use the
regular list.h (either the doubly linked one, or the hlist one), it
needs to at least learn to do list removal right.

   Linus


Re: net: BUG still has locks held in unix_stream_splice_read

2016-10-09 Thread Al Viro
On Sun, Oct 09, 2016 at 12:06:14PM +0200, Dmitry Vyukov wrote:
> I suspect this is:
> 
> commit 25869262ef7af24ccde988867ac3eb1c3d4b88d4
> Author: Al Viro 
> Date:   Sat Sep 17 21:02:10 2016 -0400
> skb_splice_bits(): get rid of callback
> since pipe_lock is the outermost now, we don't need to drop/regain
> socket locks around the call of splice_to_pipe() from skb_splice_bits(),
> which kills the need to have a socket-specific callback; we can just
> call splice_to_pipe() and be done with that.

Unlikely, since that particular commit removes unlocking/relocking ->iolock
around the call of splice_to_pipe().  Original would've retaken the same
lock on the way out; it's not as if we could leave the syscall there.

It might be splice-related, but I don't believe that you've got the right
commit here.


Re: [PATCH iproute2 0/2] ip rule: merger iprule_flush and add selector support

2016-10-09 Thread Stephen Hemminger
On Thu, 22 Sep 2016 14:28:47 +0800
Hangbin Liu  wrote:

> When merge iprule_flush() and iprule_list_or_save(). Renamed
> rtnl_filter_t filter to filter_fn because we want to use global
> variable 'filter' to filter nlmsg in the next patch.
> 
> Hangbin Liu (2):
>   ip rule: merge ip rule flush and list, save together
>   ip rule: add selector support
> 
>  ip/iprule.c| 295 
> +
>  man/man8/ip-rule.8 |   6 +-
>  2 files changed, 231 insertions(+), 70 deletions(-)
> 
> -- 

Applied to net-next (for 4.9)


Re: [PATCH iproute2] ip link: Add support to configure SR-IOV VF to vlan protocol 802.1ad (VST QinQ)

2016-10-09 Thread Stephen Hemminger
On Wed, 28 Sep 2016 10:58:59 +0300
Tariq Toukan  wrote:

> From: Moshe Shemesh 
> 
> Introduce a new API that exposes a list of vlans per VF (IFLA_VF_VLAN_LIST),
> giving the ability for user-space application to specify it for the VF as
> an option to support 802.1ad (VST QinQ).
> 
> We introduce struct vf_vlan_info, which extends struct vf_vlan and adds
> an optional VF VLAN proto parameter.
> Default VLAN-protocol is 802.1Q.
> 
> Add IFLA_VF_VLAN_LIST in addition to IFLA_VF_VLAN to keep backward
> compatibility with older kernel versions.
> 
> Suitable ip link tool command examples:
>  - Set vf vlan protocol 802.1ad (S-TAG)
>   ip link set eth0 vf 1 vlan 100 proto 802.1ad
>  - Set vf vlan S-TAG and vlan C-TAG (VST QinQ)
>   ip link set eth0 vf 1 vlan 100 proto 802.1ad vlan 30 proto 802.1Q
>  - Set vf to VST (802.1Q) mode
>   ip link set eth0 vf 1 vlan 100 proto 802.1Q
>  - Or by omitting the new parameter (backward compatible)
>   ip link set eth0 vf 1 vlan 100
> 
> Signed-off-by: Moshe Shemesh 
> Signed-off-by: Tariq Toukan 

Applied to net-next (for 4.9)


Re: [PATCH v2 iproute2 net-next] tc: m_vlan: Add vlan modify action

2016-10-09 Thread Stephen Hemminger
On Thu, 22 Sep 2016 21:01:05 +0300
Shmulik Ladkani  wrote:

> The 'vlan modify' action allows to replace an existing 802.1q tag
> according to user provided settings.
> It accepts same arguments as the 'vlan push' action.
> 
> For example, this replaces vid 6 with vid 5:
> 
>  # tc filter add dev veth0 parent : pref 1 protocol 802.1q \
>   basic match 'meta(vlan mask 0xfff eq 6)' \
>   action vlan modify id 5 continue
> 
> Signed-off-by: Shmulik Ladkani 

Applied to net-next (for 4.9)


Re: [PATCH iproute2 net-next] ipmroute: add support for age dumping

2016-10-09 Thread Stephen Hemminger
On Wed, 21 Sep 2016 11:45:58 +0200
Nikolay Aleksandrov  wrote:

> Add support to dump the mroute cache entry age if the show_stats (-s)
> switch is provided.
> Example:
> $ ip -s mroute
> (0.0.0.0, 239.10.10.10)  Iif: eth0   Oifs: eth0
>   0 packets, 0 bytes, Age  245.44
> 
> Signed-off-by: Nikolay Aleksandrov 

Applied to net-next (pending for 4.9)


Re: [PATCH iproute2-next] tc: fq: display unthrottle latency

2016-10-09 Thread Stephen Hemminger
On Wed, 28 Sep 2016 06:23:15 -0700
Eric Dumazet  wrote:

> From: Eric Dumazet 
> 
> In linux-4.9 fq packet scheduler got a new stat :
> 
> unthrottle_latency in nano second units.
> 
> Gives a good indication of system load or timer implementation
> latencies.
> 
> Signed-off-by: Eric Dumazet 

Applied to net-next (for 4.9)



Re: [PATCH iproute2 6/9] actions: add skbmod action

2016-10-09 Thread Stephen Hemminger
On Sat,  1 Oct 2016 16:48:34 -0400
Jamal Hadi Salim  wrote:

> From: Jamal Hadi Salim 
> 
> This action is intended to be an upgrade from a usability perspective
> from pedit (as well as operational debugability).
> Compare this:
> 
> sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
> u32 match ip protocol 1 0xff flowid 1:2 \
> action pedit munge offset -14 u8 set 0x02 \
> munge offset -13 u8 set 0x15 \
> munge offset -12 u8 set 0x15 \
> munge offset -11 u8 set 0x15 \
> munge offset -10 u16 set 0x1515 \
> pipe
> 
> to:
> 
> sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
> u32 match ip protocol 1 0xff flowid 1:2 \
> action skbmod dmac 02:15:15:15:15:15
> 
> Or worse, try to debug a policy with destination mac, source mac and
> etherype. Then make that a hundred rules and you'll get my point.
> 
> The most important ethernet use case at the moment is when redirecting or
> mirroring packets to a remote machine. The dst mac address needs a re-write
> so that it doesnt get dropped or confuse an interconnecting (learning) switch
> or dropped by a target machine (which looks at the dst mac).
> 
> In the future common use cases on pedit can be migrated to this action
> (as an example different fields in ip v4/6, transports like tcp/udp/sctp
> etc). For this first cut, this allows modifying basic ethernet header.
> 
> Signed-off-by: Jamal Hadi Salim 

Lots of checkpatch errors on this. Please fix and resubmit series.
For example:

ERROR: spaces required around that '+=' (ctx:WxV)
#442: FILE: tc/m_skbmod.c:79:
+   ok +=1;


ERROR: code indent should use tabs where possible
#567: FILE: tc/m_skbmod.c:204:
+SPRINT_BUF(b1);$

WARNING: please, no spaces at the start of a line
#567: FILE: tc/m_skbmod.c:204:
+SPRINT_BUF(b1);$

ERROR: code indent should use tabs where possible
#568: FILE: tc/m_skbmod.c:205:
+SPRINT_BUF(b2);$

WARNING: please, no spaces at the start of a line
#568: FILE: tc/m_skbmod.c:205:
+SPRINT_BUF(b2);$

WARNING: braces {} are not necessary for single statement blocks
#610: FILE: tc/m_skbmod.c:247:
+   if (p->flags & SKBMOD_F_SWAPMAC) {
+   fprintf(f, "swap mac ");
+   }


ERROR: trailing whitespace
#816: FILE: man/man8/tc-skbmod.8:28:
+.IR CONTROL " := {"
 $


[ANNOUNCE] iproute 4.8

2016-10-09 Thread Stephen Hemminger
Release of iproute2 for Linux 4.8, slightly late because of netdev.

Update to iproute2 utility to support new features in Linux 4.8.
Includes support for MACSEC and ILA.
Plus the usual array of documentation and minor fixes.

Source:
  http://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-4.8.0.tar.gz

Repository:
  git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git

Report problems (or enhancements) to the netdev@vger.kernel.org mailing list.

---
Andrey Jr. Melnikov (1):
  iproute: disallow ip rule del without parameters

David Ahern (1):
  ip rule: Add support for l3mdev rules

Davide Caratti (5):
  macsec: fix input of 'port', improve documentation of 'address'
  man: ip.8: add missing 'macsec' item to OBJECT list
  macsec: fix byte ordering on input/display of 'sci'
  tc: don't accept qdisc 'handle' greater than 
  macsec: fix input range of 'icvlen' parameter

Eric Dumazet (1):
  ip: report IFLA_GSO_MAX_SIZE and IFLA_GSO_MAX_SEGS

Gustavo Zacarias (1):
  ss: fix build with musl libc

Hangbin Liu (5):
  nstat: add sctp snmp support
  gitignore: Ignore 'tags' file generated by ctags
  ip route: check ftell, fseek return value
  misc/ss: tcp cwnd should be unsigned
  ip: Use specific slave id

Hannes Frederic Sowa (1):
  iptuntap: show processes using tuntap interface

Igor Ryzhov (1):
  fix netlink message length checks

Iskren Chernev (1):
  iproute: fix documentation for ip rule scan order

Jamal Hadi Salim (2):
  actions: skbedit add support for mod-ing skb pkt_type
  tc classifiers: Modernize tcindex classifier

Jiri Benc (2):
  vxlan: group address requires net device
  tunnels: use macros for IPv6 address comparison

Liping Zhang (1):
  ipmonitor: fix ip monitor can't work when NET_NS is not enabled

Nikolay Aleksandrov (1):
  ip: route: fix multicast route dumps

Or Gerlitz (1):
  devlink: Add e-switch support

Phil Sutter (4):
  man: ip-link.8: Document missing geneve options
  ip-link: add missing {min,max}_tx_rate to help text
  ip-route: Prevent some double spaces in output
  iproute: fix documentation for ip rule scan order

Richard Alpe (2):
  tipc: fix UDP bearer synopsis
  tipc: refactor bearer identification

Roman Mashak (3):
  police: add extra space to improve police result printing
  police: improve usage message
  police: bug fix man page

Roopa Prabhu (2):
  bridge: print_vlan: add missing check for json instance
  bridge: print_vlan: add missing check for json instance

Sabrina Dubroca (4):
  libgenl: introduce genl_init_handle
  macsec: show usage even if the module is not available
  fou: show usage even if the module is not available
  ila: show usage even if the module is not available

Simon Horman (1):
  iproute2: correct port in FOU/GRE example

Stephen Hemminger (14):
  minor header update from net-next
  fib_rules.h update header file
  iprule: whitespace cleanup
  update kernel headers (net-next)
  update kernel header (4.7 net-next)
  update headers files to current net-next
  include: update net-next XDP headers
  update kernel headers
  update BPF headers
  devlink: whitespace cleanup
  remove useless return statement
  ip: iptuntap cleanup
  update kernel headers from 4.8-rc4
  v4.8.0

Sushma Sitaram (1):
  tc: f_u32: Fill in 'linkid' provided by user

Thomas Graf (1):
  tuntap: Add name attribute to usage text

Tom Herbert (6):
  ila: Support for checksum neutral translation
  ila: Support for configuring ila to use netfilter hook
  ip6tnl: Support for fou encapsulation
  gre6: Support for fou encapsulation
  fou: Allowing configuring IPv6 listener
  ipila: Fixed unitialized variables

WANG Cong (1):
  tc: fix a misleading failure

Xin Long (1):
  ip route: restore_handler should check tb[RTA_PREFSRC] for local networks

Yotam Gigi (2):
  tc: Add support for the matchall traffic classifier.
  tc: man: Add man entry for the matchall classifier.

anuradhak (1):
  bridge: Fix garbled json output seen if a vlan filter is specified



Re: [PATCH iproute2] fix netlink message length checks

2016-10-09 Thread Stephen Hemminger
On Tue,  4 Oct 2016 13:16:55 +0300
Igor Ryzhov  wrote:

> Signed-off-by: Igor Ryzhov 

Makes sense applied, I wonder why one of the static checkers didn't see this.


Re: [iproute2] bridge: Fix garbled json output seen if a vlan filter is specified

2016-10-09 Thread Stephen Hemminger
On Fri,  7 Oct 2016 09:40:18 -0700
Anuradha Karuppiah  wrote:

> From: anuradhak 
> 
> json objects were started but not completed if the fdb vlan did not
> match the specified filter vlan.
> 
> Sample output:
> $ bridge -j fdb show vlan 111
> [{
> "mac": "44:38:39:00:69:88",
> "dev": "br0",
> "vlan": 111,
> "master": "br0",
> "state": "permanent"
> }
> ]
> $ bridge -j fdb show vlan 100
> []
> $
> 
> Signed-off-by: Anuradha Karuppiah 

Applied for 4.8 version


Re: [PATCH] tc: f_u32: Fill in 'linkid' provided by user

2016-10-09 Thread Stephen Hemminger
On Wed, 28 Sep 2016 11:30:16 -0700
Sushma Sitaram  wrote:

> Currently, 'linkid' input by the user is parsed but 'handle' is appended to 
> the netlink message.
> 
> # tc filter add dev enp1s0f1 protocol ip parent : prio 99 u32 ht 800: \
>   order 1 link 1: offset at 0 mask 0f00 shift 6 plus 0 eat match ip \
>   protocol 6 ff
> 
> resulted in:
> filter protocol ip pref 99 u32 fh 800::1 order 1 key ht 800 bkt 0
>   match 0006/00ff at 8
> offset 0f00>>6 at 0  eat
> 
> This patch results in:
> filter protocol ip pref 99 u32 fh 800::1 order 1 key ht 800 bkt 0 link 1:
>   match 0006/00ff at 8
> offset 0f00>>6 at 0  eat
> 
> 
> Signed-off-by Sushma Sitaram: Sushma Sitaram 

Applied (for 4.8).


Re: [PATCH v3] iproute2: build nsid-name cache only for commands that need it

2016-10-09 Thread Stephen Hemminger
On Tue, 20 Sep 2016 06:01:27 +
Anton Aksola  wrote:

> The calling of netns_map_init() before command parsing introduced
> a performance issue with large number of namespaces.
> 
> As commands such as add, del and exec do not need to iterate through
> /var/run/netns it would be good not no build the cache before executing
> these commands.
> 
> Example:
> unpatched:
> time seq 1 1000 | xargs -n 1 ip netns add
> 
> real0m16.832s
> user0m1.350s
> sys0m15.029s
> 
> patched:
> time seq 1 1000 | xargs -n 1 ip netns add
> 
> real0m3.859s
> user0m0.132s
> sys0m3.205s
> 
> Signed-off-by: Anton Aksola 
> ---
>  ip/ip_common.h|  1 +
>  ip/ipmonitor.c|  1 +
>  ip/ipnetns.c  | 31 
> ++-
>  testsuite/tests/ip/netns/set_nsid.t   | 22 ++
>  testsuite/tests/ip/netns/set_nsid_batch.t | 18 ++
>  5 files changed, 64 insertions(+), 9 deletions(-)
>  create mode 100755 testsuite/tests/ip/netns/set_nsid.t
>  create mode 100755 testsuite/tests/ip/netns/set_nsid_batch.t
> 

Applied to net-next (ie for 4.9)


Re: slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice))

2016-10-09 Thread Aaron Conole
Florian Westphal  writes:

> Linus Torvalds  wrote:
>> On Sun, Oct 9, 2016 at 12:11 PM, Linus Torvalds
>>  wrote:
>> >
>> > Anyway, I don't think I can bisect it, but I'll try to narrow it down
>> > a *bit* at least.
>> >
>> > Not doing any more pulls on this unstable base, I've been puttering
>> > around in trying to clean up some stupid printk logging issues
>> > instead.
>> 
>> So I finally got a oops with slub debugging enabled. It doesn't really
>> narrow things down, though, it kind of extends on the possible
>> suspects. Now adding David Miller and Pablo, because it looks like it
>> may be netfilter that does something bad and corrupts memory.
>
> Quite possible, the netns interactions are not nice :-/
>
>> Without further ado, here's the new oops:
>> 
>>general protection fault:  [#1] SMP
>>CPU: 7 PID: 169 Comm: kworker/u16:7 Not tainted
>> 4.8.0-11288-gb66484cd7470 #1
>>Hardware name: System manufacturer System Product Name/Z170-K, BIOS
> ..
>>Call Trace:
>>  netfilter_net_exit+0x2f/0x60
>>  ops_exit_list.isra.4+0x38/0x60
>>  cleanup_net+0x1ba/0x2a0
>>  process_one_work+0x1f1/0x480
>>  worker_thread+0x48/0x4d0
>>  ? process_one_work+0x480/0x480
>
> ..
>
>> like it's a pointer loaded from a free'd allocation.
>> 
>> The code disassembles to
>> 
>>0: 0f b6 ca movzbl %dl,%ecx
>>3: 48 8d 84 c8 00 01 00 lea0x100(%rax,%rcx,8),%rax
>>a: 00
>>b: 49 8b 5c c5 00   mov0x0(%r13,%rax,8),%rbx
>>   10: 48 85 db test   %rbx,%rbx
>>   13: 0f 84 cb 00 00 00 je 0xe4
>>   19: 4c 3b 63 40   cmp0x40(%rbx),%r12
>>   1d: 48 8b 03 mov(%rbx),%rax
>>   20: 0f 84 e9 00 00 00 je 0x10f
>>   26: 48 85 c0 test   %rax,%rax
>>   29: 74 26 je 0x51
>>   2b:* 4c 3b 60 40   cmp0x40(%rax),%r12 <-- trapping instruction
>>   2f: 75 08 jne0x39
>>   31: e9 ef 00 00 00   jmpq   0x125
>>   36: 48 89 d8 mov%rbx,%rax
>>   39: 48 8b 18 mov(%rax),%rbx
>>   3c: 48 85 db test   %rbx,%rbx
>> 
>> and that oopsing instruction seems to be the compare of
>> "hooks_entry->orig_ops" from hooks_entry in this expression:
>> 
>> if (hooks_entry && hooks_entry->orig_ops == reg) {
>> 
>> so hooks_entry() is bogus. It was gotten from
>> 
>> hooks_entry = nf_hook_entry_head(net, reg);
>> 
>> but that's as far as I dug. And yes, I do have
>> CONFIG_NETFILTER_INGRESS=y in case that matters.
>> 
>> And all this code has changed pretty radically in commit e3b37f11e6e4
>> ("netfilter: replace list_head with single linked list"), and there
>> was clearly already something wrong with that code, with commit
>> 5119e4381a90 ("netfilter: Fix potential null pointer dereference")
>> adding the test against NULL. But I suspect that only hid the "oops,
>> it's actually not NULL, it loaded some uninitialized value" problem.
>> 
>> Over to the networking guys.. Ideas?
>
> Sorry, not off the top of my head.
> Pablo is currently travelling back home from netdev 1.2 in Tokyo,
> I can help starting Wednesday when I am back.
>
> One shot in the dark (not even compile tested; wonder if we can end up
> zapping bogus hook ...)
>

I was just about to build and test something similar:

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index c9d90eb..e84103f 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -189,7 +189,7 @@ void nf_unregister_net_hook(struct net *net, const struct 
nf_hook_ops *reg)
 
 unlock:
mutex_unlock(_hook_mutex);
-   if (!hooks_entry) {
+   if (!hooks_entry || hooks_entry->orig_ops != reg) {
WARN(1, "nf_unregister_net_hook: hook not found!\n");
return;
}


Re: [PATCH kernel v2] PCI: Enable access to custom VPD for Chelsio devices (cxgb3)

2016-10-09 Thread Alexey Kardashevskiy
Anyone, ping?

On 29/09/16 15:21, Alexey Kardashevskiy wrote:
> There is at least one Chelsio 10Gb card which uses VPD area to store
> some custom blocks (example below). However pci_vpd_size() returns
> the length of the first block only assuming that there can be only
> one VPD "End Tag" and VFIO blocks access beyond that offset
> (since 4e1a63555) which leads to the situation when the guest "cxgb3"
> driver fails to probe the device. The host system does not have this
> problem as the drives accesses the config space directly without
> pci_read_vpd()/...
> 
> This adds a quirk to override the VPD size to a bigger value.
> The maximum size is taken from EEPROMSIZE in
> drivers/net/ethernet/chelsio/cxgb3/common.h. We do not read the tag
> as the cxgb3 driver does as the driver supports writing to EEPROM/VPD
> and when it writes, it only checks for 8192 bytes boundary. The quirk
> is registerted for all devices supported by the cxgb3 driver.
> 
> This adds a quirk to the PCI layer (not to the cxgb3 driver) as
> the cxgb3 driver itself accesses VPD directly and the problem only exists
> with the vfio-pci driver (when cxgb3 is not running on the host and
> may not be even loaded) which blocks accesses beyond the first block
> of VPD data. However vfio-pci itself does not have quirks mechanism so
> we add it to PCI.
> 
> Tested on:
> Ethernet controller [0200]: Chelsio Communications Inc T310 10GbE Single Port 
> Adapter [1425:0030]
> 
> This is its VPD:
>  Large item 42 bytes; name 0x2 Identifier String
>   b'10 Gigabit Ethernet-SR PCI Express Adapter'
>   #00 [EC] len=7: b'D76809 '
>   #0a [FN] len=7: b'46K7897'
>   #14 [PN] len=7: b'46K7897'
>   #1e [MN] len=4: b'1037'
>   #25 [FC] len=4: b'5769'
>   #2c [SN] len=12: b'YL102035603V'
>   #3b [NA] len=12: b'00145E992ED1'
> 
> 0c00 Large item 16 bytes; name 0x2 Identifier String
> b'S310E-SR-X  '
> 0c13 Large item 234 bytes; name 0x10
> #00 [PN] len=16: b'TBD '
> #13 [EC] len=16: b'110107730D2 '
> #26 [SN] len=16: b'97YL102035603V  '
> #39 [NA] len=12: b'00145E992ED1'
> #48 [V0] len=6: b'175000'
> #51 [V1] len=6: b'26'
> #5a [V2] len=6: b'26'
> #63 [V3] len=6: b'2000  '
> #6c [V4] len=2: b'1 '
> #71 [V5] len=6: b'c2'
> #7a [V6] len=6: b'0 '
> #83 [V7] len=2: b'1 '
> #88 [V8] len=2: b'0 '
> #8d [V9] len=2: b'0 '
> #92 [VA] len=2: b'0 '
> #97 [RV] len=80: 
> b's\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'...
> 0d00 Large item 252 bytes; name 0x11
> #00 [VC] len=16: b'122310_1222 dp  '
> #13 [VD] len=16: b'610-0001-00 H1\x00\x00'
> #26 [VE] len=16: b'122310_1353 fp  '
> #39 [VF] len=16: b'610-0001-00 H1\x00\x00'
> #4c [RW] len=173: 
> b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'...
> 0dff Small item 0 bytes; name 0xf End Tag
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v2:
> * used pci_set_vpd_size() helper
> * added explicit list of IDs from cxgb3 driver
> * added a note in the commit log why the quirk is not in cxgb3
> ---
>  drivers/pci/quirks.c | 22 ++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 44e0ff3..b22fce5 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3243,6 +3243,28 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 
> PCI_DEVICE_ID_INTEL_CACTUS_RIDGE_4C
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_PORT_RIDGE,
>   quirk_thunderbolt_hotplug_msi);
>  
> +static void quirk_chelsio_extend_vpd(struct pci_dev *dev)
> +{
> + if (!dev->vpd)
> + return;
> +
> + pci_set_vpd_size(dev, max_t(unsigned int, dev->vpd->len, 8192));
> +}
> +
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x20, 
> quirk_chelsio_extend_vpd);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x21, 
> quirk_chelsio_extend_vpd);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x22, 
> quirk_chelsio_extend_vpd);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x23, 
> quirk_chelsio_extend_vpd);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x24, 
> quirk_chelsio_extend_vpd);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x25, 
> quirk_chelsio_extend_vpd);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x26, 
> quirk_chelsio_extend_vpd);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x30, 
> quirk_chelsio_extend_vpd);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x31, 
> quirk_chelsio_extend_vpd);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x32, 
> quirk_chelsio_extend_vpd);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x35, 
> quirk_chelsio_extend_vpd);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x36, 
> quirk_chelsio_extend_vpd);
> 

Re: slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice))

2016-10-09 Thread Florian Westphal
Linus Torvalds  wrote:
> On Sun, Oct 9, 2016 at 12:11 PM, Linus Torvalds
>  wrote:
> >
> > Anyway, I don't think I can bisect it, but I'll try to narrow it down
> > a *bit* at least.
> >
> > Not doing any more pulls on this unstable base, I've been puttering
> > around in trying to clean up some stupid printk logging issues
> > instead.
> 
> So I finally got a oops with slub debugging enabled. It doesn't really
> narrow things down, though, it kind of extends on the possible
> suspects. Now adding David Miller and Pablo, because it looks like it
> may be netfilter that does something bad and corrupts memory.

Quite possible, the netns interactions are not nice :-/

> Without further ado, here's the new oops:
> 
>general protection fault:  [#1] SMP
>CPU: 7 PID: 169 Comm: kworker/u16:7 Not tainted 4.8.0-11288-gb66484cd7470 
> #1
>Hardware name: System manufacturer System Product Name/Z170-K, BIOS
..
>Call Trace:
>  netfilter_net_exit+0x2f/0x60
>  ops_exit_list.isra.4+0x38/0x60
>  cleanup_net+0x1ba/0x2a0
>  process_one_work+0x1f1/0x480
>  worker_thread+0x48/0x4d0
>  ? process_one_work+0x480/0x480

..

> like it's a pointer loaded from a free'd allocation.
> 
> The code disassembles to
> 
>0: 0f b6 ca movzbl %dl,%ecx
>3: 48 8d 84 c8 00 01 00 lea0x100(%rax,%rcx,8),%rax
>a: 00
>b: 49 8b 5c c5 00   mov0x0(%r13,%rax,8),%rbx
>   10: 48 85 db test   %rbx,%rbx
>   13: 0f 84 cb 00 00 00 je 0xe4
>   19: 4c 3b 63 40   cmp0x40(%rbx),%r12
>   1d: 48 8b 03 mov(%rbx),%rax
>   20: 0f 84 e9 00 00 00 je 0x10f
>   26: 48 85 c0 test   %rax,%rax
>   29: 74 26 je 0x51
>   2b:* 4c 3b 60 40   cmp0x40(%rax),%r12 <-- trapping instruction
>   2f: 75 08 jne0x39
>   31: e9 ef 00 00 00   jmpq   0x125
>   36: 48 89 d8 mov%rbx,%rax
>   39: 48 8b 18 mov(%rax),%rbx
>   3c: 48 85 db test   %rbx,%rbx
> 
> and that oopsing instruction seems to be the compare of
> "hooks_entry->orig_ops" from hooks_entry in this expression:
> 
> if (hooks_entry && hooks_entry->orig_ops == reg) {
> 
> so hooks_entry() is bogus. It was gotten from
> 
> hooks_entry = nf_hook_entry_head(net, reg);
> 
> but that's as far as I dug. And yes, I do have
> CONFIG_NETFILTER_INGRESS=y in case that matters.
> 
> And all this code has changed pretty radically in commit e3b37f11e6e4
> ("netfilter: replace list_head with single linked list"), and there
> was clearly already something wrong with that code, with commit
> 5119e4381a90 ("netfilter: Fix potential null pointer dereference")
> adding the test against NULL. But I suspect that only hid the "oops,
> it's actually not NULL, it loaded some uninitialized value" problem.
> 
> Over to the networking guys.. Ideas?

Sorry, not off the top of my head.
Pablo is currently travelling back home from netdev 1.2 in Tokyo,
I can help starting Wednesday when I am back.

One shot in the dark (not even compile tested; wonder if we can end up
zapping bogus hook ...)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index c9d90eb..fd6a2ce 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -189,6 +189,9 @@ void nf_unregister_net_hook(struct net *net, const struct 
nf_hook_ops *reg)
 
 unlock:
mutex_unlock(_hook_mutex);
+
+   WARN_ON(hooks_entry && hooks_entry->orig_ops != reg);
+
if (!hooks_entry) {
WARN(1, "nf_unregister_net_hook: hook not found!\n");
return;



Re: [PATCH net-next 5/6] qed: Allow chance for fast ramrod completions

2016-10-09 Thread Eric Dumazet
On Sun, 2016-10-09 at 18:25 +0300, Yuval Mintz wrote:
> From: Yuval Mintz 
> 
> Whenever a ramrod is being sent for some device configuration,
> the driver is going to sleep at least 5ms between each iteration
> of polling on the completion of the ramrod.
> 
> However, in almost every configuration scenario the firmware
> would be able to comply and complete the ramrod in a manner of
> several usecs. This is especially important in cases where there
> might be a lot of sequential configurations applying to the hardware
> [e.g., RoCE], in which case the existing scheme might cause some
> visible user delays.
> 
> This patch changes the completion scheme - instead of immediately
> starting to sleep for a 'long' period, allow the device to quickly
> poll on the first iteration after a couple of usecs.
> 
> Signed-off-by: Yuval Mintz 
> ---
>  drivers/net/ethernet/qlogic/qed/qed_spq.c | 85 
> +--
>  1 file changed, 59 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/net/ethernet/qlogic/qed/qed_spq.c 
> b/drivers/net/ethernet/qlogic/qed/qed_spq.c
> index caff415..259a615 100644
> --- a/drivers/net/ethernet/qlogic/qed/qed_spq.c
> +++ b/drivers/net/ethernet/qlogic/qed/qed_spq.c
> @@ -37,7 +37,11 @@
>  ***/
>  
>  #define SPQ_HIGH_PRI_RESERVE_DEFAULT(1)
> -#define SPQ_BLOCK_SLEEP_LENGTH  (1000)
> +
> +#define SPQ_BLOCK_DELAY_MAX_ITER(10)
> +#define SPQ_BLOCK_DELAY_US  (10)
> +#define SPQ_BLOCK_SLEEP_MAX_ITER(1000)
> +#define SPQ_BLOCK_SLEEP_MS  (5)
>  
>  /***
>  * Blocking Imp. (BLOCK/EBLOCK mode)
> @@ -57,53 +61,81 @@ static void qed_spq_blocking_cb(struct qed_hwfn *p_hwfn,
>   smp_wmb();
>  }
>  
> -static int qed_spq_block(struct qed_hwfn *p_hwfn,
> -  struct qed_spq_entry *p_ent,
> -  u8 *p_fw_ret)
> +static int __qed_spq_block(struct qed_hwfn *p_hwfn,
> +struct qed_spq_entry *p_ent,
> +u8 *p_fw_ret, bool sleep_between_iter)
>  {
> - int sleep_count = SPQ_BLOCK_SLEEP_LENGTH;
>   struct qed_spq_comp_done *comp_done;
> - int rc;
> + u32 iter_cnt;
>  
>   comp_done = (struct qed_spq_comp_done *)p_ent->comp_cb.cookie;
> - while (sleep_count) {
> - /* validate we receive completion update */
> + iter_cnt = sleep_between_iter ? SPQ_BLOCK_SLEEP_MAX_ITER
> +   : SPQ_BLOCK_DELAY_MAX_ITER;
> +
> + while (iter_cnt--) {
> + /* Validate we receive completion update */
>   smp_rmb();
>   if (comp_done->done == 1) {
>   if (p_fw_ret)
>   *p_fw_ret = comp_done->fw_return_code;
>   return 0;
>   }

Note that this smp_rmb() and accesses to ->done and ->fw_return_code are
racy.

fq_return_code needs to be written _before_ done.

Something like the following patch would make sense...

diff --git a/drivers/net/ethernet/qlogic/qed/qed_spq.c 
b/drivers/net/ethernet/qlogic/qed/qed_spq.c
index 
caff41544898baed09f45a41829cb0ba9c719fb9..eefa45eab3728791f4d15a088c4550f318a1d1da
 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_spq.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_spq.c
@@ -50,11 +50,9 @@ static void qed_spq_blocking_cb(struct qed_hwfn *p_hwfn,
 
comp_done = (struct qed_spq_comp_done *)cookie;
 
-   comp_done->done = 0x1;
-   comp_done->fw_return_code   = fw_return_code;
+   comp_done->fw_return_code = fw_return_code;
 
-   /* make update visible to waiting thread */
-   smp_wmb();
+   smp_store_release(_done->done, 0x1);
 }
 
 static int qed_spq_block(struct qed_hwfn *p_hwfn,
@@ -68,8 +66,8 @@ static int qed_spq_block(struct qed_hwfn *p_hwfn,
comp_done = (struct qed_spq_comp_done *)p_ent->comp_cb.cookie;
while (sleep_count) {
/* validate we receive completion update */
-   smp_rmb();
-   if (comp_done->done == 1) {
+   if (READ_ONCE(comp_done->done) == 1) {
+   smp_read_barrier_depends();
if (p_fw_ret)
*p_fw_ret = comp_done->fw_return_code;
return 0;
@@ -87,8 +85,8 @@ static int qed_spq_block(struct qed_hwfn *p_hwfn,
sleep_count = SPQ_BLOCK_SLEEP_LENGTH;
while (sleep_count) {
/* validate we receive completion update */
-   smp_rmb();
-   if (comp_done->done == 1) {
+   if (READ_ONCE(comp_done->done) == 1) {
+   smp_read_barrier_depends();
if (p_fw_ret)
*p_fw_ret = comp_done->fw_return_code;

Re: Accelerated receive flow steering (aRFS) for UDP

2016-10-09 Thread Eric Dumazet
On Sun, 2016-10-09 at 19:48 +, Chopra, Manish wrote:

> Hi Eric, I used "-n" as well with "-N" but still the problem doesn't
> go away.
> 
> This is what I have done -
> 
> Started "netserver" on local/test setup
> 
> #netserver
> Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family
> AF_UNSPEC
> 
> It starts listening on port "12865"
> 
> From remote setup, started multiple netperf using different ports for
> data sockets specified using "-P" with "-N" and "-n" options specified
> as well.
> netperf -H 192.168.200.40 -l 150 -t UDP_STREAM -T 8,8 -- -N -n -m 1400
> -P 6660,5550 &
> netperf -H 192.168.200.40 -l 150 -t UDP_STREAM -T 9,9 -- -N -n -m 1400
> -P 9990,9880 &
> netperf -H 192.168.200.40 -l 150 -t UDP_STREAM -T 10,10 -- -N -n -m
> 1400 -P 4455,4400 &
> netperf -H 192.168.200.40 -l 150 -t UDP_STREAM -T 11,11 -- -N -n -m
> 1400 -P 3300,7800 &
> netperf -H 192.168.200.40 -l 150 -t UDP_STREAM -T 12,12 -- -N -n -m
> 1400 -P 50512,4 &
> netperf -H 192.168.200.40 -l 150 -t UDP_STREAM -T 13,13 -- -N -n -m
> 1400 -P 10512,45672 &
> netperf -H 192.168.200.40 -l 150 -t UDP_STREAM -T 14,14 -- -N -n -m
> 1400 -P ,56721 &
> netperf -H 192.168.200.40 -l 150 -t UDP_STREAM -T 15,15 -- -N -n -m
> 1400 -P 9300,8899 &
> 
> When on local/test receiving setup, I dump skb's IP header protocol
> field in .ndo_rx_flow_steer() handler - it is still always
> IPPROTO_TCP.
> Which has destined port 12865. But that handler never receives a SKB
> whose IP header protocol field is set to IPPROTO_UDP.
> 
> As suspected, I believe in receive flow, packets always go in the path
> where it never match any entry in global flow table in get_rps_cpu()
> function
> ,possibly due to packets don't get received from the flow of
> inet_recvmsg() which updates the global flow table ?
> 
> 3571 /* First check into global flow table if there is
> a match */
> 3572 ident = sock_flow_table->ents[hash &
> sock_flow_table->mask];
> 3573 if ((ident ^ hash) & ~rps_cpu_mask)
> 3574 goto try_rps;
> 
> Hence, it never call set_rps_cpu() which internally is supposed to
> call .ndo_rx_flow_steer() for the SKB's whose flows to be steered.
> 
> On another side, when I use "Iperf" for sending UDP stream, which I
> believe receives the packets from the intet_recvmsg() flow
> and I do see flows getting steered for UDP packets. [Actually seeing
> SKB's whose IP header protocol set to IPPROTO_UDP arriving
> in .ndo-rx_flow_steer()].
> 
> iperf -s -u
> iperf -u -c 192.168.200.40 -t 3000 -i 10 -P 8


OK, I am adding/CC Rick Jones, netperf author, since it seems a netperf
bug, not a kernel one.

I believe I already mentioned fact that "UDP_STREAM -- -N" was not doing
a connect() on the receiver side.


> 
> 
> 




Re: [PATCH net-next v9 1/1] net: phy: Cleanup the Edge-Rate feature in Microsemi PHYs.

2016-10-09 Thread Allan W. Nielsen
Hi,

On 07/10/16 22:22, Andrew Lunn wrote:
> Overall, this is much better. I just have a few nitpicks.
It is good to hear that we are getting closer ;-)

> dt-bindings/net/mscc-phy-vsc8531.h is removed by this patch. It would
> be good to also remove the reference.
My bad.

> You don't need edge_slowdown and vddmac in the private structure,
> since they are never used after determining what rate_magic is.
Year... I was actually a bit confused about this... But assumed that you had
some conventions about saving "input" configuration.

Well, clearly not - I will clean this up.

Actually, there is no need to store rate_magic either... It is only used in the
init function. This means that there is no need for private date...

The code could look somthing like this (it is not tested, so just look at see if
you like the idea):

#ifdef CONFIG_OF_MDIO
static int vsc85xx_edge_rate_magic_get(struct phy_device *phydev)
{
int rc;
u8 sd;
u16 vdd;
struct device *dev = >mdio.dev;
struct device_node *of_node = dev->of_node;
u8 sd_array_size = ARRAY_SIZE(edge_table[0].slowdown);

if (!of_node)
return -ENODEV;

rc = of_property_read_u16(of_node, "vsc8531,vddmac", );
if (rc != 0)
vdd = MSCC_VDDMAC_3300;

rc = of_property_read_u8(of_node, "vsc8531,edge-slowdown", );
if (rc != 0)
sd = 0;

for (vdd = 0; vdd < ARRAY_SIZE(edge_table); vdd++)
if (edge_table[vdd].vddmac == vddmac)
for (sd = 0; sd < sd_array_size; sd++)
if (edge_table[vdd].slowdown[sd] == slowdown)
return (sd_array_size - sd - 1);

return -EINVAL;
}
#else
static int vsc85xx_edge_rate_magic_get(struct phy_device *phydev)
{
return 0;
}
#endif /* CONFIG_OF_MDIO */


static int vsc85xx_config_init(struct phy_device *phydev)
{
int rc, rate_magic;
...
rate_magic = vsc85xx_edge_rate_magic_get(phydev);
if (rate_magic < 0)
return rate_magic;

rc = vsc85xx_edge_rate_cntl_set(phydev, rate_magic);
if (rc)
return rc;
...
}

This is clearly how I would prefere it, as it would simplify the implementation.

It should be no problem to have this tested, and have a new patch avialable
tomorrow.

/Allan



Re: [PATCH v2 1/3] net: smc91x: isolate u16 writes alignment workaround

2016-10-09 Thread Andy Shevchenko
On Sun, Oct 9, 2016 at 11:33 PM, Robert Jarzmik  wrote:
> Writes to u16 has a special handling on 3 PXA platforms, where the
> hardware wiring forces these writes to be u32 aligned.
>
> This patch isolates this handling for PXA platforms as before, but
> enables this "workaround" to be set up dynamically, which will be the
> case in device-tree build types.
>
> This patch was tested on 2 PXA platforms : mainstone, which relies on
> the workaround, and lubbock, which doesn't.

> @@ -2276,6 +2277,9 @@ static int smc_drv_probe(struct platform_device *pdev)
> memcpy(>cfg, pd, sizeof(lp->cfg));
> lp->io_shift = SMC91X_IO_SHIFT(lp->cfg.flags);
> }
> +   lp->half_word_align4 =
> +   machine_is_mainstone() || machine_is_stargate2() ||
> +   machine_is_pxa_idp();

>  /* We actually can't write halfwords properly if not word aligned */
> -static inline void SMC_outw(u16 val, void __iomem *ioaddr, int reg)
> +static inline void _SMC_outw_align4(u16 val, void __iomem *ioaddr, int reg,
> +   bool use_align4_workaround)
>  {
> -   if ((machine_is_mainstone() || machine_is_stargate2() ||
> -machine_is_pxa_idp()) && reg & 2) {
> +   if (use_align4_workaround) {
> unsigned int v = val << 16;
> v |= readl(ioaddr + (reg & ~2)) & 0x;
> writel(v, ioaddr + (reg & ~2));

> +#define SMC_outw(lp, v, a, r)  \
> +   _SMC_outw_align4((v), (a), (r), \
> +IS_BUILTIN(CONFIG_ARCH_PXA) && ((r) & 2) &&\
> +lp->half_word_align4)

Hmm... Isn't enough to have just (r) & 2 && lp->half_word_align4 ?


-- 
With Best Regards,
Andy Shevchenko


slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice))

2016-10-09 Thread Linus Torvalds
On Sun, Oct 9, 2016 at 12:11 PM, Linus Torvalds
 wrote:
>
> Anyway, I don't think I can bisect it, but I'll try to narrow it down
> a *bit* at least.
>
> Not doing any more pulls on this unstable base, I've been puttering
> around in trying to clean up some stupid printk logging issues
> instead.

So I finally got a oops with slub debugging enabled. It doesn't really
narrow things down, though, it kind of extends on the possible
suspects. Now adding David Miller and Pablo, because it looks like it
may be netfilter that does something bad and corrupts memory.

Of course, maybe this is another symptom, and not the root cause for
my troubles, but it does look like it might be getting closer to the
cause... In particular, now it very much looks like a use-after-free
in the netfilter code, which *could* explain my original symptom with
later allocation users oopsing randomly.

Without further ado, here's the new oops:

   general protection fault:  [#1] SMP
   CPU: 7 PID: 169 Comm: kworker/u16:7 Not tainted 4.8.0-11288-gb66484cd7470 #1
   Hardware name: System manufacturer System Product Name/Z170-K, BIOS
1803 05/06/2016
   Workqueue: netns cleanup_net
   task: 91935e001fc0 task.stack: b4e2c213c000
   RIP: nf_unregister_net_hook+0x5f/0x190
   RSP: :b4e2c213fd40  EFLAGS: 00010202
   RAX: 6b6b6b6b6b6b6b6b RBX: 91933c4ab968 RCX: 0002
   RDX: 0002 RSI: c0642280 RDI: 91cf9820
   RBP: b4e2c213fd58 R08: 91933c4a86c8 R09: 0025
   R10: 00cc R11: 91935dd22000 R12: c0642280
   R13: 91934cc0ea80 R14: 91cf97e0 R15: 
   FS:  () GS:919376dc() knlGS:
   CS:  0010 DS:  ES:  CR0: 80050033
   CR2: 03e7c000 CR3: 0003fdb62000 CR4: 003406e0
   DR0:  DR1:  DR2: 
   DR3:  DR6: fffe0ff0 DR7: 0400
   Call Trace:
 netfilter_net_exit+0x2f/0x60
 ops_exit_list.isra.4+0x38/0x60
 cleanup_net+0x1ba/0x2a0
 process_one_work+0x1f1/0x480
 worker_thread+0x48/0x4d0
 ? process_one_work+0x480/0x480
 ? process_one_work+0x480/0x480
 kthread+0xd9/0xf0
 ? kthread_park+0x60/0x60
 ret_from_fork+0x22/0x30
   Code: 0f b6 ca 48 8d 84 c8 00 01 00 00 49 8b 5c c5 00 48 85 db 0f
84 cb 00 00 00 4c 3b 63 40 48 8b 03 0f 84 e9 00 00 00 48 85 c0 74 26
<4c> 3b 60 40 75 08 e9 ef 00 00 00 48 89 d8 48 8b 18 48 85 db 0f
   RIP  [] nf_unregister_net_hook+0x5f/0x190

and note the value in %rax: 6b is POISON_FREE, so it very much looks
like it's a pointer loaded from a free'd allocation.

The code disassembles to

   0: 0f b6 ca movzbl %dl,%ecx
   3: 48 8d 84 c8 00 01 00 lea0x100(%rax,%rcx,8),%rax
   a: 00
   b: 49 8b 5c c5 00   mov0x0(%r13,%rax,8),%rbx
  10: 48 85 db test   %rbx,%rbx
  13: 0f 84 cb 00 00 00 je 0xe4
  19: 4c 3b 63 40   cmp0x40(%rbx),%r12
  1d: 48 8b 03 mov(%rbx),%rax
  20: 0f 84 e9 00 00 00 je 0x10f
  26: 48 85 c0 test   %rax,%rax
  29: 74 26 je 0x51
  2b:* 4c 3b 60 40   cmp0x40(%rax),%r12 <-- trapping instruction
  2f: 75 08 jne0x39
  31: e9 ef 00 00 00   jmpq   0x125
  36: 48 89 d8 mov%rbx,%rax
  39: 48 8b 18 mov(%rax),%rbx
  3c: 48 85 db test   %rbx,%rbx

and that oopsing instruction seems to be the compare of
"hooks_entry->orig_ops" from hooks_entry in this expression:

if (hooks_entry && hooks_entry->orig_ops == reg) {

so hooks_entry() is bogus. It was gotten from

hooks_entry = nf_hook_entry_head(net, reg);

but that's as far as I dug. And yes, I do have
CONFIG_NETFILTER_INGRESS=y in case that matters.

And all this code has changed pretty radically in commit e3b37f11e6e4
("netfilter: replace list_head with single linked list"), and there
was clearly already something wrong with that code, with commit
5119e4381a90 ("netfilter: Fix potential null pointer dereference")
adding the test against NULL. But I suspect that only hid the "oops,
it's actually not NULL, it loaded some uninitialized value" problem.

Over to the networking guys.. Ideas?

   Linus


Please Check.

2016-10-09 Thread Jose Wong
I have a proposal to share

Joseph


[PATCH v2 1/3] net: smc91x: isolate u16 writes alignment workaround

2016-10-09 Thread Robert Jarzmik
Writes to u16 has a special handling on 3 PXA platforms, where the
hardware wiring forces these writes to be u32 aligned.

This patch isolates this handling for PXA platforms as before, but
enables this "workaround" to be set up dynamically, which will be the
case in device-tree build types.

This patch was tested on 2 PXA platforms : mainstone, which relies on
the workaround, and lubbock, which doesn't.

Signed-off-by: Robert Jarzmik 
---
 drivers/net/ethernet/smsc/smc91x.c |  6 ++-
 drivers/net/ethernet/smsc/smc91x.h | 80 ++
 2 files changed, 52 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/smsc/smc91x.c 
b/drivers/net/ethernet/smsc/smc91x.c
index 77ad2a3f59db..b1f74e06d98e 100644
--- a/drivers/net/ethernet/smsc/smc91x.c
+++ b/drivers/net/ethernet/smsc/smc91x.c
@@ -602,7 +602,8 @@ static void smc_hardware_send_pkt(unsigned long data)
SMC_PUSH_DATA(lp, buf, len & ~1);
 
/* Send final ctl word with the last byte if there is one */
-   SMC_outw(((len & 1) ? (0x2000 | buf[len-1]) : 0), ioaddr, DATA_REG(lp));
+   SMC_outw(lp, ((len & 1) ? (0x2000 | buf[len-1]) : 0), ioaddr,
+DATA_REG(lp));
 
/*
 * If THROTTLE_TX_PKTS is set, we stop the queue here. This will
@@ -2276,6 +2277,9 @@ static int smc_drv_probe(struct platform_device *pdev)
memcpy(>cfg, pd, sizeof(lp->cfg));
lp->io_shift = SMC91X_IO_SHIFT(lp->cfg.flags);
}
+   lp->half_word_align4 =
+   machine_is_mainstone() || machine_is_stargate2() ||
+   machine_is_pxa_idp();
 
 #if IS_BUILTIN(CONFIG_OF)
match = of_match_device(of_match_ptr(smc91x_match), >dev);
diff --git a/drivers/net/ethernet/smsc/smc91x.h 
b/drivers/net/ethernet/smsc/smc91x.h
index 1a55c7976df0..2b7752db8635 100644
--- a/drivers/net/ethernet/smsc/smc91x.h
+++ b/drivers/net/ethernet/smsc/smc91x.h
@@ -66,10 +66,10 @@
 #define SMC_IRQ_FLAGS  (-1)/* from resource */
 
 /* We actually can't write halfwords properly if not word aligned */
-static inline void SMC_outw(u16 val, void __iomem *ioaddr, int reg)
+static inline void _SMC_outw_align4(u16 val, void __iomem *ioaddr, int reg,
+   bool use_align4_workaround)
 {
-   if ((machine_is_mainstone() || machine_is_stargate2() ||
-machine_is_pxa_idp()) && reg & 2) {
+   if (use_align4_workaround) {
unsigned int v = val << 16;
v |= readl(ioaddr + (reg & ~2)) & 0x;
writel(v, ioaddr + (reg & ~2));
@@ -78,6 +78,12 @@ static inline void SMC_outw(u16 val, void __iomem *ioaddr, 
int reg)
}
 }
 
+#define SMC_outw(lp, v, a, r)  \
+   _SMC_outw_align4((v), (a), (r), \
+IS_BUILTIN(CONFIG_ARCH_PXA) && ((r) & 2) &&\
+lp->half_word_align4)
+
+
 #elif  defined(CONFIG_SH_SH4202_MICRODEV)
 
 #define SMC_CAN_USE_8BIT   0
@@ -88,7 +94,8 @@ static inline void SMC_outw(u16 val, void __iomem *ioaddr, 
int reg)
 #define SMC_inw(a, r)  inw((a) + (r) - 0xa000)
 #define SMC_inl(a, r)  inl((a) + (r) - 0xa000)
 #define SMC_outb(v, a, r)  outb(v, (a) + (r) - 0xa000)
-#define SMC_outw(v, a, r)  outw(v, (a) + (r) - 0xa000)
+#define _SMC_outw(v, a, r) outw(v, (a) + (r) - 0xa000)
+#define SMC_outw(lp, v, a, r)  _SMC_outw((v), (a), (r))
 #define SMC_outl(v, a, r)  outl(v, (a) + (r) - 0xa000)
 #define SMC_insl(a, r, p, l)   insl((a) + (r) - 0xa000, p, l)
 #define SMC_outsl(a, r, p, l)  outsl((a) + (r) - 0xa000, p, l)
@@ -106,7 +113,8 @@ static inline void SMC_outw(u16 val, void __iomem *ioaddr, 
int reg)
 #define SMC_inb(a, r)  inb(((u32)a) + (r))
 #define SMC_inw(a, r)  inw(((u32)a) + (r))
 #define SMC_outb(v, a, r)  outb(v, ((u32)a) + (r))
-#define SMC_outw(v, a, r)  outw(v, ((u32)a) + (r))
+#define _SMC_outw(v, a, r) outw(v, ((u32)a) + (r))
+#define SMC_outw(lp, v, a, r)  _SMC_outw((v), (a), (r))
 #define SMC_insw(a, r, p, l)   insw(((u32)a) + (r), p, l)
 #define SMC_outsw(a, r, p, l)  outsw(((u32)a) + (r), p, l)
 
@@ -134,7 +142,8 @@ static inline void SMC_outw(u16 val, void __iomem *ioaddr, 
int reg)
 #define SMC_inw(a, r)   readw((a) + (r))
 #define SMC_inl(a, r)   readl((a) + (r))
 #define SMC_outb(v, a, r)   writeb(v, (a) + (r))
-#define SMC_outw(v, a, r)   writew(v, (a) + (r))
+#define _SMC_outw(v, a, r)   writew(v, (a) + (r))
+#define SMC_outw(lp, v, a, r)  _SMC_outw((v), (a), (r))
 #define SMC_outl(v, a, r)   writel(v, (a) + (r))
 #define SMC_insw(a, r, p, l)readsw((a) + (r), p, l)
 #define SMC_outsw(a, r, p, l)   writesw((a) + (r), p, l)
@@ -166,7 +175,8 @@ static inline void mcf_outsw(void *a, unsigned char *p, int 
l)
 }
 
 #define SMC_inw(a, r)  _swapw(readw((a) + (r)))

[PATCH v2 2/3] net: smc91x: take into account half-word workaround

2016-10-09 Thread Robert Jarzmik
For device-tree builds, platforms such as mainstone, idp and stargate2
must have their u16 writes all aligned on 32 bit boundaries. This is
already enabled in platform data builds, and this patch adds it to
device-tree builds.

Signed-off-by: Robert Jarzmik 
---
Since v1: rename dt property to pxa-u16-align4
---
 drivers/net/ethernet/smsc/smc91x.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/smsc/smc91x.c 
b/drivers/net/ethernet/smsc/smc91x.c
index b1f74e06d98e..e2f151258b38 100644
--- a/drivers/net/ethernet/smsc/smc91x.c
+++ b/drivers/net/ethernet/smsc/smc91x.c
@@ -2323,6 +2323,8 @@ static int smc_drv_probe(struct platform_device *pdev)
if (!device_property_read_u32(>dev, "reg-shift",
  ))
lp->io_shift = val;
+   lp->half_word_align4 =
+   device_property_read_bool(>dev, "pxa-u16-align4");
}
 #endif
 
-- 
2.1.4



[PATCH v2 3/3] net: smsc91x: add u16 workaround for pxa platforms

2016-10-09 Thread Robert Jarzmik
Add a workaround for mainstone, idp and stargate2 boards, for u16 writes
which must be aligned on 32 bits addresses.

Signed-off-by: Robert Jarzmik 
Cc: Jeremy Linton 
---
Since v1: rename dt property to pxa-u16-align4
  change the binding documentation file
---
 Documentation/devicetree/bindings/net/smsc-lan91c111.txt | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/smsc-lan91c111.txt 
b/Documentation/devicetree/bindings/net/smsc-lan91c111.txt
index e77e167593db..309e37eb7c7c 100644
--- a/Documentation/devicetree/bindings/net/smsc-lan91c111.txt
+++ b/Documentation/devicetree/bindings/net/smsc-lan91c111.txt
@@ -13,3 +13,5 @@ Optional properties:
   16-bit access only.
 - power-gpios: GPIO to control the PWRDWN pin
 - reset-gpios: GPIO to control the RESET pin
+- pxa-u16-align4 : Boolean, put in place the workaround the force all
+  u16 writes to be 32 bits aligned
-- 
2.1.4



Re: [PATCH net-next v9 1/1] net: phy: Cleanup the Edge-Rate feature in Microsemi PHYs.

2016-10-09 Thread Andrew Lunn
> Year... I was actually a bit confused about this... But assumed that you had
> some conventions about saving "input" configuration.

Nope.

Humm, actually, i messed up here and missed something. Sorry.

You should really do all the DT parsing in the probe function, or a
function it calls. We want the probe to return -EINVAL if the device
tree is invalid. Either you can program the magic value immediately,
if it will survive a reset, or save it in the private structure until
the init is called.

   Andrew


RE: Accelerated receive flow steering (aRFS) for UDP

2016-10-09 Thread Chopra, Manish
> -Original Message-
> From: Eric Dumazet [mailto:eric.duma...@gmail.com]
> Sent: Sunday, October 09, 2016 10:45 PM
> To: Chopra, Manish 
> Cc: netdev@vger.kernel.org; ma...@mellanox.com; t...@herbertland.com
> Subject: Re: Accelerated receive flow steering (aRFS) for UDP
> 
> On Sat, 2016-10-08 at 12:25 +, Chopra, Manish wrote:
> > > -Original Message-
> > > From: Eric Dumazet [mailto:eric.duma...@gmail.com]
> > > Sent: Saturday, October 08, 2016 5:08 AM
> > > To: Chopra, Manish 
> > > Cc: netdev@vger.kernel.org; ma...@mellanox.com; t...@herbertland.com
> > > Subject: Re: Accelerated receive flow steering (aRFS) for UDP
> > >
> > > On Fri, 2016-10-07 at 22:55 +, Chopra, Manish wrote:
> > > > Hello Folks,
> > > >
> > > > I am experimenting aRFS with our NIC devices, and for that I have
> > > > kernel 4.8.x installed with below config.
> > > >
> > > > CONFIG_RPS=y
> > > > CONFIG_RFS_ACCEL=y
> > > >
> > > > # cat /proc/cpuinfo  | grep processor
> > > > processor   : 0
> > > > processor   : 1
> > > > processor   : 2
> > > > processor   : 3
> > > > processor   : 4
> > > > processor   : 5
> > > > processor   : 6
> > > > processor   : 7
> > > > processor   : 8
> > > > processor   : 9
> > > > processor   : 10
> > > > processor   : 11
> > > > processor   : 12
> > > > processor   : 13
> > > > processor   : 14
> > > > processor   : 15
> > > >
> > > > I configured rps_sock_flow_entries  and our NIC rx queues with below
> > > > values
> > > >
> > > > echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
> > > > echo 4096 > /sys/class/net/p4p1/queues/rx-0/rps_flow_cnt
> > > > echo 4096 > /sys/class/net/p4p1/queues/rx-1/rps_flow_cnt
> > > > echo 4096 > /sys/class/net/p4p1/queues/rx-2/rps_flow_cnt
> > > > echo 4096 > /sys/class/net/p4p1/queues/rx-3/rps_flow_cnt
> > > > echo 4096 > /sys/class/net/p4p1/queues/rx-4/rps_flow_cnt
> > > > echo 4096 > /sys/class/net/p4p1/queues/rx-5/rps_flow_cnt
> > > > echo 4096 > /sys/class/net/p4p1/queues/rx-6/rps_flow_cnt
> > > > echo 4096 > /sys/class/net/p4p1/queues/rx-7/rps_flow_cnt
> > > >
> > > > echo  > /sys/class/net/p4p1/queues/rx-0/rps_cpus
> > > > echo  > /sys/class/net/p4p1/queues/rx-1/rps_cpus
> > > > echo  > /sys/class/net/p4p1/queues/rx-2/rps_cpus
> > > > echo  > /sys/class/net/p4p1/queues/rx-3/rps_cpus
> > > > echo  > /sys/class/net/p4p1/queues/rx-4/rps_cpus
> > > > echo  > /sys/class/net/p4p1/queues/rx-5/rps_cpus
> > > > echo  > /sys/class/net/p4p1/queues/rx-6/rps_cpus
> > > > echo  > /sys/class/net/p4p1/queues/rx-7/rps_cpus
> > > >
> > > > Below is IRQ affinity configuration for NIC irqs used.
> > > >
> > > > # cat /proc/irq/67/smp_affinity_list
> > > > 8
> > > > # cat /proc/irq/68/smp_affinity_list
> > > > 9
> > > > # cat /proc/irq/69/smp_affinity_list
> > > > 10
> > > > # cat /proc/irq/70/smp_affinity_list
> > > > 11
> > > > # cat /proc/irq/71/smp_affinity_list
> > > > 12
> > > > # cat /proc/irq/72/smp_affinity_list
> > > > 13
> > > > # cat /proc/irq/73/smp_affinity_list
> > > > 14
> > > > # cat /proc/irq/74/smp_affinity_list
> > > > 15
> > > >
> > > > Driver has required feature NETIF_F_NTUPLE set, ndo_rx_flow_steer()
> > > > registered and I am running UDP multiple connections stream using
> > > > netperf to the host where I am experimenting aRFS.
> > > >
> > > > # netperf -V
> > > > Netperf version 2.7.0
> > > >
> > > > netperf -H 192.168.200.40 -t UDP_STREAM -l 150 -T 8,8 -- -m 1470 -P
> > > > 5001,48512 &
> > > > netperf -H 192.168.200.40 -t UDP_STREAM -l 150 -T 9,9 -- -m 1470 -P
> > > > 5001,37990 &
> > > > netperf -H 192.168.200.40 -t UDP_STREAM -l 150 -T 10,10 -- -m 1470 -P
> > > > 5001,40302 &
> > > > netperf -H 192.168.200.40 -t UDP_STREAM -l 150 -T 11,11 -- -m 1470 -P
> > > > 5001,39071 &
> > > > netperf -H 192.168.200.40 -t UDP_STREAM -l 150 -T 12,12 -- -m 1470 -P
> > > > 5001,58994 &
> > > > netperf -H 192.168.200.40 -t UDP_STREAM -l 150 -T 13,13 -- -m 1470 -P
> > > > 5001,59884 &
> > > > netperf -H 192.168.200.40 -t UDP_STREAM -l 150 -T 14,14 -- -m 1470 -P
> > > > 5001,40282 &
> > > > netperf -H 192.168.200.40 -t UDP_STREAM -l 150 -T 15,15 -- -m 1470 -P
> > > > 5001,56042 &
> > > >
> > > > I see that our registered callback for ndo_rx_flow_steer() "NEVER"
> > > > gets invoked for UDP packets, with TCP_STREAM I do see it gets
> > > > invoked.
> > > > But while running UDP_STREAM I see it gets invoked for some of TCP
> > > > packets as netperf also uses TCP managed connections while running
> > > > UDP_STREAM.
> > > >
> > > > My initial investigation suspects that while running UDP_STREAM with
> > > > netperf, rps_sock_flow_table doesn't get updated, as packets never
> > > > reach to the flow of inet_recvmsg()
> > > > where it gets updated using sock_rps_record_flow(). Which might be the
> > > > reason it never invokes NIC's flow steering handler ?
> > > >
> > > > Please note that when I 

Re: Accelerated receive flow steering (aRFS) for UDP

2016-10-09 Thread Eric Dumazet
On Sat, 2016-10-08 at 12:25 +, Chopra, Manish wrote:
> > -Original Message-
> > From: Eric Dumazet [mailto:eric.duma...@gmail.com]
> > Sent: Saturday, October 08, 2016 5:08 AM
> > To: Chopra, Manish 
> > Cc: netdev@vger.kernel.org; ma...@mellanox.com; t...@herbertland.com
> > Subject: Re: Accelerated receive flow steering (aRFS) for UDP
> > 
> > On Fri, 2016-10-07 at 22:55 +, Chopra, Manish wrote:
> > > Hello Folks,
> > >
> > > I am experimenting aRFS with our NIC devices, and for that I have
> > > kernel 4.8.x installed with below config.
> > >
> > > CONFIG_RPS=y
> > > CONFIG_RFS_ACCEL=y
> > >
> > > # cat /proc/cpuinfo  | grep processor
> > > processor   : 0
> > > processor   : 1
> > > processor   : 2
> > > processor   : 3
> > > processor   : 4
> > > processor   : 5
> > > processor   : 6
> > > processor   : 7
> > > processor   : 8
> > > processor   : 9
> > > processor   : 10
> > > processor   : 11
> > > processor   : 12
> > > processor   : 13
> > > processor   : 14
> > > processor   : 15
> > >
> > > I configured rps_sock_flow_entries  and our NIC rx queues with below
> > > values
> > >
> > > echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
> > > echo 4096 > /sys/class/net/p4p1/queues/rx-0/rps_flow_cnt
> > > echo 4096 > /sys/class/net/p4p1/queues/rx-1/rps_flow_cnt
> > > echo 4096 > /sys/class/net/p4p1/queues/rx-2/rps_flow_cnt
> > > echo 4096 > /sys/class/net/p4p1/queues/rx-3/rps_flow_cnt
> > > echo 4096 > /sys/class/net/p4p1/queues/rx-4/rps_flow_cnt
> > > echo 4096 > /sys/class/net/p4p1/queues/rx-5/rps_flow_cnt
> > > echo 4096 > /sys/class/net/p4p1/queues/rx-6/rps_flow_cnt
> > > echo 4096 > /sys/class/net/p4p1/queues/rx-7/rps_flow_cnt
> > >
> > > echo  > /sys/class/net/p4p1/queues/rx-0/rps_cpus
> > > echo  > /sys/class/net/p4p1/queues/rx-1/rps_cpus
> > > echo  > /sys/class/net/p4p1/queues/rx-2/rps_cpus
> > > echo  > /sys/class/net/p4p1/queues/rx-3/rps_cpus
> > > echo  > /sys/class/net/p4p1/queues/rx-4/rps_cpus
> > > echo  > /sys/class/net/p4p1/queues/rx-5/rps_cpus
> > > echo  > /sys/class/net/p4p1/queues/rx-6/rps_cpus
> > > echo  > /sys/class/net/p4p1/queues/rx-7/rps_cpus
> > >
> > > Below is IRQ affinity configuration for NIC irqs used.
> > >
> > > # cat /proc/irq/67/smp_affinity_list
> > > 8
> > > # cat /proc/irq/68/smp_affinity_list
> > > 9
> > > # cat /proc/irq/69/smp_affinity_list
> > > 10
> > > # cat /proc/irq/70/smp_affinity_list
> > > 11
> > > # cat /proc/irq/71/smp_affinity_list
> > > 12
> > > # cat /proc/irq/72/smp_affinity_list
> > > 13
> > > # cat /proc/irq/73/smp_affinity_list
> > > 14
> > > # cat /proc/irq/74/smp_affinity_list
> > > 15
> > >
> > > Driver has required feature NETIF_F_NTUPLE set, ndo_rx_flow_steer()
> > > registered and I am running UDP multiple connections stream using
> > > netperf to the host where I am experimenting aRFS.
> > >
> > > # netperf -V
> > > Netperf version 2.7.0
> > >
> > > netperf -H 192.168.200.40 -t UDP_STREAM -l 150 -T 8,8 -- -m 1470 -P
> > > 5001,48512 &
> > > netperf -H 192.168.200.40 -t UDP_STREAM -l 150 -T 9,9 -- -m 1470 -P
> > > 5001,37990 &
> > > netperf -H 192.168.200.40 -t UDP_STREAM -l 150 -T 10,10 -- -m 1470 -P
> > > 5001,40302 &
> > > netperf -H 192.168.200.40 -t UDP_STREAM -l 150 -T 11,11 -- -m 1470 -P
> > > 5001,39071 &
> > > netperf -H 192.168.200.40 -t UDP_STREAM -l 150 -T 12,12 -- -m 1470 -P
> > > 5001,58994 &
> > > netperf -H 192.168.200.40 -t UDP_STREAM -l 150 -T 13,13 -- -m 1470 -P
> > > 5001,59884 &
> > > netperf -H 192.168.200.40 -t UDP_STREAM -l 150 -T 14,14 -- -m 1470 -P
> > > 5001,40282 &
> > > netperf -H 192.168.200.40 -t UDP_STREAM -l 150 -T 15,15 -- -m 1470 -P
> > > 5001,56042 &
> > >
> > > I see that our registered callback for ndo_rx_flow_steer() "NEVER"
> > > gets invoked for UDP packets, with TCP_STREAM I do see it gets
> > > invoked.
> > > But while running UDP_STREAM I see it gets invoked for some of TCP
> > > packets as netperf also uses TCP managed connections while running
> > > UDP_STREAM.
> > >
> > > My initial investigation suspects that while running UDP_STREAM with
> > > netperf, rps_sock_flow_table doesn't get updated, as packets never
> > > reach to the flow of inet_recvmsg()
> > > where it gets updated using sock_rps_record_flow(). Which might be the
> > > reason it never invokes NIC's flow steering handler ?
> > >
> > > Please note that when I run UDP stream using "iperf" - I do see that
> > > our registered callback function for flow steering gets invoked for
> > > "UDP" packets.
> > > I am not sure if I am missing something in configuration or something
> > > else which is I am unware of  ?
> > >
> > > I appreciate any help for this.
> > 
> > Make sure you use connected UDP flows
> > 
> > 
> > netperf -t UDP_STREAM ... -- -N -n
> > 
> > Otherwise, one UDP socket can be involved in millions of 4-tuples (aka
> > flows)
> > 
> > 
> 
> Hi Eric, I tried with it but it doesn't 

[PATCH net-next 3/6] qede: Prevent GSO on long Geneve headers

2016-10-09 Thread Yuval Mintz
From: Manish Chopra 

Due to hardware limitation, when transmitting a geneve-encapsulated
packet with more than 32 bytes worth of geneve options the hardware
would not be able to crack the packet and consider it a regular UDP
packet.

This implements the ndo_features_check() in qede in order to prevent
GSO on said transmitted packets.

Signed-off-by: Manish Chopra 
Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qede/qede_main.c | 35 
 1 file changed, 35 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c 
b/drivers/net/ethernet/qlogic/qede/qede_main.c
index 7d5dc1e..6c2b09c 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -2240,6 +2240,40 @@ static void qede_udp_tunnel_del(struct net_device *dev,
schedule_delayed_work(>sp_task, 0);
 }
 
+/* 8B udp header + 8B base tunnel header + 32B option length */
+#define QEDE_MAX_TUN_HDR_LEN 48
+
+static netdev_features_t qede_features_check(struct sk_buff *skb,
+struct net_device *dev,
+netdev_features_t features)
+{
+   if (skb->encapsulation) {
+   u8 l4_proto = 0;
+
+   switch (vlan_get_protocol(skb)) {
+   case htons(ETH_P_IP):
+   l4_proto = ip_hdr(skb)->protocol;
+   break;
+   case htons(ETH_P_IPV6):
+   l4_proto = ipv6_hdr(skb)->nexthdr;
+   break;
+   default:
+   return features;
+   }
+
+   /* Disable offloads for geneve tunnels, as HW can't parse
+* the geneve header which has option length greater than 32B.
+*/
+   if ((l4_proto == IPPROTO_UDP) &&
+   ((skb_inner_mac_header(skb) -
+ skb_transport_header(skb)) > QEDE_MAX_TUN_HDR_LEN))
+   return features & ~(NETIF_F_CSUM_MASK |
+   NETIF_F_GSO_MASK);
+   }
+
+   return features;
+}
+
 static const struct net_device_ops qede_netdev_ops = {
.ndo_open = qede_open,
.ndo_stop = qede_close,
@@ -2264,6 +2298,7 @@ static const struct net_device_ops qede_netdev_ops = {
 #endif
.ndo_udp_tunnel_add = qede_udp_tunnel_add,
.ndo_udp_tunnel_del = qede_udp_tunnel_del,
+   .ndo_features_check = qede_features_check,
 };
 
 /* -
-- 
1.9.3



[PATCH net-next 5/6] qed: Allow chance for fast ramrod completions

2016-10-09 Thread Yuval Mintz
From: Yuval Mintz 

Whenever a ramrod is being sent for some device configuration,
the driver is going to sleep at least 5ms between each iteration
of polling on the completion of the ramrod.

However, in almost every configuration scenario the firmware
would be able to comply and complete the ramrod in a manner of
several usecs. This is especially important in cases where there
might be a lot of sequential configurations applying to the hardware
[e.g., RoCE], in which case the existing scheme might cause some
visible user delays.

This patch changes the completion scheme - instead of immediately
starting to sleep for a 'long' period, allow the device to quickly
poll on the first iteration after a couple of usecs.

Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/qed_spq.c | 85 +--
 1 file changed, 59 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_spq.c 
b/drivers/net/ethernet/qlogic/qed/qed_spq.c
index caff415..259a615 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_spq.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_spq.c
@@ -37,7 +37,11 @@
 ***/
 
 #define SPQ_HIGH_PRI_RESERVE_DEFAULT(1)
-#define SPQ_BLOCK_SLEEP_LENGTH  (1000)
+
+#define SPQ_BLOCK_DELAY_MAX_ITER(10)
+#define SPQ_BLOCK_DELAY_US  (10)
+#define SPQ_BLOCK_SLEEP_MAX_ITER(1000)
+#define SPQ_BLOCK_SLEEP_MS  (5)
 
 /***
 * Blocking Imp. (BLOCK/EBLOCK mode)
@@ -57,53 +61,81 @@ static void qed_spq_blocking_cb(struct qed_hwfn *p_hwfn,
smp_wmb();
 }
 
-static int qed_spq_block(struct qed_hwfn *p_hwfn,
-struct qed_spq_entry *p_ent,
-u8 *p_fw_ret)
+static int __qed_spq_block(struct qed_hwfn *p_hwfn,
+  struct qed_spq_entry *p_ent,
+  u8 *p_fw_ret, bool sleep_between_iter)
 {
-   int sleep_count = SPQ_BLOCK_SLEEP_LENGTH;
struct qed_spq_comp_done *comp_done;
-   int rc;
+   u32 iter_cnt;
 
comp_done = (struct qed_spq_comp_done *)p_ent->comp_cb.cookie;
-   while (sleep_count) {
-   /* validate we receive completion update */
+   iter_cnt = sleep_between_iter ? SPQ_BLOCK_SLEEP_MAX_ITER
+ : SPQ_BLOCK_DELAY_MAX_ITER;
+
+   while (iter_cnt--) {
+   /* Validate we receive completion update */
smp_rmb();
if (comp_done->done == 1) {
if (p_fw_ret)
*p_fw_ret = comp_done->fw_return_code;
return 0;
}
-   usleep_range(5000, 1);
-   sleep_count--;
+
+   if (sleep_between_iter)
+   msleep(SPQ_BLOCK_SLEEP_MS);
+   else
+   udelay(SPQ_BLOCK_DELAY_US);
}
 
+   return -EBUSY;
+}
+
+static int qed_spq_block(struct qed_hwfn *p_hwfn,
+struct qed_spq_entry *p_ent,
+u8 *p_fw_ret, bool skip_quick_poll)
+{
+   struct qed_spq_comp_done *comp_done;
+   int rc;
+
+   /* A relatively short polling period w/o sleeping, to allow the FW to
+* complete the ramrod and thus possibly to avoid the following sleeps.
+*/
+   if (!skip_quick_poll) {
+   rc = __qed_spq_block(p_hwfn, p_ent, p_fw_ret, false);
+   if (!rc)
+   return 0;
+   }
+
+   /* Move to polling with a sleeping period between iterations */
+   rc = __qed_spq_block(p_hwfn, p_ent, p_fw_ret, true);
+   if (!rc)
+   return 0;
+
DP_INFO(p_hwfn, "Ramrod is stuck, requesting MCP drain\n");
rc = qed_mcp_drain(p_hwfn, p_hwfn->p_main_ptt);
-   if (rc != 0)
+   if (rc) {
DP_NOTICE(p_hwfn, "MCP drain failed\n");
+   goto err;
+   }
 
/* Retry after drain */
-   sleep_count = SPQ_BLOCK_SLEEP_LENGTH;
-   while (sleep_count) {
-   /* validate we receive completion update */
-   smp_rmb();
-   if (comp_done->done == 1) {
-   if (p_fw_ret)
-   *p_fw_ret = comp_done->fw_return_code;
-   return 0;
-   }
-   usleep_range(5000, 1);
-   sleep_count--;
-   }
+   rc = __qed_spq_block(p_hwfn, p_ent, p_fw_ret, true);
+   if (!rc)
+   return 0;
 
+   comp_done = (struct qed_spq_comp_done *)p_ent->comp_cb.cookie;
if (comp_done->done == 1) {
if (p_fw_ret)
*p_fw_ret = comp_done->fw_return_code;
return 0;
}
-
-   DP_NOTICE(p_hwfn, 

[PATCH net-next 1/6] qed: Pass MAC hints to VFs

2016-10-09 Thread Yuval Mintz
From: Yuval Mintz 

Some hypervisors can support MAC hints to their VFs.
Even though we don't have such a hypervisor API in linux, we add
sufficient logic for the VF to be able to receive such hints and
set the mac accordingly - as long as the VF has not been set with
a MAC already.

Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/qed_vf.c | 4 ++--
 drivers/net/ethernet/qlogic/qede/qede_main.c | 6 +-
 include/linux/qed/qed_eth_if.h   | 2 +-
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_vf.c 
b/drivers/net/ethernet/qlogic/qed/qed_vf.c
index abf5bf1..f580bf4 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_vf.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_vf.c
@@ -1230,8 +1230,8 @@ static void qed_handle_bulletin_change(struct qed_hwfn 
*hwfn)
 
is_mac_exist = qed_vf_bulletin_get_forced_mac(hwfn, mac,
  _mac_forced);
-   if (is_mac_exist && is_mac_forced && cookie)
-   ops->force_mac(cookie, mac);
+   if (is_mac_exist && cookie)
+   ops->force_mac(cookie, mac, !!is_mac_forced);
 
/* Always update link configuration according to bulletin */
qed_link_update(hwfn);
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c 
b/drivers/net/ethernet/qlogic/qede/qede_main.c
index 343038c..9866d95 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -171,10 +171,14 @@ static struct pci_driver qede_pci_driver = {
 #endif
 };
 
-static void qede_force_mac(void *dev, u8 *mac)
+static void qede_force_mac(void *dev, u8 *mac, bool forced)
 {
struct qede_dev *edev = dev;
 
+   /* MAC hints take effect only if we haven't set one already */
+   if (is_valid_ether_addr(edev->ndev->dev_addr) && !forced)
+   return;
+
ether_addr_copy(edev->ndev->dev_addr, mac);
ether_addr_copy(edev->primary_mac, mac);
 }
diff --git a/include/linux/qed/qed_eth_if.h b/include/linux/qed/qed_eth_if.h
index 33c24eb..1c77948 100644
--- a/include/linux/qed/qed_eth_if.h
+++ b/include/linux/qed/qed_eth_if.h
@@ -129,7 +129,7 @@ struct qed_tunn_params {
 
 struct qed_eth_cb_ops {
struct qed_common_cb_ops common;
-   void (*force_mac) (void *dev, u8 *mac);
+   void (*force_mac) (void *dev, u8 *mac, bool forced);
 };
 
 #ifdef CONFIG_DCB
-- 
1.9.3



[PATCH net-next 2/6] qede: GSO support for tunnels with outer csum

2016-10-09 Thread Yuval Mintz
From: Manish Chopra 

This patch adds GSO support for GRE and UDP tunnels
where outer checksums are enabled.

Signed-off-by: Manish Chopra 
Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qede/qede.h  |  1 +
 drivers/net/ethernet/qlogic/qede/qede_main.c | 26 +++---
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede.h 
b/drivers/net/ethernet/qlogic/qede/qede.h
index 28c0e9f..f50e527 100644
--- a/drivers/net/ethernet/qlogic/qede/qede.h
+++ b/drivers/net/ethernet/qlogic/qede/qede.h
@@ -320,6 +320,7 @@ struct qede_fastpath {
 #define XMIT_L4_CSUM   BIT(0)
 #define XMIT_LSO   BIT(1)
 #define XMIT_ENC   BIT(2)
+#define XMIT_ENC_GSO_L4_CSUM   BIT(3)
 
 #define QEDE_CSUM_ERRORBIT(0)
 #define QEDE_CSUM_UNNECESSARY  BIT(1)
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c 
b/drivers/net/ethernet/qlogic/qede/qede_main.c
index 9866d95..7d5dc1e 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -400,8 +400,19 @@ static u32 qede_xmit_type(struct qede_dev *edev,
(ipv6_hdr(skb)->nexthdr == NEXTHDR_IPV6))
*ipv6_ext = 1;
 
-   if (skb->encapsulation)
+   if (skb->encapsulation) {
rc |= XMIT_ENC;
+   if (skb_is_gso(skb)) {
+   unsigned short gso_type = skb_shinfo(skb)->gso_type;
+
+   if ((gso_type & SKB_GSO_UDP_TUNNEL_CSUM) ||
+   (gso_type & SKB_GSO_GRE_CSUM))
+   rc |= XMIT_ENC_GSO_L4_CSUM;
+
+   rc |= XMIT_LSO;
+   return rc;
+   }
+   }
 
if (skb_is_gso(skb))
rc |= XMIT_LSO;
@@ -637,6 +648,12 @@ static netdev_tx_t qede_start_xmit(struct sk_buff *skb,
if (unlikely(xmit_type & XMIT_ENC)) {
first_bd->data.bd_flags.bitfields |=
1 << ETH_TX_1ST_BD_FLAGS_TUNN_IP_CSUM_SHIFT;
+
+   if (xmit_type & XMIT_ENC_GSO_L4_CSUM) {
+   u8 tmp = ETH_TX_1ST_BD_FLAGS_TUNN_L4_CSUM_SHIFT;
+
+   first_bd->data.bd_flags.bitfields |= 1 << tmp;
+   }
hlen = qede_get_skb_hlen(skb, true);
} else {
first_bd->data.bd_flags.bitfields |=
@@ -2320,11 +2337,14 @@ static void qede_init_ndev(struct qede_dev *edev)
 
/* Encap features*/
hw_features |= NETIF_F_GSO_GRE | NETIF_F_GSO_UDP_TUNNEL |
-  NETIF_F_TSO_ECN;
+  NETIF_F_TSO_ECN | NETIF_F_GSO_UDP_TUNNEL_CSUM |
+  NETIF_F_GSO_GRE_CSUM;
ndev->hw_enc_features = NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
NETIF_F_SG | NETIF_F_TSO | NETIF_F_TSO_ECN |
NETIF_F_TSO6 | NETIF_F_GSO_GRE |
-   NETIF_F_GSO_UDP_TUNNEL | NETIF_F_RXCSUM;
+   NETIF_F_GSO_UDP_TUNNEL | NETIF_F_RXCSUM |
+   NETIF_F_GSO_UDP_TUNNEL_CSUM |
+   NETIF_F_GSO_GRE_CSUM;
 
ndev->vlan_features = hw_features | NETIF_F_RXHASH | NETIF_F_RXCSUM |
  NETIF_F_HIGHDMA;
-- 
1.9.3



[PATCH net-next 0/6] qed*: driver updates

2016-10-09 Thread Yuval Mintz
From: Yuval Mintz 

There are several new additions in this series;
Most are connected to either Tx offloading or Rx classifications
[either fastpath changes or supporting configuration].

In addition, there's a single IOV enhancement.

Dave,

Please consider applying this series to `net-next'.

Thanks,
Yuval

Manish Chopra (2):
  qede: GSO support for tunnels with outer csum
  qede: Prevent GSO on long Geneve headers

Yuval Mintz (4):
  qed: Pass MAC hints to VFs
  qed*: Allow unicast filtering
  qed: Allow chance for fast ramrod completions
  qed: Handle malicious VFs events

 drivers/net/ethernet/qlogic/qed/qed_l2.c |  12 ++-
 drivers/net/ethernet/qlogic/qed/qed_spq.c|  85 ++--
 drivers/net/ethernet/qlogic/qed/qed_sriov.c  | 114 ++-
 drivers/net/ethernet/qlogic/qed/qed_sriov.h  |   1 +
 drivers/net/ethernet/qlogic/qed/qed_vf.c |   4 +-
 drivers/net/ethernet/qlogic/qed/qed_vf.h |   1 +
 drivers/net/ethernet/qlogic/qede/qede.h  |   1 +
 drivers/net/ethernet/qlogic/qede/qede_main.c |  71 +++--
 include/linux/qed/qed_eth_if.h   |   3 +-
 9 files changed, 236 insertions(+), 56 deletions(-)

-- 
1.9.3



[PATCH net-next 6/6] qed: Handle malicious VFs events

2016-10-09 Thread Yuval Mintz
From: Yuval Mintz 

Malicious VFs might be caught in several different methods:
  - Misusing their bar permission and being blocked by hardware.
  - Misusing their fastpath logic and being blocked by firmware.
  - Misusing their interaction with their PF via hw-channel,
and being blocked by PF driver.

On the first two items, firmware would indicate to driver that
the VF is to be considered malicious, but would sometime still
allow the VF to communicate with the PF [depending on the exact
nature of the malicious activity done by the VF].
The current existing logic on the PF side lacks handling of such events,
and might allow the PF to perform some incorrect configuration on behalf
of a VF that was previously indicated as malicious.

The new scheme is simple -
Once the PF determines a VF is malicious it would:
 a. Ignore any further requests on behalf of the VF-driver.
 b. Prevent any configurations initiated by the hyperuser for
the malicious VF, as firmware isn't willing to serve such.

The malicious indication would be cleared upon the VF flr,
after which it would become usable once again.

Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/qed_sriov.c | 114 +++-
 drivers/net/ethernet/qlogic/qed/qed_sriov.h |   1 +
 drivers/net/ethernet/qlogic/qed/qed_vf.h|   1 +
 3 files changed, 96 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_sriov.c 
b/drivers/net/ethernet/qlogic/qed/qed_sriov.c
index d2d6621..6f029f9 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_sriov.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_sriov.c
@@ -109,7 +109,8 @@ static int qed_sp_vf_stop(struct qed_hwfn *p_hwfn,
 }
 
 static bool qed_iov_is_valid_vfid(struct qed_hwfn *p_hwfn,
- int rel_vf_id, bool b_enabled_only)
+ int rel_vf_id,
+ bool b_enabled_only, bool b_non_malicious)
 {
if (!p_hwfn->pf_iov_info) {
DP_NOTICE(p_hwfn->cdev, "No iov info\n");
@@ -124,6 +125,10 @@ static bool qed_iov_is_valid_vfid(struct qed_hwfn *p_hwfn,
b_enabled_only)
return false;
 
+   if ((p_hwfn->pf_iov_info->vfs_array[rel_vf_id].b_malicious) &&
+   b_non_malicious)
+   return false;
+
return true;
 }
 
@@ -138,7 +143,8 @@ static struct qed_vf_info *qed_iov_get_vf_info(struct 
qed_hwfn *p_hwfn,
return NULL;
}
 
-   if (qed_iov_is_valid_vfid(p_hwfn, relative_vf_id, b_enabled_only))
+   if (qed_iov_is_valid_vfid(p_hwfn, relative_vf_id,
+ b_enabled_only, false))
vf = _hwfn->pf_iov_info->vfs_array[relative_vf_id];
else
DP_ERR(p_hwfn, "qed_iov_get_vf_info: VF[%d] is not enabled\n",
@@ -542,7 +548,8 @@ int qed_iov_hw_info(struct qed_hwfn *p_hwfn)
return 0;
 }
 
-static bool qed_iov_pf_sanity_check(struct qed_hwfn *p_hwfn, int vfid)
+bool _qed_iov_pf_sanity_check(struct qed_hwfn *p_hwfn,
+ int vfid, bool b_fail_malicious)
 {
/* Check PF supports sriov */
if (IS_VF(p_hwfn->cdev) || !IS_QED_SRIOV(p_hwfn->cdev) ||
@@ -550,12 +557,17 @@ static bool qed_iov_pf_sanity_check(struct qed_hwfn 
*p_hwfn, int vfid)
return false;
 
/* Check VF validity */
-   if (!qed_iov_is_valid_vfid(p_hwfn, vfid, true))
+   if (!qed_iov_is_valid_vfid(p_hwfn, vfid, true, b_fail_malicious))
return false;
 
return true;
 }
 
+bool qed_iov_pf_sanity_check(struct qed_hwfn *p_hwfn, int vfid)
+{
+   return _qed_iov_pf_sanity_check(p_hwfn, vfid, true);
+}
+
 static void qed_iov_set_vf_to_disable(struct qed_dev *cdev,
  u16 rel_vf_id, u8 to_disable)
 {
@@ -652,6 +664,9 @@ static int qed_iov_enable_vf_access(struct qed_hwfn *p_hwfn,
 
qed_iov_vf_igu_reset(p_hwfn, p_ptt, vf);
 
+   /* It's possible VF was previously considered malicious */
+   vf->b_malicious = false;
+
rc = qed_mcp_config_vf_msix(p_hwfn, p_ptt, vf->abs_vf_id, vf->num_sbs);
if (rc)
return rc;
@@ -2804,6 +2819,13 @@ qed_iov_execute_vf_flr_cleanup(struct qed_hwfn *p_hwfn,
return rc;
}
 
+   /* Workaround to make VF-PF channel ready, as FW
+* doesn't do that as a part of FLR.
+*/
+   REG_WR(p_hwfn,
+  GTT_BAR0_MAP_REG_USDM_RAM +
+  USTORM_VF_PF_CHANNEL_READY_OFFSET(vfid), 1);
+
/* VF_STOPPED has to be set only after final cleanup
 * but prior to re-enabling the VF.
 */
@@ -2942,7 +2964,8 @@ static void qed_iov_process_mbx_req(struct qed_hwfn 
*p_hwfn,
mbx->first_tlv = mbx->req_virt->first_tlv;
 
/* check if tlv type is known */

[PATCH net-next 4/6] qed*: Allow unicast filtering

2016-10-09 Thread Yuval Mintz
From: Yuval Mintz 

Apparently qede fails to set IFF_UNICAST_FLT, and as a result is not
actually performing unicast MAC filtering.
While we're at it - relax a hard-coded limitation that limits each
interface into using at most 15 unicast MAC addresses before turning
promiscuous. Instead utilize the HW resources to their limit.

Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/qlogic/qed/qed_l2.c | 12 ++--
 drivers/net/ethernet/qlogic/qede/qede_main.c |  4 +++-
 include/linux/qed/qed_eth_if.h   |  1 +
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_l2.c 
b/drivers/net/ethernet/qlogic/qed/qed_l2.c
index ddd410a..572ed04 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_l2.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_l2.c
@@ -1652,6 +1652,7 @@ static int qed_fill_eth_dev_info(struct qed_dev *cdev,
 
if (IS_PF(cdev)) {
int max_vf_vlan_filters = 0;
+   int max_vf_mac_filters = 0;
 
if (cdev->int_params.out.int_mode == QED_INT_MODE_MSIX) {
for_each_hwfn(cdev, i)
@@ -1665,11 +1666,18 @@ static int qed_fill_eth_dev_info(struct qed_dev *cdev,
info->num_queues = cdev->num_hwfns;
}
 
-   if (IS_QED_SRIOV(cdev))
+   if (IS_QED_SRIOV(cdev)) {
max_vf_vlan_filters = cdev->p_iov_info->total_vfs *
  QED_ETH_VF_NUM_VLAN_FILTERS;
-   info->num_vlan_filters = RESC_NUM(>hwfns[0], QED_VLAN) -
+   max_vf_mac_filters = cdev->p_iov_info->total_vfs *
+QED_ETH_VF_NUM_MAC_FILTERS;
+   }
+   info->num_vlan_filters = RESC_NUM(QED_LEADING_HWFN(cdev),
+ QED_VLAN) -
 max_vf_vlan_filters;
+   info->num_mac_filters = RESC_NUM(QED_LEADING_HWFN(cdev),
+QED_MAC) -
+   max_vf_mac_filters;
 
ether_addr_copy(info->port_mac,
cdev->hwfns[0].hw_info.hw_mac_addr);
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c 
b/drivers/net/ethernet/qlogic/qede/qede_main.c
index 6c2b09c..0e483af 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -2365,6 +2365,8 @@ static void qede_init_ndev(struct qede_dev *edev)
 
qede_set_ethtool_ops(ndev);
 
+   ndev->priv_flags = IFF_UNICAST_FLT;
+
/* user-changeble features */
hw_features = NETIF_F_GRO | NETIF_F_SG |
  NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
@@ -3937,7 +3939,7 @@ static void qede_config_rx_mode(struct net_device *ndev)
 
/* Check for promiscuous */
if ((ndev->flags & IFF_PROMISC) ||
-   (uc_count > 15)) { /* @@@TBD resource allocation - 1 */
+   (uc_count > edev->dev_info.num_mac_filters - 1)) {
accept_flags = QED_FILTER_RX_MODE_TYPE_PROMISC;
} else {
/* Add MAC filters according to the unicast secondary macs */
diff --git a/include/linux/qed/qed_eth_if.h b/include/linux/qed/qed_eth_if.h
index 1c77948..1513080 100644
--- a/include/linux/qed/qed_eth_if.h
+++ b/include/linux/qed/qed_eth_if.h
@@ -23,6 +23,7 @@ struct qed_dev_eth_info {
 
u8  port_mac[ETH_ALEN];
u8  num_vlan_filters;
+   u16 num_mac_filters;
 
/* Legacy VF - this affects the datapath, so qede has to know */
bool is_legacy;
-- 
1.9.3



[PATCH] net: dsa: slave: use new api ethtool_{get|set}_link_ksettings

2016-10-09 Thread Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
 net/dsa/slave.c |   14 --
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 6b1282c..68714a5 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -641,7 +641,8 @@ static netdev_tx_t dsa_slave_xmit(struct sk_buff *skb, 
struct net_device *dev)
 
 /* ethtool operations ***/
 static int
-dsa_slave_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+dsa_slave_get_link_ksettings(struct net_device *dev,
+struct ethtool_link_ksettings *cmd)
 {
struct dsa_slave_priv *p = netdev_priv(dev);
int err;
@@ -650,19 +651,20 @@ dsa_slave_get_settings(struct net_device *dev, struct 
ethtool_cmd *cmd)
if (p->phy != NULL) {
err = phy_read_status(p->phy);
if (err == 0)
-   err = phy_ethtool_gset(p->phy, cmd);
+   err = phy_ethtool_ksettings_get(p->phy, cmd);
}
 
return err;
 }
 
 static int
-dsa_slave_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+dsa_slave_set_link_ksettings(struct net_device *dev,
+const struct ethtool_link_ksettings *cmd)
 {
struct dsa_slave_priv *p = netdev_priv(dev);
 
if (p->phy != NULL)
-   return phy_ethtool_sset(p->phy, cmd);
+   return phy_ethtool_ksettings_set(p->phy, cmd);
 
return -EOPNOTSUPP;
 }
@@ -990,8 +992,6 @@ void dsa_cpu_port_ethtool_init(struct ethtool_ops *ops)
 }
 
 static const struct ethtool_ops dsa_slave_ethtool_ops = {
-   .get_settings   = dsa_slave_get_settings,
-   .set_settings   = dsa_slave_set_settings,
.get_drvinfo= dsa_slave_get_drvinfo,
.get_regs_len   = dsa_slave_get_regs_len,
.get_regs   = dsa_slave_get_regs,
@@ -1007,6 +1007,8 @@ static const struct ethtool_ops dsa_slave_ethtool_ops = {
.get_wol= dsa_slave_get_wol,
.set_eee= dsa_slave_set_eee,
.get_eee= dsa_slave_get_eee,
+   .get_link_ksettings = dsa_slave_get_link_ksettings,
+   .set_link_ksettings = dsa_slave_set_link_ksettings,
 };
 
 static const struct net_device_ops dsa_slave_netdev_ops = {
-- 
1.7.4.4



[PATCH] qed: Fix to use list_for_each_entry_safe() when delete items

2016-10-09 Thread Wei Yongjun
From: Wei Yongjun 

Since we will remove items off the list using list_del() we need
to use a safe version of the list_for_each_entry() macro aptly named
list_for_each_entry_safe().

Fixes: 0a7fb11c23c0 ("qed: Add Light L2 support")
Signed-off-by: Wei Yongjun 
---
 drivers/net/ethernet/qlogic/qed/qed_ll2.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_ll2.c 
b/drivers/net/ethernet/qlogic/qed/qed_ll2.c
index a6db107..4428333 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_ll2.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_ll2.c
@@ -1517,7 +1517,7 @@ static void qed_ll2_register_cb_ops(struct qed_dev *cdev,
 static int qed_ll2_start(struct qed_dev *cdev, struct qed_ll2_params *params)
 {
struct qed_ll2_info ll2_info;
-   struct qed_ll2_buffer *buffer;
+   struct qed_ll2_buffer *buffer, *tmp;
enum qed_ll2_conn_type conn_type;
struct qed_ptt *p_ptt;
int rc, i;
@@ -1587,7 +1587,7 @@ static int qed_ll2_start(struct qed_dev *cdev, struct 
qed_ll2_params *params)
 
/* Post all Rx buffers to FW */
spin_lock_bh(>ll2->lock);
-   list_for_each_entry(buffer, >ll2->list, list) {
+   list_for_each_entry_safe(buffer, tmp, >ll2->list, list) {
rc = qed_ll2_post_rx_buffer(QED_LEADING_HWFN(cdev),
cdev->ll2->handle,
buffer->phys_addr, 0, buffer, 1);



RE: [PATCH] qed: Fix to use list_for_each_entry_safe() when delete items

2016-10-09 Thread Mintz, Yuval
> From: Wei Yongjun 
> 
> Since we will remove items off the list using list_del() we need to use a safe
> version of the list_for_each_entry() macro aptly named
> list_for_each_entry_safe().
> 
> Fixes: 0a7fb11c23c0 ("qed: Add Light L2 support")
> Signed-off-by: Wei Yongjun 
> ---
>  drivers/net/ethernet/qlogic/qed/qed_ll2.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/qlogic/qed/qed_ll2.c
> b/drivers/net/ethernet/qlogic/qed/qed_ll2.c
> index a6db107..4428333 100644
> --- a/drivers/net/ethernet/qlogic/qed/qed_ll2.c
> +++ b/drivers/net/ethernet/qlogic/qed/qed_ll2.c
> @@ -1517,7 +1517,7 @@ static void qed_ll2_register_cb_ops(struct qed_dev
> *cdev,  static int qed_ll2_start(struct qed_dev *cdev, struct qed_ll2_params
> *params)  {
>   struct qed_ll2_info ll2_info;
> - struct qed_ll2_buffer *buffer;
> + struct qed_ll2_buffer *buffer, *tmp;
>   enum qed_ll2_conn_type conn_type;
>   struct qed_ptt *p_ptt;
>   int rc, i;
> @@ -1587,7 +1587,7 @@ static int qed_ll2_start(struct qed_dev *cdev, struct
> qed_ll2_params *params)
> 
>   /* Post all Rx buffers to FW */
>   spin_lock_bh(>ll2->lock);
> - list_for_each_entry(buffer, >ll2->list, list) {
> + list_for_each_entry_safe(buffer, tmp, >ll2->list, list) {
>   rc = qed_ll2_post_rx_buffer(QED_LEADING_HWFN(cdev),
>   cdev->ll2->handle,
>   buffer->phys_addr, 0, buffer, 1);

Thanks for the catch.

A single petty comment - having the variable called 'tmp' in such a long
function is far from informative. Perhaps 'tmp_buffer' would have been
better.

Regardless,
Acked-by: Yuval Mintz 



[PATCH] staging: net: netlogic: use new api ethtool_{get|set}_link_ksettings

2016-10-09 Thread Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
 drivers/staging/netlogic/xlr_net.c |   14 --
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/netlogic/xlr_net.c 
b/drivers/staging/netlogic/xlr_net.c
index 552a7dc..cdf01b9 100644
--- a/drivers/staging/netlogic/xlr_net.c
+++ b/drivers/staging/netlogic/xlr_net.c
@@ -172,29 +172,31 @@ static struct phy_device *xlr_get_phydev(struct 
xlr_net_priv *priv)
 /*
  * Ethtool operation
  */
-static int xlr_get_settings(struct net_device *ndev, struct ethtool_cmd *ecmd)
+static int xlr_get_link_ksettings(struct net_device *ndev,
+ struct ethtool_link_ksettings *ecmd)
 {
struct xlr_net_priv *priv = netdev_priv(ndev);
struct phy_device *phydev = xlr_get_phydev(priv);
 
if (!phydev)
return -ENODEV;
-   return phy_ethtool_gset(phydev, ecmd);
+   return phy_ethtool_ksettings_get(phydev, ecmd);
 }
 
-static int xlr_set_settings(struct net_device *ndev, struct ethtool_cmd *ecmd)
+static int xlr_set_link_ksettings(struct net_device *ndev,
+ const struct ethtool_link_ksettings *ecmd)
 {
struct xlr_net_priv *priv = netdev_priv(ndev);
struct phy_device *phydev = xlr_get_phydev(priv);
 
if (!phydev)
return -ENODEV;
-   return phy_ethtool_sset(phydev, ecmd);
+   return phy_ethtool_ksettings_set(phydev, ecmd);
 }
 
 static const struct ethtool_ops xlr_ethtool_ops = {
-   .get_settings = xlr_get_settings,
-   .set_settings = xlr_set_settings,
+   .get_link_ksettings = xlr_get_link_ksettings,
+   .set_link_ksettings = xlr_set_link_ksettings,
 };
 
 /*
-- 
1.7.4.4



Re: Code quality and XDP

2016-10-09 Thread David Miller
From: Tom Herbert 
Date: Sat, 8 Oct 2016 12:11:51 -0700

> On Fri, Oct 7, 2016 at 9:28 PM, Jesper Dangaard Brouer
>  wrote:
>>
>> On Sat, 8 Oct 2016 07:25:01 +0900 Tom Herbert  wrote:
>>
>>> One concern raised at netdev concerning XDP is how are we going to
>>> maintain code quality, security, and correctness of programs being
>>> loaded. With kernel bypass it is not just the kernel code path that is
>>> being bypassed, but also the processes that hold the quality of code
>>> being accepted to a high bar. Our users expect that quality to be
>>> maintained.
>>>
>>> I suggest that we need XDP programs to be subject to the same scrutiny
>>> that any other kernel netdev code is. One idea is to sign programs
>>> that have been accepted into the kernel. By default only signed
>>> programs would be allowed to be loaded, the override to allow unsigned
>>> programs might be a kernel config or a least a boot parameter
>>> (enabling the override needs to be very explicit).
>>
>> Sorry, I think this "lock-down" will kill the DDoS use-case.  In the
>> DDoS mitigation use-case, is all about flexibility to adapt quickly to
>> changing attacks.  Thus, you need the ability to quickly modify your
>> programs to catch attack signatures.
>>
> As I mentioned the ability to run arbitrary programs can be explicitly
> be disabled for such use-cases. But not all use cases of XDP require
> such dynamic program-ability and not every user is going to need or
> want this capability. For instance, an ILA router should be a
> straightforward program to implement and not really need dynamic
> modification (it is controlled through configuration). If someone
> chooses to do a proprietary ILA router themselves they can do that by
> disabling the lock-down, but they shouldn't expect any support from
> the community when things got wrong. This is no different then when
> people post to netdev about problems they are having with proprietary
> modules. If they use an in-tree implementation then we could support
> that.

The only entities that care about lock down are the same entities that
use kernel module signing.

And these area people who put together distributions where they wish to
make sure only signed kernel modules and XDP programs can execute
without somehow "tainting" the system so that crash dumps sent in for
support and analysis are appropriately marked as such.

The default should definitely be to allow the administrator to load
any XDP program they want.

Administrators and distribution maintainers can both optionally change
the system to only run signed XDP programs.

I frankly am not so fearful of people running arbitrary XDP programs.

If a CAP_SYS_ADMIN privileged entity on a physical host does it, it's
his problem.  He could have messed up the system even more.

A CAP_SYS_ADMIN privileged entity on a virtual host can only screw up
his own traffic, and this doesn't hurt anyone else on the machine.

Powerful facilities come at a certain cost.  Let's not defang this
by default until we really know it will bite people.


Re: [PATCH net-next 0/5] be2net: patch-set

2016-10-09 Thread David Miller
From: Sriharsha Basavapatna 
Date: Sun,  9 Oct 2016 09:58:48 +0530

> The following patch set contains a few bug fixes.
> Please consider applying this to the net-next tree.
> Thanks.
> 
> Patch-1 Obtains proper PF number for BEx chips
> Patch-2 Fixes a FW update issue seen with BEx chips
> Patch-3 Updates copyright string
> Patch-4 Fixes TX stats for TSO packets
> Patch-5 Enables VF link state setting for BE3

Please do not target real bug fixes at net-next, they should
target 'net' instead.

And that's where I have applied this series.

Thanks.


[PATCH] net: usb: lan78xx: use new api ethtool_{get|set}_link_ksettings

2016-10-09 Thread Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
 drivers/net/usb/lan78xx.c |   70 +---
 1 files changed, 40 insertions(+), 30 deletions(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index db558b8..13f033c 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -1092,7 +1092,7 @@ static int lan78xx_update_flowcontrol(struct lan78xx_net 
*dev, u8 duplex,
 static int lan78xx_link_reset(struct lan78xx_net *dev)
 {
struct phy_device *phydev = dev->net->phydev;
-   struct ethtool_cmd ecmd = { .cmd = ETHTOOL_GSET };
+   struct ethtool_link_ksettings ecmd;
int ladv, radv, ret;
u32 buf;
 
@@ -1126,12 +1126,12 @@ static int lan78xx_link_reset(struct lan78xx_net *dev)
} else if (phydev->link && !dev->link_on) {
dev->link_on = true;
 
-   phy_ethtool_gset(phydev, );
+   phy_ethtool_ksettings_get(phydev, );
 
ret = phy_read(phydev, LAN88XX_INT_STS);
 
if (dev->udev->speed == USB_SPEED_SUPER) {
-   if (ethtool_cmd_speed() == 1000) {
+   if (ecmd.base.speed == 1000) {
/* disable U2 */
ret = lan78xx_read_reg(dev, USB_CFG1, );
buf &= ~USB_CFG1_DEV_U2_INIT_EN_;
@@ -1159,9 +1159,10 @@ static int lan78xx_link_reset(struct lan78xx_net *dev)
 
netif_dbg(dev, link, dev->net,
  "speed: %u duplex: %d anadv: 0x%04x anlpa: 0x%04x",
- ethtool_cmd_speed(), ecmd.duplex, ladv, radv);
+ ecmd.base.speed, ecmd.base.duplex, ladv, radv);
 
-   ret = lan78xx_update_flowcontrol(dev, ecmd.duplex, ladv, radv);
+   ret = lan78xx_update_flowcontrol(dev, ecmd.base.duplex, ladv,
+radv);
phy_mac_interrupt(phydev, 1);
 
if (!timer_pending(>stat_monitor)) {
@@ -1484,7 +1485,8 @@ static void lan78xx_set_mdix_status(struct net_device 
*net, __u8 mdix_ctrl)
dev->mdix_ctrl = mdix_ctrl;
 }
 
-static int lan78xx_get_settings(struct net_device *net, struct ethtool_cmd 
*cmd)
+static int lan78xx_get_link_ksettings(struct net_device *net,
+ struct ethtool_link_ksettings *cmd)
 {
struct lan78xx_net *dev = netdev_priv(net);
struct phy_device *phydev = net->phydev;
@@ -1495,20 +1497,20 @@ static int lan78xx_get_settings(struct net_device *net, 
struct ethtool_cmd *cmd)
if (ret < 0)
return ret;
 
-   ret = phy_ethtool_gset(phydev, cmd);
+   ret = phy_ethtool_ksettings_get(phydev, cmd);
 
buf = lan78xx_get_mdix_status(net);
 
buf &= LAN88XX_EXT_MODE_CTRL_MDIX_MASK_;
if (buf == LAN88XX_EXT_MODE_CTRL_AUTO_MDIX_) {
-   cmd->eth_tp_mdix = ETH_TP_MDI_AUTO;
-   cmd->eth_tp_mdix_ctrl = ETH_TP_MDI_AUTO;
+   cmd->base.eth_tp_mdix = ETH_TP_MDI_AUTO;
+   cmd->base.eth_tp_mdix_ctrl = ETH_TP_MDI_AUTO;
} else if (buf == LAN88XX_EXT_MODE_CTRL_MDI_) {
-   cmd->eth_tp_mdix = ETH_TP_MDI;
-   cmd->eth_tp_mdix_ctrl = ETH_TP_MDI;
+   cmd->base.eth_tp_mdix = ETH_TP_MDI;
+   cmd->base.eth_tp_mdix_ctrl = ETH_TP_MDI;
} else if (buf == LAN88XX_EXT_MODE_CTRL_MDI_X_) {
-   cmd->eth_tp_mdix = ETH_TP_MDI_X;
-   cmd->eth_tp_mdix_ctrl = ETH_TP_MDI_X;
+   cmd->base.eth_tp_mdix = ETH_TP_MDI_X;
+   cmd->base.eth_tp_mdix_ctrl = ETH_TP_MDI_X;
}
 
usb_autopm_put_interface(dev->intf);
@@ -1516,7 +1518,8 @@ static int lan78xx_get_settings(struct net_device *net, 
struct ethtool_cmd *cmd)
return ret;
 }
 
-static int lan78xx_set_settings(struct net_device *net, struct ethtool_cmd 
*cmd)
+static int lan78xx_set_link_ksettings(struct net_device *net,
+ const struct ethtool_link_ksettings *cmd)
 {
struct lan78xx_net *dev = netdev_priv(net);
struct phy_device *phydev = net->phydev;
@@ -1527,14 +1530,13 @@ static int lan78xx_set_settings(struct net_device *net, 
struct ethtool_cmd *cmd)
if (ret < 0)
return ret;
 
-   if (dev->mdix_ctrl != cmd->eth_tp_mdix_ctrl) {
-   lan78xx_set_mdix_status(net, cmd->eth_tp_mdix_ctrl);
-   }
+   if (dev->mdix_ctrl != cmd->base.eth_tp_mdix_ctrl)
+   lan78xx_set_mdix_status(net, cmd->base.eth_tp_mdix_ctrl);
 
/* change speed & duplex */
-   ret = phy_ethtool_sset(phydev, cmd);
+   ret = phy_ethtool_ksettings_set(phydev, cmd);
 
-   if (!cmd->autoneg) {
+   if (!cmd->base.autoneg) {
/* force link down */
   

Re: net: BUG still has locks held in unix_stream_splice_read

2016-10-09 Thread Dmitry Vyukov
Hello,

While running syzkaller fuzzer on commit
b66484cd74706fa8681d051840fe4b18a3da40ff (Oct 7), I am getting:

[ BUG: syz-executor/15138 still has locks held! ]
4.8.0+ #29 Not tainted
-
1 lock held by syz-executor/15138:
 #0:  (>mutex/1){+.+.+.}, at: [< inline >]
pipe_lock_nested fs/pipe.c:66
 #0:  (>mutex/1){+.+.+.}, at: []
pipe_lock+0x5b/0x70 fs/pipe.c:74

stack backtrace:
CPU: 1 PID: 15138 Comm: syz-executor Not tainted 4.8.0+ #29
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
 880044d4fa38 82d383c9  fbfff1097248
 88005a44a3c0 88005a44a3c0 dc00 88005a44a3c0
 8800541fb9b8 880044d4fa58 81463cd5 
Call Trace:
 [< inline >] __dump_stack lib/dump_stack.c:15
 [] dump_stack+0x12e/0x185 lib/dump_stack.c:51
 [< inline >] print_held_locks_bug kernel/locking/lockdep.c:4296
 [] debug_check_no_locks_held+0x125/0x140
kernel/locking/lockdep.c:4302
 [< inline >] try_to_freeze include/linux/freezer.h:65
 [< inline >] freezer_count include/linux/freezer.h:127
 [< inline >] freezable_schedule_timeout include/linux/freezer.h:192
 [< inline >] unix_stream_data_wait net/unix/af_unix.c:2223
 [] unix_stream_read_generic+0x1317/0x1b70
net/unix/af_unix.c:2332
 [] unix_stream_splice_read+0x15b/0x1d0
net/unix/af_unix.c:2506
 [] sock_splice_read+0xbe/0x100 net/socket.c:775
 [] do_splice_to+0x10f/0x170 fs/splice.c:908
 [< inline >] do_splice fs/splice.c:1196
 [< inline >] SYSC_splice fs/splice.c:1420
 [] SyS_splice+0x114c/0x15b0 fs/splice.c:1403
 [] entry_SYSCALL_64_fastpath+0x23/0xc6


I suspect this is:

commit 25869262ef7af24ccde988867ac3eb1c3d4b88d4
Author: Al Viro 
Date:   Sat Sep 17 21:02:10 2016 -0400
skb_splice_bits(): get rid of callback
since pipe_lock is the outermost now, we don't need to drop/regain
socket locks around the call of splice_to_pipe() from skb_splice_bits(),
which kills the need to have a socket-specific callback; we can just
call splice_to_pipe() and be done with that.


net: BUG still has locks held in unix_stream_splice_read

2016-10-09 Thread Dmitry Vyukov
Hello,

While running syzkaller fuzzer on commit
b66484cd74706fa8681d051840fe4b18a3da40ff (Oct 7), I am getting:

[ BUG: syz-executor/15138 still has locks held! ]
4.8.0+ #29 Not tainted
-
1 lock held by syz-executor/15138:
 #0:  (>mutex/1){+.+.+.}, at: [< inline >]
pipe_lock_nested fs/pipe.c:66
 #0:  (>mutex/1){+.+.+.}, at: []
pipe_lock+0x5b/0x70 fs/pipe.c:74

stack backtrace:
CPU: 1 PID: 15138 Comm: syz-executor Not tainted 4.8.0+ #29
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
 880044d4fa38 82d383c9  fbfff1097248
 88005a44a3c0 88005a44a3c0 dc00 88005a44a3c0
 8800541fb9b8 880044d4fa58 81463cd5 
Call Trace:
 [< inline >] __dump_stack lib/dump_stack.c:15
 [] dump_stack+0x12e/0x185 lib/dump_stack.c:51
 [< inline >] print_held_locks_bug kernel/locking/lockdep.c:4296
 [] debug_check_no_locks_held+0x125/0x140
kernel/locking/lockdep.c:4302
 [< inline >] try_to_freeze include/linux/freezer.h:65
 [< inline >] freezer_count include/linux/freezer.h:127
 [< inline >] freezable_schedule_timeout include/linux/freezer.h:192
 [< inline >] unix_stream_data_wait net/unix/af_unix.c:2223
 [] unix_stream_read_generic+0x1317/0x1b70
net/unix/af_unix.c:2332
 [] unix_stream_splice_read+0x15b/0x1d0
net/unix/af_unix.c:2506
 [] sock_splice_read+0xbe/0x100 net/socket.c:775
 [] do_splice_to+0x10f/0x170 fs/splice.c:908
 [< inline >] do_splice fs/splice.c:1196
 [< inline >] SYSC_splice fs/splice.c:1420
 [] SyS_splice+0x114c/0x15b0 fs/splice.c:1403
 [] entry_SYSCALL_64_fastpath+0x23/0xc6


I suspect this is:

commit 25869262ef7af24ccde988867ac3eb1c3d4b88d4
Author: Al Viro 
Date:   Sat Sep 17 21:02:10 2016 -0400
skb_splice_bits(): get rid of callback
since pipe_lock is the outermost now, we don't need to drop/regain
socket locks around the call of splice_to_pipe() from skb_splice_bits(),
which kills the need to have a socket-specific callback; we can just
call splice_to_pipe() and be done with that.


Re: [PATCH iproute2] devlink: Convert conditional in dl_argv_handle_port() to switch()

2016-10-09 Thread Phil Sutter
On Sun, Oct 09, 2016 at 10:14:18AM +0800, Hangbin Liu wrote:
> Discovered by Phil's covscan. The final return statement is never reached.
> This is not inherently clear from looking at the code, so change the
> conditional to a switch() statement which should clarify this.
> 
> CC: Phil Sutter 
> Signed-off-by: Hangbin Liu 

Acked-by: Phil Sutter 


[PATCH v2] Add support for ethtool operations to RDC R6040.

2016-10-09 Thread VENKAT PRASHANTH B U
Signed-off-by: Venkat Prashanth B U 

Changes since v1:
1. Made the commit message more clear
2. Add enumeration data type RTL_FLAG_MAX
3. Modified the locking interface used in r6040_get_regs()
   
4. Initialized mutex dynamically in a function r6040_get_regs()
   
5. Declared u32 msg_enable in struct r6040_private
---
 drivers/net/ethernet/rdc/r6040.c | 95 
 1 file changed, 95 insertions(+)

diff --git a/drivers/net/ethernet/rdc/r6040.c b/drivers/net/ethernet/rdc/r6040.c
index cb29ee2..167ff59 100644
--- a/drivers/net/ethernet/rdc/r6040.c
+++ b/drivers/net/ethernet/rdc/r6040.c
@@ -183,6 +183,10 @@ struct r6040_descriptor {
u32 rev2;   /* 1C-1F */
 } __aligned(32);
 
+enum rtl_flag {
+   RTL_FLAG_MAX
+};
+
 struct r6040_private {
spinlock_t lock;/* driver lock */
struct pci_dev *pdev;
@@ -196,12 +200,18 @@ struct r6040_private {
dma_addr_t tx_ring_dma;
u16 tx_free_desc;
u16 mcr0;
+   u32 msg_enable;
struct net_device *dev;
struct mii_bus *mii_bus;
struct napi_struct napi;
void __iomem *base;
int old_link;
int old_duplex;
+   struct {
+   DECLARE_BITMAP(flags, RTL_FLAG_MAX);
+   struct mutex mutex;
+   struct work_struct work;
+   } wk;
 };
 
 static char version[] = DRV_NAME
@@ -955,12 +965,97 @@ static void netdev_get_drvinfo(struct net_device *dev,
strlcpy(info->bus_info, pci_name(rp->pdev), sizeof(info->bus_info));
 }
 
+static void r6040_lock_work(struct r6040_private *tp)
+{
+   mutex_lock(>wk.mutex);
+}
+
+static void r6040_unlock_work(struct r6040_private *tp)
+{
+   mutex_unlock(>wk.mutex);
+}
+
+static int r6040_get_regs_len (struct net_device *dev)
+{
+  return R6040_IO_SIZE;
+}
+
+static void r6040_get_regs (struct net_device *dev, struct ethtool_regs *regs, 
void *p)
+{
+  struct r6040_private *tp = netdev_priv (dev);
+  u32 __iomem *data = tp->base;
+  u32 *dw = p;
+  int i;
+
+  r6040_lock_work (tp);
+  for (i = 0; i < R6040_IO_SIZE; i += 4)
+memcpy_fromio (dw++, data++, 4);
+  r6040_unlock_work (tp);
+}
+
+static u32 r6040_get_msglevel (struct net_device *dev)
+{
+  struct r6040_private *tp = netdev_priv (dev);
+
+  return tp->msg_enable;
+}
+
+static void r6040_set_msglevel (struct net_device *dev, u32 value)
+{
+  struct r6040_private *tp = netdev_priv (dev);
+
+  tp->msg_enable = value;
+}
+
+static const char r6040_gstrings[][ETH_GSTRING_LEN] = {
+  "tx_packets",
+  "rx_packets",
+  "tx_errors",
+  "rx_errors",
+  "rx_missed",
+  "align_errors",
+  "tx_single_collisions",
+  "tx_multi_collisions",
+  "unicast",
+  "broadcast",
+  "multicast",
+  "tx_aborted",
+  "tx_underrun",
+};
+
+static int r6040_get_sset_count (struct net_device *dev, int sset)
+{
+  switch (sset)
+{
+case ETH_SS_STATS:
+  return ARRAY_SIZE (r6040_gstrings);
+default:
+  return -EOPNOTSUPP;
+}
+}
+
+static void r6040_get_strings (struct net_device *dev, u32 stringset, u8 * 
data)
+{
+  switch (stringset)
+{
+case ETH_SS_STATS:
+  memcpy (data, *r6040_gstrings, sizeof (r6040_gstrings));
+  break;
+}
+}
+
 static const struct ethtool_ops netdev_ethtool_ops = {
.get_drvinfo= netdev_get_drvinfo,
.get_link   = ethtool_op_get_link,
.get_ts_info= ethtool_op_get_ts_info,
.get_link_ksettings = phy_ethtool_get_link_ksettings,
.set_link_ksettings = phy_ethtool_set_link_ksettings,
+   .get_regs_len = r6040_get_regs_len,
+   .get_msglevel = r6040_get_msglevel,
+   .set_msglevel = r6040_set_msglevel,
+   .get_regs = r6040_get_regs,
+   .get_strings = r6040_get_strings,
+   .get_sset_count = r6040_get_sset_count,
 };
 
 static const struct net_device_ops r6040_netdev_ops = {
-- 
1.9.2



Re: [PATCH net] Panic when tc_lookup_action_n finds a partially initialized action.

2016-10-09 Thread Krister Johansen
Hi Cong,
Thanks for the follow-up.

On Thu, Oct 06, 2016 at 12:01:15PM -0700, Cong Wang wrote:
> On Wed, Oct 5, 2016 at 11:11 PM, Krister Johansen
> > pernet_operations pointer.  The code in register_pernet_subsys() makes
> > no attempt to check for duplicates.  If we add a pointer that's already
> > in the list, and subsequently call unregister, the results seem
> > undefined.  It looks like we'll remove the pernet_operations for the
> > existing action, assuming we don't corrupt the list in the process.
> >
> > Is this actually safe?  If so, what corner case is the act->type /
> > act->kind protecting us from?
> 
> ops->type and ops->kind should be unique too, user-space already
> relies on this (tc action ls action xxx). The code exists probably just
> for sanity check.

With that in mind, would it make sense to change the check to a WARN/BUG
or some kind of assertion?  I mistakenly inferred that it was possible to
legtimately end up in this scenario.

> So please give that patch a try, let's see if we miss any other problem.

Will do.  I have not forgotten.  I hope to have results for you in a few
days.

> > Part of the desire to inhibit extra modprobe calls is that if hundreds
> > of these all start at once on boot, it's really unnecessary to have all
> > of the rest of them wait while lots of extra modprobe calls are forked
> > by the kernel.
> 
> You can tell systemd to load these modules before starting these
> containers to avoid blocking, no?

That was exactly what I did to work around the panic until I was able to
get a patch together.  The preload of the modules is still occurring,
but I was hoping to excise that workaround entirely.

Thanks,

-K