date:20180511

Re: safe skb resetting after decapsulation and encapsulation

2018-05-11 Thread Md. Islam

I'm not an expert on this, but it looks about right. You can take a
look at build_skb() or __build_skb(). It shows the fields that needs
to be set before passing to netif_receive_skb/netif_rx.

On Fri, May 11, 2018 at 6:56 PM, Jason A. Donenfeld  wrote:
> Hey Netdev,
>
> A UDP skb comes in via the encap_rcv interface. I do a lot of wild
> things to the bytes in the skb -- change where the head starts, modify
> a few fragments, decrypt some stuff, trim off some things at the end,
> etc. In other words, I'm decapsulating the skb in a pretty intense
> way. I benefit from reusing the same skb, performance wise, but after
> I'm done processing it, it's really a totally new skb. Eventually it's
> time to pass off my skb to netif_receive_skb/netif_rx, but before I do
> that, I need to "reinitialize" the skb. (The same goes for when
> sending out an skb -- I get it from userspace via ndo_start_xmit, do
> crazy things to it, and eventually pass it off to the udp_tunnel send
> functions, but first "reinitializing" it.)
>
> At the moment I'm using a function that looks like this:
>
> static void jasons_wild_and_crazy_skb_reset(struct sk_buff *skb)
> {
> skb_scrub_packet(skb, true); //1
> memset(>headers_start, 0, offsetof(struct sk_buff,
> headers_end) - offsetof(struct sk_buff, headers_start)); //2
> skb->queue_mapping = 0; //3
> skb->nohdr = 0; //4
> skb->peeked = 0; //5
> skb->mac_len = 0; //6
> skb->dev = NULL; //7
> #ifdef CONFIG_NET_SCHED
> skb->tc_index = 0; //8
> skb_reset_tc(skb); //9
> #endif
> skb->hdr_len = skb_headroom(skb); //10
> skb_reset_mac_header(skb); //11
> skb_reset_network_header(skb); //12
> skb_probe_transport_header(skb, 0); //13
> skb_reset_inner_headers(skb); //14
> }
>
> I'm sure that some of this is wrong. Most of it is based on part of an
> Octeon ethernet driver I read a few years ago. I numbered each
> statement above, hoping to go through it with you all in detail here,
> and see what we can cut away and see what we can approve.
>
> 1. Obviously correct and required.
> 2. This is probably wrong. At least it causes crashes when receiving
> packets from RHEL 7.5's latest i40e driver in their vendor
> frankenkernel, because those flags there have some critical bits
> related to allocation. But there are a lot flags in there that I might
> consider going through one by one and zeroing out.
> 3-5. Fields that should be zero, I assume, after
> decapsulating/decrypting (and encapsulating/encrypting).
> 6. WireGuard is layer 3, so there's no mac.
> 7. We're later going to change the dev this came in on.
> 8-9: Same flakey rationale as 2,3-5.
> 10: Since the headroom has changed during the various modifications, I
> need to let the packet field know about it.
> 11-14: The beginning of the headers has changed, and so resetting and
> probing is necessary for this to work at all.
>
> So I'm wondering - how much of this is necessary? How much am I
> unnecessarily reinventing things that exist elsewhere? I'm pretty sure
> in most cases the driver would work with only 1,10-14, but I worry
> that bad things would happen in more unusual configurations. I've
> tried to systematically go through the entire stack and see where
> these might be used or not used, but it seems really inconsistent.
>
> So, I'm writing wondering if somebody has an easy simplification or
> rule for handling this kind of intense decapsulation/decryption (and
> encapsulation/encryption operation on the other way) operation. I'd
> like to make sure I get this down solid.
>
> Thanks,
> Jason



-- 
Tamim
PhD Candidate,
Kent State University
http://web.cs.kent.edu/~mislam4/

Re: [RFC bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-05-11 Thread Joe Stringer

On 11 May 2018 at 14:41, Martin KaFai Lau  wrote:
> On Fri, May 11, 2018 at 02:08:01PM -0700, Joe Stringer wrote:
>> On 10 May 2018 at 22:00, Martin KaFai Lau  wrote:
>> > On Wed, May 09, 2018 at 02:07:05PM -0700, Joe Stringer wrote:
>> >> This patch adds a new BPF helper function, sk_lookup() which allows BPF
>> >> programs to find out if there is a socket listening on this host, and
>> >> returns a socket pointer which the BPF program can then access to
>> >> determine, for instance, whether to forward or drop traffic. sk_lookup()
>> >> takes a reference on the socket, so when a BPF program makes use of this
>> >> function, it must subsequently pass the returned pointer into the newly
>> >> added sk_release() to return the reference.
>> >>
>> >> By way of example, the following pseudocode would filter inbound
>> >> connections at XDP if there is no corresponding service listening for
>> >> the traffic:
>> >>
>> >>   struct bpf_sock_tuple tuple;
>> >>   struct bpf_sock_ops *sk;
>> >>
>> >>   populate_tuple(ctx, ); // Extract the 5tuple from the packet
>> >>   sk = bpf_sk_lookup(ctx, , sizeof tuple, netns, 0);
>> >>   if (!sk) {
>> >> // Couldn't find a socket listening for this traffic. Drop.
>> >> return TC_ACT_SHOT;
>> >>   }
>> >>   bpf_sk_release(sk, 0);
>> >>   return TC_ACT_OK;
>> >>
>> >> Signed-off-by: Joe Stringer 
>> >> ---
>>
>> ...
>>
>> >> @@ -4032,6 +4036,96 @@ static const struct bpf_func_proto 
>> >> bpf_skb_get_xfrm_state_proto = {
>> >>  };
>> >>  #endif
>> >>
>> >> +struct sock *
>> >> +sk_lookup(struct net *net, struct bpf_sock_tuple *tuple) {
>> > Would it be possible to have another version that
>> > returns a sk without taking its refcnt?
>> > It may have performance benefit.
>>
>> Not really. The sockets are not RCU-protected, and established sockets
>> may be torn down without notice. If we don't take a reference, there's
>> no guarantee that the socket will continue to exist for the duration
>> of running the BPF program.
>>
>> From what I follow, the comment below has a hidden implication which
>> is that sockets without SOCK_RCU_FREE, eg established sockets, may be
>> directly freed regardless of RCU.
> Right, SOCK_RCU_FREE sk is the one I am concern about.
> For example, TCP_LISTEN socket does not require taking a refcnt
> now.  Doing a bpf_sk_lookup() may have a rather big
> impact on handling TCP syn flood.  or the usual intention
> is to redirect instead of passing it up to the stack?

I see, if you're only interested in listen sockets then probably this
series could be extended with a new flag, eg something like
BPF_F_SK_FIND_LISTENERS which restricts the set of possible sockets
found to only listen sockets, then the implementation would call into
__inet_lookup_listener() instead of inet_lookup(). The presence of
that flag in the relevant register during CALL instruction would show
that the verifier should not reference-track the result, then there'd
need to be a check on the release to ensure that this unreferenced
socket is never released. Just a thought, completely untested and I
could still be missing some detail..

That said, I don't completely follow how you would expect to handle
the traffic for sockets that are already established - the helper
would no longer find those sockets, so you wouldn't know whether to
pass the traffic up the stack for established traffic or not.

Re: KASAN: null-ptr-deref Read in rds_ib_get_mr

2018-05-11 Thread Yanjun Zhu




On 2018/5/12 0:58, Santosh Shilimkar wrote:

On 5/11/2018 12:48 AM, Yanjun Zhu wrote:



On 2018/5/11 13:20, DaeRyong Jeong wrote:

We report the crash: KASAN: null-ptr-deref Read in rds_ib_get_mr

Note that this bug is previously reported by syzkaller.
https://syzkaller.appspot.com/bug?id=0bb56a5a48b000b52aa2b0d8dd20b1f545214d91 

Nonetheless, this bug has not fixed yet, and we hope that this 
report and our
analysis, which gets help by the RaceFuzzer's feature, will helpful 
to fix the

crash.

This crash has been found in v4.17-rc1 using RaceFuzzer (a modified
version of Syzkaller), which we describe more at the end of this
report. Our analysis shows that the race occurs when invoking two
syscalls concurrently, bind$rds and setsockopt$RDS_GET_MR.


Analysis:
We think the concurrent execution of __rds_rdma_map() and rds_bind()
causes the problem. __rds_rdma_map() checks whether 
rs->rs_bound_addr is 0

or not. But the concurrent execution with rds_bind() can by-pass this
check. Therefore, __rds_rdmap_map() calls rs->rs_transport->get_mr() 
and
rds_ib_get_mr() causes the null deref at ib_rdma.c:544 in v4.17-rc1, 
when

dereferencing rs_conn.


Thread interleaving:
CPU0 (__rds_rdma_map)    CPU1 (rds_bind)
    // rds_add_bound() sets rs->bound_addr 
as none 0
    ret = rds_add_bound(rs, 
sin->sin_addr.s_addr, >sin_port);

if (rs->rs_bound_addr == 0 || !rs->rs_transport) {
ret = -ENOTCONN; /* XXX not a great errno */
goto out;
}
    if (rs->rs_transport) { /* previously 
bound */

    trans = rs->rs_transport;
    if 
(trans->laddr_check(sock_net(sock->sk),

sin->sin_addr.s_addr) != 0) {
    ret = -ENOPROTOOPT;
    // rds_remove_bound() sets 
rs->bound_addr as 0

    rds_remove_bound(rs);
...
trans_private = rs->rs_transport->get_mr(sg, nents, rs,
 >r_key);
(in rds_ib_get_mr())
struct rds_ib_connection *ic = rs->rs_conn->c_transport_data;


Call sequence (v4.17-rc1):
CPU0
rds_setsockopt
rds_get_mr
    __rds_rdma_map
    rds_ib_get_mr


CPU1
rds_bind
rds_add_bound
...
rds_remove_bound


Crash log:
==
BUG: KASAN: null-ptr-deref in rds_ib_get_mr+0x3a/0x150 
net/rds/ib_rdma.c:544

Read of size 8 at addr 0068 by task syz-executor0/32067

CPU: 0 PID: 32067 Comm: syz-executor0 Not tainted 4.17.0-rc1 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014

Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x166/0x21c lib/dump_stack.c:113
  kasan_report_error mm/kasan/report.c:352 [inline]
  kasan_report+0x140/0x360 mm/kasan/report.c:412
  check_memory_region_inline mm/kasan/kasan.c:260 [inline]
  __asan_load8+0x54/0x90 mm/kasan/kasan.c:699
  rds_ib_get_mr+0x3a/0x150 net/rds/ib_rdma.c:544
  __rds_rdma_map+0x521/0x9d0 net/rds/rdma.c:271
  rds_get_mr+0xad/0xf0 net/rds/rdma.c:333
  rds_setsockopt+0x57f/0x720 net/rds/af_rds.c:347
  __sys_setsockopt+0x147/0x230 net/socket.c:1903
  __do_sys_setsockopt net/socket.c:1914 [inline]
  __se_sys_setsockopt net/socket.c:1911 [inline]
  __x64_sys_setsockopt+0x67/0x80 net/socket.c:1911
  do_syscall_64+0x15f/0x4a0 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4563f9
RSP: 002b:7f6a2b3c2b28 EFLAGS: 0246 ORIG_RAX: 0036
RAX: ffda RBX: 0072bee0 RCX: 004563f9
RDX: 0002 RSI: 0114 RDI: 0015
RBP: 0575 R08: 0020 R09: 
R10: 2140 R11: 0246 R12: 7f6a2b3c36d4
R13:  R14: 006fd398 R15: 
==

diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index e678699..2228b50 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -539,11 +539,17 @@ void rds_ib_flush_mrs(void)
  void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents,
 struct rds_sock *rs, u32 *key_ret)
  {
-   struct rds_ib_device *rds_ibdev;
+   struct rds_ib_device *rds_ibdev = NULL;
 struct rds_ib_mr *ibmr = NULL;
-   struct rds_ib_connection *ic = rs->rs_conn->c_transport_data;
+   struct rds_ib_connection *ic = NULL;
 int ret;

+   if (rs->rs_bound_addr == 0) {
+   ret = -EPERM;
+   goto out;
+   }
+

No you can't return such error for this API and the
socket related checks needs to be done at core layer.
I remember fixing this race but probably never pushed
fix upstream.

OK. Wait for your patch. :-)


The MR code is due for update with optimized FRWR code
which now stable enough. We will address this issue as

Re: [GIT] Networking

2018-05-11 Thread Linus Torvalds

On Fri, May 11, 2018 at 5:10 PM David Miller  wrote:

> I guess this is my reward for trying to break the monotony of
> pull requests :-)

I actually went back and checked a few older pull requests to see if this
had been going on for a while and I just hadn't noticed.

It just took me by surprise :^p

   Linus

[PATCH bpf-next 2/4] samples: bpf: rename libbpf.h to bpf_insn.h

2018-05-11 Thread Jakub Kicinski

The libbpf.h file in samples is clashing with libbpf's header.
Since it only includes a subset of filter.h instruction helpers
rename it to bpf_insn.h.  Drop the unnecessary include of bpf/bpf.h.

Signed-off-by: Jakub Kicinski 
---
 samples/bpf/{libbpf.h => bpf_insn.h}| 8 +++-
 samples/bpf/cookie_uid_helper_example.c | 2 +-
 samples/bpf/fds_example.c   | 4 +++-
 samples/bpf/sock_example.c  | 3 ++-
 samples/bpf/test_cgrp2_attach.c | 3 ++-
 samples/bpf/test_cgrp2_attach2.c| 3 ++-
 samples/bpf/test_cgrp2_sock.c   | 3 ++-
 samples/bpf/test_cgrp2_sock2.c  | 3 ++-
 8 files changed, 17 insertions(+), 12 deletions(-)
 rename samples/bpf/{libbpf.h => bpf_insn.h} (98%)

diff --git a/samples/bpf/libbpf.h b/samples/bpf/bpf_insn.h
similarity index 98%
rename from samples/bpf/libbpf.h
rename to samples/bpf/bpf_insn.h
index 18bfee5aab6b..20dc5cefec84 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/bpf_insn.h
@@ -1,9 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-/* eBPF mini library */
-#ifndef __LIBBPF_H
-#define __LIBBPF_H
-
-#include 
+/* eBPF instruction mini library */
+#ifndef __BPF_INSN_H
+#define __BPF_INSN_H
 
 struct bpf_insn;
 
diff --git a/samples/bpf/cookie_uid_helper_example.c 
b/samples/bpf/cookie_uid_helper_example.c
index 8eca27e595ae..deb0e3e0324d 100644
--- a/samples/bpf/cookie_uid_helper_example.c
+++ b/samples/bpf/cookie_uid_helper_example.c
@@ -51,7 +51,7 @@
 #include 
 #include 
 #include 
-#include "libbpf.h"
+#include "bpf_insn.h"
 
 #define PORT 
 
diff --git a/samples/bpf/fds_example.c b/samples/bpf/fds_example.c
index e29bd52ff9e8..9854854f05d1 100644
--- a/samples/bpf/fds_example.c
+++ b/samples/bpf/fds_example.c
@@ -12,8 +12,10 @@
 #include 
 #include 
 
+#include 
+
+#include "bpf_insn.h"
 #include "bpf_load.h"
-#include "libbpf.h"
 #include "sock_example.h"
 
 #define BPF_F_PIN  (1 << 0)
diff --git a/samples/bpf/sock_example.c b/samples/bpf/sock_example.c
index 33a637507c00..60ec467c78ab 100644
--- a/samples/bpf/sock_example.c
+++ b/samples/bpf/sock_example.c
@@ -26,7 +26,8 @@
 #include 
 #include 
 #include 
-#include "libbpf.h"
+#include 
+#include "bpf_insn.h"
 #include "sock_example.h"
 
 char bpf_log_buf[BPF_LOG_BUF_SIZE];
diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c
index 4bfcaf93fcf3..20fbd1241db3 100644
--- a/samples/bpf/test_cgrp2_attach.c
+++ b/samples/bpf/test_cgrp2_attach.c
@@ -28,8 +28,9 @@
 #include 
 
 #include 
+#include 
 
-#include "libbpf.h"
+#include "bpf_insn.h"
 
 enum {
MAP_KEY_PACKETS,
diff --git a/samples/bpf/test_cgrp2_attach2.c b/samples/bpf/test_cgrp2_attach2.c
index 1af412ec6007..b453e6a161be 100644
--- a/samples/bpf/test_cgrp2_attach2.c
+++ b/samples/bpf/test_cgrp2_attach2.c
@@ -24,8 +24,9 @@
 #include 
 
 #include 
+#include 
 
-#include "libbpf.h"
+#include "bpf_insn.h"
 #include "cgroup_helpers.h"
 
 #define FOO"/foo"
diff --git a/samples/bpf/test_cgrp2_sock.c b/samples/bpf/test_cgrp2_sock.c
index e79594dd629b..b0811da5a00f 100644
--- a/samples/bpf/test_cgrp2_sock.c
+++ b/samples/bpf/test_cgrp2_sock.c
@@ -21,8 +21,9 @@
 #include 
 #include 
 #include 
+#include 
 
-#include "libbpf.h"
+#include "bpf_insn.h"
 
 char bpf_log_buf[BPF_LOG_BUF_SIZE];
 
diff --git a/samples/bpf/test_cgrp2_sock2.c b/samples/bpf/test_cgrp2_sock2.c
index e53f1f6f0867..3b5be2364975 100644
--- a/samples/bpf/test_cgrp2_sock2.c
+++ b/samples/bpf/test_cgrp2_sock2.c
@@ -19,8 +19,9 @@
 #include 
 #include 
 #include 
+#include 
 
-#include "libbpf.h"
+#include "bpf_insn.h"
 #include "bpf_load.h"
 
 static int usage(const char *argv0)
-- 
2.17.0

[PATCH bpf-next 3/4] samples: bpf: fix build after move to compiling full libbpf.a

2018-05-11 Thread Jakub Kicinski

There are many ways users may compile samples, some of them got
broken by commit 5f9380572b4b ("samples: bpf: compile and link
against full libbpf").  Improve path resolution and make libbpf
building a dependency of source files to force its build.

Samples should now again build with any of:
 cd samples/bpf; make
 make samples/bpf
 make -C samples/bpf
 cd samples/bpf; make O=builddir
 make samples/bpf O=builddir
 make -C samples/bpf O=builddir

Fixes: 5f9380572b4b ("samples: bpf: compile and link against full libbpf")
Reported-by: Björn Töpel 
Signed-off-by: Jakub Kicinski 
---
 samples/bpf/Makefile | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 9e255ca4059a..bed205ab1f81 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -1,4 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0
+
+BPF_SAMPLES_PATH ?= $(abspath $(srctree)/$(src))
+TOOLS_PATH := $(BPF_SAMPLES_PATH)/../../tools
+
 # List of programs to build
 hostprogs-y := test_lru_dist
 hostprogs-y += sock_example
@@ -49,7 +53,8 @@ hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
 
 # Libbpf dependencies
-LIBBPF := ../../tools/lib/bpf/libbpf.a
+LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
+
 CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o
 TRACE_HELPERS := ../../tools/testing/selftests/bpf/trace_helpers.o
 
@@ -233,15 +238,15 @@ CLANG_ARCH_ARGS = -target $(ARCH)
 endif
 
 # Trick to allow make to be run from this directory
-all: $(LIBBPF)
-   $(MAKE) -C ../../ $(CURDIR)/
+all:
+   $(MAKE) -C ../../ $(CURDIR)/ BPF_SAMPLES_PATH=$(CURDIR)
 
 clean:
$(MAKE) -C ../../ M=$(CURDIR) clean
@rm -f *~
 
 $(LIBBPF): FORCE
-   $(MAKE) -C $(dir $@)
+   $(MAKE) -C $(dir $@) O= srctree=$(BPF_SAMPLES_PATH)/../../
 
 $(obj)/syscall_nrs.s:  $(src)/syscall_nrs.c
$(call if_changed_dep,cc_s_c)
@@ -272,7 +277,8 @@ verify_target_bpf: verify_cmds
exit 2; \
else true; fi
 
-$(src)/*.c: verify_target_bpf
+$(BPF_SAMPLES_PATH)/*.c: verify_target_bpf $(LIBBPF)
+$(src)/*.c: verify_target_bpf $(LIBBPF)
 
 $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h
 
-- 
2.17.0

[PATCH bpf-next 4/4] samples: bpf: move libbpf from object dependencies to libs

2018-05-11 Thread Jakub Kicinski

Make complains that it doesn't know how to make libbpf.a:

scripts/Makefile.host:106: target 'samples/bpf/../../tools/lib/bpf/libbpf.a' 
doesn't match the target pattern

Now that we have it as a dependency of the sources simply add libbpf.a
to libraries not objects.

Signed-off-by: Jakub Kicinski 
---
 samples/bpf/Makefile | 145 +++
 1 file changed, 51 insertions(+), 94 deletions(-)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index bed205ab1f81..64cdbb4d22a6 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -58,55 +58,53 @@ LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
 CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o
 TRACE_HELPERS := ../../tools/testing/selftests/bpf/trace_helpers.o
 
-test_lru_dist-objs := test_lru_dist.o $(LIBBPF)
-sock_example-objs := sock_example.o $(LIBBPF)
-fds_example-objs := bpf_load.o $(LIBBPF) fds_example.o
-sockex1-objs := bpf_load.o $(LIBBPF) sockex1_user.o
-sockex2-objs := bpf_load.o $(LIBBPF) sockex2_user.o
-sockex3-objs := bpf_load.o $(LIBBPF) sockex3_user.o
-tracex1-objs := bpf_load.o $(LIBBPF) tracex1_user.o
-tracex2-objs := bpf_load.o $(LIBBPF) tracex2_user.o
-tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o
-tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o
-tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o
-tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o
-tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o
-load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o
-test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o
-trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o $(TRACE_HELPERS)
-lathist-objs := bpf_load.o $(LIBBPF) lathist_user.o
-offwaketime-objs := bpf_load.o $(LIBBPF) offwaketime_user.o $(TRACE_HELPERS)
-spintest-objs := bpf_load.o $(LIBBPF) spintest_user.o $(TRACE_HELPERS)
-map_perf_test-objs := bpf_load.o $(LIBBPF) map_perf_test_user.o
-test_overhead-objs := bpf_load.o $(LIBBPF) test_overhead_user.o
-test_cgrp2_array_pin-objs := test_cgrp2_array_pin.o $(LIBBPF)
-test_cgrp2_attach-objs := test_cgrp2_attach.o $(LIBBPF)
-test_cgrp2_attach2-objs := test_cgrp2_attach2.o $(LIBBPF) $(CGROUP_HELPERS)
-test_cgrp2_sock-objs := test_cgrp2_sock.o $(LIBBPF)
-test_cgrp2_sock2-objs := bpf_load.o $(LIBBPF) test_cgrp2_sock2.o
-xdp1-objs := xdp1_user.o $(LIBBPF)
+fds_example-objs := bpf_load.o fds_example.o
+sockex1-objs := bpf_load.o sockex1_user.o
+sockex2-objs := bpf_load.o sockex2_user.o
+sockex3-objs := bpf_load.o sockex3_user.o
+tracex1-objs := bpf_load.o tracex1_user.o
+tracex2-objs := bpf_load.o tracex2_user.o
+tracex3-objs := bpf_load.o tracex3_user.o
+tracex4-objs := bpf_load.o tracex4_user.o
+tracex5-objs := bpf_load.o tracex5_user.o
+tracex6-objs := bpf_load.o tracex6_user.o
+tracex7-objs := bpf_load.o tracex7_user.o
+load_sock_ops-objs := bpf_load.o load_sock_ops.o
+test_probe_write_user-objs := bpf_load.o test_probe_write_user_user.o
+trace_output-objs := bpf_load.o trace_output_user.o $(TRACE_HELPERS)
+lathist-objs := bpf_load.o lathist_user.o
+offwaketime-objs := bpf_load.o offwaketime_user.o $(TRACE_HELPERS)
+spintest-objs := bpf_load.o spintest_user.o $(TRACE_HELPERS)
+map_perf_test-objs := bpf_load.o map_perf_test_user.o
+test_overhead-objs := bpf_load.o test_overhead_user.o
+test_cgrp2_array_pin-objs := test_cgrp2_array_pin.o
+test_cgrp2_attach-objs := test_cgrp2_attach.o
+test_cgrp2_attach2-objs := test_cgrp2_attach2.o $(CGROUP_HELPERS)
+test_cgrp2_sock-objs := test_cgrp2_sock.o
+test_cgrp2_sock2-objs := bpf_load.o test_cgrp2_sock2.o
+xdp1-objs := xdp1_user.o
 # reuse xdp1 source intentionally
-xdp2-objs := xdp1_user.o $(LIBBPF)
-xdp_router_ipv4-objs := bpf_load.o $(LIBBPF) xdp_router_ipv4_user.o
-test_current_task_under_cgroup-objs := bpf_load.o $(LIBBPF) $(CGROUP_HELPERS) \
+xdp2-objs := xdp1_user.o
+xdp_router_ipv4-objs := bpf_load.o xdp_router_ipv4_user.o
+test_current_task_under_cgroup-objs := bpf_load.o $(CGROUP_HELPERS) \
   test_current_task_under_cgroup_user.o
-trace_event-objs := bpf_load.o $(LIBBPF) trace_event_user.o $(TRACE_HELPERS)
-sampleip-objs := bpf_load.o $(LIBBPF) sampleip_user.o $(TRACE_HELPERS)
-tc_l2_redirect-objs := bpf_load.o $(LIBBPF) tc_l2_redirect_user.o
-lwt_len_hist-objs := bpf_load.o $(LIBBPF) lwt_len_hist_user.o
-xdp_tx_iptunnel-objs := bpf_load.o $(LIBBPF) xdp_tx_iptunnel_user.o
-test_map_in_map-objs := bpf_load.o $(LIBBPF) test_map_in_map_user.o
-per_socket_stats_example-objs := cookie_uid_helper_example.o $(LIBBPF)
-xdp_redirect-objs := bpf_load.o $(LIBBPF) xdp_redirect_user.o
-xdp_redirect_map-objs := bpf_load.o $(LIBBPF) xdp_redirect_map_user.o
-xdp_redirect_cpu-objs := bpf_load.o $(LIBBPF) xdp_redirect_cpu_user.o
-xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
-xdp_rxq_info-objs := xdp_rxq_info_user.o $(LIBBPF)
-syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
-cpustat-objs := bpf_load.o $(LIBBPF)

[PATCH bpf-next 0/4] samples: bpf: fix build after move to full libbpf

2018-05-11 Thread Jakub Kicinski

Hi!

Following patches address build issues after recent move to libbpf.
For out-of-tree builds we would see the following error:

gcc: error: samples/bpf/../../tools/lib/bpf/libbpf.a: No such file or directory

Mini-library called libbpf.h in samples is renamed to bpf_insn.h,
using linux/filter.h seems not completely trivial since some samples
get upset when order on include search path in changed.  We do have
to rename libbpf.h, however, because otherwise it's hard to reliably
get to libbpf's header in out-of-tree builds.

Jakub Kicinski (4):
  samples: bpf: include bpf/bpf.h instead of local libbpf.h
  samples: bpf: rename libbpf.h to bpf_insn.h
  samples: bpf: fix build after move to compiling full libbpf.a
  samples: bpf: move libbpf from object dependencies to libs

 samples/bpf/Makefile  | 170 +++---
 samples/bpf/{libbpf.h => bpf_insn.h}  |   8 +-
 samples/bpf/bpf_load.c|   2 +-
 samples/bpf/bpf_load.h|   2 +-
 samples/bpf/cookie_uid_helper_example.c   |   2 +-
 samples/bpf/cpustat_user.c|   2 +-
 samples/bpf/fds_example.c |   4 +-
 samples/bpf/lathist_user.c|   2 +-
 samples/bpf/load_sock_ops.c   |   2 +-
 samples/bpf/lwt_len_hist_user.c   |   2 +-
 samples/bpf/map_perf_test_user.c  |   2 +-
 samples/bpf/sock_example.c|   3 +-
 samples/bpf/sock_example.h|   1 -
 samples/bpf/sockex1_user.c|   2 +-
 samples/bpf/sockex2_user.c|   2 +-
 samples/bpf/sockex3_user.c|   2 +-
 samples/bpf/syscall_tp_user.c |   2 +-
 samples/bpf/tc_l2_redirect_user.c |   2 +-
 samples/bpf/test_cgrp2_array_pin.c|   2 +-
 samples/bpf/test_cgrp2_attach.c   |   3 +-
 samples/bpf/test_cgrp2_attach2.c  |   3 +-
 samples/bpf/test_cgrp2_sock.c |   3 +-
 samples/bpf/test_cgrp2_sock2.c|   3 +-
 .../bpf/test_current_task_under_cgroup_user.c |   2 +-
 samples/bpf/test_lru_dist.c   |   2 +-
 samples/bpf/test_map_in_map_user.c|   2 +-
 samples/bpf/test_overhead_user.c  |   2 +-
 samples/bpf/test_probe_write_user_user.c  |   2 +-
 samples/bpf/trace_output_user.c   |   2 +-
 samples/bpf/tracex1_user.c|   2 +-
 samples/bpf/tracex2_user.c|   2 +-
 samples/bpf/tracex3_user.c|   2 +-
 samples/bpf/tracex4_user.c|   2 +-
 samples/bpf/tracex5_user.c|   2 +-
 samples/bpf/tracex6_user.c|   2 +-
 samples/bpf/tracex7_user.c|   2 +-
 samples/bpf/xdp_fwd_user.c|   2 +-
 samples/bpf/xdp_monitor_user.c|   2 +-
 samples/bpf/xdp_redirect_cpu_user.c   |   2 +-
 samples/bpf/xdp_redirect_map_user.c   |   2 +-
 samples/bpf/xdp_redirect_user.c   |   2 +-
 samples/bpf/xdp_router_ipv4_user.c|   2 +-
 samples/bpf/xdp_tx_iptunnel_user.c|   2 +-
 samples/bpf/xdpsock_user.c|   2 +-
 44 files changed, 117 insertions(+), 151 deletions(-)
 rename samples/bpf/{libbpf.h => bpf_insn.h} (98%)

-- 
2.17.0

[PATCH bpf-next 1/4] samples: bpf: include bpf/bpf.h instead of local libbpf.h

2018-05-11 Thread Jakub Kicinski

There are two files in the tree called libbpf.h which is becoming
problematic.  Most samples don't actually need the local libbpf.h
they simply include it to get to bpf/bpf.h.  Include bpf/bpf.h
directly instead.

Signed-off-by: Jakub Kicinski 
---
 samples/bpf/bpf_load.c| 2 +-
 samples/bpf/bpf_load.h| 2 +-
 samples/bpf/cpustat_user.c| 2 +-
 samples/bpf/lathist_user.c| 2 +-
 samples/bpf/load_sock_ops.c   | 2 +-
 samples/bpf/lwt_len_hist_user.c   | 2 +-
 samples/bpf/map_perf_test_user.c  | 2 +-
 samples/bpf/sock_example.h| 1 -
 samples/bpf/sockex1_user.c| 2 +-
 samples/bpf/sockex2_user.c| 2 +-
 samples/bpf/sockex3_user.c| 2 +-
 samples/bpf/syscall_tp_user.c | 2 +-
 samples/bpf/tc_l2_redirect_user.c | 2 +-
 samples/bpf/test_cgrp2_array_pin.c| 2 +-
 samples/bpf/test_current_task_under_cgroup_user.c | 2 +-
 samples/bpf/test_lru_dist.c   | 2 +-
 samples/bpf/test_map_in_map_user.c| 2 +-
 samples/bpf/test_overhead_user.c  | 2 +-
 samples/bpf/test_probe_write_user_user.c  | 2 +-
 samples/bpf/trace_output_user.c   | 2 +-
 samples/bpf/tracex1_user.c| 2 +-
 samples/bpf/tracex2_user.c| 2 +-
 samples/bpf/tracex3_user.c| 2 +-
 samples/bpf/tracex4_user.c| 2 +-
 samples/bpf/tracex5_user.c| 2 +-
 samples/bpf/tracex6_user.c| 2 +-
 samples/bpf/tracex7_user.c| 2 +-
 samples/bpf/xdp_fwd_user.c| 2 +-
 samples/bpf/xdp_monitor_user.c| 2 +-
 samples/bpf/xdp_redirect_cpu_user.c   | 2 +-
 samples/bpf/xdp_redirect_map_user.c   | 2 +-
 samples/bpf/xdp_redirect_user.c   | 2 +-
 samples/bpf/xdp_router_ipv4_user.c| 2 +-
 samples/bpf/xdp_tx_iptunnel_user.c| 2 +-
 samples/bpf/xdpsock_user.c| 2 +-
 35 files changed, 34 insertions(+), 35 deletions(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index a6b290de5632..89161c9ed466 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -24,7 +24,7 @@
 #include 
 #include 
 #include 
-#include "libbpf.h"
+#include 
 #include "bpf_load.h"
 #include "perf-sys.h"
 
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index f9da59bca0cc..814894a12974 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -2,7 +2,7 @@
 #ifndef __BPF_LOAD_H
 #define __BPF_LOAD_H
 
-#include "libbpf.h"
+#include 
 
 #define MAX_MAPS 32
 #define MAX_PROGS 32
diff --git a/samples/bpf/cpustat_user.c b/samples/bpf/cpustat_user.c
index 2b4cd1ae57c5..869a99406dbf 100644
--- a/samples/bpf/cpustat_user.c
+++ b/samples/bpf/cpustat_user.c
@@ -17,7 +17,7 @@
 #include 
 #include 
 
-#include "libbpf.h"
+#include 
 #include "bpf_load.h"
 
 #define MAX_CPU8
diff --git a/samples/bpf/lathist_user.c b/samples/bpf/lathist_user.c
index 6477bad5b4e2..c8e88cc84e61 100644
--- a/samples/bpf/lathist_user.c
+++ b/samples/bpf/lathist_user.c
@@ -10,7 +10,7 @@
 #include 
 #include 
 #include 
-#include "libbpf.h"
+#include 
 #include "bpf_load.h"
 
 #define MAX_ENTRIES20
diff --git a/samples/bpf/load_sock_ops.c b/samples/bpf/load_sock_ops.c
index e5da6cf71a3e..8ecb41ea0c03 100644
--- a/samples/bpf/load_sock_ops.c
+++ b/samples/bpf/load_sock_ops.c
@@ -8,7 +8,7 @@
 #include 
 #include 
 #include 
-#include "libbpf.h"
+#include 
 #include "bpf_load.h"
 #include 
 #include 
diff --git a/samples/bpf/lwt_len_hist_user.c b/samples/bpf/lwt_len_hist_user.c
index 7fcb94c09112..587b68b1f8dd 100644
--- a/samples/bpf/lwt_len_hist_user.c
+++ b/samples/bpf/lwt_len_hist_user.c
@@ -9,7 +9,7 @@
 #include 
 #include 
 
-#include "libbpf.h"
+#include 
 #include "bpf_util.h"
 
 #define MAX_INDEX 64
diff --git a/samples/bpf/map_perf_test_user.c b/samples/bpf/map_perf_test_user.c
index 519d9af4b04a..38b7b1a96cc2 100644
--- a/samples/bpf/map_perf_test_user.c
+++ b/samples/bpf/map_perf_test_user.c
@@ -21,7 +21,7 @@
 #include 
 #include 
 
-#include "libbpf.h"
+#include 
 #include "bpf_load.h"
 
 #define TEST_BIT(t) (1U << (t))
diff --git a/samples/bpf/sock_example.h b/samples/bpf/sock_example.h
index 772d5dad8465..a27d7579bc73 100644
--- a/samples/bpf/sock_example.h
+++ b/samples/bpf/sock_example.h
@@ -9,7 +9,6 @@
 #include 
 #include 
 #include 
-#include "libbpf.h"
 
 static inline int open_raw_sock(const char *name)
 {
diff --git a/samples/bpf/sockex1_user.c b/samples/bpf/sockex1_user.c
index 2be935c2627d..93ec01c56104 100644
--- a/samples/bpf/sockex1_user.c
+++

Re: [GIT] Networking

2018-05-11 Thread David Miller

From: Linus Torvalds 
Date: Fri, 11 May 2018 14:25:59 -0700

> David, is there something you want to tell us?
> 
> Drugs are bad, m'kay..

I guess this is my reward for trying to break the monotony of
pull requests :-)

Re: [PATCH net] net: dsa: bcm_sf2: Fix RX_CLS_LOC_ANY overwrite for last rule

2018-05-11 Thread David Miller

From: Florian Fainelli 
Date: Fri, 11 May 2018 16:38:02 -0700

> David, please discard that for now, the IPv4 part is correct, but I am
> not fixing the bug correctly for the IPv6 part. v2 coming some time next
> week. Thank you!

Ok.

Re: [PATCH net] net: dsa: bcm_sf2: Fix RX_CLS_LOC_ANY overwrite for last rule

2018-05-11 Thread Florian Fainelli

On 05/11/2018 04:24 PM, Florian Fainelli wrote:
> When we let the kernel pick up a rule location with RX_CLS_LOC_ANY, we
> would be able to overwrite the last rules because of a number of issues:
> 
> - the IPv4 code path would not be checking that rule_index is within
>   bounds, the IPv6 code path would only be checking the second index and
>   not the first one
> 
> - find_first_zero_bit() needs to operate on the full bitmap size
>   (priv->num_cfp_rules) otherwise it would be off by one in the results
>   it returns and the checks against bcm_sf2_cfp_rule_size() would be non
>   functioning
> 
> Fixes: 3306145866b6 ("net: dsa: bcm_sf2: Move IPv4 CFP processing to specific 
> functions")
> Fixes: ba0696c22e7c ("net: dsa: bcm_sf2: Add support for IPv6 CFP rules")
> Signed-off-by: Florian Fainelli 

David, please discard that for now, the IPv4 part is correct, but I am
not fixing the bug correctly for the IPv6 part. v2 coming some time next
week. Thank you!
-- 
Florian

Proposal

2018-05-11 Thread Zeliha Omer Faruk




--
Hello

Greetings to you please i have a business proposal for you contact me
for more detailes asap thanks.

Best Regards,
Miss.Zeliha ömer faruk
Esentepe Mahallesi Büyükdere
Caddesi Kristal Kule Binasi
No:215
Sisli - Istanbul, Turkey

[PATCH net-next 0/3] sctp: Introduce sctp_flush_ctx

2018-05-11 Thread Marcelo Ricardo Leitner

This struct will hold all the context used during the outq flush, so we
don't have to pass lots of pointers all around.

Checked on x86_64, the compiler inlines all these functions and there is no
derreference added because of the struct.

Marcelo Ricardo Leitner (3):
  sctp: add sctp_flush_ctx, a context struct on outq_flush routines
  sctp: add asoc and packet to sctp_flush_ctx
  sctp: checkpatch fixups

 net/sctp/outqueue.c | 259 
 1 file changed, 119 insertions(+), 140 deletions(-)

--
2.14.3

[PATCH net-next 3/3] sctp: checkpatch fixups

2018-05-11 Thread Marcelo Ricardo Leitner

A collection of fixups from previous patches, left for later to not
introduce unnecessary changes while moving code around.

Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/outqueue.c | 20 +++-
 1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 
a594d181fa1178c34cf477e13d700f7b37e72e21..9a2fa7d6d68b1d695cd745ed612eb32193f947e0
 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -812,8 +812,7 @@ static void sctp_outq_select_transport(struct 
sctp_flush_ctx *ctx,
 
if (!new_transport) {
if (!sctp_chunk_is_data(chunk)) {
-   /*
-* If we have a prior transport pointer, see if
+   /* If we have a prior transport pointer, see if
 * the destination address of the chunk
 * matches the destination address of the
 * current transport.  If not a match, then
@@ -912,8 +911,7 @@ static void sctp_outq_flush_ctrl(struct sctp_flush_ctx *ctx)
sctp_outq_select_transport(ctx, chunk);
 
switch (chunk->chunk_hdr->type) {
-   /*
-* 6.10 Bundling
+   /* 6.10 Bundling
 *   ...
 *   An endpoint MUST NOT bundle INIT, INIT ACK or SHUTDOWN
 *   COMPLETE with any other chunks.  [Send them immediately.]
@@ -1061,8 +1059,7 @@ static void sctp_outq_flush_data(struct sctp_flush_ctx 
*ctx,
return;
}
 
-   /*
-* RFC 2960 6.1  Transmission of DATA Chunks
+   /* RFC 2960 6.1  Transmission of DATA Chunks
 *
 * C) When the time comes for the sender to transmit,
 * before sending new DATA chunks, the sender MUST
@@ -1101,8 +1098,7 @@ static void sctp_outq_flush_data(struct sctp_flush_ctx 
*ctx,
 
sctp_outq_select_transport(ctx, chunk);
 
-   pr_debug("%s: outq:%p, chunk:%p[%s], tx-tsn:0x%x skb->head:%p "
-"skb->users:%d\n",
+   pr_debug("%s: outq:%p, chunk:%p[%s], tx-tsn:0x%x skb->head:%p 
skb->users:%d\n",
 __func__, ctx->q, chunk, chunk && chunk->chunk_hdr ?
 sctp_cname(SCTP_ST_CHUNK(chunk->chunk_hdr->type)) :
 "illegal chunk", ntohl(chunk->subh.data_hdr->tsn),
@@ -1175,8 +1171,7 @@ static void sctp_outq_flush_transports(struct 
sctp_flush_ctx *ctx)
}
 }
 
-/*
- * Try to flush an outqueue.
+/* Try to flush an outqueue.
  *
  * Description: Send everything in q which we legally can, subject to
  * congestion limitations.
@@ -1196,8 +1191,7 @@ static void sctp_outq_flush(struct sctp_outq *q, int 
rtx_timeout, gfp_t gfp)
.gfp = gfp,
};
 
-   /*
-* 6.10 Bundling
+   /* 6.10 Bundling
 *   ...
 *   When bundling control chunks with DATA chunks, an
 *   endpoint MUST place control chunks first in the outbound
@@ -1768,7 +1762,7 @@ static int sctp_acked(struct sctp_sackhdr *sack, __u32 
tsn)
if (TSN_lte(tsn, ctsn))
goto pass;
 
-   /* 3.3.4 Selective Acknowledgement (SACK) (3):
+   /* 3.3.4 Selective Acknowledgment (SACK) (3):
 *
 * Gap Ack Blocks:
 *  These fields contain the Gap Ack Blocks. They are repeated
-- 
2.14.3

[PATCH net-next 2/3] sctp: add asoc and packet to sctp_flush_ctx

2018-05-11 Thread Marcelo Ricardo Leitner

Pre-compute these so the compiler won't reload them (due to
no-strict-aliasing).

Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/outqueue.c | 99 -
 1 file changed, 45 insertions(+), 54 deletions(-)

diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 
db94a2513dd874149aa77c4936f68537e97f8855..a594d181fa1178c34cf477e13d700f7b37e72e21
 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -798,16 +798,17 @@ struct sctp_flush_ctx {
struct sctp_transport *transport;
/* These transports have chunks to send. */
struct list_head transport_list;
+   struct sctp_association *asoc;
+   /* Packet on the current transport above */
+   struct sctp_packet *packet;
gfp_t gfp;
 };
 
 /* transport: current transport */
-static bool sctp_outq_select_transport(struct sctp_flush_ctx *ctx,
+static void sctp_outq_select_transport(struct sctp_flush_ctx *ctx,
   struct sctp_chunk *chunk)
 {
struct sctp_transport *new_transport = chunk->transport;
-   struct sctp_association *asoc = ctx->q->asoc;
-   bool changed = false;
 
if (!new_transport) {
if (!sctp_chunk_is_data(chunk)) {
@@ -825,7 +826,7 @@ static bool sctp_outq_select_transport(struct 
sctp_flush_ctx *ctx,

>transport->ipaddr))
new_transport = ctx->transport;
else
-   new_transport = sctp_assoc_lookup_paddr(asoc,
+   new_transport = 
sctp_assoc_lookup_paddr(ctx->asoc,
  >dest);
}
 
@@ -833,7 +834,7 @@ static bool sctp_outq_select_transport(struct 
sctp_flush_ctx *ctx,
 * use the current active path.
 */
if (!new_transport)
-   new_transport = asoc->peer.active_path;
+   new_transport = ctx->asoc->peer.active_path;
} else {
__u8 type;
 
@@ -858,7 +859,7 @@ static bool sctp_outq_select_transport(struct 
sctp_flush_ctx *ctx,
if (type != SCTP_CID_HEARTBEAT &&
type != SCTP_CID_HEARTBEAT_ACK &&
type != SCTP_CID_ASCONF_ACK)
-   new_transport = asoc->peer.active_path;
+   new_transport = ctx->asoc->peer.active_path;
break;
default:
break;
@@ -867,27 +868,25 @@ static bool sctp_outq_select_transport(struct 
sctp_flush_ctx *ctx,
 
/* Are we switching transports? Take care of transport locks. */
if (new_transport != ctx->transport) {
-   changed = true;
ctx->transport = new_transport;
+   ctx->packet = >transport->packet;
+
if (list_empty(>transport->send_ready))
list_add_tail(>transport->send_ready,
  >transport_list);
 
-   sctp_packet_config(>transport->packet, 
asoc->peer.i.init_tag,
-  asoc->peer.ecn_capable);
+   sctp_packet_config(ctx->packet,
+  ctx->asoc->peer.i.init_tag,
+  ctx->asoc->peer.ecn_capable);
/* We've switched transports, so apply the
 * Burst limit to the new transport.
 */
sctp_transport_burst_limited(ctx->transport);
}
-
-   return changed;
 }
 
 static void sctp_outq_flush_ctrl(struct sctp_flush_ctx *ctx)
 {
-   struct sctp_association *asoc = ctx->q->asoc;
-   struct sctp_packet *packet = NULL;
struct sctp_chunk *chunk, *tmp;
enum sctp_xmit status;
int one_packet, error;
@@ -901,7 +900,7 @@ static void sctp_outq_flush_ctrl(struct sctp_flush_ctx *ctx)
 * NOT use the new IP address as a source for ANY SCTP
 * packet except on carrying an ASCONF Chunk.
 */
-   if (asoc->src_out_of_asoc_ok &&
+   if (ctx->asoc->src_out_of_asoc_ok &&
chunk->chunk_hdr->type != SCTP_CID_ASCONF)
continue;
 
@@ -910,8 +909,7 @@ static void sctp_outq_flush_ctrl(struct sctp_flush_ctx *ctx)
/* Pick the right transport to use. Should always be true for
 * the first chunk as we don't have a transport by then.
 */
-   if (sctp_outq_select_transport(ctx, chunk))
-   packet = >transport->packet;
+   sctp_outq_select_transport(ctx, chunk);
 
switch (chunk->chunk_hdr->type) {
/*
@@ -926,14 +924,14 @@ static void

[PATCH net-next 1/3] sctp: add sctp_flush_ctx, a context struct on outq_flush routines

2018-05-11 Thread Marcelo Ricardo Leitner

With this struct we avoid passing lots of variables around and taking care
of updating the current transport/packet.

Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/outqueue.c | 182 +---
 1 file changed, 88 insertions(+), 94 deletions(-)

diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 
c7f65bcd7bd6ee6996080d091bda1651f7bb8c44..db94a2513dd874149aa77c4936f68537e97f8855
 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -791,13 +791,22 @@ static int sctp_packet_singleton(struct sctp_transport 
*transport,
return sctp_packet_transmit(, gfp);
 }
 
-static bool sctp_outq_select_transport(struct sctp_chunk *chunk,
-  struct sctp_association *asoc,
-  struct sctp_transport **transport,
-  struct list_head *transport_list)
+/* Struct to hold the context during sctp outq flush */
+struct sctp_flush_ctx {
+   struct sctp_outq *q;
+   /* Current transport being used. It's NOT the same as curr active one */
+   struct sctp_transport *transport;
+   /* These transports have chunks to send. */
+   struct list_head transport_list;
+   gfp_t gfp;
+};
+
+/* transport: current transport */
+static bool sctp_outq_select_transport(struct sctp_flush_ctx *ctx,
+  struct sctp_chunk *chunk)
 {
struct sctp_transport *new_transport = chunk->transport;
-   struct sctp_transport *curr = *transport;
+   struct sctp_association *asoc = ctx->q->asoc;
bool changed = false;
 
if (!new_transport) {
@@ -812,9 +821,9 @@ static bool sctp_outq_select_transport(struct sctp_chunk 
*chunk,
 * after processing ASCONFs, we may have new
 * transports created.
 */
-   if (curr && sctp_cmp_addr_exact(>dest,
-   >ipaddr))
-   new_transport = curr;
+   if (ctx->transport && sctp_cmp_addr_exact(>dest,
+   
>transport->ipaddr))
+   new_transport = ctx->transport;
else
new_transport = sctp_assoc_lookup_paddr(asoc,
  >dest);
@@ -857,37 +866,33 @@ static bool sctp_outq_select_transport(struct sctp_chunk 
*chunk,
}
 
/* Are we switching transports? Take care of transport locks. */
-   if (new_transport != curr) {
+   if (new_transport != ctx->transport) {
changed = true;
-   curr = new_transport;
-   *transport = curr;
-   if (list_empty(>send_ready))
-   list_add_tail(>send_ready, transport_list);
+   ctx->transport = new_transport;
+   if (list_empty(>transport->send_ready))
+   list_add_tail(>transport->send_ready,
+ >transport_list);
 
-   sctp_packet_config(>packet, asoc->peer.i.init_tag,
+   sctp_packet_config(>transport->packet, 
asoc->peer.i.init_tag,
   asoc->peer.ecn_capable);
/* We've switched transports, so apply the
 * Burst limit to the new transport.
 */
-   sctp_transport_burst_limited(curr);
+   sctp_transport_burst_limited(ctx->transport);
}
 
return changed;
 }
 
-static void sctp_outq_flush_ctrl(struct sctp_outq *q,
-struct sctp_transport **_transport,
-struct list_head *transport_list,
-gfp_t gfp)
+static void sctp_outq_flush_ctrl(struct sctp_flush_ctx *ctx)
 {
-   struct sctp_transport *transport = *_transport;
-   struct sctp_association *asoc = q->asoc;
+   struct sctp_association *asoc = ctx->q->asoc;
struct sctp_packet *packet = NULL;
struct sctp_chunk *chunk, *tmp;
enum sctp_xmit status;
int one_packet, error;
 
-   list_for_each_entry_safe(chunk, tmp, >control_chunk_list, list) {
+   list_for_each_entry_safe(chunk, tmp, >q->control_chunk_list, list) 
{
one_packet = 0;
 
/* RFC 5061, 5.3
@@ -905,11 +910,8 @@ static void sctp_outq_flush_ctrl(struct sctp_outq *q,
/* Pick the right transport to use. Should always be true for
 * the first chunk as we don't have a transport by then.
 */
-   if (sctp_outq_select_transport(chunk, asoc, ,
-  _list)) {
-   transport = *_transport;
-   packet = >packet;
-

[PATCH net] net: dsa: bcm_sf2: Fix RX_CLS_LOC_ANY overwrite for last rule

2018-05-11 Thread Florian Fainelli

When we let the kernel pick up a rule location with RX_CLS_LOC_ANY, we
would be able to overwrite the last rules because of a number of issues:

- the IPv4 code path would not be checking that rule_index is within
  bounds, the IPv6 code path would only be checking the second index and
  not the first one

- find_first_zero_bit() needs to operate on the full bitmap size
  (priv->num_cfp_rules) otherwise it would be off by one in the results
  it returns and the checks against bcm_sf2_cfp_rule_size() would be non
  functioning

Fixes: 3306145866b6 ("net: dsa: bcm_sf2: Move IPv4 CFP processing to specific 
functions")
Fixes: ba0696c22e7c ("net: dsa: bcm_sf2: Add support for IPv6 CFP rules")
Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/bcm_sf2_cfp.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2_cfp.c b/drivers/net/dsa/bcm_sf2_cfp.c
index 23b45da784cb..ade5fa3d747d 100644
--- a/drivers/net/dsa/bcm_sf2_cfp.c
+++ b/drivers/net/dsa/bcm_sf2_cfp.c
@@ -354,10 +354,13 @@ static int bcm_sf2_cfp_ipv4_rule_set(struct bcm_sf2_priv 
*priv, int port,
/* Locate the first rule available */
if (fs->location == RX_CLS_LOC_ANY)
rule_index = find_first_zero_bit(priv->cfp.used,
-bcm_sf2_cfp_rule_size(priv));
+priv->num_cfp_rules);
else
rule_index = fs->location;
 
+   if (rule_index > bcm_sf2_cfp_rule_size(priv))
+   return -ENOSPC;
+
layout = _tcpip4_layout;
/* We only use one UDF slice for now */
slice_num = bcm_sf2_get_slice_number(layout, 0);
@@ -563,9 +566,11 @@ static int bcm_sf2_cfp_ipv6_rule_set(struct bcm_sf2_priv 
*priv, int port,
 */
if (fs->location == RX_CLS_LOC_ANY)
rule_index[0] = find_first_zero_bit(priv->cfp.used,
-   
bcm_sf2_cfp_rule_size(priv));
+   priv->num_cfp_rules);
else
rule_index[0] = fs->location;
+   if (rule_index[0] > bcm_sf2_cfp_rule_size(priv))
+   return -ENOSPC;
 
/* Flag it as used (cleared on error path) such that we can immediately
 * obtain a second one to chain from.
@@ -573,7 +578,7 @@ static int bcm_sf2_cfp_ipv6_rule_set(struct bcm_sf2_priv 
*priv, int port,
set_bit(rule_index[0], priv->cfp.used);
 
rule_index[1] = find_first_zero_bit(priv->cfp.used,
-   bcm_sf2_cfp_rule_size(priv));
+   priv->num_cfp_rules);
if (rule_index[1] > bcm_sf2_cfp_rule_size(priv)) {
ret = -ENOSPC;
goto out_err;
-- 
2.14.1

[PATCH net-next 8/8] sctp: rework switch cases in sctp_outq_flush_data

2018-05-11 Thread Marcelo Ricardo Leitner

Remove an inner one, which tended to be error prone due to the cascading
and it can be replaced by a simple if ().

Rework the outer one so that the actual flush code is not inside it. Now
we first validate if we can or cannot send data, return if not, and then
the flush code.

Suggested-by: Xin Long 
Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/outqueue.c | 191 +---
 1 file changed, 93 insertions(+), 98 deletions(-)

diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 
388e0665057be6ca7864b8bfdc0925e95e8b2858..c7f65bcd7bd6ee6996080d091bda1651f7bb8c44
 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -1058,122 +1058,117 @@ static void sctp_outq_flush_data(struct sctp_outq *q,
 * chunk.
 */
if (!packet || !packet->has_cookie_echo)
-   break;
+   return;
 
/* fallthru */
case SCTP_STATE_ESTABLISHED:
case SCTP_STATE_SHUTDOWN_PENDING:
case SCTP_STATE_SHUTDOWN_RECEIVED:
-   /*
-* RFC 2960 6.1  Transmission of DATA Chunks
-*
-* C) When the time comes for the sender to transmit,
-* before sending new DATA chunks, the sender MUST
-* first transmit any outstanding DATA chunks which
-* are marked for retransmission (limited by the
-* current cwnd).
-*/
-   if (!list_empty(>retransmit)) {
-   if (!sctp_outq_flush_rtx(q, _transport, transport_list,
-rtx_timeout, gfp))
-   break;
-   /* We may have switched current transport */
-   transport = *_transport;
-   packet = >packet;
-   }
+   break;
 
-   /* Apply Max.Burst limitation to the current transport in
-* case it will be used for new data.  We are going to
-* rest it before we return, but we want to apply the limit
-* to the currently queued data.
-*/
-   if (transport)
-   sctp_transport_burst_limited(transport);
-
-   /* Finally, transmit new packets.  */
-   while ((chunk = sctp_outq_dequeue_data(q)) != NULL) {
-   __u32 sid = ntohs(chunk->subh.data_hdr->stream);
-
-   /* Has this chunk expired? */
-   if (sctp_chunk_abandoned(chunk)) {
-   sctp_sched_dequeue_done(q, chunk);
-   sctp_chunk_fail(chunk, 0);
-   sctp_chunk_free(chunk);
-   continue;
-   }
+   default:
+   /* Do nothing. */
+   return;
+   }
 
-   if (asoc->stream.out[sid].state == SCTP_STREAM_CLOSED) {
-   sctp_outq_head_data(q, chunk);
-   break;
-   }
+   /*
+* RFC 2960 6.1  Transmission of DATA Chunks
+*
+* C) When the time comes for the sender to transmit,
+* before sending new DATA chunks, the sender MUST
+* first transmit any outstanding DATA chunks which
+* are marked for retransmission (limited by the
+* current cwnd).
+*/
+   if (!list_empty(>retransmit)) {
+   if (!sctp_outq_flush_rtx(q, _transport, transport_list,
+rtx_timeout, gfp))
+   return;
+   /* We may have switched current transport */
+   transport = *_transport;
+   packet = >packet;
+   }
 
-   if (sctp_outq_select_transport(chunk, asoc, ,
-  _list)) {
-   transport = *_transport;
-   packet = >packet;
-   }
+   /* Apply Max.Burst limitation to the current transport in
+* case it will be used for new data.  We are going to
+* rest it before we return, but we want to apply the limit
+* to the currently queued data.
+*/
+   if (transport)
+   sctp_transport_burst_limited(transport);
 
-   pr_debug("%s: outq:%p, chunk:%p[%s], tx-tsn:0x%x 
skb->head:%p "
-"skb->users:%d\n",
-__func__, q, chunk, chunk && chunk->chunk_hdr ?
-
sctp_cname(SCTP_ST_CHUNK(chunk->chunk_hdr->type)) :
-"illegal chunk", 
ntohl(chunk->subh.data_hdr->tsn),
-chunk->skb ?

[PATCH net-next 5/8] sctp: move flushing of data chunks out of sctp_outq_flush

2018-05-11 Thread Marcelo Ricardo Leitner

To the new sctp_outq_flush_data. Again, smaller functions and with well
defined objectives.

Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/outqueue.c | 144 ++--
 1 file changed, 73 insertions(+), 71 deletions(-)

diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 
74c3961eec4fca8b4ce9bb380f8465fae4625763..e445a59db26004553984088d50e458a93b03dcb8
 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -1038,45 +1038,17 @@ static bool sctp_outq_flush_rtx(struct sctp_outq *q,
 
return true;
 }
-/*
- * Try to flush an outqueue.
- *
- * Description: Send everything in q which we legally can, subject to
- * congestion limitations.
- * * Note: This function can be called from multiple contexts so appropriate
- * locking concerns must be made.  Today we use the sock lock to protect
- * this function.
- */
-static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
+
+static void sctp_outq_flush_data(struct sctp_outq *q,
+struct sctp_transport **_transport,
+struct list_head *transport_list,
+int rtx_timeout, gfp_t gfp)
 {
-   struct sctp_packet *packet;
+   struct sctp_transport *transport = *_transport;
+   struct sctp_packet *packet = transport ? >packet : NULL;
struct sctp_association *asoc = q->asoc;
-   struct sctp_transport *transport = NULL;
struct sctp_chunk *chunk;
enum sctp_xmit status;
-   int error = 0;
-
-   /* These transports have chunks to send. */
-   struct list_head transport_list;
-   struct list_head *ltransport;
-
-   INIT_LIST_HEAD(_list);
-   packet = NULL;
-
-   /*
-* 6.10 Bundling
-*   ...
-*   When bundling control chunks with DATA chunks, an
-*   endpoint MUST place control chunks first in the outbound
-*   SCTP packet.  The transmitter MUST transmit DATA chunks
-*   within a SCTP packet in increasing order of TSN.
-*   ...
-*/
-
-   sctp_outq_flush_ctrl(q, , _list, gfp);
-
-   if (q->asoc->src_out_of_asoc_ok)
-   goto sctp_flush_out;
 
/* Is it OK to send data chunks?  */
switch (asoc->state) {
@@ -1105,6 +1077,7 @@ static void sctp_outq_flush(struct sctp_outq *q, int 
rtx_timeout, gfp_t gfp)
 rtx_timeout))
break;
/* We may have switched current transport */
+   transport = *_transport;
packet = >packet;
}
 
@@ -1130,12 +1103,14 @@ static void sctp_outq_flush(struct sctp_outq *q, int 
rtx_timeout, gfp_t gfp)
 
if (asoc->stream.out[sid].state == SCTP_STREAM_CLOSED) {
sctp_outq_head_data(q, chunk);
-   goto sctp_flush_out;
+   break;
}
 
if (sctp_outq_select_transport(chunk, asoc, ,
-  _list))
+  _list)) {
+   transport = *_transport;
packet = >packet;
+   }
 
pr_debug("%s: outq:%p, chunk:%p[%s], tx-tsn:0x%x 
skb->head:%p "
 "skb->users:%d\n",
@@ -1147,8 +1122,10 @@ static void sctp_outq_flush(struct sctp_outq *q, int 
rtx_timeout, gfp_t gfp)
 
/* Add the chunk to the packet.  */
status = sctp_packet_transmit_chunk(packet, chunk, 0, 
gfp);
-
switch (status) {
+   case SCTP_XMIT_OK:
+   break;
+
case SCTP_XMIT_PMTU_FULL:
case SCTP_XMIT_RWND_FULL:
case SCTP_XMIT_DELAY:
@@ -1160,41 +1137,25 @@ static void sctp_outq_flush(struct sctp_outq *q, int 
rtx_timeout, gfp_t gfp)
 status);
 
sctp_outq_head_data(q, chunk);
-   goto sctp_flush_out;
-
-   case SCTP_XMIT_OK:
-   /* The sender is in the SHUTDOWN-PENDING state,
-* The sender MAY set the I-bit in the DATA
-* chunk header.
-*/
-   if (asoc->state == SCTP_STATE_SHUTDOWN_PENDING)
-   chunk->chunk_hdr->flags |= 
SCTP_DATA_SACK_IMM;
-   if (chunk->chunk_hdr->flags & 
SCTP_DATA_UNORDERED)
-   asoc->stats.ouodchunks++;
-   else
-

[PATCH net-next 2/8] sctp: factor out sctp_outq_select_transport

2018-05-11 Thread Marcelo Ricardo Leitner

We had two spots doing such complex operation and they were very close to
each other, a bit more tailored to here or there.

This patch unifies these under the same function,
sctp_outq_select_transport, which knows how to handle control chunks and
original transmissions (but not retransmissions).

Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/outqueue.c | 187 +---
 1 file changed, 90 insertions(+), 97 deletions(-)

diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 
300bd0dfc7c14c9df579dbe2f9e78dd8356ae1a3..bda50596d4bfebeac03966c5a161473df1c1986a
 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -791,6 +791,90 @@ static int sctp_packet_singleton(struct sctp_transport 
*transport,
return sctp_packet_transmit(, gfp);
 }
 
+static bool sctp_outq_select_transport(struct sctp_chunk *chunk,
+  struct sctp_association *asoc,
+  struct sctp_transport **transport,
+  struct list_head *transport_list)
+{
+   struct sctp_transport *new_transport = chunk->transport;
+   struct sctp_transport *curr = *transport;
+   bool changed = false;
+
+   if (!new_transport) {
+   if (!sctp_chunk_is_data(chunk)) {
+   /*
+* If we have a prior transport pointer, see if
+* the destination address of the chunk
+* matches the destination address of the
+* current transport.  If not a match, then
+* try to look up the transport with a given
+* destination address.  We do this because
+* after processing ASCONFs, we may have new
+* transports created.
+*/
+   if (curr && sctp_cmp_addr_exact(>dest,
+   >ipaddr))
+   new_transport = curr;
+   else
+   new_transport = sctp_assoc_lookup_paddr(asoc,
+ >dest);
+   }
+
+   /* if we still don't have a new transport, then
+* use the current active path.
+*/
+   if (!new_transport)
+   new_transport = asoc->peer.active_path;
+   } else {
+   __u8 type;
+
+   switch (new_transport->state) {
+   case SCTP_INACTIVE:
+   case SCTP_UNCONFIRMED:
+   case SCTP_PF:
+   /* If the chunk is Heartbeat or Heartbeat Ack,
+* send it to chunk->transport, even if it's
+* inactive.
+*
+* 3.3.6 Heartbeat Acknowledgement:
+* ...
+* A HEARTBEAT ACK is always sent to the source IP
+* address of the IP datagram containing the
+* HEARTBEAT chunk to which this ack is responding.
+* ...
+*
+* ASCONF_ACKs also must be sent to the source.
+*/
+   type = chunk->chunk_hdr->type;
+   if (type != SCTP_CID_HEARTBEAT &&
+   type != SCTP_CID_HEARTBEAT_ACK &&
+   type != SCTP_CID_ASCONF_ACK)
+   new_transport = asoc->peer.active_path;
+   break;
+   default:
+   break;
+   }
+   }
+
+   /* Are we switching transports? Take care of transport locks. */
+   if (new_transport != curr) {
+   changed = true;
+   curr = new_transport;
+   *transport = curr;
+   if (list_empty(>send_ready))
+   list_add_tail(>send_ready, transport_list);
+
+   sctp_packet_config(>packet, asoc->peer.i.init_tag,
+  asoc->peer.ecn_capable);
+   /* We've switched transports, so apply the
+* Burst limit to the new transport.
+*/
+   sctp_transport_burst_limited(curr);
+   }
+
+   return changed;
+}
+
 /*
  * Try to flush an outqueue.
  *
@@ -806,7 +890,6 @@ static void sctp_outq_flush(struct sctp_outq *q, int 
rtx_timeout, gfp_t gfp)
struct sctp_association *asoc = q->asoc;
__u32 vtag = asoc->peer.i.init_tag;
struct sctp_transport *transport = NULL;
-   struct sctp_transport *new_transport;
struct sctp_chunk *chunk, *tmp;
enum sctp_xmit status;
int error = 0;
@@ -843,68 +926,12 @@ static void

[PATCH net-next 4/8] sctp: move outq data rtx code out of sctp_outq_flush

2018-05-11 Thread Marcelo Ricardo Leitner

This patch renames current sctp_outq_flush_rtx to __sctp_outq_flush_rtx
and create a new sctp_outq_flush_rtx, with the code that was on
sctp_outq_flush. Again, the idea is to have functions with small and
defined objectives.

Yes, there is an open-coded path selection in the now sctp_outq_flush_rtx.
That is kept as is for now because it may be very different when we
implement retransmission path selection algorithms for CMT-SCTP.

Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/outqueue.c | 101 ++--
 1 file changed, 58 insertions(+), 43 deletions(-)

diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 
1081e1eea703be5d65d9828c3e4265fbb7a155f9..74c3961eec4fca8b4ce9bb380f8465fae4625763
 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -601,14 +601,14 @@ void sctp_retransmit(struct sctp_outq *q, struct 
sctp_transport *transport,
 
 /*
  * Transmit DATA chunks on the retransmit queue.  Upon return from
- * sctp_outq_flush_rtx() the packet 'pkt' may contain chunks which
+ * __sctp_outq_flush_rtx() the packet 'pkt' may contain chunks which
  * need to be transmitted by the caller.
  * We assume that pkt->transport has already been set.
  *
  * The return value is a normal kernel error return value.
  */
-static int sctp_outq_flush_rtx(struct sctp_outq *q, struct sctp_packet *pkt,
-  int rtx_timeout, int *start_timer)
+static int __sctp_outq_flush_rtx(struct sctp_outq *q, struct sctp_packet *pkt,
+int rtx_timeout, int *start_timer)
 {
struct sctp_transport *transport = pkt->transport;
struct sctp_chunk *chunk, *chunk1;
@@ -987,6 +987,57 @@ static void sctp_outq_flush_ctrl(struct sctp_outq *q,
}
 }
 
+/* Returns false if new data shouldn't be sent */
+static bool sctp_outq_flush_rtx(struct sctp_outq *q,
+   struct sctp_transport **_transport,
+   struct list_head *transport_list,
+   int rtx_timeout)
+{
+   struct sctp_transport *transport = *_transport;
+   struct sctp_packet *packet = transport ? >packet : NULL;
+   struct sctp_association *asoc = q->asoc;
+   int error, start_timer = 0;
+
+   if (asoc->peer.retran_path->state == SCTP_UNCONFIRMED)
+   return false;
+
+   if (transport != asoc->peer.retran_path) {
+   /* Switch transports & prepare the packet.  */
+   transport = asoc->peer.retran_path;
+   *_transport = transport;
+
+   if (list_empty(>send_ready))
+   list_add_tail(>send_ready,
+ transport_list);
+
+   packet = >packet;
+   sctp_packet_config(packet, asoc->peer.i.init_tag,
+  asoc->peer.ecn_capable);
+   }
+
+   error = __sctp_outq_flush_rtx(q, packet, rtx_timeout, _timer);
+   if (error < 0)
+   asoc->base.sk->sk_err = -error;
+
+   if (start_timer) {
+   sctp_transport_reset_t3_rtx(transport);
+   transport->last_time_sent = jiffies;
+   }
+
+   /* This can happen on COOKIE-ECHO resend.  Only
+* one chunk can get bundled with a COOKIE-ECHO.
+*/
+   if (packet->has_cookie_echo)
+   return false;
+
+   /* Don't send new data if there is still data
+* waiting to retransmit.
+*/
+   if (!list_empty(>retransmit))
+   return false;
+
+   return true;
+}
 /*
  * Try to flush an outqueue.
  *
@@ -1000,12 +1051,10 @@ static void sctp_outq_flush(struct sctp_outq *q, int 
rtx_timeout, gfp_t gfp)
 {
struct sctp_packet *packet;
struct sctp_association *asoc = q->asoc;
-   __u32 vtag = asoc->peer.i.init_tag;
struct sctp_transport *transport = NULL;
struct sctp_chunk *chunk;
enum sctp_xmit status;
int error = 0;
-   int start_timer = 0;
 
/* These transports have chunks to send. */
struct list_head transport_list;
@@ -1052,45 +1101,11 @@ static void sctp_outq_flush(struct sctp_outq *q, int 
rtx_timeout, gfp_t gfp)
 * current cwnd).
 */
if (!list_empty(>retransmit)) {
-   if (asoc->peer.retran_path->state == SCTP_UNCONFIRMED)
-   goto sctp_flush_out;
-   if (transport == asoc->peer.retran_path)
-   goto retran;
-
-   /* Switch transports & prepare the packet.  */
-
-   transport = asoc->peer.retran_path;
-
-   if (list_empty(>send_ready)) {
-   list_add_tail(>send_ready,
- _list);
-   }
-
+   if (!sctp_outq_flush_rtx(q, _transport,

[PATCH net-next 3/8] sctp: move the flush of ctrl chunks into its own function

2018-05-11 Thread Marcelo Ricardo Leitner

Named sctp_outq_flush_ctrl and, with that, keep the contexts contained.

One small fix embedded is the reset of one_packet at every iteration.
This allows bundling of some control chunks in case they were preceded by
another control chunk that cannot be bundled.

Other than this, it has the same behavior.

Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/outqueue.c | 89 -
 1 file changed, 54 insertions(+), 35 deletions(-)

diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 
bda50596d4bfebeac03966c5a161473df1c1986a..1081e1eea703be5d65d9828c3e4265fbb7a155f9
 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -875,45 +875,21 @@ static bool sctp_outq_select_transport(struct sctp_chunk 
*chunk,
return changed;
 }

-/*
- * Try to flush an outqueue.
- *
- * Description: Send everything in q which we legally can, subject to
- * congestion limitations.
- * * Note: This function can be called from multiple contexts so appropriate
- * locking concerns must be made.  Today we use the sock lock to protect
- * this function.
- */
-static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
+static void sctp_outq_flush_ctrl(struct sctp_outq *q,
+struct sctp_transport **_transport,
+struct list_head *transport_list,
+gfp_t gfp)
 {
-   struct sctp_packet *packet;
+   struct sctp_transport *transport = *_transport;
struct sctp_association *asoc = q->asoc;
-   __u32 vtag = asoc->peer.i.init_tag;
-   struct sctp_transport *transport = NULL;
+   struct sctp_packet *packet = NULL;
struct sctp_chunk *chunk, *tmp;
enum sctp_xmit status;
-   int error = 0;
-   int start_timer = 0;
-   int one_packet = 0;
-
-   /* These transports have chunks to send. */
-   struct list_head transport_list;
-   struct list_head *ltransport;
-
-   INIT_LIST_HEAD(_list);
-   packet = NULL;
-
-   /*
-* 6.10 Bundling
-*   ...
-*   When bundling control chunks with DATA chunks, an
-*   endpoint MUST place control chunks first in the outbound
-*   SCTP packet.  The transmitter MUST transmit DATA chunks
-*   within a SCTP packet in increasing order of TSN.
-*   ...
-*/
+   int one_packet, error;

list_for_each_entry_safe(chunk, tmp, >control_chunk_list, list) {
+   one_packet = 0;
+
/* RFC 5061, 5.3
 * F1) This means that until such time as the ASCONF
 * containing the add is acknowledged, the sender MUST
@@ -930,8 +906,10 @@ static void sctp_outq_flush(struct sctp_outq *q, int 
rtx_timeout, gfp_t gfp)
 * the first chunk as we don't have a transport by then.
 */
if (sctp_outq_select_transport(chunk, asoc, ,
-  _list))
+  _list)) {
+   transport = *_transport;
packet = >packet;
+   }

switch (chunk->chunk_hdr->type) {
/*
@@ -954,6 +932,7 @@ static void sctp_outq_flush(struct sctp_outq *q, int 
rtx_timeout, gfp_t gfp)
if (sctp_test_T_bit(chunk))
packet->vtag = asoc->c.my_vtag;
/* fallthru */
+
/* The following chunks are "response" chunks, i.e.
 * they are generated in response to something we
 * received.  If we are sending these, then we can
@@ -979,7 +958,7 @@ static void sctp_outq_flush(struct sctp_outq *q, int 
rtx_timeout, gfp_t gfp)
case SCTP_CID_RECONF:
status = sctp_packet_transmit_chunk(packet, chunk,
one_packet, gfp);
-   if (status  != SCTP_XMIT_OK) {
+   if (status != SCTP_XMIT_OK) {
/* put the chunk back */
list_add(>list, >control_chunk_list);
break;
@@ -1006,6 +985,46 @@ static void sctp_outq_flush(struct sctp_outq *q, int 
rtx_timeout, gfp_t gfp)
BUG();
}
}
+}
+
+/*
+ * Try to flush an outqueue.
+ *
+ * Description: Send everything in q which we legally can, subject to
+ * congestion limitations.
+ * * Note: This function can be called from multiple contexts so appropriate
+ * locking concerns must be made.  Today we use the sock lock to protect
+ * this function.
+ */
+static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
+{
+   struct sctp_packet *packet;
+   struct sctp_association *asoc = q->asoc;
+   __u32 vtag = asoc->peer.i.init_tag;
+

[PATCH net-next 6/8] sctp: move transport flush code out of sctp_outq_flush

2018-05-11 Thread Marcelo Ricardo Leitner

To the new sctp_outq_flush_transports.

Comment on Nagle is outdated and removed. Nagle is performed earlier, while
checking if the chunk fits the packet: if the outq length is not enough to
fill the packet, it returns SCTP_XMIT_DELAY.

So by when it gets to sctp_outq_flush_transports, it has to go through all
enlisted transports.

Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/outqueue.c | 56 +
 1 file changed, 26 insertions(+), 30 deletions(-)

diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 
e445a59db26004553984088d50e458a93b03dcb8..e867bde0b2d93f730f0cb89ad2f54a2094f47833
 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -1176,6 +1176,29 @@ static void sctp_outq_flush_data(struct sctp_outq *q,
}
 }
 
+static void sctp_outq_flush_transports(struct sctp_outq *q,
+  struct list_head *transport_list,
+  gfp_t gfp)
+{
+   struct list_head *ltransport;
+   struct sctp_packet *packet;
+   struct sctp_transport *t;
+   int error = 0;
+
+   while ((ltransport = sctp_list_dequeue(transport_list)) != NULL) {
+   t = list_entry(ltransport, struct sctp_transport, send_ready);
+   packet = >packet;
+   if (!sctp_packet_empty(packet)) {
+   error = sctp_packet_transmit(packet, gfp);
+   if (error < 0)
+   q->asoc->base.sk->sk_err = -error;
+   }
+
+   /* Clear the burst limited state, if any */
+   sctp_transport_burst_reset(t);
+   }
+}
+
 /*
  * Try to flush an outqueue.
  *
@@ -1187,17 +1210,10 @@ static void sctp_outq_flush_data(struct sctp_outq *q,
  */
 static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
 {
-   struct sctp_packet *packet;
-   struct sctp_association *asoc = q->asoc;
+   /* Current transport being used. It's NOT the same as curr active one */
struct sctp_transport *transport = NULL;
-   int error = 0;
-
/* These transports have chunks to send. */
-   struct list_head transport_list;
-   struct list_head *ltransport;
-
-   INIT_LIST_HEAD(_list);
-   packet = NULL;
+   LIST_HEAD(transport_list);
 
/*
 * 6.10 Bundling
@@ -1218,27 +1234,7 @@ static void sctp_outq_flush(struct sctp_outq *q, int 
rtx_timeout, gfp_t gfp)
 
 sctp_flush_out:
 
-   /* Before returning, examine all the transports touched in
-* this call.  Right now, we bluntly force clear all the
-* transports.  Things might change after we implement Nagle.
-* But such an examination is still required.
-*
-* --xguo
-*/
-   while ((ltransport = sctp_list_dequeue(_list)) != NULL) {
-   struct sctp_transport *t = list_entry(ltransport,
- struct sctp_transport,
- send_ready);
-   packet = >packet;
-   if (!sctp_packet_empty(packet)) {
-   error = sctp_packet_transmit(packet, gfp);
-   if (error < 0)
-   asoc->base.sk->sk_err = -error;
-   }
-
-   /* Clear the burst limited state, if any */
-   sctp_transport_burst_reset(t);
-   }
+   sctp_outq_flush_transports(q, _list, gfp);
 }
 
 /* Update unack_data based on the incoming SACK chunk */
-- 
2.14.3

[PATCH net-next 7/8] sctp: make use of gfp on retransmissions

2018-05-11 Thread Marcelo Ricardo Leitner

Retransmissions may be triggered when in user context, so lets make use
of gfp.

Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/outqueue.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 
e867bde0b2d93f730f0cb89ad2f54a2094f47833..388e0665057be6ca7864b8bfdc0925e95e8b2858
 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -608,7 +608,7 @@ void sctp_retransmit(struct sctp_outq *q, struct 
sctp_transport *transport,
  * The return value is a normal kernel error return value.
  */
 static int __sctp_outq_flush_rtx(struct sctp_outq *q, struct sctp_packet *pkt,
-int rtx_timeout, int *start_timer)
+int rtx_timeout, int *start_timer, gfp_t gfp)
 {
struct sctp_transport *transport = pkt->transport;
struct sctp_chunk *chunk, *chunk1;
@@ -684,12 +684,12 @@ static int __sctp_outq_flush_rtx(struct sctp_outq *q, 
struct sctp_packet *pkt,
 * control chunks are already freed so there
 * is nothing we can do.
 */
-   sctp_packet_transmit(pkt, GFP_ATOMIC);
+   sctp_packet_transmit(pkt, gfp);
goto redo;
}
 
/* Send this packet.  */
-   error = sctp_packet_transmit(pkt, GFP_ATOMIC);
+   error = sctp_packet_transmit(pkt, gfp);
 
/* If we are retransmitting, we should only
 * send a single packet.
@@ -705,7 +705,7 @@ static int __sctp_outq_flush_rtx(struct sctp_outq *q, 
struct sctp_packet *pkt,
 
case SCTP_XMIT_RWND_FULL:
/* Send this packet. */
-   error = sctp_packet_transmit(pkt, GFP_ATOMIC);
+   error = sctp_packet_transmit(pkt, gfp);
 
/* Stop sending DATA as there is no more room
 * at the receiver.
@@ -715,7 +715,7 @@ static int __sctp_outq_flush_rtx(struct sctp_outq *q, 
struct sctp_packet *pkt,
 
case SCTP_XMIT_DELAY:
/* Send this packet. */
-   error = sctp_packet_transmit(pkt, GFP_ATOMIC);
+   error = sctp_packet_transmit(pkt, gfp);
 
/* Stop sending DATA because of nagle delay. */
done = 1;
@@ -991,7 +991,7 @@ static void sctp_outq_flush_ctrl(struct sctp_outq *q,
 static bool sctp_outq_flush_rtx(struct sctp_outq *q,
struct sctp_transport **_transport,
struct list_head *transport_list,
-   int rtx_timeout)
+   int rtx_timeout, gfp_t gfp)
 {
struct sctp_transport *transport = *_transport;
struct sctp_packet *packet = transport ? >packet : NULL;
@@ -1015,7 +1015,8 @@ static bool sctp_outq_flush_rtx(struct sctp_outq *q,
   asoc->peer.ecn_capable);
}
 
-   error = __sctp_outq_flush_rtx(q, packet, rtx_timeout, _timer);
+   error = __sctp_outq_flush_rtx(q, packet, rtx_timeout, _timer,
+ gfp);
if (error < 0)
asoc->base.sk->sk_err = -error;
 
@@ -1074,7 +1075,7 @@ static void sctp_outq_flush_data(struct sctp_outq *q,
 */
if (!list_empty(>retransmit)) {
if (!sctp_outq_flush_rtx(q, _transport, transport_list,
-rtx_timeout))
+rtx_timeout, gfp))
break;
/* We may have switched current transport */
transport = *_transport;
-- 
2.14.3

[PATCH net-next 1/8] sctp: add sctp_packet_singleton

2018-05-11 Thread Marcelo Ricardo Leitner

Factor out the code for generating singletons. It's used only once, but
helps to keep the context contained.

The const variables are to ease the reading of subsequent calls in there.

Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/outqueue.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 
dee7cbd5483149024f2f3195db2fe4d473b1a00a..300bd0dfc7c14c9df579dbe2f9e78dd8356ae1a3
 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -776,6 +776,20 @@ void sctp_outq_uncork(struct sctp_outq *q, gfp_t gfp)
sctp_outq_flush(q, 0, gfp);
 }
 
+static int sctp_packet_singleton(struct sctp_transport *transport,
+struct sctp_chunk *chunk, gfp_t gfp)
+{
+   const struct sctp_association *asoc = transport->asoc;
+   const __u16 sport = asoc->base.bind_addr.port;
+   const __u16 dport = asoc->peer.port;
+   const __u32 vtag = asoc->peer.i.init_tag;
+   struct sctp_packet singleton;
+
+   sctp_packet_init(, transport, sport, dport);
+   sctp_packet_config(, vtag, 0);
+   sctp_packet_append_chunk(, chunk);
+   return sctp_packet_transmit(, gfp);
+}
 
 /*
  * Try to flush an outqueue.
@@ -789,10 +803,7 @@ void sctp_outq_uncork(struct sctp_outq *q, gfp_t gfp)
 static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
 {
struct sctp_packet *packet;
-   struct sctp_packet singleton;
struct sctp_association *asoc = q->asoc;
-   __u16 sport = asoc->base.bind_addr.port;
-   __u16 dport = asoc->peer.port;
__u32 vtag = asoc->peer.i.init_tag;
struct sctp_transport *transport = NULL;
struct sctp_transport *new_transport;
@@ -905,10 +916,7 @@ static void sctp_outq_flush(struct sctp_outq *q, int 
rtx_timeout, gfp_t gfp)
case SCTP_CID_INIT:
case SCTP_CID_INIT_ACK:
case SCTP_CID_SHUTDOWN_COMPLETE:
-   sctp_packet_init(, transport, sport, dport);
-   sctp_packet_config(, vtag, 0);
-   sctp_packet_append_chunk(, chunk);
-   error = sctp_packet_transmit(, gfp);
+   error = sctp_packet_singleton(transport, chunk, gfp);
if (error < 0) {
asoc->base.sk->sk_err = -error;
return;
-- 
2.14.3

[PATCH net-next 0/8] sctp: refactor sctp_outq_flush

2018-05-11 Thread Marcelo Ricardo Leitner

Currently sctp_outq_flush does many different things and arguably
unrelated, such as doing transport selection and outq dequeueing.

This patchset refactors it into smaller and more dedicated functions.
The end behavior should be the same.

The next patchset will rework the function parameters.

Marcelo Ricardo Leitner (8):
  sctp: add sctp_packet_singleton
  sctp: factor out sctp_outq_select_transport
  sctp: move the flush of ctrl chunks into its own function
  sctp: move outq data rtx code out of sctp_outq_flush
  sctp: move flushing of data chunks out of sctp_outq_flush
  sctp: move transport flush code out of sctp_outq_flush
  sctp: make use of gfp on retransmissions
  sctp: rework switch cases in sctp_outq_flush_data

 net/sctp/outqueue.c | 593 +++-
 1 file changed, 311 insertions(+), 282 deletions(-)

--
2.14.3

Re: [PATCH net-next] udp: Fix kernel panic in UDP GSO path

2018-05-11 Thread Willem de Bruijn

On Thu, May 10, 2018 at 8:51 PM, Eric Dumazet  wrote:
>
>
> On 05/10/2018 05:38 PM, Sean Tranchetti wrote:
>> Using GSO in the UDP path on a device with
>> scatter-gather netdevice feature disabled will result in a kernel
>> panic with the following call stack:
>>
>> This panic is the result of allocating SKBs with small size
>> for the newly segmented SKB. If the scatter-gather feature is
>> disabled, the code attempts to call skb_put() on the small SKB
>> with an argument of nearly the entire unsegmented SKB length.
>>
>> After this patch, attempting to use GSO with scatter-gather
>> disabled will result in -EINVAL being returned.
>>
>> Fixes: 15e36f5b8e98 ("udp: paged allocation with gso")
>> Signed-off-by: Sean Tranchetti 
>> Signed-off-by: Subash Abhinov Kasiviswanathan 
>> ---
>>  net/ipv4/ip_output.c | 8 
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
>> index b5e21eb..0d63690 100644
>> --- a/net/ipv4/ip_output.c
>> +++ b/net/ipv4/ip_output.c
>> @@ -1054,8 +1054,16 @@ static int __ip_append_data(struct sock *sk,
>>   copy = length;
>>
>>   if (!(rt->dst.dev->features_F_SG)) {
>> + struct sk_buff *tmp;
>>   unsigned int off;
>>
>> + if (paged) {
>> + err = -EINVAL;
>> + while ((tmp = __skb_dequeue(queue)) != NULL)
>> + kfree(tmp);
>> + goto error;
>> + }
>> +
>>   off = skb->len;
>>   if (getfrag(from, skb_put(skb, copy),
>>   offset, copy, off, skb) < 0) {
>>
>
>
> Hmm, no, we absolutely need to fix GSO instead.
>
> Think of a bonding device (or any virtual devices), your patch wont avoid the 
> crash.

Thanks for reporting the issue.

Paged skbuffs is an optimization for gso, but the feature should
continue to work even if gso skbs are linear, indeed (if at the cost
of copying during skb_segment).

We need to make paged contingent on scatter-gather. Rough
patch below. That is for ipv4 only, the same will be needed for ipv6.

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index b5e21eb198d8..b38731d8a44f 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -884,7 +884,7 @@ static int __ip_append_data(struct sock *sk,

exthdrlen = !skb ? rt->dst.header_len : 0;
mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize;
-   paged = !!cork->gso_size;
+   paged = cork->gso_size && (rt->dst.dev->features & NETIF_F_SG);

safe skb resetting after decapsulation and encapsulation

2018-05-11 Thread Jason A. Donenfeld

Hey Netdev,

A UDP skb comes in via the encap_rcv interface. I do a lot of wild
things to the bytes in the skb -- change where the head starts, modify
a few fragments, decrypt some stuff, trim off some things at the end,
etc. In other words, I'm decapsulating the skb in a pretty intense
way. I benefit from reusing the same skb, performance wise, but after
I'm done processing it, it's really a totally new skb. Eventually it's
time to pass off my skb to netif_receive_skb/netif_rx, but before I do
that, I need to "reinitialize" the skb. (The same goes for when
sending out an skb -- I get it from userspace via ndo_start_xmit, do
crazy things to it, and eventually pass it off to the udp_tunnel send
functions, but first "reinitializing" it.)

At the moment I'm using a function that looks like this:

static void jasons_wild_and_crazy_skb_reset(struct sk_buff *skb)
{
skb_scrub_packet(skb, true); //1
memset(>headers_start, 0, offsetof(struct sk_buff,
headers_end) - offsetof(struct sk_buff, headers_start)); //2
skb->queue_mapping = 0; //3
skb->nohdr = 0; //4
skb->peeked = 0; //5
skb->mac_len = 0; //6
skb->dev = NULL; //7
#ifdef CONFIG_NET_SCHED
skb->tc_index = 0; //8
skb_reset_tc(skb); //9
#endif
skb->hdr_len = skb_headroom(skb); //10
skb_reset_mac_header(skb); //11
skb_reset_network_header(skb); //12
skb_probe_transport_header(skb, 0); //13
skb_reset_inner_headers(skb); //14
}

I'm sure that some of this is wrong. Most of it is based on part of an
Octeon ethernet driver I read a few years ago. I numbered each
statement above, hoping to go through it with you all in detail here,
and see what we can cut away and see what we can approve.

1. Obviously correct and required.
2. This is probably wrong. At least it causes crashes when receiving
packets from RHEL 7.5's latest i40e driver in their vendor
frankenkernel, because those flags there have some critical bits
related to allocation. But there are a lot flags in there that I might
consider going through one by one and zeroing out.
3-5. Fields that should be zero, I assume, after
decapsulating/decrypting (and encapsulating/encrypting).
6. WireGuard is layer 3, so there's no mac.
7. We're later going to change the dev this came in on.
8-9: Same flakey rationale as 2,3-5.
10: Since the headroom has changed during the various modifications, I
need to let the packet field know about it.
11-14: The beginning of the headers has changed, and so resetting and
probing is necessary for this to work at all.

So I'm wondering - how much of this is necessary? How much am I
unnecessarily reinventing things that exist elsewhere? I'm pretty sure
in most cases the driver would work with only 1,10-14, but I worry
that bad things would happen in more unusual configurations. I've
tried to systematically go through the entire stack and see where
these might be used or not used, but it seems really inconsistent.

So, I'm writing wondering if somebody has an easy simplification or
rule for handling this kind of intense decapsulation/decryption (and
encapsulation/encryption operation on the other way) operation. I'd
like to make sure I get this down solid.

Thanks,
Jason

Re: [PATCH v6 1/6] net: phy: at803x: Export at803x_debug_reg_mask()

2018-05-11 Thread Paul Burton

Hi Andrew,

On Fri, May 11, 2018 at 09:24:46PM +0200, Andrew Lunn wrote:
> > I could reorder the probe function a little to initialize the PHY before
> > performing the MAC reset, drop this patch and the AR803X hibernation
> > stuff from patch 2 if you like. But again, I can't actually test the
> > result on the affected hardware.
> 
> Hi Paul
> 
> I don't like a MAC driver poking around in PHY registers.
> 
> So if you can rearrange the code, that would be great.
> 
>Thanks
>   Andrew

Sure, I'll give it a shot.

After digging into it I see 2 ways to go here:

  1) We could just always reset the PHY before we reset the MAC. That
 would give us a window of however long the PHY takes to enter its
 low power state & stop providing the RX clock during which we'd
 need the MAC reset to complete. In the case of the AR8031 that's
 "about 10 seconds" according to its data sheet. In this particular
 case that feels like plenty, but it does also feel a bit icky to
 rely on the timing chosen by the PHY manufacturer to line up with
 that of the MAC reset.

  2) We could introduce a couple of new phy_* functions to disable &
 enable low power states like the AR8031's hibernation feature, by
 calling new function pointers in struct phy_driver. Then pch_gbe &
 other MACs could call those to have the PHY driver disable
 hibernation at times where we know we'll need the RX clock and
 re-enable it afterwards.

I'm currently leaning towards option 2. How does that sound to you? Or
can you see another way to handle this?

Thanks,
Paul

Re: [PATCH ghak81 RFC V1 1/5] audit: normalize loginuid read access

2018-05-11 Thread Richard Guy Briggs

On 2018-05-10 17:21, Richard Guy Briggs wrote:
> On 2018-05-09 11:13, Paul Moore wrote:
> > On Fri, May 4, 2018 at 4:54 PM, Richard Guy Briggs  wrote:
> > > Recognizing that the loginuid is an internal audit value, use an access
> > > function to retrieve the audit loginuid value for the task rather than
> > > reaching directly into the task struct to get it.
> > >
> > > Signed-off-by: Richard Guy Briggs 
> > > ---
> > >  kernel/auditsc.c | 16 
> > >  1 file changed, 8 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > > index 479c031..f3817d0 100644
> > > --- a/kernel/auditsc.c
> > > +++ b/kernel/auditsc.c
> > > @@ -374,7 +374,7 @@ static int audit_field_compare(struct task_struct 
> > > *tsk,
> > > case AUDIT_COMPARE_EGID_TO_OBJ_GID:
> > > return audit_compare_gid(cred->egid, name, f, ctx);
> > > case AUDIT_COMPARE_AUID_TO_OBJ_UID:
> > > -   return audit_compare_uid(tsk->loginuid, name, f, ctx);
> > > +   return audit_compare_uid(audit_get_loginuid(tsk), name, 
> > > f, ctx);
> > > case AUDIT_COMPARE_SUID_TO_OBJ_UID:
> > > return audit_compare_uid(cred->suid, name, f, ctx);
> > > case AUDIT_COMPARE_SGID_TO_OBJ_GID:
> > > @@ -385,7 +385,7 @@ static int audit_field_compare(struct task_struct 
> > > *tsk,
> > > return audit_compare_gid(cred->fsgid, name, f, ctx);
> > > /* uid comparisons */
> > > case AUDIT_COMPARE_UID_TO_AUID:
> > > -   return audit_uid_comparator(cred->uid, f->op, 
> > > tsk->loginuid);
> > > +   return audit_uid_comparator(cred->uid, f->op, 
> > > audit_get_loginuid(tsk));
> > > case AUDIT_COMPARE_UID_TO_EUID:
> > > return audit_uid_comparator(cred->uid, f->op, cred->euid);
> > > case AUDIT_COMPARE_UID_TO_SUID:
> > > @@ -394,11 +394,11 @@ static int audit_field_compare(struct task_struct 
> > > *tsk,
> > > return audit_uid_comparator(cred->uid, f->op, 
> > > cred->fsuid);
> > > /* auid comparisons */
> > > case AUDIT_COMPARE_AUID_TO_EUID:
> > > -   return audit_uid_comparator(tsk->loginuid, f->op, 
> > > cred->euid);
> > > +   return audit_uid_comparator(audit_get_loginuid(tsk), 
> > > f->op, cred->euid);
> > > case AUDIT_COMPARE_AUID_TO_SUID:
> > > -   return audit_uid_comparator(tsk->loginuid, f->op, 
> > > cred->suid);
> > > +   return audit_uid_comparator(audit_get_loginuid(tsk), 
> > > f->op, cred->suid);
> > > case AUDIT_COMPARE_AUID_TO_FSUID:
> > > -   return audit_uid_comparator(tsk->loginuid, f->op, 
> > > cred->fsuid);
> > > +   return audit_uid_comparator(audit_get_loginuid(tsk), 
> > > f->op, cred->fsuid);
> > > /* euid comparisons */
> > > case AUDIT_COMPARE_EUID_TO_SUID:
> > > return audit_uid_comparator(cred->euid, f->op, 
> > > cred->suid);
> > > @@ -611,7 +611,7 @@ static int audit_filter_rules(struct task_struct *tsk,
> > > result = match_tree_refs(ctx, rule->tree);
> > > break;
> > > case AUDIT_LOGINUID:
> > > -   result = audit_uid_comparator(tsk->loginuid, 
> > > f->op, f->uid);
> > > +   result = 
> > > audit_uid_comparator(audit_get_loginuid(tsk), f->op, f->uid);
> > > break;
> > > case AUDIT_LOGINUID_SET:
> > > result = 
> > > audit_comparator(audit_loginuid_set(tsk), f->op, f->val);
> > > @@ -2287,8 +2287,8 @@ int audit_signal_info(int sig, struct task_struct 
> > > *t)
> > > (sig == SIGTERM || sig == SIGHUP ||
> > >  sig == SIGUSR1 || sig == SIGUSR2)) {
> > > audit_sig_pid = task_tgid_nr(tsk);
> > > -   if (uid_valid(tsk->loginuid))
> > > -   audit_sig_uid = tsk->loginuid;
> > > +   if (uid_valid(audit_get_loginuid(tsk)))
> > > +   audit_sig_uid = audit_get_loginuid(tsk);
> > 
> > I realize this comment is a little silly given the nature of loginuid,
> > but if we are going to abstract away loginuid accesses (which I think
> > is good), we should probably access it once, store it in a local
> > variable, perform the validity check on the local variable, then
> > commit the local variable to audit_sig_uid.  I realize a TOCTOU
> > problem is unlikely here, but with this new layer of abstraction it
> > seems that some additional safety might be a good thing.
> 
> Ok, I'll just assign it to where it is going and check it there, holding
> the audit_ctl_lock the whole time, since it should have been done
> anyways for all of audit_sig_{pid,uid,sid} anyways to get a consistent
> view from the AUDIT_SIGNAL_INFO fetch.

Hmmm, holding audit_ctl_lock won't work because it could sleep

Re: [PATCH net-next 4/4] bonding: allow carrier and link status to determine link state

2018-05-11 Thread Jay Vosburgh

Debabrata Banerjee  wrote:

>In a mixed environment it may be difficult to tell if your hardware
>support carrier, if it does not it can always report true. With a new
>use_carrier option of 2, we can check both carrier and link status
>sequentially, instead of one or the other

What do you mean by "mixed environment," and under what
circumstances are you seeing an actual benefit from doing the MII /
ethtool test in addition to the standard netif_carrier_ok test?

The use_carrier option was meant for backwards compatibility
with old-in-2005 device drivers, so this seem counterintuitive to me.  I
don't recall seeing any devices lacking netif_carrier support for some
time.  At this point, I would tend to argue that a new device driver
that does not implement netif_carrier support should be fixed, and not
have another hack added to bonding to work around it.

-J

>Signed-off-by: Debabrata Banerjee 
>---
> Documentation/networking/bonding.txt |  4 ++--
> drivers/net/bonding/bond_main.c  | 12 
> drivers/net/bonding/bond_options.c   |  7 ---
> 3 files changed, 14 insertions(+), 9 deletions(-)
>
>diff --git a/Documentation/networking/bonding.txt 
>b/Documentation/networking/bonding.txt
>index 9ba04c0bab8d..f063730e7e73 100644
>--- a/Documentation/networking/bonding.txt
>+++ b/Documentation/networking/bonding.txt
>@@ -828,8 +828,8 @@ use_carrier
>   MII / ETHTOOL ioctl method to determine the link state.
> 
>   A value of 1 enables the use of netif_carrier_ok(), a value of
>-  0 will use the deprecated MII / ETHTOOL ioctls.  The default
>-  value is 1.
>+  0 will use the deprecated MII / ETHTOOL ioctls. A value of 2
>+  will check both.  The default value is 1.
> 
> xmit_hash_policy
> 
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index f7f8a49cb32b..7e9652c4b35c 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -132,7 +132,7 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link 
>down, "
>   "in milliseconds");
> module_param(use_carrier, int, 0);
> MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in 
> miimon; "
>-"0 for off, 1 for on (default)");
>+"0 for off, 1 for on (default), 2 for carrier 
>then legacy checks");
> module_param(mode, charp, 0);
> MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, "
>  "1 for active-backup, 2 for balance-xor, "
>@@ -434,12 +434,16 @@ static int bond_check_dev_link(struct bonding *bond,
>   int (*ioctl)(struct net_device *, struct ifreq *, int);
>   struct ifreq ifr;
>   struct mii_ioctl_data *mii;
>+  bool carrier = true;
> 
>   if (!reporting && !netif_running(slave_dev))
>   return 0;
> 
>   if (bond->params.use_carrier)
>-  return netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0;
>+  carrier = netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0;
>+
>+  if (!carrier)
>+  return carrier;
> 
>   /* Try to get link status using Ethtool first. */
>   if (slave_dev->ethtool_ops->get_link)
>@@ -4399,8 +4403,8 @@ static int bond_check_params(struct bond_params *params)
>   downdelay = 0;
>   }
> 
>-  if ((use_carrier != 0) && (use_carrier != 1)) {
>-  pr_warn("Warning: use_carrier module parameter (%d), not of 
>valid value (0/1), so it was set to 1\n",
>+  if (use_carrier < 0 || use_carrier > 2) {
>+  pr_warn("Warning: use_carrier module parameter (%d), not of 
>valid value (0-2), so it was set to 1\n",
>   use_carrier);
>   use_carrier = 1;
>   }
>diff --git a/drivers/net/bonding/bond_options.c 
>b/drivers/net/bonding/bond_options.c
>index 8a945c9341d6..dba6cef05134 100644
>--- a/drivers/net/bonding/bond_options.c
>+++ b/drivers/net/bonding/bond_options.c
>@@ -164,9 +164,10 @@ static const struct bond_opt_value 
>bond_primary_reselect_tbl[] = {
> };
> 
> static const struct bond_opt_value bond_use_carrier_tbl[] = {
>-  { "off", 0,  0},
>-  { "on",  1,  BOND_VALFLAG_DEFAULT},
>-  { NULL,  -1, 0}
>+  { "off",  0,  0},
>+  { "on",   1,  BOND_VALFLAG_DEFAULT},
>+  { "both", 2,  0},
>+  { NULL,  -1,  0}
> };
> 
> static const struct bond_opt_value bond_all_slaves_active_tbl[] = {
>-- 
>2.17.0
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net] macmace: Set platform device coherent_dma_mask

2018-05-11 Thread Michael Schmitz

Hi Finn,

Am 11.05.2018 um 22:06 schrieb Finn Thain:
>> You would have to be careful not to overwrite a pdev->dev.dma_mask and 
>> pdev->dev.dma_coherent_mask that might have been set in a platform 
>> device passed via platform_device_register here. Coldfire is the only 
>> m68k platform currently using that, but there might be others in future.
>>
> 
> That Coldfire patch could be reverted if this is a better solution.

True, but there might be other uses for deviating from a platform
default (I'm thinking of Atari SCSI and floppy drivers here). But we
could chose the correct mask to set in arch_setup_pdev_archdata()
instead, as it's a platform property not a driver property in that case.

>> ... But I don't think there are smaller DMA masks used by m68k drivers 
>> that use the platform device mechanism at present. I've only looked at 
>> arch/m68k though.
> 
> So we're back at the same problem that Geert's suggestion also raised: how 
> to identify potentially affected platform devices and drivers?
> 
> Maybe we can take a leaf out of Christoph's book, and leave a noisy 
> WARNING splat in the log.
> 
> void arch_setup_pdev_archdata(struct platform_device *pdev)
> {
> WARN_ON_ONCE(pdev->dev.coherent_dma_mask != DMA_MASK_NONE ||
>  pdev->dev.dma_mask != NULL);

I'd suggest using WARN_ON() so we catch all uses on a particular platform.

I initially thought it necessary to warn on unset mask here, but I see
that would throw up a lot of redundant false positives.

Cheers,

Michael

Re: [PATCH net-next 3/4] bonding: allow use of tx hashing in balance-alb

2018-05-11 Thread Jay Vosburgh

Debabrata Banerjee  wrote:

>The rx load balancing provided by balance-alb is not mutually
>exclusive with using hashing for tx selection, and should provide a decent
>speed increase because this eliminates spinlocks and cache contention.
>
>Signed-off-by: Debabrata Banerjee 
>---
> drivers/net/bonding/bond_alb.c | 20 ++--
> drivers/net/bonding/bond_main.c| 25 +++--
> drivers/net/bonding/bond_options.c |  2 +-
> include/net/bonding.h  | 10 +-
> 4 files changed, 43 insertions(+), 14 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>index 180e50f7806f..6228635880d5 100644
>--- a/drivers/net/bonding/bond_alb.c
>+++ b/drivers/net/bonding/bond_alb.c
>@@ -1478,8 +1478,24 @@ int bond_alb_xmit(struct sk_buff *skb, struct 
>net_device *bond_dev)
>   }
> 
>   if (do_tx_balance) {
>-  hash_index = _simple_hash(hash_start, hash_size);
>-  tx_slave = tlb_choose_channel(bond, hash_index, skb->len);
>+  if (bond->params.tlb_dynamic_lb) {
>+  hash_index = _simple_hash(hash_start, hash_size);
>+  tx_slave = tlb_choose_channel(bond, hash_index, 
>skb->len);
>+  } else {
>+  /*
>+   * do_tx_balance means we are free to select the 
>tx_slave
>+   * So we do exactly what tlb would do for hash selection
>+   */
>+
>+  struct bond_up_slave *slaves;
>+  unsigned int count;
>+
>+  slaves = rcu_dereference(bond->slave_arr);
>+  count = slaves ? READ_ONCE(slaves->count) : 0;
>+  if (likely(count))
>+  tx_slave = slaves->arr[bond_xmit_hash(bond, 
>skb) %
>+ count];
>+  }
>   }
> 
>   return bond_do_alb_xmit(skb, bond, tx_slave);
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 1f1e97b26f95..f7f8a49cb32b 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -159,7 +159,7 @@ module_param(min_links, int, 0);
> MODULE_PARM_DESC(min_links, "Minimum number of available links before turning 
> on carrier");
> 
> module_param(xmit_hash_policy, charp, 0);
>-MODULE_PARM_DESC(xmit_hash_policy, "balance-xor and 802.3ad hashing method; "
>+MODULE_PARM_DESC(xmit_hash_policy, "balance-alb, balance-tlb, balance-xor, 
>802.3ad hashing method; "
>  "0 for layer 2 (default), 1 for layer 3+4, "
>  "2 for layer 2+3, 3 for encap layer 2+3, "
>  "4 for encap layer 3+4");
>@@ -1735,7 +1735,7 @@ int bond_enslave(struct net_device *bond_dev, struct 
>net_device *slave_dev,
>   unblock_netpoll_tx();
>   }
> 
>-  if (bond_mode_uses_xmit_hash(bond))
>+  if (bond_mode_can_use_xmit_hash(bond))
>   bond_update_slave_arr(bond, NULL);
> 
>   bond->nest_level = dev_get_nest_level(bond_dev);
>@@ -1870,7 +1870,7 @@ static int __bond_release_one(struct net_device 
>*bond_dev,
>   if (BOND_MODE(bond) == BOND_MODE_8023AD)
>   bond_3ad_unbind_slave(slave);
> 
>-  if (bond_mode_uses_xmit_hash(bond))
>+  if (bond_mode_can_use_xmit_hash(bond))
>   bond_update_slave_arr(bond, slave);
> 
>   netdev_info(bond_dev, "Releasing %s interface %s\n",
>@@ -3102,7 +3102,7 @@ static int bond_slave_netdev_event(unsigned long event,
>* events. If these (miimon/arpmon) parameters are configured
>* then array gets refreshed twice and that should be fine!
>*/
>-  if (bond_mode_uses_xmit_hash(bond))
>+  if (bond_mode_can_use_xmit_hash(bond))
>   bond_update_slave_arr(bond, NULL);
>   break;
>   case NETDEV_CHANGEMTU:
>@@ -3322,7 +3322,7 @@ static int bond_open(struct net_device *bond_dev)
>*/
>   if (bond_alb_initialize(bond, (BOND_MODE(bond) == 
> BOND_MODE_ALB)))
>   return -ENOMEM;
>-  if (bond->params.tlb_dynamic_lb)
>+  if (bond->params.tlb_dynamic_lb || BOND_MODE(bond) == 
>BOND_MODE_ALB)
>   queue_delayed_work(bond->wq, >alb_work, 0);
>   }
> 
>@@ -3341,7 +3341,7 @@ static int bond_open(struct net_device *bond_dev)
>   bond_3ad_initiate_agg_selection(bond, 1);
>   }
> 
>-  if (bond_mode_uses_xmit_hash(bond))
>+  if (bond_mode_can_use_xmit_hash(bond))
>   bond_update_slave_arr(bond, NULL);
> 
>   return 0;
>@@ -3892,7 +3892,7 @@ static void bond_slave_arr_handler(struct work_struct 
>*work)
>  * to determine the slave interface -
>  * (a) BOND_MODE_8023AD
>  * (b) BOND_MODE_XOR
>- * (c)

Re: [PATCH net v2] rps: Correct wrong skb_flow_limit check when enable RPS

2018-05-11 Thread Willem de Bruijn

On Thu, May 10, 2018 at 6:09 PM,   wrote:
> From: Gao Feng 
>
> The skb flow limit is implemented for each CPU independently. In the
> current codes, the function skb_flow_limit gets the softnet_data by
> this_cpu_ptr. But the target cpu of enqueue_to_backlog would be not
> the current cpu when enable RPS. As the result, the skb_flow_limit checks
> the stats of current CPU, while the skb is going to append the queue of
> another CPU. It isn't the expected behavior.
>
> Now pass the softnet_data as a param to make consistent.
>
> Fixes: 99bbc7074190 ("rps: selective flow shedding during softnet overflow")
> Signed-off-by: Gao Feng 

See also the discussion in the v1 of this patch.

The merits of moving flow_limit state from irq to rps cpu can
be argued, but the existing behavior is intentional and correct,
so this should not be applied to net and be backported to stable
branches.

My bad for reviving the discussion in the v1 thread while v2 was
already pending, sorry.

Re: [PATCH bpf-next 3/7] samples: bpf: compile and link against full libbpf

2018-05-11 Thread Jakub Kicinski

On Thu, 10 May 2018 10:24:39 -0700, Jakub Kicinski wrote:
> samples/bpf currently cherry-picks object files from tools/lib/bpf
> to link against.  Just compile the full library and link statically
> against it.
> 
> Signed-off-by: Jakub Kicinski 
> Reviewed-by: Quentin Monnet 

Looks like this breaks some build configs :(  Fix is forthcoming, sorry!

Re: [RFC bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-05-11 Thread Martin KaFai Lau

On Fri, May 11, 2018 at 02:08:01PM -0700, Joe Stringer wrote:
> On 10 May 2018 at 22:00, Martin KaFai Lau  wrote:
> > On Wed, May 09, 2018 at 02:07:05PM -0700, Joe Stringer wrote:
> >> This patch adds a new BPF helper function, sk_lookup() which allows BPF
> >> programs to find out if there is a socket listening on this host, and
> >> returns a socket pointer which the BPF program can then access to
> >> determine, for instance, whether to forward or drop traffic. sk_lookup()
> >> takes a reference on the socket, so when a BPF program makes use of this
> >> function, it must subsequently pass the returned pointer into the newly
> >> added sk_release() to return the reference.
> >>
> >> By way of example, the following pseudocode would filter inbound
> >> connections at XDP if there is no corresponding service listening for
> >> the traffic:
> >>
> >>   struct bpf_sock_tuple tuple;
> >>   struct bpf_sock_ops *sk;
> >>
> >>   populate_tuple(ctx, ); // Extract the 5tuple from the packet
> >>   sk = bpf_sk_lookup(ctx, , sizeof tuple, netns, 0);
> >>   if (!sk) {
> >> // Couldn't find a socket listening for this traffic. Drop.
> >> return TC_ACT_SHOT;
> >>   }
> >>   bpf_sk_release(sk, 0);
> >>   return TC_ACT_OK;
> >>
> >> Signed-off-by: Joe Stringer 
> >> ---
> 
> ...
> 
> >> @@ -4032,6 +4036,96 @@ static const struct bpf_func_proto 
> >> bpf_skb_get_xfrm_state_proto = {
> >>  };
> >>  #endif
> >>
> >> +struct sock *
> >> +sk_lookup(struct net *net, struct bpf_sock_tuple *tuple) {
> > Would it be possible to have another version that
> > returns a sk without taking its refcnt?
> > It may have performance benefit.
> 
> Not really. The sockets are not RCU-protected, and established sockets
> may be torn down without notice. If we don't take a reference, there's
> no guarantee that the socket will continue to exist for the duration
> of running the BPF program.
> 
> From what I follow, the comment below has a hidden implication which
> is that sockets without SOCK_RCU_FREE, eg established sockets, may be
> directly freed regardless of RCU.
Right, SOCK_RCU_FREE sk is the one I am concern about.
For example, TCP_LISTEN socket does not require taking a refcnt
now.  Doing a bpf_sk_lookup() may have a rather big
impact on handling TCP syn flood.  or the usual intention
is to redirect instead of passing it up to the stack?


> 
> /* Sockets having SOCK_RCU_FREE will call this function after one RCU
>  * grace period. This is the case for UDP sockets and TCP listeners.
>  */
> static void __sk_destruct(struct rcu_head *head)
> ...
> 
> Therefore without the refcount, it won't be safe.

Re: [PATCH net-next 2/4] bonding: use common mac addr checks

2018-05-11 Thread Jay Vosburgh

Banerjee, Debabrata  wrote:

>> From: Jay Vosburgh [mailto:jay.vosbu...@canonical.com]
>> Debabrata Banerjee  wrote:
>
>> >-   if
>> (!ether_addr_equal_64bits(rx_hash_table[index].mac_dst,
>> >-mac_bcast) &&
>> >-
>> !is_zero_ether_addr(rx_hash_table[index].mac_dst)) {
>> >+   if
>> (is_valid_ether_addr(rx_hash_table[index].mac_dst)) {
>> 
>>  This change and the similar ones below will now fail non-broadcast
>> multicast Ethernet addresses, where the prior code would not.  Is this an
>> intentional change?
>
>Yes I don't see how it makes sense to use multicast addresses at all, but I 
>may be missing something. It's also illegal according to rfc1812 3.3.2, but 
>obviously this balancing mode is trying to be very clever. We probably 
>shouldn't violate the rfc anyway.

Fair enough, but I think it would be good to call this out in
the change log just in case it does somehow cause a regression.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [GIT] Networking

2018-05-11 Thread Linus Torvalds

David, is there something you want to tell us?

Drugs are bad, m'kay..

   Linus

On Fri, May 11, 2018 at 2:00 PM David Miller  wrote:

> "from Kevin Easton", "Thanks to Bhadram Varka", "courtesy of Gustavo A.
> R.  Silva", "To Eric Dumazet we are most grateful for this fix", "This
> fix from YU Bo, we do appreciate", "Once again we are blessed by the
> honorable Eric Dumazet with this fix", "This fix is bestowed upon us by
> Andrew Tomt", "another great gift from Eric Dumazet", "to Hangbin Liu we
> give thanks for this", "Paolo Abeni, he gave us this", "thank you Moshe
> Shemesh", "from our good brother David Howells", "Daniel Juergens,
> you're the best!", "Debabrata Benerjee saved us!", "The ship is now
> water tight, thanks to Andrey Ignatov", "from Colin Ian King, man we've
> got holes everywhere!", "Jiri Pirko what would we do without you!

RE: [PATCH net-next 2/4] bonding: use common mac addr checks

2018-05-11 Thread Banerjee, Debabrata

> From: Jay Vosburgh [mailto:jay.vosbu...@canonical.com]
> Debabrata Banerjee  wrote:

> >-if
> (!ether_addr_equal_64bits(rx_hash_table[index].mac_dst,
> >- mac_bcast) &&
> >-
> !is_zero_ether_addr(rx_hash_table[index].mac_dst)) {
> >+if
> (is_valid_ether_addr(rx_hash_table[index].mac_dst)) {
> 
>   This change and the similar ones below will now fail non-broadcast
> multicast Ethernet addresses, where the prior code would not.  Is this an
> intentional change?

Yes I don't see how it makes sense to use multicast addresses at all, but I may 
be missing something. It's also illegal according to rfc1812 3.3.2, but 
obviously this balancing mode is trying to be very clever. We probably 
shouldn't violate the rfc anyway.

[PATCH net-next 2/3] net: dsa: mv88e6xxx: add IEEE and IP mapping ops

2018-05-11 Thread Vivien Didelot

All Marvell switch families except 88E6390 have direct registers in
Global 1 for IEEE and IP priorities override mapping. The 88E6390 uses
indirect tables instead.

Add .ieee_pri_map and .ip_pri_map ops to distinct that and call them
from a mv88e6xxx_pri_setup helper. Only non-6390 are concerned ATM.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/chip.c| 94 +++--
 drivers/net/dsa/mv88e6xxx/chip.h|  3 +
 drivers/net/dsa/mv88e6xxx/global1.c | 58 ++
 drivers/net/dsa/mv88e6xxx/global1.h |  3 +
 4 files changed, 127 insertions(+), 31 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 1cebde80b101..df92fed44674 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -1104,6 +1104,25 @@ static void mv88e6xxx_port_stp_state_set(struct 
dsa_switch *ds, int port,
dev_err(ds->dev, "p%d: failed to update state\n", port);
 }
 
+static int mv88e6xxx_pri_setup(struct mv88e6xxx_chip *chip)
+{
+   int err;
+
+   if (chip->info->ops->ieee_pri_map) {
+   err = chip->info->ops->ieee_pri_map(chip);
+   if (err)
+   return err;
+   }
+
+   if (chip->info->ops->ip_pri_map) {
+   err = chip->info->ops->ip_pri_map(chip);
+   if (err)
+   return err;
+   }
+
+   return 0;
+}
+
 static int mv88e6xxx_devmap_setup(struct mv88e6xxx_chip *chip)
 {
int target, port;
@@ -2252,37 +2271,6 @@ static int mv88e6xxx_g1_setup(struct mv88e6xxx_chip 
*chip)
 {
int err;
 
-   /* Configure the IP ToS mapping registers. */
-   err = mv88e6xxx_g1_write(chip, MV88E6XXX_G1_IP_PRI_0, 0x);
-   if (err)
-   return err;
-   err = mv88e6xxx_g1_write(chip, MV88E6XXX_G1_IP_PRI_1, 0x);
-   if (err)
-   return err;
-   err = mv88e6xxx_g1_write(chip, MV88E6XXX_G1_IP_PRI_2, 0x);
-   if (err)
-   return err;
-   err = mv88e6xxx_g1_write(chip, MV88E6XXX_G1_IP_PRI_3, 0x);
-   if (err)
-   return err;
-   err = mv88e6xxx_g1_write(chip, MV88E6XXX_G1_IP_PRI_4, 0x);
-   if (err)
-   return err;
-   err = mv88e6xxx_g1_write(chip, MV88E6XXX_G1_IP_PRI_5, 0x);
-   if (err)
-   return err;
-   err = mv88e6xxx_g1_write(chip, MV88E6XXX_G1_IP_PRI_6, 0x);
-   if (err)
-   return err;
-   err = mv88e6xxx_g1_write(chip, MV88E6XXX_G1_IP_PRI_7, 0x);
-   if (err)
-   return err;
-
-   /* Configure the IEEE 802.1p priority mapping register. */
-   err = mv88e6xxx_g1_write(chip, MV88E6XXX_G1_IEEE_PRI, 0xfa41);
-   if (err)
-   return err;
-
/* Initialize the statistics unit */
err = mv88e6xxx_stats_set_histogram(chip);
if (err)
@@ -2365,6 +2353,10 @@ static int mv88e6xxx_setup(struct dsa_switch *ds)
if (err)
goto unlock;
 
+   err = mv88e6xxx_pri_setup(chip);
+   if (err)
+   goto unlock;
+
/* Setup PTP Hardware Clock and timestamping */
if (chip->info->ptp_support) {
err = mv88e6xxx_ptp_setup(chip);
@@ -2592,6 +2584,8 @@ static int mv88e6xxx_set_eeprom(struct dsa_switch *ds,
 
 static const struct mv88e6xxx_ops mv88e6085_ops = {
/* MV88E6XXX_FAMILY_6097 */
+   .ieee_pri_map = mv88e6085_g1_ieee_pri_map,
+   .ip_pri_map = mv88e6085_g1_ip_pri_map,
.irl_init_all = mv88e6352_g2_irl_init_all,
.set_switch_mac = mv88e6xxx_g1_set_switch_mac,
.phy_read = mv88e6185_phy_ppu_read,
@@ -2628,6 +2622,8 @@ static const struct mv88e6xxx_ops mv88e6085_ops = {
 
 static const struct mv88e6xxx_ops mv88e6095_ops = {
/* MV88E6XXX_FAMILY_6095 */
+   .ieee_pri_map = mv88e6085_g1_ieee_pri_map,
+   .ip_pri_map = mv88e6085_g1_ip_pri_map,
.set_switch_mac = mv88e6xxx_g1_set_switch_mac,
.phy_read = mv88e6185_phy_ppu_read,
.phy_write = mv88e6185_phy_ppu_write,
@@ -2652,6 +2648,8 @@ static const struct mv88e6xxx_ops mv88e6095_ops = {
 
 static const struct mv88e6xxx_ops mv88e6097_ops = {
/* MV88E6XXX_FAMILY_6097 */
+   .ieee_pri_map = mv88e6085_g1_ieee_pri_map,
+   .ip_pri_map = mv88e6085_g1_ip_pri_map,
.irl_init_all = mv88e6352_g2_irl_init_all,
.set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
@@ -2686,6 +2684,8 @@ static const struct mv88e6xxx_ops mv88e6097_ops = {
 
 static const struct mv88e6xxx_ops mv88e6123_ops = {
/* MV88E6XXX_FAMILY_6165 */
+   .ieee_pri_map = mv88e6085_g1_ieee_pri_map,
+   .ip_pri_map = mv88e6085_g1_ip_pri_map,
.irl_init_all = mv88e6352_g2_irl_init_all,
.set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
@@ -2714,6 +2714,8 @@

[PATCH net-next 0/3] net: dsa: mv88e6xxx: remove Global 1 setup

2018-05-11 Thread Vivien Didelot

The mv88e6xxx driver is still writing arbitrary registers at setup time,
e.g. priority override bits. Add ops for them and provide specific setup
functions for priority and stats before getting rid of the erroneous
mv88e6xxx_g1_setup code, as previously done with Global 2.

Vivien Didelot (3):
  net: dsa: mv88e6xxx: use helper for 6390 histogram
  net: dsa: mv88e6xxx: add IEEE and IP mapping ops
  net: dsa: mv88e6xxx: add a stats setup function

 drivers/net/dsa/mv88e6xxx/chip.c| 121 +---
 drivers/net/dsa/mv88e6xxx/chip.h|   3 +
 drivers/net/dsa/mv88e6xxx/global1.c |  73 ++---
 drivers/net/dsa/mv88e6xxx/global1.h |  15 +++-
 4 files changed, 149 insertions(+), 63 deletions(-)

-- 
2.17.0

[PATCH net-next 1/3] net: dsa: mv88e6xxx: use helper for 6390 histogram

2018-05-11 Thread Vivien Didelot

The Marvell 88E6390 model has its histogram mode bits moved in the
Global 1 Control 2 register. Use the previously introduced
mv88e6xxx_g1_ctl2_mask helper to set them.

At the same time complete the documentation of the said register.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/global1.c | 15 +++
 drivers/net/dsa/mv88e6xxx/global1.h | 12 +---
 2 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/global1.c 
b/drivers/net/dsa/mv88e6xxx/global1.c
index 244ee1ff9edc..0f2b05342c18 100644
--- a/drivers/net/dsa/mv88e6xxx/global1.c
+++ b/drivers/net/dsa/mv88e6xxx/global1.c
@@ -393,18 +393,9 @@ int mv88e6390_g1_rmu_disable(struct mv88e6xxx_chip *chip)
 
 int mv88e6390_g1_stats_set_histogram(struct mv88e6xxx_chip *chip)
 {
-   u16 val;
-   int err;
-
-   err = mv88e6xxx_g1_read(chip, MV88E6XXX_G1_CTL2, );
-   if (err)
-   return err;
-
-   val |= MV88E6XXX_G1_CTL2_HIST_RX_TX;
-
-   err = mv88e6xxx_g1_write(chip, MV88E6XXX_G1_CTL2, val);
-
-   return err;
+   return mv88e6xxx_g1_ctl2_mask(chip, MV88E6390_G1_CTL2_HIST_MODE_MASK,
+ MV88E6390_G1_CTL2_HIST_MODE_RX |
+ MV88E6390_G1_CTL2_HIST_MODE_TX);
 }
 
 int mv88e6xxx_g1_set_device_number(struct mv88e6xxx_chip *chip, int index)
diff --git a/drivers/net/dsa/mv88e6xxx/global1.h 
b/drivers/net/dsa/mv88e6xxx/global1.h
index e186a026e1b1..c357b3ca9a09 100644
--- a/drivers/net/dsa/mv88e6xxx/global1.h
+++ b/drivers/net/dsa/mv88e6xxx/global1.h
@@ -201,12 +201,13 @@
 
 /* Offset 0x1C: Global Control 2 */
 #define MV88E6XXX_G1_CTL2  0x1c
-#define MV88E6XXX_G1_CTL2_HIST_RX  0x0040
-#define MV88E6XXX_G1_CTL2_HIST_TX  0x0080
-#define MV88E6XXX_G1_CTL2_HIST_RX_TX   0x00c0
 #define MV88E6185_G1_CTL2_CASCADE_PORT_MASK0xf000
 #define MV88E6185_G1_CTL2_CASCADE_PORT_NONE0xe000
 #define MV88E6185_G1_CTL2_CASCADE_PORT_MULTI   0xf000
+#define MV88E6352_G1_CTL2_HEADER_TYPE_MASK 0xc000
+#define MV88E6352_G1_CTL2_HEADER_TYPE_ORIG 0x
+#define MV88E6352_G1_CTL2_HEADER_TYPE_MGMT 0x4000
+#define MV88E6390_G1_CTL2_HEADER_TYPE_LAG  0x8000
 #define MV88E6352_G1_CTL2_RMU_MODE_MASK0x3000
 #define MV88E6352_G1_CTL2_RMU_MODE_DISABLED0x
 #define MV88E6352_G1_CTL2_RMU_MODE_PORT_4  0x1000
@@ -223,6 +224,11 @@
 #define MV88E6390_G1_CTL2_RMU_MODE_PORT_10 0x0300
 #define MV88E6390_G1_CTL2_RMU_MODE_ALL_DSA 0x0600
 #define MV88E6390_G1_CTL2_RMU_MODE_DISABLED0x0700
+#define MV88E6390_G1_CTL2_HIST_MODE_MASK   0x00c0
+#define MV88E6390_G1_CTL2_HIST_MODE_RX 0x0040
+#define MV88E6390_G1_CTL2_HIST_MODE_TX 0x0080
+#define MV88E6352_G1_CTL2_CTR_MODE_MASK0x0060
+#define MV88E6390_G1_CTL2_CTR_MODE 0x0020
 #define MV88E6XXX_G1_CTL2_DEVICE_NUMBER_MASK   0x001f
 
 /* Offset 0x1D: Stats Operation Register */
-- 
2.17.0

[PATCH net-next 3/3] net: dsa: mv88e6xxx: add a stats setup function

2018-05-11 Thread Vivien Didelot

Now that the Global 1 specific setup function only setup the statistics
unit, kill it in favor of a mv88e6xxx_stats_setup function.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/chip.c | 27 ++-
 1 file changed, 10 insertions(+), 17 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index df92fed44674..a4efc6544c0d 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -995,14 +995,6 @@ static void mv88e6xxx_get_ethtool_stats(struct dsa_switch 
*ds, int port,
 
 }
 
-static int mv88e6xxx_stats_set_histogram(struct mv88e6xxx_chip *chip)
-{
-   if (chip->info->ops->stats_set_histogram)
-   return chip->info->ops->stats_set_histogram(chip);
-
-   return 0;
-}
-
 static int mv88e6xxx_get_regs_len(struct dsa_switch *ds, int port)
 {
return 32 * sizeof(u16);
@@ -2267,14 +2259,16 @@ static int mv88e6xxx_set_ageing_time(struct dsa_switch 
*ds,
return err;
 }
 
-static int mv88e6xxx_g1_setup(struct mv88e6xxx_chip *chip)
+static int mv88e6xxx_stats_setup(struct mv88e6xxx_chip *chip)
 {
int err;
 
/* Initialize the statistics unit */
-   err = mv88e6xxx_stats_set_histogram(chip);
-   if (err)
-   return err;
+   if (chip->info->ops->stats_set_histogram) {
+   err = chip->info->ops->stats_set_histogram(chip);
+   if (err)
+   return err;
+   }
 
return mv88e6xxx_g1_stats_clear(chip);
 }
@@ -2300,11 +2294,6 @@ static int mv88e6xxx_setup(struct dsa_switch *ds)
goto unlock;
}
 
-   /* Setup Switch Global 1 Registers */
-   err = mv88e6xxx_g1_setup(chip);
-   if (err)
-   goto unlock;
-
err = mv88e6xxx_irl_setup(chip);
if (err)
goto unlock;
@@ -2368,6 +2357,10 @@ static int mv88e6xxx_setup(struct dsa_switch *ds)
goto unlock;
}
 
+   err = mv88e6xxx_stats_setup(chip);
+   if (err)
+   goto unlock;
+
 unlock:
mutex_unlock(>reg_lock);
 
-- 
2.17.0

Re: [bpf-next V2 PATCH 4/4] xdp: change ndo_xdp_xmit API to support bulking

2018-05-11 Thread Jesper Dangaard Brouer

On Fri, 11 May 2018 20:12:12 +0200 Jesper Dangaard Brouer  
wrote:

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 03ed492c4e14..debdb6286170 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1185,9 +1185,13 @@ struct dev_ifalias {
>   *   This function is used to set or query state related to XDP on the
>   *   netdevice and manage BPF offload. See definition of
>   *   enum bpf_netdev_command for details.
> - * int (*ndo_xdp_xmit)(struct net_device *dev, struct xdp_frame *xdp);
> - *   This function is used to submit a XDP packet for transmit on a
> - *   netdevice.
> + * int (*ndo_xdp_xmit)(struct net_device *dev, int n, struct xdp_frame 
> **xdp);
> + *   This function is used to submit @n XDP packets for transmit on a
> + *   netdevice. Returns number of frames successfully transmitted, frames
> + *   that got dropped are freed/returned via xdp_return_frame().
> + *   Returns negative number, means general error invoking ndo, meaning
> + *   no frames were xmit'ed and core-caller will free all frames.
> + *   TODO: Consider add flag to allow sending flush operation.

Another reason for adding a flag to ndo_xdp_xmit, is to allow calling
it from other contexts.  Like from AF_XDP TX code path, which in the
sendmsg is not protected by NAPI.


>   * void (*ndo_xdp_flush)(struct net_device *dev);
>   *   This function is used to inform the driver to flush a particular
>   *   xdp tx queue. Must be called on same CPU as xdp_xmit.
> @@ -1375,8 +1379,8 @@ struct net_device_ops {
>  int needed_headroom);
>   int (*ndo_bpf)(struct net_device *dev,
>  struct netdev_bpf *bpf);
> - int (*ndo_xdp_xmit)(struct net_device *dev,
> - struct xdp_frame *xdp);
> + int (*ndo_xdp_xmit)(struct net_device *dev, int n,
> + struct xdp_frame **xdp);
>   void(*ndo_xdp_flush)(struct net_device *dev);
>  };



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [RFC bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-05-11 Thread Joe Stringer

On 10 May 2018 at 22:00, Martin KaFai Lau  wrote:
> On Wed, May 09, 2018 at 02:07:05PM -0700, Joe Stringer wrote:
>> This patch adds a new BPF helper function, sk_lookup() which allows BPF
>> programs to find out if there is a socket listening on this host, and
>> returns a socket pointer which the BPF program can then access to
>> determine, for instance, whether to forward or drop traffic. sk_lookup()
>> takes a reference on the socket, so when a BPF program makes use of this
>> function, it must subsequently pass the returned pointer into the newly
>> added sk_release() to return the reference.
>>
>> By way of example, the following pseudocode would filter inbound
>> connections at XDP if there is no corresponding service listening for
>> the traffic:
>>
>>   struct bpf_sock_tuple tuple;
>>   struct bpf_sock_ops *sk;
>>
>>   populate_tuple(ctx, ); // Extract the 5tuple from the packet
>>   sk = bpf_sk_lookup(ctx, , sizeof tuple, netns, 0);
>>   if (!sk) {
>> // Couldn't find a socket listening for this traffic. Drop.
>> return TC_ACT_SHOT;
>>   }
>>   bpf_sk_release(sk, 0);
>>   return TC_ACT_OK;
>>
>> Signed-off-by: Joe Stringer 
>> ---

...

>> @@ -4032,6 +4036,96 @@ static const struct bpf_func_proto 
>> bpf_skb_get_xfrm_state_proto = {
>>  };
>>  #endif
>>
>> +struct sock *
>> +sk_lookup(struct net *net, struct bpf_sock_tuple *tuple) {
> Would it be possible to have another version that
> returns a sk without taking its refcnt?
> It may have performance benefit.

Not really. The sockets are not RCU-protected, and established sockets
may be torn down without notice. If we don't take a reference, there's
no guarantee that the socket will continue to exist for the duration
of running the BPF program.

>From what I follow, the comment below has a hidden implication which
is that sockets without SOCK_RCU_FREE, eg established sockets, may be
directly freed regardless of RCU.

/* Sockets having SOCK_RCU_FREE will call this function after one RCU
 * grace period. This is the case for UDP sockets and TCP listeners.
 */
static void __sk_destruct(struct rcu_head *head)
...

Therefore without the refcount, it won't be safe.

[GIT] Networking

2018-05-11 Thread David Miller


1) Verify lengths of keys provided by the user is AF_KEY, from
   Kevin Easton.

2) Add device ID for BCM89610 PHY.  Thanks to Bhadram Varka.

3) Add Spectre guards to some ATM code, courtesy of Gustavo
   A. R. Silva.

4) Fix infinite loop in NSH protocol code.  To Eric Dumazet
   we are most grateful for this fix.

5) Line up /proc/net/netlink headers properly.  This fix from YU Bo,
   we do appreciate.

6) Use after free in TLS code.  Once again we are blessed by the
   honorable Eric Dumazet with this fix.

7) Fix regression in TLS code causing stalls on partial TLS records.
   This fix is bestowed upon us by Andrew Tomt.

8) Deal with too small MTUs properly in LLC code, another great gift
   from Eric Dumazet.

9) Handle cached route flushing properly wrt. MTU locking in ipv4,
   to Hangbin Liu we give thanks for this.

10) Fix regression in SO_BINDTODEVIC handling wrt. UDP socket demux.
Paolo Abeni, he gave us this.

11) Range check coalescing parameters in mlx4 driver, thank you
Moshe Shemesh.

12) Some ipv6 ICMP error handling fixes in rxrpc, from our good
brother David Howells.

13) Fix kexec on mlx5 by freeing IRQs in shutdown path.  Daniel
Juergens, you're the best!

14) Don't send bonding RLB updates to invalid MAC addresses.
Debabrata Benerjee saved us!

15) Uh oh, we were leaking in udp_sendmsg and ping_v4_sendmsg.  The
ship is now water tight, thanks to Andrey Ignatov.

16) IPSEC memory leak in ixgbe from Colin Ian King, man we've got
holes everywhere!

17) Fix error path in tcf_proto_create, Jiri Pirko what would we
do without you!

Please pull, thanks a lot!

The following changes since commit 1504269814263c9676b4605a6a91e14dc6ceac21:

  Merge tag 'linux-kselftest-4.17-rc4' of 
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest (2018-05-03 
19:26:51 -1000)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to a52956dfc503f8cc5cfe6454959b7049fddb4413:

  net sched actions: fix refcnt leak in skbmod (2018-05-11 16:37:03 -0400)


Adi Nissim (1):
  net/mlx5: E-Switch, Include VF RDMA stats in vport statistics

Alexander Aring (1):
  net: ieee802154: 6lowpan: fix frag reassembly

Anders Roxell (1):
  selftests: net: use TEST_PROGS_EXTENDED

Andre Tomt (1):
  net/tls: Fix connection stall on partial tls record

Andrew Lunn (1):
  net: dsa: mv88e6xxx: Fix PHY interrupts by parameterising PHY base address

Andrey Ignatov (1):
  ipv4: fix memory leaks in udp_sendmsg, ping_v4_sendmsg

Antoine Tenart (1):
  net: phy: sfp: fix the BR,min computation

Bhadram Varka (1):
  net: phy: broadcom: add support for BCM89610 PHY

Christophe JAILLET (2):
  net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'
  mlxsw: core: Fix an error handling path in 
'mlxsw_core_bus_device_register()'

Colin Ian King (5):
  firestream: fix spelling mistake: "reseverd" -> "reserved"
  sctp: fix spelling mistake: "max_retans" -> "max_retrans"
  net/9p: fix spelling mistake: "suspsend" -> "suspend"
  qed: fix spelling mistake: "taskelt" -> "tasklet"
  ixgbe: fix memory leak on ipsec allocation

Daniel Borkmann (1):
  bpf: use array_index_nospec in find_prog_type

Daniel Jurgens (1):
  net/mlx5: Free IRQs in shutdown path

David Howells (5):
  rxrpc: Fix missing start of call timeout
  rxrpc: Fix error reception on AF_INET6 sockets
  rxrpc: Fix the min security level for kernel calls
  rxrpc: Add a tracepoint to log ICMP/ICMP6 and error messages
  rxrpc: Trace UDP transmission failure

David S. Miller (13):
  Merge git://git.kernel.org/.../bpf/bpf
  Merge branch 'for-upstream' of 
git://git.kernel.org/.../bluetooth/bluetooth
  Merge branch 'master' of git://git.kernel.org/.../klassert/ipsec
  Merge branch 'Aquantia-various-patches-2018-05'
  Merge branch 'ieee802154-for-davem-2018-05-08' of 
git://git.kernel.org/.../sschmidt/wpan
  Merge tag 'linux-can-fixes-for-4.17-20180508' of 
ssh://gitolite.kernel.org/.../mkl/linux-can
  Merge branch 'qed-rdma-fixes'
  Merge tag 'mac80211-for-davem-2018-05-09' of 
git://git.kernel.org/.../jberg/mac80211
  Merge tag 'linux-can-fixes-for-4.17-20180510' of 
ssh://gitolite.kernel.org/.../mkl/linux-can
  Merge branch 'bonding-bug-fixes-and-regressions'
  Merge tag 'mlx5-fixes-2018-05-10' of git://git.kernel.org/.../saeed/linux
  Merge tag 'rxrpc-fixes-20180510' of 
git://git.kernel.org/.../dhowells/linux-fs
  Merge branch '10GbE' of git://git.kernel.org/.../jkirsher/net-queue

Davide Caratti (1):
  tc-testing: fix tdc tests for 'bpf' action

Debabrata Banerjee (2):
  bonding: do not allow rlb updates to invalid mac
  bonding: send learning packets for vlans on slave

Emil Tantilov (1):
  ixgbe: return error on unsupported SFP module

Re: [PATCH net-next 2/4] bonding: use common mac addr checks

2018-05-11 Thread Jay Vosburgh

Debabrata Banerjee  wrote:

>Replace homegrown mac addr checks with faster defs from etherdevice.h
>
>Signed-off-by: Debabrata Banerjee 
>---
> drivers/net/bonding/bond_alb.c | 28 +---
> 1 file changed, 9 insertions(+), 19 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>index c2f6c58e4e6a..180e50f7806f 100644
>--- a/drivers/net/bonding/bond_alb.c
>+++ b/drivers/net/bonding/bond_alb.c
>@@ -40,11 +40,6 @@
> #include 
> #include 
> 
>-
>-
>-static const u8 mac_bcast[ETH_ALEN + 2] __long_aligned = {
>-  0xff, 0xff, 0xff, 0xff, 0xff, 0xff
>-};
> static const u8 mac_v6_allmcast[ETH_ALEN + 2] __long_aligned = {
>   0x33, 0x33, 0x00, 0x00, 0x00, 0x01
> };
>@@ -420,9 +415,7 @@ static void rlb_clear_slave(struct bonding *bond, struct 
>slave *slave)
> 
>   if (assigned_slave) {
>   rx_hash_table[index].slave = assigned_slave;
>-  if 
>(!ether_addr_equal_64bits(rx_hash_table[index].mac_dst,
>-   mac_bcast) &&
>-  
>!is_zero_ether_addr(rx_hash_table[index].mac_dst)) {
>+  if 
>(is_valid_ether_addr(rx_hash_table[index].mac_dst)) {

This change and the similar ones below will now fail
non-broadcast multicast Ethernet addresses, where the prior code would
not.  Is this an intentional change?

-J

>   bond_info->rx_hashtbl[index].ntt = 1;
>   bond_info->rx_ntt = 1;
>   /* A slave has been removed from the
>@@ -525,8 +518,7 @@ static void rlb_req_update_slave_clients(struct bonding 
>*bond, struct slave *sla
>   client_info = &(bond_info->rx_hashtbl[hash_index]);
> 
>   if ((client_info->slave == slave) &&
>-  !ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) &&
>-  !is_zero_ether_addr(client_info->mac_dst)) {
>+  is_valid_ether_addr(client_info->mac_dst)) {
>   client_info->ntt = 1;
>   ntt = 1;
>   }
>@@ -567,8 +559,7 @@ static void rlb_req_update_subnet_clients(struct bonding 
>*bond, __be32 src_ip)
>   if ((client_info->ip_src == src_ip) &&
>   !ether_addr_equal_64bits(client_info->slave->dev->dev_addr,
>bond->dev->dev_addr) &&
>-  !ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) &&
>-  !is_zero_ether_addr(client_info->mac_dst)) {
>+  is_valid_ether_addr(client_info->mac_dst)) {
>   client_info->ntt = 1;
>   bond_info->rx_ntt = 1;
>   }
>@@ -596,7 +587,7 @@ static struct slave *rlb_choose_channel(struct sk_buff 
>*skb, struct bonding *bon
>   if ((client_info->ip_src == arp->ip_src) &&
>   (client_info->ip_dst == arp->ip_dst)) {
>   /* the entry is already assigned to this client */
>-  if (!ether_addr_equal_64bits(arp->mac_dst, mac_bcast)) {
>+  if (!is_broadcast_ether_addr(arp->mac_dst)) {
>   /* update mac address from arp */
>   ether_addr_copy(client_info->mac_dst, 
> arp->mac_dst);
>   }
>@@ -644,8 +635,7 @@ static struct slave *rlb_choose_channel(struct sk_buff 
>*skb, struct bonding *bon
>   ether_addr_copy(client_info->mac_src, arp->mac_src);
>   client_info->slave = assigned_slave;
> 
>-  if (!ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) &&
>-  !is_zero_ether_addr(client_info->mac_dst)) {
>+  if (is_valid_ether_addr(client_info->mac_dst)) {
>   client_info->ntt = 1;
>   bond->alb_info.rx_ntt = 1;
>   } else {
>@@ -1418,9 +1408,9 @@ int bond_alb_xmit(struct sk_buff *skb, struct net_device 
>*bond_dev)
>   case ETH_P_IP: {
>   const struct iphdr *iph = ip_hdr(skb);
> 
>-  if (ether_addr_equal_64bits(eth_data->h_dest, mac_bcast) ||
>-  (iph->daddr == ip_bcast) ||
>-  (iph->protocol == IPPROTO_IGMP)) {
>+  if (is_broadcast_ether_addr(eth_data->h_dest) ||
>+  iph->daddr == ip_bcast ||
>+  iph->protocol == IPPROTO_IGMP) {
>   do_tx_balance = false;
>   break;
>   }
>@@ -1432,7 +1422,7 @@ int bond_alb_xmit(struct sk_buff *skb, struct net_device 
>*bond_dev)
>   /* IPv6 doesn't really use broadcast mac address, but leave
>* that here just in case.
>*/
>-  if

Re: INFO: rcu detected stall in kfree_skbmem

2018-05-11 Thread Marcelo Ricardo Leitner

On Fri, May 11, 2018 at 12:08:33PM -0700, Eric Dumazet wrote:
>
>
> On 05/11/2018 11:41 AM, Marcelo Ricardo Leitner wrote:
>
> > But calling ip6_xmit with rcu_read_lock is expected. tcp stack also
> > does it.
> > Thus I think this is more of an issue with IPv6 stack. If a host has
> > an extensive ip6tables ruleset, it probably generates this more
> > easily.
> >
> >>>  sctp_v6_xmit+0x4a5/0x6b0 net/sctp/ipv6.c:225
> >>>  sctp_packet_transmit+0x26f6/0x3ba0 net/sctp/output.c:650
> >>>  sctp_outq_flush+0x1373/0x4370 net/sctp/outqueue.c:1197
> >>>  sctp_outq_uncork+0x6a/0x80 net/sctp/outqueue.c:776
> >>>  sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
> >>>  sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
> >>>  sctp_do_sm+0x596/0x7160 net/sctp/sm_sideeffect.c:1191
> >>>  sctp_generate_heartbeat_event+0x218/0x450 net/sctp/sm_sideeffect.c:406
> >>>  call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
> >>>  expire_timers kernel/time/timer.c:1363 [inline]
> >
> > Having this call from a timer means it wasn't processing sctp stack
> > for too long.
> >
>
> I feel the problem is that this part is looping, in some infinite loop.
>
> I have seen this stack traces in other reports.

Checked mail history now, seems at least two other reports on RCU
stalls had sctp_generate_heartbeat_event involved.

>
> Maybe some kind of list corruption.

Could be.
Do we know if it generated a flood of packets?

  Marcelo

Re: [PATCH net 1/1] net sched actions: fix refcnt leak in skbmod

2018-05-11 Thread David Miller

From: Roman Mashak 
Date: Fri, 11 May 2018 14:35:33 -0400

> When application fails to pass flags in netlink TLV when replacing
> existing skbmod action, the kernel will leak refcnt:
> 
> $ tc actions get action skbmod index 1
> total acts 0
> 
> action order 0: skbmod pipe set smac 00:11:22:33:44:55
>  index 1 ref 1 bind 0
> 
> For example, at this point a buggy application replaces the action with
> index 1 with new smac 00:aa:22:33:44:55, it fails because of zero flags,
> however refcnt gets bumped:
> 
> $ tc actions get actions skbmod index 1
> total acts 0
> 
> action order 0: skbmod pipe set smac 00:11:22:33:44:55
>  index 1 ref 2 bind 0
> $
> 
> Tha patch fixes this by calling tcf_idr_release() on existing actions.
> 
> Fixes: 86da71b57383d ("net_sched: Introduce skbmod action")
> Signed-off-by: Roman Mashak 

Applied and queued up for -stable, thanks.

Re: [PATCH v3] net: phy: DP83TC811: Introduce support for the DP83TC811 phy

2018-05-11 Thread David Miller

From: Dan Murphy 
Date: Fri, 11 May 2018 13:08:19 -0500

> Add support for the DP83811 phy.
> 
> The DP83811 supports both rgmii and sgmii interfaces.
> There are 2 part numbers for this the DP83TC811R does not
> reliably support the SGMII interface but the DP83TC811S will.
> 
> There is not a way to differentiate these parts from the
> hardware or register set.  So this is controlled via the DT
> to indicate which phy mode is required.  Or the part can be
> strapped to a certain interface.
> 
> Data sheet can be found here:
> http://www.ti.com/product/DP83TC811S-Q1/description
> http://www.ti.com/product/DP83TC811R-Q1/description
> 
> Signed-off-by: Dan Murphy 

Applied to net-next, thank you.

Re: [patch net] net: sched: fix error path in tcf_proto_create() when modules are not configured

2018-05-11 Thread David Miller

From: Jiri Pirko 
Date: Fri, 11 May 2018 17:45:32 +0200

> From: Jiri Pirko 
> 
> In case modules are not configured, error out when tp->ops is null
> and prevent later null pointer dereference.
> 
> Fixes: 33a48927c193 ("sched: push TC filter protocol creation into a separate 
> function")
> Signed-off-by: Jiri Pirko 

Applied and queued up for -stable.

Re: [PATCH net-next 1/3] cxgb4: Fix {vxlan/geneve}_port initialization

2018-05-11 Thread David Miller

From: Ganesh Goudar 
Date: Fri, 11 May 2018 18:34:43 +0530

> From: Arjun Vynipadath 
> 
> adapter->rawf_cnt was not initialized, thereby
> ndo_udp_tunnel_{add/del} was returning immediately
> without initializing {vxlan/geneve}_port.
> Also initializes mps_encap_entry refcnt.
> 
> Fixes: 846eac3fccec ("cxgb4: implement udp tunnel callbacks")
> Signed-off-by: Arjun Vynipadath 
> Signed-off-by: Ganesh Goudar 

Applied.

Re: [PATCH net-next 2/3] cxgb4: enable inner header checksum calculation

2018-05-11 Thread David Miller

From: Ganesh Goudar 
Date: Fri, 11 May 2018 18:35:33 +0530

> set cntrl bits to indicate whether inner header checksum
> needs to be calculated whenever the packet is an encapsulated
> packet and enable supported encap features.
> 
> Fixes: d0a1299c6bf7 ("cxgb4: add support for vxlan segmentation offload")
> Signed-off-by: Ganesh Goudar 

Applied.

Re: [PATCH net-next 3/3] cxgb4: avoid schedule while atomic

2018-05-11 Thread David Miller

From: Ganesh Goudar 
Date: Fri, 11 May 2018 18:36:16 +0530

> do not sleep while adding or deleting udp tunnel.
> 
> Fixes: 846eac3fccec ("cxgb4: implement udp tunnel callbacks")
> Signed-off-by: Ganesh Goudar 

Applied.

Re: [PATCH net-next] cxgb4: Add new T5 device id

2018-05-11 Thread David Miller

From: Ganesh Goudar 
Date: Fri, 11 May 2018 18:37:34 +0530

> Add 0x50ad device id for new T5 card.
> 
> Signed-off-by: Ganesh Goudar 

Applied.

Re: [PATCH 1/3] bonding: replace the return value type

2018-05-11 Thread David Miller

From: Tonghao Zhang 
Date: Fri, 11 May 2018 02:52:32 -0700

> The method ndo_start_xmit is defined as returning a
> netdev_tx_t, which is a typedef for an enum type,
> but the implementation in this driver returns an int.
> 
> Signed-off-by: Tonghao Zhang 

Applied to net-next

Re: [PATCH 2/3] bonding: use the skb_get/set_queue_mapping

2018-05-11 Thread David Miller

From: Tonghao Zhang 
Date: Fri, 11 May 2018 02:53:11 -0700

> Use the skb_get_queue_mapping, skb_set_queue_mapping
> and skb_rx_queue_recorded for skb queue_mapping in bonding
> driver, but not use it directly.
> 
> Signed-off-by: Tonghao Zhang 

Applied to net-next

Re: [PATCH 3/3] net: doc: fix spelling mistake: "modrobe.d" -> "modprobe.d"

2018-05-11 Thread David Miller

From: Tonghao Zhang 
Date: Fri, 11 May 2018 02:53:12 -0700

> Signed-off-by: Tonghao Zhang 

Applied to net-next.

Re: [PATCH net-next] erspan: auto detect truncated ipv6 packets.

2018-05-11 Thread David Miller

From: William Tu 
Date: Fri, 11 May 2018 05:49:47 -0700

> Currently the truncated bit is set only when 1) the mirrored packet
> is larger than mtu and 2) the ipv4 packet tot_len is larger than
> the actual skb->len.  This patch adds another case for detecting
> whether ipv6 packet is truncated or not, by checking the ipv6 header
> payload_len and the skb->len.
> 
> Reported-by: Xiaoyan Jin 
> Signed-off-by: William Tu 

Applied, thanks William.

Re: [PATCH v2 1/3] selinux: add AF_UNSPEC and INADDR_ANY checks to selinux_socket_bind()

2018-05-11 Thread Richard Haines

On Fri, 2018-05-11 at 20:15 +0300, Alexey Kodanev wrote:
> Commit d452930fd3b9 ("selinux: Add SCTP support") breaks
> compatibility
> with the old programs that can pass sockaddr_in structure with
> AF_UNSPEC
> and INADDR_ANY to bind(). As a result, bind() returns EAFNOSUPPORT
> error.
> This was found with LTP/asapi_01 test.
> 
> Similar to commit 29c486df6a20 ("net: ipv4: relax AF_INET check in
> bind()"), which relaxed AF_INET check for compatibility, add
> AF_UNSPEC
> case to AF_INET and make sure that the address is INADDR_ANY.
> 
> Fixes: d452930fd3b9 ("selinux: Add SCTP support")
> Signed-off-by: Alexey Kodanev 
> ---
> 
> v2: As suggested by Paul:
> * return EINVAL for SCTP socket if sa_family is AF_UNSPEC and
>   address is not INADDR_ANY
> * add new 'sa_family' variable so that it equals either AF_INET
>   or AF_INET6. Besides, it it will be used in the next patch that
>   fixes audit record.
> 
>  security/selinux/hooks.c | 29 +++--
>  1 file changed, 19 insertions(+), 10 deletions(-)
> 
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 4cafe6a..1ed7004 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -4576,6 +4576,7 @@ static int selinux_socket_post_create(struct
> socket *sock, int family,
>  static int selinux_socket_bind(struct socket *sock, struct sockaddr
> *address, int addrlen)
>  {
>   struct sock *sk = sock->sk;
> + struct sk_security_struct *sksec = sk->sk_security;
>   u16 family;
>   int err;
>  
> @@ -4587,11 +4588,11 @@ static int selinux_socket_bind(struct socket
> *sock, struct sockaddr *address, in
>   family = sk->sk_family;
>   if (family == PF_INET || family == PF_INET6) {
>   char *addrp;
> - struct sk_security_struct *sksec = sk->sk_security;
>   struct common_audit_data ad;
>   struct lsm_network_audit net = {0,};
>   struct sockaddr_in *addr4 = NULL;
>   struct sockaddr_in6 *addr6 = NULL;
> + u16 family_sa = address->sa_family;
>   unsigned short snum;
>   u32 sid, node_perm;
>  
> @@ -4601,11 +4602,20 @@ static int selinux_socket_bind(struct socket
> *sock, struct sockaddr *address, in
>* need to check address->sa_family as it is
> possible to have
>* sk->sk_family = PF_INET6 with addr->sa_family =
> AF_INET.
>*/
> - switch (address->sa_family) {
> + switch (family_sa) {
> + case AF_UNSPEC:
>   case AF_INET:
>   if (addrlen < sizeof(struct sockaddr_in))
>   return -EINVAL;
>   addr4 = (struct sockaddr_in *)address;
> + if (family_sa == AF_UNSPEC) {
> + /* see __inet_bind(), we only want
> to allow
> +  * AF_UNSPEC if the address is
> INADDR_ANY
> +  */
> + if (addr4->sin_addr.s_addr !=
> htonl(INADDR_ANY))
> + goto err_af;
> + family_sa = AF_INET;
> + }
>   snum = ntohs(addr4->sin_port);
>   addrp = (char *)>sin_addr.s_addr;
>   break;
> @@ -4617,13 +4627,7 @@ static int selinux_socket_bind(struct socket
> *sock, struct sockaddr *address, in
>   addrp = (char *)>sin6_addr.s6_addr;
>   break;
>   default:
> - /* Note that SCTP services expect -EINVAL,
> whereas
> -  * others expect -EAFNOSUPPORT.
> -  */
> - if (sksec->sclass == SECCLASS_SCTP_SOCKET)
> - return -EINVAL;
> - else
> - return -EAFNOSUPPORT;
> + goto err_af;
>   }
>  
>   if (snum) {
> @@ -4681,7 +4685,7 @@ static int selinux_socket_bind(struct socket
> *sock, struct sockaddr *address, in
>   ad.u.net->sport = htons(snum);
>   ad.u.net->family = family;
>  
> - if (address->sa_family == AF_INET)
> + if (family_sa == AF_INET)
>   ad.u.net->v4info.saddr = addr4-
> >sin_addr.s_addr;
>   else
>   ad.u.net->v6info.saddr = addr6->sin6_addr;
> @@ -4694,6 +4698,11 @@ static int selinux_socket_bind(struct socket
> *sock, struct sockaddr *address, in
>   }
>  out:
>   return err;
> +err_af:
> + /* Note that SCTP services expect -EINVAL, others
> -EAFNOSUPPORT. */
> + if (sksec->sclass == SECCLASS_SCTP_SOCKET)
> + return -EINVAL;
> + return -EAFNOSUPPORT;
>  }
>  
>  /* This supports connect(2) and SCTP connect services such as
> sctp_connectx(3)

Tested all

Re: [PATCH net-next 0/2] mlxsw: spectrum_span: Two minor adjustments

2018-05-11 Thread David Miller

From: Ido Schimmel 
Date: Fri, 11 May 2018 11:57:29 +0300

> Petr says:
> 
> This patch set fixes a couple of nits in mlxsw's SPAN implementation:
> two counts of inaccurate variable name and one count of unsuitable error
> code, fixed, respectively, in patches #1 and #2.

Series applied, thanks.

Re: [PATCH] dt-bindings: net: ravb: Add support for r8a77990 SoC

2018-05-11 Thread David Miller

From: Yoshihiro Shimoda 
Date: Fri, 11 May 2018 12:18:56 +0900

> Add documentation for r8a77990 compatible string to renesas ravb device
> tree bindings documentation.
> 
> Signed-off-by: Yoshihiro Shimoda 

I'm assuming this isn't targetted at one of my trees.  Just FYI.

Re: [net 0/4][pull request] Intel Wired LAN Driver Updates 2018-05-11

2018-05-11 Thread David Miller

From: Jeff Kirsher 
Date: Fri, 11 May 2018 12:47:18 -0700

> This series contains fixes to the ice, ixgbe and ixgbevf drivers.
 ...
> The following are changes since commit 
> 5ae4bbf76928b401fe467e837073d939300adbf0:
>   Merge tag 'mlx5-fixes-2018-05-10' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 10GbE

Pulled, thanks Jeff.

Re: [PATCH net 0/5] rxrpc: Fixes

2018-05-11 Thread David Miller

From: David Howells 
Date: Thu, 10 May 2018 23:45:17 +0100

> Here are three fixes for AF_RXRPC and two tracepoints that were useful for
> finding them:
 ...
> The patches are tagged here:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
>   rxrpc-fixes-20180510

Pulled, thanks David.

Re: [PATCH v2 net 1/1] net sched actions: fix invalid pointer dereferencing if skbedit flags missing

2018-05-11 Thread David Miller

From: Roman Mashak 
Date: Fri, 11 May 2018 10:55:09 -0400

> When application fails to pass flags in netlink TLV for a new skbedit action,
> the kernel results in the following oops:
 ...
> The caller calls action's ->init() and passes pointer to "struct tc_action 
> *a",
> which later may be initialized to point at the existing action, otherwise
> "struct tc_action *a" is still invalid, and therefore dereferencing it is an
> error as happens in tcf_idr_release, where refcnt is decremented.
> 
> So in case of missing flags tcf_idr_release must be called only for
> existing actions.
> 
> v2:
> - prepare patch for net tree
> 
> Signed-off-by: Roman Mashak 

Applied and queued up for -stable.

[PATCH net-next 0/4] bonding: performance and reliability

2018-05-11 Thread Debabrata Banerjee

Series of fixes to how rlb updates are handled, code cleanup, allowing
higher performance tx hashing in balance-alb mode, and reliability of
link up/down monitoring.

Debabrata Banerjee (4):
  bonding: don't queue up extraneous rlb updates
  bonding: use common mac addr checks
  bonding: allow use of tx hashing in balance-alb
  bonding: allow carrier and link status to determine link state

 Documentation/networking/bonding.txt |  4 +--
 drivers/net/bonding/bond_alb.c   | 50 +---
 drivers/net/bonding/bond_main.c  | 37 
 drivers/net/bonding/bond_options.c   |  9 ++---
 include/net/bonding.h| 10 +-
 5 files changed, 70 insertions(+), 40 deletions(-)

-- 
2.17.0

Re: [PATCH] isdn: eicon: fix a missing-check bug

2018-05-11 Thread David Miller

From: Wenwen Wang 
Date: Sat,  5 May 2018 14:32:46 -0500

> To avoid such issues, this patch adds a check after the second copy in the
> function diva_xdi_write(). If the adapter number is not equal to the one
> obtained in the first copy, (-4) will be returned to divas_write(), which
> will then return an error code -EINVAL.

Better fix is to copy the msg header once into an on-stack buffer supplied
by diva_write() to diva_xdi_open_adapter(), which is then passed on to
diva_xdi_write() with an adjusted src pointer and length.

[PATCH net-next 4/4] bonding: allow carrier and link status to determine link state

2018-05-11 Thread Debabrata Banerjee

In a mixed environment it may be difficult to tell if your hardware
support carrier, if it does not it can always report true. With a new
use_carrier option of 2, we can check both carrier and link status
sequentially, instead of one or the other

Signed-off-by: Debabrata Banerjee 
---
 Documentation/networking/bonding.txt |  4 ++--
 drivers/net/bonding/bond_main.c  | 12 
 drivers/net/bonding/bond_options.c   |  7 ---
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/Documentation/networking/bonding.txt 
b/Documentation/networking/bonding.txt
index 9ba04c0bab8d..f063730e7e73 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -828,8 +828,8 @@ use_carrier
MII / ETHTOOL ioctl method to determine the link state.
 
A value of 1 enables the use of netif_carrier_ok(), a value of
-   0 will use the deprecated MII / ETHTOOL ioctls.  The default
-   value is 1.
+   0 will use the deprecated MII / ETHTOOL ioctls. A value of 2
+   will check both.  The default value is 1.
 
 xmit_hash_policy
 
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index f7f8a49cb32b..7e9652c4b35c 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -132,7 +132,7 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link 
down, "
"in milliseconds");
 module_param(use_carrier, int, 0);
 MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in miimon; 
"
- "0 for off, 1 for on (default)");
+ "0 for off, 1 for on (default), 2 for carrier 
then legacy checks");
 module_param(mode, charp, 0);
 MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, "
   "1 for active-backup, 2 for balance-xor, "
@@ -434,12 +434,16 @@ static int bond_check_dev_link(struct bonding *bond,
int (*ioctl)(struct net_device *, struct ifreq *, int);
struct ifreq ifr;
struct mii_ioctl_data *mii;
+   bool carrier = true;
 
if (!reporting && !netif_running(slave_dev))
return 0;
 
if (bond->params.use_carrier)
-   return netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0;
+   carrier = netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0;
+
+   if (!carrier)
+   return carrier;
 
/* Try to get link status using Ethtool first. */
if (slave_dev->ethtool_ops->get_link)
@@ -4399,8 +4403,8 @@ static int bond_check_params(struct bond_params *params)
downdelay = 0;
}
 
-   if ((use_carrier != 0) && (use_carrier != 1)) {
-   pr_warn("Warning: use_carrier module parameter (%d), not of 
valid value (0/1), so it was set to 1\n",
+   if (use_carrier < 0 || use_carrier > 2) {
+   pr_warn("Warning: use_carrier module parameter (%d), not of 
valid value (0-2), so it was set to 1\n",
use_carrier);
use_carrier = 1;
}
diff --git a/drivers/net/bonding/bond_options.c 
b/drivers/net/bonding/bond_options.c
index 8a945c9341d6..dba6cef05134 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -164,9 +164,10 @@ static const struct bond_opt_value 
bond_primary_reselect_tbl[] = {
 };
 
 static const struct bond_opt_value bond_use_carrier_tbl[] = {
-   { "off", 0,  0},
-   { "on",  1,  BOND_VALFLAG_DEFAULT},
-   { NULL,  -1, 0}
+   { "off",  0,  0},
+   { "on",   1,  BOND_VALFLAG_DEFAULT},
+   { "both", 2,  0},
+   { NULL,  -1,  0}
 };
 
 static const struct bond_opt_value bond_all_slaves_active_tbl[] = {
-- 
2.17.0

[net 2/4] ixgbe: return error on unsupported SFP module when resetting

2018-05-11 Thread Jeff Kirsher

From: Emil Tantilov 

Add check for unsupported module and return the error code.
This fixes a Coverity hit due to unused return status from setup_sfp.

Signed-off-by: Emil Tantilov 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
index 3123267dfba9..9592f3e3e42e 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
@@ -3427,6 +3427,9 @@ static s32 ixgbe_reset_hw_X550em(struct ixgbe_hw *hw)
hw->phy.sfp_setup_needed = false;
}
 
+   if (status == IXGBE_ERR_SFP_NOT_SUPPORTED)
+   return status;
+
/* Reset PHY */
if (!hw->phy.reset_disable && hw->phy.ops.reset)
hw->phy.ops.reset(hw);
-- 
2.17.0

[net 1/4] ice: Set rq_last_status when cleaning rq

2018-05-11 Thread Jeff Kirsher

From: Jeff Shaw 

Prior to this commit, the rq_last_status was only set when hardware
responded with an error. This leads to rq_last_status being invalid
in the future when hardware eventually responds without error. This
commit resolves the issue by unconditionally setting rq_last_status
with the value returned in the descriptor.

Fixes: 940b61af02f4 ("ice: Initialize PF and setup miscellaneous
interrupt")

Signed-off-by: Jeff Shaw 
Signed-off-by: Anirudh Venkataramanan 
Tested-by: Tony Brelinski 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_controlq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_controlq.c 
b/drivers/net/ethernet/intel/ice/ice_controlq.c
index 5909a4407e38..7c511f144ed6 100644
--- a/drivers/net/ethernet/intel/ice/ice_controlq.c
+++ b/drivers/net/ethernet/intel/ice/ice_controlq.c
@@ -1014,10 +1014,10 @@ ice_clean_rq_elem(struct ice_hw *hw, struct 
ice_ctl_q_info *cq,
desc = ICE_CTL_Q_DESC(cq->rq, ntc);
desc_idx = ntc;
 
+   cq->rq_last_status = (enum ice_aq_err)le16_to_cpu(desc->retval);
flags = le16_to_cpu(desc->flags);
if (flags & ICE_AQ_FLAG_ERR) {
ret_code = ICE_ERR_AQ_ERROR;
-   cq->rq_last_status = (enum ice_aq_err)le16_to_cpu(desc->retval);
ice_debug(hw, ICE_DBG_AQ_MSG,
  "Control Receive Queue Event received with error 
0x%x\n",
  cq->rq_last_status);
-- 
2.17.0

[net 3/4] ixgbevf: fix ixgbevf_xmit_frame()'s return type

2018-05-11 Thread Jeff Kirsher

From: Luc Van Oostenryck 

The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, but the implementation in this
driver returns an 'int'.

Fix this by returning 'netdev_tx_t' in this driver too.

Signed-off-by: Luc Van Oostenryck 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index e3d04f226d57..850f8af95e49 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -4137,7 +4137,7 @@ static int ixgbevf_xmit_frame_ring(struct sk_buff *skb,
return NETDEV_TX_OK;
 }
 
-static int ixgbevf_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
+static netdev_tx_t ixgbevf_xmit_frame(struct sk_buff *skb, struct net_device 
*netdev)
 {
struct ixgbevf_adapter *adapter = netdev_priv(netdev);
struct ixgbevf_ring *tx_ring;
-- 
2.17.0

[net 4/4] ixgbe: fix memory leak on ipsec allocation

2018-05-11 Thread Jeff Kirsher

From: Colin Ian King 

The error clean up path kfree's adapter->ipsec and should be
instead kfree'ing ipsec. Fix this.  Also, the err1 error exit path
does not need to kfree ipsec because this failure path was for
the failed allocation of ipsec.

Detected by CoverityScan, CID#146424 ("Resource Leak")

Fixes: 63a67fe229ea ("ixgbe: add ipsec offload add and remove SA")
Signed-off-by: Colin Ian King 
Acked-by: Shannon Nelson 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index 68af127987bc..cead23e3db0c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -943,8 +943,8 @@ void ixgbe_init_ipsec_offload(struct ixgbe_adapter *adapter)
kfree(ipsec->ip_tbl);
kfree(ipsec->rx_tbl);
kfree(ipsec->tx_tbl);
+   kfree(ipsec);
 err1:
-   kfree(adapter->ipsec);
netdev_err(adapter->netdev, "Unable to allocate memory for SA tables");
 }
 
-- 
2.17.0

[PATCH net-next 1/4] bonding: don't queue up extraneous rlb updates

2018-05-11 Thread Debabrata Banerjee

arps for incomplete entries can't be sent anyway.

Signed-off-by: Debabrata Banerjee 
---
 drivers/net/bonding/bond_alb.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 5eb0df2e5464..c2f6c58e4e6a 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -421,7 +421,8 @@ static void rlb_clear_slave(struct bonding *bond, struct 
slave *slave)
if (assigned_slave) {
rx_hash_table[index].slave = assigned_slave;
if 
(!ether_addr_equal_64bits(rx_hash_table[index].mac_dst,
-mac_bcast)) {
+mac_bcast) &&
+   
!is_zero_ether_addr(rx_hash_table[index].mac_dst)) {
bond_info->rx_hashtbl[index].ntt = 1;
bond_info->rx_ntt = 1;
/* A slave has been removed from the
@@ -524,7 +525,8 @@ static void rlb_req_update_slave_clients(struct bonding 
*bond, struct slave *sla
client_info = &(bond_info->rx_hashtbl[hash_index]);
 
if ((client_info->slave == slave) &&
-   !ether_addr_equal_64bits(client_info->mac_dst, mac_bcast)) {
+   !ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) &&
+   !is_zero_ether_addr(client_info->mac_dst)) {
client_info->ntt = 1;
ntt = 1;
}
@@ -565,7 +567,8 @@ static void rlb_req_update_subnet_clients(struct bonding 
*bond, __be32 src_ip)
if ((client_info->ip_src == src_ip) &&
!ether_addr_equal_64bits(client_info->slave->dev->dev_addr,
 bond->dev->dev_addr) &&
-   !ether_addr_equal_64bits(client_info->mac_dst, mac_bcast)) {
+   !ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) &&
+   !is_zero_ether_addr(client_info->mac_dst)) {
client_info->ntt = 1;
bond_info->rx_ntt = 1;
}
@@ -641,7 +644,8 @@ static struct slave *rlb_choose_channel(struct sk_buff 
*skb, struct bonding *bon
ether_addr_copy(client_info->mac_src, arp->mac_src);
client_info->slave = assigned_slave;
 
-   if (!ether_addr_equal_64bits(client_info->mac_dst, mac_bcast)) {
+   if (!ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) &&
+   !is_zero_ether_addr(client_info->mac_dst)) {
client_info->ntt = 1;
bond->alb_info.rx_ntt = 1;
} else {
@@ -733,8 +737,10 @@ static void rlb_rebalance(struct bonding *bond)
assigned_slave = __rlb_next_rx_slave(bond);
if (assigned_slave && (client_info->slave != assigned_slave)) {
client_info->slave = assigned_slave;
-   client_info->ntt = 1;
-   ntt = 1;
+   if (!is_zero_ether_addr(client_info->mac_dst)) {
+   client_info->ntt = 1;
+   ntt = 1;
+   }
}
}
 
-- 
2.17.0

[net 0/4][pull request] Intel Wired LAN Driver Updates 2018-05-11

2018-05-11 Thread Jeff Kirsher

This series contains fixes to the ice, ixgbe and ixgbevf drivers.

Jeff Shaw provides a fix to ensure rq_last_status gets set, whether or
not the hardware responds with an error in the ice driver.

Emil adds a check for unsupported module during the reset routine for
ixgbe.

Luc Van Oostenryck fixes ixgbevf_xmit_frame() where it was not using the
correct return value (int).

Colin Ian King fixes a potential resource leak in ixgbe, where we were
not freeing ipsec in our cleanup path.

The following are changes since commit 5ae4bbf76928b401fe467e837073d939300adbf0:
  Merge tag 'mlx5-fixes-2018-05-10' of 
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 10GbE

Colin Ian King (1):
  ixgbe: fix memory leak on ipsec allocation

Emil Tantilov (1):
  ixgbe: return error on unsupported SFP module when resetting

Jeff Shaw (1):
  ice: Set rq_last_status when cleaning rq

Luc Van Oostenryck (1):
  ixgbevf: fix ixgbevf_xmit_frame()'s return type

 drivers/net/ethernet/intel/ice/ice_controlq.c | 2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c| 2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c | 3 +++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
 4 files changed, 6 insertions(+), 3 deletions(-)

-- 
2.17.0

[PATCH net-next 3/4] bonding: allow use of tx hashing in balance-alb

2018-05-11 Thread Debabrata Banerjee

The rx load balancing provided by balance-alb is not mutually
exclusive with using hashing for tx selection, and should provide a decent
speed increase because this eliminates spinlocks and cache contention.

Signed-off-by: Debabrata Banerjee 
---
 drivers/net/bonding/bond_alb.c | 20 ++--
 drivers/net/bonding/bond_main.c| 25 +++--
 drivers/net/bonding/bond_options.c |  2 +-
 include/net/bonding.h  | 10 +-
 4 files changed, 43 insertions(+), 14 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 180e50f7806f..6228635880d5 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -1478,8 +1478,24 @@ int bond_alb_xmit(struct sk_buff *skb, struct net_device 
*bond_dev)
}
 
if (do_tx_balance) {
-   hash_index = _simple_hash(hash_start, hash_size);
-   tx_slave = tlb_choose_channel(bond, hash_index, skb->len);
+   if (bond->params.tlb_dynamic_lb) {
+   hash_index = _simple_hash(hash_start, hash_size);
+   tx_slave = tlb_choose_channel(bond, hash_index, 
skb->len);
+   } else {
+   /*
+* do_tx_balance means we are free to select the 
tx_slave
+* So we do exactly what tlb would do for hash selection
+*/
+
+   struct bond_up_slave *slaves;
+   unsigned int count;
+
+   slaves = rcu_dereference(bond->slave_arr);
+   count = slaves ? READ_ONCE(slaves->count) : 0;
+   if (likely(count))
+   tx_slave = slaves->arr[bond_xmit_hash(bond, 
skb) %
+  count];
+   }
}
 
return bond_do_alb_xmit(skb, bond, tx_slave);
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 1f1e97b26f95..f7f8a49cb32b 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -159,7 +159,7 @@ module_param(min_links, int, 0);
 MODULE_PARM_DESC(min_links, "Minimum number of available links before turning 
on carrier");
 
 module_param(xmit_hash_policy, charp, 0);
-MODULE_PARM_DESC(xmit_hash_policy, "balance-xor and 802.3ad hashing method; "
+MODULE_PARM_DESC(xmit_hash_policy, "balance-alb, balance-tlb, balance-xor, 
802.3ad hashing method; "
   "0 for layer 2 (default), 1 for layer 3+4, "
   "2 for layer 2+3, 3 for encap layer 2+3, "
   "4 for encap layer 3+4");
@@ -1735,7 +1735,7 @@ int bond_enslave(struct net_device *bond_dev, struct 
net_device *slave_dev,
unblock_netpoll_tx();
}
 
-   if (bond_mode_uses_xmit_hash(bond))
+   if (bond_mode_can_use_xmit_hash(bond))
bond_update_slave_arr(bond, NULL);
 
bond->nest_level = dev_get_nest_level(bond_dev);
@@ -1870,7 +1870,7 @@ static int __bond_release_one(struct net_device *bond_dev,
if (BOND_MODE(bond) == BOND_MODE_8023AD)
bond_3ad_unbind_slave(slave);
 
-   if (bond_mode_uses_xmit_hash(bond))
+   if (bond_mode_can_use_xmit_hash(bond))
bond_update_slave_arr(bond, slave);
 
netdev_info(bond_dev, "Releasing %s interface %s\n",
@@ -3102,7 +3102,7 @@ static int bond_slave_netdev_event(unsigned long event,
 * events. If these (miimon/arpmon) parameters are configured
 * then array gets refreshed twice and that should be fine!
 */
-   if (bond_mode_uses_xmit_hash(bond))
+   if (bond_mode_can_use_xmit_hash(bond))
bond_update_slave_arr(bond, NULL);
break;
case NETDEV_CHANGEMTU:
@@ -3322,7 +3322,7 @@ static int bond_open(struct net_device *bond_dev)
 */
if (bond_alb_initialize(bond, (BOND_MODE(bond) == 
BOND_MODE_ALB)))
return -ENOMEM;
-   if (bond->params.tlb_dynamic_lb)
+   if (bond->params.tlb_dynamic_lb || BOND_MODE(bond) == 
BOND_MODE_ALB)
queue_delayed_work(bond->wq, >alb_work, 0);
}
 
@@ -3341,7 +3341,7 @@ static int bond_open(struct net_device *bond_dev)
bond_3ad_initiate_agg_selection(bond, 1);
}
 
-   if (bond_mode_uses_xmit_hash(bond))
+   if (bond_mode_can_use_xmit_hash(bond))
bond_update_slave_arr(bond, NULL);
 
return 0;
@@ -3892,7 +3892,7 @@ static void bond_slave_arr_handler(struct work_struct 
*work)
  * to determine the slave interface -
  * (a) BOND_MODE_8023AD
  * (b) BOND_MODE_XOR
- * (c) BOND_MODE_TLB && tlb_dynamic_lb == 0
+ * (c) (BOND_MODE_TLB || BOND_MODE_ALB) && tlb_dynamic_lb == 0
  *
  * The

[PATCH net-next 2/4] bonding: use common mac addr checks

2018-05-11 Thread Debabrata Banerjee

Replace homegrown mac addr checks with faster defs from etherdevice.h

Signed-off-by: Debabrata Banerjee 
---
 drivers/net/bonding/bond_alb.c | 28 +---
 1 file changed, 9 insertions(+), 19 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index c2f6c58e4e6a..180e50f7806f 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -40,11 +40,6 @@
 #include 
 #include 
 
-
-
-static const u8 mac_bcast[ETH_ALEN + 2] __long_aligned = {
-   0xff, 0xff, 0xff, 0xff, 0xff, 0xff
-};
 static const u8 mac_v6_allmcast[ETH_ALEN + 2] __long_aligned = {
0x33, 0x33, 0x00, 0x00, 0x00, 0x01
 };
@@ -420,9 +415,7 @@ static void rlb_clear_slave(struct bonding *bond, struct 
slave *slave)
 
if (assigned_slave) {
rx_hash_table[index].slave = assigned_slave;
-   if 
(!ether_addr_equal_64bits(rx_hash_table[index].mac_dst,
-mac_bcast) &&
-   
!is_zero_ether_addr(rx_hash_table[index].mac_dst)) {
+   if 
(is_valid_ether_addr(rx_hash_table[index].mac_dst)) {
bond_info->rx_hashtbl[index].ntt = 1;
bond_info->rx_ntt = 1;
/* A slave has been removed from the
@@ -525,8 +518,7 @@ static void rlb_req_update_slave_clients(struct bonding 
*bond, struct slave *sla
client_info = &(bond_info->rx_hashtbl[hash_index]);
 
if ((client_info->slave == slave) &&
-   !ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) &&
-   !is_zero_ether_addr(client_info->mac_dst)) {
+   is_valid_ether_addr(client_info->mac_dst)) {
client_info->ntt = 1;
ntt = 1;
}
@@ -567,8 +559,7 @@ static void rlb_req_update_subnet_clients(struct bonding 
*bond, __be32 src_ip)
if ((client_info->ip_src == src_ip) &&
!ether_addr_equal_64bits(client_info->slave->dev->dev_addr,
 bond->dev->dev_addr) &&
-   !ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) &&
-   !is_zero_ether_addr(client_info->mac_dst)) {
+   is_valid_ether_addr(client_info->mac_dst)) {
client_info->ntt = 1;
bond_info->rx_ntt = 1;
}
@@ -596,7 +587,7 @@ static struct slave *rlb_choose_channel(struct sk_buff 
*skb, struct bonding *bon
if ((client_info->ip_src == arp->ip_src) &&
(client_info->ip_dst == arp->ip_dst)) {
/* the entry is already assigned to this client */
-   if (!ether_addr_equal_64bits(arp->mac_dst, mac_bcast)) {
+   if (!is_broadcast_ether_addr(arp->mac_dst)) {
/* update mac address from arp */
ether_addr_copy(client_info->mac_dst, 
arp->mac_dst);
}
@@ -644,8 +635,7 @@ static struct slave *rlb_choose_channel(struct sk_buff 
*skb, struct bonding *bon
ether_addr_copy(client_info->mac_src, arp->mac_src);
client_info->slave = assigned_slave;
 
-   if (!ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) &&
-   !is_zero_ether_addr(client_info->mac_dst)) {
+   if (is_valid_ether_addr(client_info->mac_dst)) {
client_info->ntt = 1;
bond->alb_info.rx_ntt = 1;
} else {
@@ -1418,9 +1408,9 @@ int bond_alb_xmit(struct sk_buff *skb, struct net_device 
*bond_dev)
case ETH_P_IP: {
const struct iphdr *iph = ip_hdr(skb);
 
-   if (ether_addr_equal_64bits(eth_data->h_dest, mac_bcast) ||
-   (iph->daddr == ip_bcast) ||
-   (iph->protocol == IPPROTO_IGMP)) {
+   if (is_broadcast_ether_addr(eth_data->h_dest) ||
+   iph->daddr == ip_bcast ||
+   iph->protocol == IPPROTO_IGMP) {
do_tx_balance = false;
break;
}
@@ -1432,7 +1422,7 @@ int bond_alb_xmit(struct sk_buff *skb, struct net_device 
*bond_dev)
/* IPv6 doesn't really use broadcast mac address, but leave
 * that here just in case.
 */
-   if (ether_addr_equal_64bits(eth_data->h_dest, mac_bcast)) {
+   if (is_broadcast_ether_addr(eth_data->h_dest)) {
do_tx_balance = false;
break;
}
-- 
2.17.0

Re: [PATCH v6 1/6] net: phy: at803x: Export at803x_debug_reg_mask()

2018-05-11 Thread Andrew Lunn

> I could reorder the probe function a little to initialize the PHY before
> performing the MAC reset, drop this patch and the AR803X hibernation
> stuff from patch 2 if you like. But again, I can't actually test the
> result on the affected hardware.

Hi Paul

I don't like a MAC driver poking around in PHY registers.

So if you can rearrange the code, that would be great.

   Thanks
Andrew

[PATCH V2] mlx4_core: allocate ICM memory in page size chunks

2018-05-11 Thread Qing Huang

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
Reviewed-by: Zhu Yanjun 
---
v2 -> v1: adjusted chunk size to reflect different architectures.

 drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..ccb62b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
 #include "fw.h"
 
 /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to avoid high
+ * order memory allocations in fragmented/high usage memory situation.
  */
 enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT,
+   MLX4_TABLE_CHUNK_SIZE   = 1 << PAGE_SHIFT
 };
 
 static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk 
*chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
 
-   table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+   table->icm  = vzalloc(num_icm * sizeof(*table->icm));
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 
return -ENOMEM;
 }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 }
-- 
2.9.3

Re: [PATCH] mlx4_core: allocate 4KB ICM chunks

2018-05-11 Thread Qing Huang



On 5/11/2018 3:27 AM, Håkon Bugge wrote:

On 11 May 2018, at 01:31, Qing Huang  wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size, the above issue is fixed.

However in order to support 4KB ICM chunk size, we need to fix another
issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang
Acked-by: Daniel Jurgens
---
drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..2b17a4b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
#include "fw.h"

/*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in 4KB page size chunks to avoid high order memory
+ * allocations in fragmented/high usage memory situation.
  */
enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = 1 << 12,
+   MLX4_TABLE_CHUNK_SIZE   = 1 << 12

Shouldn’t these be the arch’s page size order? E.g., if running on SPARC, the 
hw page size is 8KiB.


Good point on supporting wider range of architectures. I got tunnel 
vision when fixing this on our x64 lab machines.

Will send an v2 patch.

Thanks,
Qing


Thxs, Håkon


};

static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk 
*chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;

-   table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+   table->icm  = vzalloc(num_icm * sizeof(*table->icm));
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}

-   kfree(table->icm);
+   vfree(table->icm);

return -ENOMEM;
}
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}

-   kfree(table->icm);
+   vfree(table->icm);
}
--
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message tomajord...@vger.kernel.org
More majordomo info athttp://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message tomajord...@vger.kernel.org
More majordomo info athttp://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] net: phy: DP83TC811: Introduce support for the DP83TC811 phy

2018-05-11 Thread David Miller

From: Andrew Lunn 
Date: Fri, 11 May 2018 21:10:11 +0200

> Humm, i thought i had given one. But i cannot find it in the mail
> archive. Going senile :-(

You aren't going senile, there is just are a huge number of patches
being submitted since net-next openned up.

Re: [PATCH v3] net: phy: DP83TC811: Introduce support for the DP83TC811 phy

2018-05-11 Thread Florian Fainelli

On 05/11/2018 11:08 AM, Dan Murphy wrote:
> Add support for the DP83811 phy.
> 
> The DP83811 supports both rgmii and sgmii interfaces.
> There are 2 part numbers for this the DP83TC811R does not
> reliably support the SGMII interface but the DP83TC811S will.
> 
> There is not a way to differentiate these parts from the
> hardware or register set.  So this is controlled via the DT
> to indicate which phy mode is required.  Or the part can be
> strapped to a certain interface.
> 
> Data sheet can be found here:
> http://www.ti.com/product/DP83TC811S-Q1/description
> http://www.ti.com/product/DP83TC811R-Q1/description
> 
> Signed-off-by: Dan Murphy 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH v3] net: phy: DP83TC811: Introduce support for the DP83TC811 phy

2018-05-11 Thread Andrew Lunn

On Fri, May 11, 2018 at 01:51:28PM -0500, Dan Murphy wrote:
> Andrew
> 
> On 05/11/2018 01:30 PM, Andrew Lunn wrote:
> > On Fri, May 11, 2018 at 01:08:19PM -0500, Dan Murphy wrote:
> >> Add support for the DP83811 phy.
> >>
> >> The DP83811 supports both rgmii and sgmii interfaces.
> >> There are 2 part numbers for this the DP83TC811R does not
> >> reliably support the SGMII interface but the DP83TC811S will.
> >>
> >> There is not a way to differentiate these parts from the
> >> hardware or register set.  So this is controlled via the DT
> >> to indicate which phy mode is required.  Or the part can be
> >> strapped to a certain interface.
> >>
> >> Data sheet can be found here:
> >> http://www.ti.com/product/DP83TC811S-Q1/description
> >> http://www.ti.com/product/DP83TC811R-Q1/description
> >>
> >> Signed-off-by: Dan Murphy 
> > 
> > Hi Dan
> > 
> > It is normal to add any Reviewed-by, or Tested-by: tags you received,
> > so long as you don't make major changes.
> > 
> 
> Thanks for the reminder.
> 
> I usually add them if I get them explicitly stated in the review.

Humm, i thought i had given one. But i cannot find it in the mail
archive. Going senile :-(

Reviewed-by: Andrew Lunn 

Andrew

Re: [PATCH net 1/1] net sched actions: fix refcnt leak in skbmod

2018-05-11 Thread Cong Wang

On Fri, May 11, 2018 at 11:35 AM, Roman Mashak  wrote:
> When application fails to pass flags in netlink TLV when replacing
> existing skbmod action, the kernel will leak refcnt:
>
> $ tc actions get action skbmod index 1
> total acts 0
>
> action order 0: skbmod pipe set smac 00:11:22:33:44:55
>  index 1 ref 1 bind 0
>
> For example, at this point a buggy application replaces the action with
> index 1 with new smac 00:aa:22:33:44:55, it fails because of zero flags,
> however refcnt gets bumped:
>
> $ tc actions get actions skbmod index 1
> total acts 0
>
> action order 0: skbmod pipe set smac 00:11:22:33:44:55
>  index 1 ref 2 bind 0
> $
>
> Tha patch fixes this by calling tcf_idr_release() on existing actions.
>
> Fixes: 86da71b57383d ("net_sched: Introduce skbmod action")
> Signed-off-by: Roman Mashak 

Acked-by: Cong Wang

Re: INFO: rcu detected stall in kfree_skbmem

2018-05-11 Thread Eric Dumazet



On 05/11/2018 11:41 AM, Marcelo Ricardo Leitner wrote:

> But calling ip6_xmit with rcu_read_lock is expected. tcp stack also
> does it.
> Thus I think this is more of an issue with IPv6 stack. If a host has
> an extensive ip6tables ruleset, it probably generates this more
> easily.
> 
>>>  sctp_v6_xmit+0x4a5/0x6b0 net/sctp/ipv6.c:225
>>>  sctp_packet_transmit+0x26f6/0x3ba0 net/sctp/output.c:650
>>>  sctp_outq_flush+0x1373/0x4370 net/sctp/outqueue.c:1197
>>>  sctp_outq_uncork+0x6a/0x80 net/sctp/outqueue.c:776
>>>  sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
>>>  sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
>>>  sctp_do_sm+0x596/0x7160 net/sctp/sm_sideeffect.c:1191
>>>  sctp_generate_heartbeat_event+0x218/0x450 net/sctp/sm_sideeffect.c:406
>>>  call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
>>>  expire_timers kernel/time/timer.c:1363 [inline]
> 
> Having this call from a timer means it wasn't processing sctp stack
> for too long.
>

I feel the problem is that this part is looping, in some infinite loop.

I have seen this stack traces in other reports.

Maybe some kind of list corruption.

Re: [PATCH v3] net: phy: DP83TC811: Introduce support for the DP83TC811 phy

2018-05-11 Thread Dan Murphy

Andrew

On 05/11/2018 01:30 PM, Andrew Lunn wrote:
> On Fri, May 11, 2018 at 01:08:19PM -0500, Dan Murphy wrote:
>> Add support for the DP83811 phy.
>>
>> The DP83811 supports both rgmii and sgmii interfaces.
>> There are 2 part numbers for this the DP83TC811R does not
>> reliably support the SGMII interface but the DP83TC811S will.
>>
>> There is not a way to differentiate these parts from the
>> hardware or register set.  So this is controlled via the DT
>> to indicate which phy mode is required.  Or the part can be
>> strapped to a certain interface.
>>
>> Data sheet can be found here:
>> http://www.ti.com/product/DP83TC811S-Q1/description
>> http://www.ti.com/product/DP83TC811R-Q1/description
>>
>> Signed-off-by: Dan Murphy 
> 
> Hi Dan
> 
> It is normal to add any Reviewed-by, or Tested-by: tags you received,
> so long as you don't make major changes.
> 

Thanks for the reminder.

I usually add them if I get them explicitly stated in the review.

I have not seen any Reviewed-by or Tested-by tags in any of the replies for the
patch.  But I may have missed it.

Dan

> Andrew
> 


-- 
--
Dan Murphy

Re: INFO: rcu detected stall in kfree_skbmem

2018-05-11 Thread Marcelo Ricardo Leitner

On Fri, May 11, 2018 at 12:00:38PM +0200, Dmitry Vyukov wrote:
> On Mon, Apr 30, 2018 at 8:09 PM, syzbot
>  wrote:
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:5d1365940a68 Merge
> > git://git.kernel.org/pub/scm/linux/kerne...
> > git tree:   net-next
> > console output: https://syzkaller.appspot.com/x/log.txt?id=5667997129637888
> > kernel config:
> > https://syzkaller.appspot.com/x/.config?id=-5947642240294114534
> > dashboard link: https://syzkaller.appspot.com/bug?extid=fc78715ba3b3257caf6a
> > compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> >
> > Unfortunately, I don't have any reproducer for this crash yet.
>
> This looks sctp-related, +sctp maintainers.
>
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+fc78715ba3b3257ca...@syzkaller.appspotmail.com
> >
> > INFO: rcu_sched self-detected stall on CPU
> > 1-...!: (1 GPs behind) idle=a3e/1/4611686018427387908
> > softirq=71980/71983 fqs=33
> >  (t=125000 jiffies g=39438 c=39437 q=958)
> > rcu_sched kthread starved for 124829 jiffies! g39438 c39437 f0x0
> > RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0
> > RCU grace-period kthread stack dump:
> > rcu_sched   R  running task23768 9  2 0x8000
> > Call Trace:
> >  context_switch kernel/sched/core.c:2848 [inline]
> >  __schedule+0x801/0x1e30 kernel/sched/core.c:3490
> >  schedule+0xef/0x430 kernel/sched/core.c:3549
> >  schedule_timeout+0x138/0x240 kernel/time/timer.c:1801
> >  rcu_gp_kthread+0x6b5/0x1940 kernel/rcu/tree.c:2231
> >  kthread+0x345/0x410 kernel/kthread.c:238
> >  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:411
> > NMI backtrace for cpu 1
> > CPU: 1 PID: 20560 Comm: syz-executor4 Not tainted 4.16.0+ #1
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Call Trace:
> >  
> >  __dump_stack lib/dump_stack.c:77 [inline]
> >  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
> >  nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
> >  nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
> >  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
> >  trigger_single_cpu_backtrace include/linux/nmi.h:156 [inline]
> >  rcu_dump_cpu_stacks+0x175/0x1c2 kernel/rcu/tree.c:1376
> >  print_cpu_stall kernel/rcu/tree.c:1525 [inline]
> >  check_cpu_stall.isra.61.cold.80+0x36c/0x59a kernel/rcu/tree.c:1593
> >  __rcu_pending kernel/rcu/tree.c:3356 [inline]
> >  rcu_pending kernel/rcu/tree.c:3401 [inline]
> >  rcu_check_callbacks+0x21b/0xad0 kernel/rcu/tree.c:2763
> >  update_process_times+0x2d/0x70 kernel/time/timer.c:1636
> >  tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:173
> >  tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1283
> >  __run_hrtimer kernel/time/hrtimer.c:1386 [inline]
> >  __hrtimer_run_queues+0x3e3/0x10a0 kernel/time/hrtimer.c:1448
> >  hrtimer_interrupt+0x286/0x650 kernel/time/hrtimer.c:1506
> >  local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1025 [inline]
> >  smp_apic_timer_interrupt+0x15d/0x710 arch/x86/kernel/apic/apic.c:1050
> >  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
> > RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783
> > [inline]
> > RIP: 0010:kmem_cache_free+0xb3/0x2d0 mm/slab.c:3757
> > RSP: 0018:8801db105228 EFLAGS: 0282 ORIG_RAX: ff13
> > RAX: 0007 RBX: 8800b055c940 RCX: 11003b2345a5
> > RDX:  RSI: 8801d91a2d80 RDI: 0282
> > RBP: 8801db105248 R08: 8801d91a2cb8 R09: 0002
> > R10: 8801d91a2480 R11:  R12: 8801d9848e40
> > R13: 0282 R14: 85b7f27c R15: 
> >  kfree_skbmem+0x13c/0x210 net/core/skbuff.c:582
> >  __kfree_skb net/core/skbuff.c:642 [inline]
> >  kfree_skb+0x19d/0x560 net/core/skbuff.c:659
> >  enqueue_to_backlog+0x2fc/0xc90 net/core/dev.c:3968
> >  netif_rx_internal+0x14d/0xae0 net/core/dev.c:4181
> >  netif_rx+0xba/0x400 net/core/dev.c:4206
> >  loopback_xmit+0x283/0x741 drivers/net/loopback.c:91
> >  __netdev_start_xmit include/linux/netdevice.h:4087 [inline]
> >  netdev_start_xmit include/linux/netdevice.h:4096 [inline]
> >  xmit_one net/core/dev.c:3053 [inline]
> >  dev_hard_start_xmit+0x264/0xc10 net/core/dev.c:3069
> >  __dev_queue_xmit+0x2724/0x34c0 net/core/dev.c:3584
> >  dev_queue_xmit+0x17/0x20 net/core/dev.c:3617
> >  neigh_hh_output include/net/neighbour.h:472 [inline]
> >  neigh_output include/net/neighbour.h:480 [inline]
> >  ip6_finish_output2+0x134e/0x2810 net/ipv6/ip6_output.c:120
> >  ip6_finish_output+0x5fe/0xbc0 net/ipv6/ip6_output.c:154
> >  NF_HOOK_COND include/linux/netfilter.h:277 [inline]
> >  ip6_output+0x227/0x9b0 net/ipv6/ip6_output.c:171
> >  dst_output include/net/dst.h:444 [inline]
> >  NF_HOOK include/linux/netfilter.h:288 [inline]
> >  ip6_xmit+0xf51/0x23f0

Re: [PATCH v6 1/6] net: phy: at803x: Export at803x_debug_reg_mask()

2018-05-11 Thread Paul Burton

On Fri, May 11, 2018 at 11:25:02AM -0700, Paul Burton wrote:
> Hi Andrew,
> 
> On Fri, May 11, 2018 at 02:26:19AM +0200, Andrew Lunn wrote:
> > On Thu, May 10, 2018 at 04:16:52PM -0700, Paul Burton wrote:
> > > From: Andrew Lunn 
> > > 
> > > On some boards, this PHY has a problem when it hibernates. Export this
> > > function to a board can register a PHY fixup to disable hibernation.
> > 
> > What do you know about the problem?
> > 
> > https://patchwork.ozlabs.org/patch/686371/
> > 
> > I don't remember how it was solved, but you should probably do the
> > same.
> > 
> > Andrew
> 
> I'm afraid I don't know much about the problem - this one is your patch
> entirely unchanged, and I don't have access to the hardware in question
> (my board uses a Realtek RTL8211E PHY).
> 
> I presume you did this because the pch_gbe driver as-is in mainline
> disables hibernation for the AR803X PHY found on the MinnowBoard, so
> this would be preserving the existing behaviour of the driver?
> 
> That behaviour was introduced by commit f1a26fdf5944f ("pch_gbe: Add
> MinnowBoard support"), so perhaps Darren as its author might know more?
> 
> My presumption would be that this is done to ensure that the PHY is
> always providing the RX clock, which the EG20T manual says is required
> for the MAC reset register RX_RST & ALL_RST bits to clear. We wait for
> those using the call to pch_gbe_wait_clr_bit() in
> pch_gbe_mac_reset_hw(), which happens before we initialize the PHY.
> 
> I could reorder the probe function a little to initialize the PHY before
> performing the MAC reset, drop this patch and the AR803X hibernation
> stuff from patch 2 if you like. But again, I can't actually test the
> result on the affected hardware.
> 
> Thanks,
> Paul

I got an undeliverable response using Darren's email address from the
commit referenced above, so updating to the latest address I see for him
in git history.

Thanks,
Paul

Re: [PATCH v6 1/6] net: phy: at803x: Export at803x_debug_reg_mask()

2018-05-11 Thread Paul Burton

Hi Andrew,

On Fri, May 11, 2018 at 02:26:19AM +0200, Andrew Lunn wrote:
> On Thu, May 10, 2018 at 04:16:52PM -0700, Paul Burton wrote:
> > From: Andrew Lunn 
> > 
> > On some boards, this PHY has a problem when it hibernates. Export this
> > function to a board can register a PHY fixup to disable hibernation.
> 
> What do you know about the problem?
> 
> https://patchwork.ozlabs.org/patch/686371/
> 
> I don't remember how it was solved, but you should probably do the
> same.
> 
>   Andrew

I'm afraid I don't know much about the problem - this one is your patch
entirely unchanged, and I don't have access to the hardware in question
(my board uses a Realtek RTL8211E PHY).

I presume you did this because the pch_gbe driver as-is in mainline
disables hibernation for the AR803X PHY found on the MinnowBoard, so
this would be preserving the existing behaviour of the driver?

That behaviour was introduced by commit f1a26fdf5944f ("pch_gbe: Add
MinnowBoard support"), so perhaps Darren as its author might know more?

My presumption would be that this is done to ensure that the PHY is
always providing the RX clock, which the EG20T manual says is required
for the MAC reset register RX_RST & ALL_RST bits to clear. We wait for
those using the call to pch_gbe_wait_clr_bit() in
pch_gbe_mac_reset_hw(), which happens before we initialize the PHY.

I could reorder the probe function a little to initialize the PHY before
performing the MAC reset, drop this patch and the AR803X hibernation
stuff from patch 2 if you like. But again, I can't actually test the
result on the affected hardware.

Thanks,
Paul

[PATCH net 1/1] net sched actions: fix refcnt leak in skbmod

2018-05-11 Thread Roman Mashak

When application fails to pass flags in netlink TLV when replacing
existing skbmod action, the kernel will leak refcnt:

$ tc actions get action skbmod index 1
total acts 0

action order 0: skbmod pipe set smac 00:11:22:33:44:55
 index 1 ref 1 bind 0

For example, at this point a buggy application replaces the action with
index 1 with new smac 00:aa:22:33:44:55, it fails because of zero flags,
however refcnt gets bumped:

$ tc actions get actions skbmod index 1
total acts 0

action order 0: skbmod pipe set smac 00:11:22:33:44:55
 index 1 ref 2 bind 0
$

Tha patch fixes this by calling tcf_idr_release() on existing actions.

Fixes: 86da71b57383d ("net_sched: Introduce skbmod action")
Signed-off-by: Roman Mashak 
---
 net/sched/act_skbmod.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_skbmod.c b/net/sched/act_skbmod.c
index bbcbdce732cc..ad050d7d4b46 100644
--- a/net/sched/act_skbmod.c
+++ b/net/sched/act_skbmod.c
@@ -131,8 +131,11 @@ static int tcf_skbmod_init(struct net *net, struct nlattr 
*nla,
if (exists && bind)
return 0;
 
-   if (!lflags)
+   if (!lflags) {
+   if (exists)
+   tcf_idr_release(*a, bind);
return -EINVAL;
+   }
 
if (!exists) {
ret = tcf_idr_create(tn, parm->index, est, a,
-- 
2.7.4

Re: possible deadlock in sk_diag_fill

2018-05-11 Thread Andrei Vagin

On Sat, May 05, 2018 at 10:59:02AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:c1c07416cdd4 Merge tag 'kbuild-fixes-v4.17' of git://git.k..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=12164c9780
> kernel config:  https://syzkaller.appspot.com/x/.config?x=5a1dc06635c10d27
> dashboard link: https://syzkaller.appspot.com/bug?extid=c1872be62e587eae9669
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> userspace arch: i386
> 
> Unfortunately, I don't have any reproducer for this crash yet.
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+c1872be62e587eae9...@syzkaller.appspotmail.com
> 
> 
> ==
> WARNING: possible circular locking dependency detected
> 4.17.0-rc3+ #59 Not tainted
> --
> syz-executor1/25282 is trying to acquire lock:
> 4fddf743 (&(>lock)->rlock/1){+.+.}, at: sk_diag_dump_icons
> net/unix/diag.c:82 [inline]
> 4fddf743 (&(>lock)->rlock/1){+.+.}, at:
> sk_diag_fill.isra.5+0xa43/0x10d0 net/unix/diag.c:144
> 
> but task is already holding lock:
> b6895645 (rlock-AF_UNIX){+.+.}, at: spin_lock
> include/linux/spinlock.h:310 [inline]
> b6895645 (rlock-AF_UNIX){+.+.}, at: sk_diag_dump_icons
> net/unix/diag.c:64 [inline]
> b6895645 (rlock-AF_UNIX){+.+.}, at: sk_diag_fill.isra.5+0x94e/0x10d0
> net/unix/diag.c:144
> 
> which lock already depends on the new lock.

In the code, we have a comment which explains why it is safe to take this lock

/*
 * The state lock is outer for the same sk's
 * queue lock. With the other's queue locked it's
 * OK to lock the state.
 */
unix_state_lock_nested(req);

It is a question how to explain this to lockdep.

> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (rlock-AF_UNIX){+.+.}:
>__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>_raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
>skb_queue_tail+0x26/0x150 net/core/skbuff.c:2900
>unix_dgram_sendmsg+0xf77/0x1730 net/unix/af_unix.c:1797
>sock_sendmsg_nosec net/socket.c:629 [inline]
>sock_sendmsg+0xd5/0x120 net/socket.c:639
>___sys_sendmsg+0x525/0x940 net/socket.c:2117
>__sys_sendmmsg+0x3bb/0x6f0 net/socket.c:2205
>__compat_sys_sendmmsg net/compat.c:770 [inline]
>__do_compat_sys_sendmmsg net/compat.c:777 [inline]
>__se_compat_sys_sendmmsg net/compat.c:774 [inline]
>__ia32_compat_sys_sendmmsg+0x9f/0x100 net/compat.c:774
>do_syscall_32_irqs_on arch/x86/entry/common.c:323 [inline]
>do_fast_syscall_32+0x345/0xf9b arch/x86/entry/common.c:394
>entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
> 
> -> #0 (&(>lock)->rlock/1){+.+.}:
>lock_acquire+0x1dc/0x520 kernel/locking/lockdep.c:3920
>_raw_spin_lock_nested+0x28/0x40 kernel/locking/spinlock.c:354
>sk_diag_dump_icons net/unix/diag.c:82 [inline]
>sk_diag_fill.isra.5+0xa43/0x10d0 net/unix/diag.c:144
>sk_diag_dump net/unix/diag.c:178 [inline]
>unix_diag_dump+0x35f/0x550 net/unix/diag.c:206
>netlink_dump+0x507/0xd20 net/netlink/af_netlink.c:2226
>__netlink_dump_start+0x51a/0x780 net/netlink/af_netlink.c:2323
>netlink_dump_start include/linux/netlink.h:214 [inline]
>unix_diag_handler_dump+0x3f4/0x7b0 net/unix/diag.c:307
>__sock_diag_cmd net/core/sock_diag.c:230 [inline]
>sock_diag_rcv_msg+0x2e0/0x3d0 net/core/sock_diag.c:261
>netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
>sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:272
>netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
>netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
>netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
>sock_sendmsg_nosec net/socket.c:629 [inline]
>sock_sendmsg+0xd5/0x120 net/socket.c:639
>sock_write_iter+0x35a/0x5a0 net/socket.c:908
>call_write_iter include/linux/fs.h:1784 [inline]
>new_sync_write fs/read_write.c:474 [inline]
>__vfs_write+0x64d/0x960 fs/read_write.c:487
>vfs_write+0x1f8/0x560 fs/read_write.c:549
>ksys_write+0xf9/0x250 fs/read_write.c:598
>__do_sys_write fs/read_write.c:610 [inline]
>__se_sys_write fs/read_write.c:607 [inline]
>__ia32_sys_write+0x71/0xb0 fs/read_write.c:607
>do_syscall_32_irqs_on arch/x86/entry/common.c:323 [inline]
>do_fast_syscall_32+0x345/0xf9b arch/x86/entry/common.c:394
>entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
> 
> other info that might help us debug this:
> 
>  Possible unsafe locking scenario:
> 
>CPU0CPU1
>
>

Re: [PATCH v3] net: phy: DP83TC811: Introduce support for the DP83TC811 phy

2018-05-11 Thread Andrew Lunn

On Fri, May 11, 2018 at 01:08:19PM -0500, Dan Murphy wrote:
> Add support for the DP83811 phy.
> 
> The DP83811 supports both rgmii and sgmii interfaces.
> There are 2 part numbers for this the DP83TC811R does not
> reliably support the SGMII interface but the DP83TC811S will.
> 
> There is not a way to differentiate these parts from the
> hardware or register set.  So this is controlled via the DT
> to indicate which phy mode is required.  Or the part can be
> strapped to a certain interface.
> 
> Data sheet can be found here:
> http://www.ti.com/product/DP83TC811S-Q1/description
> http://www.ti.com/product/DP83TC811R-Q1/description
> 
> Signed-off-by: Dan Murphy 

Hi Dan

It is normal to add any Reviewed-by, or Tested-by: tags you received,
so long as you don't make major changes.

Andrew

Re: [PATCH net V2] tun: fix use after free for ptr_ring

2018-05-11 Thread Michael S. Tsirkin

On Fri, May 11, 2018 at 10:49:25AM +0800, Jason Wang wrote:
> We used to initialize ptr_ring during TUNSETIFF, this is because its
> size depends on the tx_queue_len of netdevice. And we try to clean it
> up when socket were detached from netdevice. A race were spotted when
> trying to do uninit during a read which will lead a use after free for
> pointer ring. Solving this by always initialize a zero size ptr_ring
> in open() and do resizing during TUNSETIFF, and then we can safely do
> cleanup during close(). With this, there's no need for the workaround
> that was introduced by commit 4df0bfc79904 ("tun: fix a memory leak
> for tfile->tx_array").
> 
> Reported-by: syzbot+e8b902c3c3fadf0a9...@syzkaller.appspotmail.com
> Cc: Eric Dumazet 
> Cc: Cong Wang 
> Cc: Michael S. Tsirkin 
> Fixes: 1576d9860599 ("tun: switch to use skb array for tx")
> Signed-off-by: Jason Wang 

Acked-by: Michael S. Tsirkin 

and will you send the revert pls then?


> ---
> Changes from v1:
> - free ptr_ring during close()
> - use tun_ptr_free() during resie for safety
> ---
>  drivers/net/tun.c | 27 ---
>  1 file changed, 12 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index ef33950..9fbbb32 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -681,15 +681,6 @@ static void tun_queue_purge(struct tun_file *tfile)
>   skb_queue_purge(>sk.sk_error_queue);
>  }
>  
> -static void tun_cleanup_tx_ring(struct tun_file *tfile)
> -{
> - if (tfile->tx_ring.queue) {
> - ptr_ring_cleanup(>tx_ring, tun_ptr_free);
> - xdp_rxq_info_unreg(>xdp_rxq);
> - memset(>tx_ring, 0, sizeof(tfile->tx_ring));
> - }
> -}
> -
>  static void __tun_detach(struct tun_file *tfile, bool clean)
>  {
>   struct tun_file *ntfile;
> @@ -736,7 +727,8 @@ static void __tun_detach(struct tun_file *tfile, bool 
> clean)
>   tun->dev->reg_state == NETREG_REGISTERED)
>   unregister_netdevice(tun->dev);
>   }
> - tun_cleanup_tx_ring(tfile);
> + if (tun)
> + xdp_rxq_info_unreg(>xdp_rxq);
>   sock_put(>sk);
>   }
>  }
> @@ -783,14 +775,14 @@ static void tun_detach_all(struct net_device *dev)
>   tun_napi_del(tun, tfile);
>   /* Drop read queue */
>   tun_queue_purge(tfile);
> + xdp_rxq_info_unreg(>xdp_rxq);
>   sock_put(>sk);
> - tun_cleanup_tx_ring(tfile);
>   }
>   list_for_each_entry_safe(tfile, tmp, >disabled, next) {
>   tun_enable_queue(tfile);
>   tun_queue_purge(tfile);
> + xdp_rxq_info_unreg(>xdp_rxq);
>   sock_put(>sk);
> - tun_cleanup_tx_ring(tfile);
>   }
>   BUG_ON(tun->numdisabled != 0);
>  
> @@ -834,7 +826,8 @@ static int tun_attach(struct tun_struct *tun, struct file 
> *file,
>   }
>  
>   if (!tfile->detached &&
> - ptr_ring_init(>tx_ring, dev->tx_queue_len, GFP_KERNEL)) {
> + ptr_ring_resize(>tx_ring, dev->tx_queue_len,
> + GFP_KERNEL, tun_ptr_free)) {
>   err = -ENOMEM;
>   goto out;
>   }
> @@ -3219,6 +3212,11 @@ static int tun_chr_open(struct inode *inode, struct 
> file * file)
>   _proto, 0);
>   if (!tfile)
>   return -ENOMEM;
> + if (ptr_ring_init(>tx_ring, 0, GFP_KERNEL)) {
> + sk_free(>sk);
> + return -ENOMEM;
> + }
> +
>   RCU_INIT_POINTER(tfile->tun, NULL);
>   tfile->flags = 0;
>   tfile->ifindex = 0;
> @@ -3239,8 +3237,6 @@ static int tun_chr_open(struct inode *inode, struct 
> file * file)
>  
>   sock_set_flag(>sk, SOCK_ZEROCOPY);
>  
> - memset(>tx_ring, 0, sizeof(tfile->tx_ring));
> -
>   return 0;
>  }
>  
> @@ -3249,6 +3245,7 @@ static int tun_chr_close(struct inode *inode, struct 
> file *file)
>   struct tun_file *tfile = file->private_data;
>  
>   tun_detach(tfile, true);
> + ptr_ring_cleanup(>tx_ring, tun_ptr_free);
>  
>   return 0;
>  }
> -- 
> 2.7.4

[bpf-next V2 PATCH 1/4] bpf: devmap introduce dev_map_enqueue

2018-05-11 Thread Jesper Dangaard Brouer

Functionality is the same, but the ndo_xdp_xmit call is now
simply invoked from inside the devmap.c code.

V2: Fix compile issue reported by kbuild test robot 

Signed-off-by: Jesper Dangaard Brouer 
---
 include/linux/bpf.h|   14 +++---
 include/trace/events/xdp.h |9 -
 kernel/bpf/devmap.c|   37 +++--
 net/core/filter.c  |   15 ++-
 4 files changed, 52 insertions(+), 23 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a38e474bf7ee..8527964da402 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -485,14 +485,15 @@ int bpf_check(struct bpf_prog **fp, union bpf_attr *attr);
 void bpf_patch_call_args(struct bpf_insn *insn, u32 stack_depth);
 
 /* Map specifics */
-struct net_device  *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
+struct xdp_buff;
+struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
 void __dev_map_insert_ctx(struct bpf_map *map, u32 index);
 void __dev_map_flush(struct bpf_map *map);
+int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp);
 
 struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key);
 void __cpu_map_insert_ctx(struct bpf_map *map, u32 index);
 void __cpu_map_flush(struct bpf_map *map);
-struct xdp_buff;
 int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp,
struct net_device *dev_rx);
 
@@ -571,6 +572,14 @@ static inline void __dev_map_flush(struct bpf_map *map)
 {
 }
 
+struct xdp_buff;
+struct bpf_dtab_netdev;
+static inline
+int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp)
+{
+   return 0;
+}
+
 static inline
 struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key)
 {
@@ -585,7 +594,6 @@ static inline void __cpu_map_flush(struct bpf_map *map)
 {
 }
 
-struct xdp_buff;
 static inline int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu,
  struct xdp_buff *xdp,
  struct net_device *dev_rx)
diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index 8989a92c571a..96104610d40e 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -138,11 +138,18 @@ DEFINE_EVENT_PRINT(xdp_redirect_template, 
xdp_redirect_map_err,
  __entry->map_id, __entry->map_index)
 );
 
+#ifndef __DEVMAP_OBJ_TYPE
+#define __DEVMAP_OBJ_TYPE
+struct _bpf_dtab_netdev {
+   struct net_device *dev;
+};
+#endif /* __DEVMAP_OBJ_TYPE */
+
 #define devmap_ifindex(fwd, map)   \
(!fwd ? 0 : \
 (!map ? 0 :\
  ((map->map_type == BPF_MAP_TYPE_DEVMAP) ? \
-  ((struct net_device *)fwd)->ifindex : 0)))
+  ((struct _bpf_dtab_netdev *)fwd)->dev->ifindex : 0)))
 
 #define _trace_xdp_redirect_map(dev, xdp, fwd, map, idx)   \
 trace_xdp_redirect_map(dev, xdp, devmap_ifindex(fwd, map), \
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 565f9ece9115..808808bf2bf2 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -48,18 +48,21 @@
  * calls will fail at this point.
  */
 #include 
+#include 
 #include 
 
 #define DEV_CREATE_FLAG_MASK \
(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
 
+/* objects in the map */
 struct bpf_dtab_netdev {
-   struct net_device *dev;
+   struct net_device *dev; /* must be first member, due to tracepoint */
struct bpf_dtab *dtab;
unsigned int bit;
struct rcu_head rcu;
 };
 
+/* bpf map container */
 struct bpf_dtab {
struct bpf_map map;
struct bpf_dtab_netdev **netdev_map;
@@ -240,21 +243,43 @@ void __dev_map_flush(struct bpf_map *map)
  * update happens in parallel here a dev_put wont happen until after reading 
the
  * ifindex.
  */
-struct net_device  *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
+struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
 {
struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
-   struct bpf_dtab_netdev *dev;
+   struct bpf_dtab_netdev *obj;
 
if (key >= map->max_entries)
return NULL;
 
-   dev = READ_ONCE(dtab->netdev_map[key]);
-   return dev ? dev->dev : NULL;
+   obj = READ_ONCE(dtab->netdev_map[key]);
+   return obj;
+}
+
+int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp)
+{
+   struct net_device *dev = dst->dev;
+   struct xdp_frame *xdpf;
+   int err;
+
+   if (!dev->netdev_ops->ndo_xdp_xmit)
+   return -EOPNOTSUPP;
+
+   xdpf = convert_to_xdp_frame(xdp);
+   if (unlikely(!xdpf))
+   return -EOVERFLOW;
+
+   /* TODO: implement a bulking/enqueue step later */
+   err =

[bpf-next V2 PATCH 0/4] xdp: introduce bulking for ndo_xdp_xmit API

2018-05-11 Thread Jesper Dangaard Brouer

This patchset change ndo_xdp_xmit API to take a bulk of xdp frames.

When kernel is compiled with CONFIG_RETPOLINE, every indirect function
pointer (branch) call hurts performance. For XDP this have a huge
negative performance impact.

This patchset reduce the needed (indirect) calls to ndo_xdp_xmit, but
also prepares for further optimizations.  The DMA APIs use of indirect
function pointer calls is the primary source the regression.  It is
left for a followup patchset, to use bulking calls towards the DMA API
(via the scatter-gatter calls).

The other advantage of this API change is that drivers can easier
amortize the cost of any sync/locking scheme, over the bulk of
packets.  The assumption of the current API is that the driver
implemementing the NDO will also allocate a dedicated XDP TX queue for
every CPU in the system.  Which is not always possible or practical to
configure. E.g. ixgbe cannot load an XDP program on a machine with
more than 96 CPUs, due to limited hardware TX queues.  E.g. virtio_net
is hard to configure as it requires manually increasing the
queues. E.g. tun driver chooses to use a per XDP frame producer lock
modulo smp_processor_id over avail queues.

---

Jesper Dangaard Brouer (4):
  bpf: devmap introduce dev_map_enqueue
  bpf: devmap prepare xdp frames for bulking
  xdp: add tracepoint for devmap like cpumap have
  xdp: change ndo_xdp_xmit API to support bulking


 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   26 -
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |2 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   21 +++-
 drivers/net/tun.c |   37 ---
 drivers/net/virtio_net.c  |   66 +---
 include/linux/bpf.h   |   16 ++-
 include/linux/netdevice.h |   14 ++-
 include/net/page_pool.h   |5 +
 include/net/xdp.h |1 
 include/trace/events/xdp.h|   50 +
 kernel/bpf/devmap.c   |  134 -
 net/core/filter.c |   19 +---
 net/core/xdp.c|   20 +++-
 samples/bpf/xdp_monitor_kern.c|   49 +
 samples/bpf/xdp_monitor_user.c|   69 +
 15 files changed, 446 insertions(+), 83 deletions(-)

--

[bpf-next V2 PATCH 4/4] xdp: change ndo_xdp_xmit API to support bulking

2018-05-11 Thread Jesper Dangaard Brouer

This patch change the API for ndo_xdp_xmit to support bulking
xdp_frames.

When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
Most of the slowdown is caused by DMA API indirect function calls, but
also the net_device->ndo_xdp_xmit() call.

Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
performance improved:
 for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
 for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps

With frames avail as a bulk inside the driver ndo_xdp_xmit call,
further optimizations are possible, like bulk DMA-mapping for TX.

Testing without CONFIG_RETPOLINE show the same performance for
physical NIC drivers.

The virtual NIC driver tun sees a huge performance boost, as it can
avoid doing per frame producer locking, but instead amortize the
locking cost over the bulk.

V2: Fix compile errors reported by kbuild test robot 

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   26 +++---
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |2 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   21 ++--
 drivers/net/tun.c |   37 +-
 drivers/net/virtio_net.c  |   66 +++--
 include/linux/netdevice.h |   14 +++--
 include/net/page_pool.h   |5 +-
 include/net/xdp.h |1 
 include/trace/events/xdp.h|   10 ++--
 kernel/bpf/devmap.c   |   33 -
 net/core/filter.c |4 +-
 net/core/xdp.c|   20 ++--
 samples/bpf/xdp_monitor_kern.c|   10 
 samples/bpf/xdp_monitor_user.c|   35 +++--
 14 files changed, 206 insertions(+), 78 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 5efa68de935b..9b698c5acd05 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -3664,14 +3664,19 @@ netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, 
struct net_device *netdev)
  * @dev: netdev
  * @xdp: XDP buffer
  *
- * Returns Zero if sent, else an error code
+ * Returns number of frames successfully sent. Frames that fail are
+ * free'ed via XDP return API.
+ *
+ * For error cases, a negative errno code is returned and no-frames
+ * are transmitted (caller must handle freeing frames).
  **/
-int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
+int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames)
 {
struct i40e_netdev_priv *np = netdev_priv(dev);
unsigned int queue_index = smp_processor_id();
struct i40e_vsi *vsi = np->vsi;
-   int err;
+   int drops = 0;
+   int i;
 
if (test_bit(__I40E_VSI_DOWN, vsi->state))
return -ENETDOWN;
@@ -3679,11 +3684,18 @@ int i40e_xdp_xmit(struct net_device *dev, struct 
xdp_frame *xdpf)
if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
return -ENXIO;
 
-   err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]);
-   if (err != I40E_XDP_TX)
-   return -ENOSPC;
+   for (i = 0; i < n; i++) {
+   struct xdp_frame *xdpf = frames[i];
+   int err;
 
-   return 0;
+   err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]);
+   if (err != I40E_XDP_TX) {
+   xdp_return_frame_rx_napi(xdpf);
+   drops++;
+   }
+   }
+
+   return n - drops;
 }
 
 /**
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index fdd2c55f03a6..eb8804b3d7b6 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -487,7 +487,7 @@ u32 i40e_get_tx_pending(struct i40e_ring *ring, bool in_sw);
 void i40e_detect_recover_hung(struct i40e_vsi *vsi);
 int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
 bool __i40e_chk_linearize(struct sk_buff *skb);
-int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf);
+int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames);
 void i40e_xdp_flush(struct net_device *dev);
 
 /**
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 6652b201df5b..9645619f7729 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -10017,11 +10017,13 @@ static int ixgbe_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
}
 }
 
-static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
+static int

[bpf-next V2 PATCH 2/4] bpf: devmap prepare xdp frames for bulking

2018-05-11 Thread Jesper Dangaard Brouer

Like cpumap create queue for xdp frames that will be bulked.  For now,
this patch simply invoke ndo_xdp_xmit foreach frame.  This happens,
either when the map flush operation is envoked, or when the limit
DEV_MAP_BULK_SIZE is reached.

Signed-off-by: Jesper Dangaard Brouer 
---
 kernel/bpf/devmap.c |   77 ---
 1 file changed, 73 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 808808bf2bf2..cab72c100bb5 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -54,11 +54,18 @@
 #define DEV_CREATE_FLAG_MASK \
(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
 
+#define DEV_MAP_BULK_SIZE 16
+struct xdp_bulk_queue {
+   struct xdp_frame *q[DEV_MAP_BULK_SIZE];
+   unsigned int count;
+};
+
 /* objects in the map */
 struct bpf_dtab_netdev {
struct net_device *dev; /* must be first member, due to tracepoint */
struct bpf_dtab *dtab;
unsigned int bit;
+   struct xdp_bulk_queue __percpu *bulkq;
struct rcu_head rcu;
 };
 
@@ -209,6 +216,38 @@ void __dev_map_insert_ctx(struct bpf_map *map, u32 bit)
__set_bit(bit, bitmap);
 }
 
+static int bq_xmit_all(struct bpf_dtab_netdev *obj,
+struct xdp_bulk_queue *bq)
+{
+   unsigned int processed = 0, drops = 0;
+   struct net_device *dev = obj->dev;
+   int i;
+
+   if (unlikely(!bq->count))
+   return 0;
+
+   for (i = 0; i < bq->count; i++) {
+   struct xdp_frame *xdpf = bq->q[i];
+
+   prefetch(xdpf);
+   }
+
+   for (i = 0; i < bq->count; i++) {
+   struct xdp_frame *xdpf = bq->q[i];
+   int err;
+
+   err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
+   if (err) {
+   drops++;
+   xdp_return_frame(xdpf);
+   }
+   processed++;
+   }
+   bq->count = 0;
+
+   return 0;
+}
+
 /* __dev_map_flush is called from xdp_do_flush_map() which _must_ be signaled
  * from the driver before returning from its napi->poll() routine. The poll()
  * routine is called either from busy_poll context or net_rx_action signaled
@@ -224,6 +263,7 @@ void __dev_map_flush(struct bpf_map *map)
 
for_each_set_bit(bit, bitmap, map->max_entries) {
struct bpf_dtab_netdev *dev = READ_ONCE(dtab->netdev_map[bit]);
+   struct xdp_bulk_queue *bq;
struct net_device *netdev;
 
/* This is possible if the dev entry is removed by user space
@@ -233,6 +273,9 @@ void __dev_map_flush(struct bpf_map *map)
continue;
 
__clear_bit(bit, bitmap);
+
+   bq = this_cpu_ptr(dev->bulkq);
+   bq_xmit_all(dev, bq);
netdev = dev->dev;
if (likely(netdev->netdev_ops->ndo_xdp_flush))
netdev->netdev_ops->ndo_xdp_flush(netdev);
@@ -255,6 +298,20 @@ struct bpf_dtab_netdev *__dev_map_lookup_elem(struct 
bpf_map *map, u32 key)
return obj;
 }
 
+/* Runs under RCU-read-side, plus in softirq under NAPI protection.
+ * Thus, safe percpu variable access.
+ */
+static int bq_enqueue(struct bpf_dtab_netdev *obj, struct xdp_frame *xdpf)
+{
+   struct xdp_bulk_queue *bq = this_cpu_ptr(obj->bulkq);
+
+   if (unlikely(bq->count == DEV_MAP_BULK_SIZE))
+   bq_xmit_all(obj, bq);
+
+   bq->q[bq->count++] = xdpf;
+   return 0;
+}
+
 int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp)
 {
struct net_device *dev = dst->dev;
@@ -268,8 +325,7 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct 
xdp_buff *xdp)
if (unlikely(!xdpf))
return -EOVERFLOW;
 
-   /* TODO: implement a bulking/enqueue step later */
-   err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
+   err = bq_enqueue(dst, xdpf);
if (err)
return err;
 
@@ -288,13 +344,18 @@ static void dev_map_flush_old(struct bpf_dtab_netdev *dev)
 {
if (dev->dev->netdev_ops->ndo_xdp_flush) {
struct net_device *fl = dev->dev;
+   struct xdp_bulk_queue *bq;
unsigned long *bitmap;
+
int cpu;
 
for_each_online_cpu(cpu) {
bitmap = per_cpu_ptr(dev->dtab->flush_needed, cpu);
__clear_bit(dev->bit, bitmap);
 
+   bq = per_cpu_ptr(dev->bulkq, cpu);
+   bq_xmit_all(dev, bq);
+
fl->netdev_ops->ndo_xdp_flush(dev->dev);
}
}
@@ -306,6 +367,7 @@ static void __dev_map_entry_free(struct rcu_head *rcu)
 
dev = container_of(rcu, struct bpf_dtab_netdev, rcu);
dev_map_flush_old(dev);
+   free_percpu(dev->bulkq);
dev_put(dev->dev);
kfree(dev);
 }
@@ -338,6 +400,7 @@ static int

[bpf-next V2 PATCH 3/4] xdp: add tracepoint for devmap like cpumap have

2018-05-11 Thread Jesper Dangaard Brouer

Notice how this allow us get XDP statistic without affecting the XDP
performance, as tracepoint is no-longer activated on a per packet basis.

The xdp_monitor sample/tool is updated to use this new tracepoint.

Signed-off-by: Jesper Dangaard Brouer 
---
 include/linux/bpf.h|6 -
 include/trace/events/xdp.h |   39 +++
 kernel/bpf/devmap.c|   25 ++-
 net/core/filter.c  |2 +-
 samples/bpf/xdp_monitor_kern.c |   39 +++
 samples/bpf/xdp_monitor_user.c |   44 +++-
 6 files changed, 146 insertions(+), 9 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8527964da402..3dda20a29cdb 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -489,7 +489,8 @@ struct xdp_buff;
 struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
 void __dev_map_insert_ctx(struct bpf_map *map, u32 index);
 void __dev_map_flush(struct bpf_map *map);
-int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp);
+int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
+   struct net_device *dev_rx);
 
 struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key);
 void __cpu_map_insert_ctx(struct bpf_map *map, u32 index);
@@ -575,7 +576,8 @@ static inline void __dev_map_flush(struct bpf_map *map)
 struct xdp_buff;
 struct bpf_dtab_netdev;
 static inline
-int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp)
+int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
+   struct net_device *dev_rx)
 {
return 0;
 }
diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index 96104610d40e..2e9ef0650144 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -229,6 +229,45 @@ TRACE_EVENT(xdp_cpumap_enqueue,
  __entry->to_cpu)
 );
 
+TRACE_EVENT(xdp_devmap_xmit,
+
+   TP_PROTO(const struct bpf_map *map, u32 map_index,
+int sent, int drops,
+const struct net_device *from_dev,
+const struct net_device *to_dev),
+
+   TP_ARGS(map, map_index, sent, drops, from_dev, to_dev),
+
+   TP_STRUCT__entry(
+   __field(int, map_id)
+   __field(u32, act)
+   __field(u32, map_index)
+   __field(int, drops)
+   __field(int, sent)
+   __field(int, from_ifindex)
+   __field(int, to_ifindex)
+   ),
+
+   TP_fast_assign(
+   __entry->map_id = map->id;
+   __entry->act= XDP_REDIRECT;
+   __entry->map_index  = map_index;
+   __entry->drops  = drops;
+   __entry->sent   = sent;
+   __entry->from_ifindex   = from_dev->ifindex;
+   __entry->to_ifindex = to_dev->ifindex;
+   ),
+
+   TP_printk("ndo_xdp_xmit"
+ " map_id=%d map_index=%d action=%s"
+ " sent=%d drops=%d"
+ " from_ifindex=%d to_ifindex=%d",
+ __entry->map_id, __entry->map_index,
+ __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB),
+ __entry->sent, __entry->drops,
+ __entry->from_ifindex, __entry->to_ifindex)
+);
+
 #endif /* _TRACE_XDP_H */
 
 #include 
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index cab72c100bb5..6f84100723b0 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DEV_CREATE_FLAG_MASK \
(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
@@ -57,6 +58,7 @@
 #define DEV_MAP_BULK_SIZE 16
 struct xdp_bulk_queue {
struct xdp_frame *q[DEV_MAP_BULK_SIZE];
+   struct net_device *dev_rx;
unsigned int count;
 };
 
@@ -219,8 +221,8 @@ void __dev_map_insert_ctx(struct bpf_map *map, u32 bit)
 static int bq_xmit_all(struct bpf_dtab_netdev *obj,
 struct xdp_bulk_queue *bq)
 {
-   unsigned int processed = 0, drops = 0;
struct net_device *dev = obj->dev;
+   int sent = 0, drops = 0;
int i;
 
if (unlikely(!bq->count))
@@ -241,10 +243,13 @@ static int bq_xmit_all(struct bpf_dtab_netdev *obj,
drops++;
xdp_return_frame(xdpf);
}
-   processed++;
+   sent++;
}
bq->count = 0;
 
+   trace_xdp_devmap_xmit(>dtab->map, obj->bit,
+ sent, drops, bq->dev_rx, dev);
+   bq->dev_rx = NULL;
return 0;
 }
 
@@ -301,18 +306,28 @@ struct bpf_dtab_netdev *__dev_map_lookup_elem(struct 
bpf_map *map, u32 key)
 /* Runs under RCU-read-side, plus in softirq under NAPI protection.
  * Thus, safe percpu variable access.
  */
-static int

Re: [PATCH net-next v10 2/4] net: Introduce generic failover module

2018-05-11 Thread Michael S. Tsirkin

On Mon, May 07, 2018 at 03:39:19PM -0700, Randy Dunlap wrote:
> Hi,
> 
> On 05/07/2018 03:10 PM, Sridhar Samudrala wrote:
> > 
> > Signed-off-by: Sridhar Samudrala 
> > ---
> >  MAINTAINERS|7 +
> >  include/linux/netdevice.h  |   16 +
> >  include/net/net_failover.h |   52 +++
> >  net/Kconfig|   10 +
> >  net/core/Makefile  |1 +
> >  net/core/net_failover.c| 1044 
> > 
> >  6 files changed, 1130 insertions(+)
> >  create mode 100644 include/net/net_failover.h
> >  create mode 100644 net/core/net_failover.c
> 
> 
> > diff --git a/net/Kconfig b/net/Kconfig
> > index b62089fb1332..0540856676de 100644
> > --- a/net/Kconfig
> > +++ b/net/Kconfig
> > @@ -429,6 +429,16 @@ config MAY_USE_DEVLINK
> >  config PAGE_POOL
> > bool
> >  
> > +config NET_FAILOVER
> > +   tristate "Failover interface"
> > +   default m
> 
> Need some justification for default m (as opposed to n).

Or one can just leave the default line out.

[PATCH v3] net: phy: DP83TC811: Introduce support for the DP83TC811 phy

2018-05-11 Thread Dan Murphy

Add support for the DP83811 phy.

The DP83811 supports both rgmii and sgmii interfaces.
There are 2 part numbers for this the DP83TC811R does not
reliably support the SGMII interface but the DP83TC811S will.

There is not a way to differentiate these parts from the
hardware or register set.  So this is controlled via the DT
to indicate which phy mode is required.  Or the part can be
strapped to a certain interface.

Data sheet can be found here:
http://www.ti.com/product/DP83TC811S-Q1/description
http://www.ti.com/product/DP83TC811R-Q1/description

Signed-off-by: Dan Murphy 
---

v3 - Variable length alignment - https://patchwork.kernel.org/patch/10389657/

v2 - Remove extra config_init in reset, update config_init call back function
fix a checkpatch alignment issue, add SGMII check in autoneg api - 
https://patchwork.kernel.org/patch/10389323/

 drivers/net/phy/Kconfig |   5 +
 drivers/net/phy/Makefile|   1 +
 drivers/net/phy/dp83tc811.c | 347 
 3 files changed, 353 insertions(+)
 create mode 100644 drivers/net/phy/dp83tc811.c

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index bdfbabb86ee0..810140a9e114 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -285,6 +285,11 @@ config DP83822_PHY
---help---
  Supports the DP83822 PHY.
 
+config DP83TC811_PHY
+   tristate "Texas Instruments DP83TC822 PHY"
+   ---help---
+ Supports the DP83TC822 PHY.
+
 config DP83848_PHY
tristate "Texas Instruments DP83848 PHY"
---help---
diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index 01acbcb2c798..00445b61a9a8 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -57,6 +57,7 @@ obj-$(CONFIG_CORTINA_PHY) += cortina.o
 obj-$(CONFIG_DAVICOM_PHY)  += davicom.o
 obj-$(CONFIG_DP83640_PHY)  += dp83640.o
 obj-$(CONFIG_DP83822_PHY)  += dp83822.o
+obj-$(CONFIG_DP83TC811_PHY)+= dp83tc811.o
 obj-$(CONFIG_DP83848_PHY)  += dp83848.o
 obj-$(CONFIG_DP83867_PHY)  += dp83867.o
 obj-$(CONFIG_FIXED_PHY)+= fixed_phy.o
diff --git a/drivers/net/phy/dp83tc811.c b/drivers/net/phy/dp83tc811.c
new file mode 100644
index ..081d99aa3985
--- /dev/null
+++ b/drivers/net/phy/dp83tc811.c
@@ -0,0 +1,347 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Driver for the Texas Instruments DP83TC811 PHY
+ *
+ * Copyright (C) 2018 Texas Instruments Incorporated - http://www.ti.com/
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DP83TC811_PHY_ID   0x2000a253
+#define DP83811_DEVADDR0x1f
+
+#define MII_DP83811_SGMII_CTRL 0x09
+#define MII_DP83811_INT_STAT1  0x12
+#define MII_DP83811_INT_STAT2  0x13
+#define MII_DP83811_RESET_CTRL 0x1f
+
+#define DP83811_HW_RESET   BIT(15)
+#define DP83811_SW_RESET   BIT(14)
+
+/* INT_STAT1 bits */
+#define DP83811_RX_ERR_HF_INT_EN   BIT(0)
+#define DP83811_MS_TRAINING_INT_EN BIT(1)
+#define DP83811_ANEG_COMPLETE_INT_EN   BIT(2)
+#define DP83811_ESD_EVENT_INT_EN   BIT(3)
+#define DP83811_WOL_INT_EN BIT(4)
+#define DP83811_LINK_STAT_INT_EN   BIT(5)
+#define DP83811_ENERGY_DET_INT_EN  BIT(6)
+#define DP83811_LINK_QUAL_INT_EN   BIT(7)
+
+/* INT_STAT2 bits */
+#define DP83811_JABBER_DET_INT_EN  BIT(0)
+#define DP83811_POLARITY_INT_ENBIT(1)
+#define DP83811_SLEEP_MODE_INT_EN  BIT(2)
+#define DP83811_OVERTEMP_INT_ENBIT(3)
+#define DP83811_OVERVOLTAGE_INT_EN BIT(6)
+#define DP83811_UNDERVOLTAGE_INT_ENBIT(7)
+
+#define MII_DP83811_RXSOP1 0x04a5
+#define MII_DP83811_RXSOP2 0x04a6
+#define MII_DP83811_RXSOP3 0x04a7
+
+/* WoL Registers */
+#define MII_DP83811_WOL_CFG0x04a0
+#define MII_DP83811_WOL_STAT   0x04a1
+#define MII_DP83811_WOL_DA10x04a2
+#define MII_DP83811_WOL_DA20x04a3
+#define MII_DP83811_WOL_DA30x04a4
+
+/* WoL bits */
+#define DP83811_WOL_MAGIC_EN   BIT(0)
+#define DP83811_WOL_SECURE_ON  BIT(5)
+#define DP83811_WOL_EN BIT(7)
+#define DP83811_WOL_INDICATION_SEL BIT(8)
+#define DP83811_WOL_CLR_INDICATION BIT(11)
+
+/* SGMII CTRL bits */
+#define DP83811_TDR_AUTO   BIT(8)
+#define DP83811_SGMII_EN   BIT(12)
+#define DP83811_SGMII_AUTO_NEG_EN  BIT(13)
+#define DP83811_SGMII_TX_ERR_DIS   BIT(14)
+#define DP83811_SGMII_SOFT_RESET   BIT(15)
+
+static int dp83811_ack_interrupt(struct phy_device *phydev)
+{
+   int err;
+
+   err = phy_read(phydev, MII_DP83811_INT_STAT1);
+   if (err < 0)
+   return err;
+
+   err = phy_read(phydev, MII_DP83811_INT_STAT2);
+   if (err < 0)
+   return err;
+
+   return 0;
+}
+
+static int dp83811_set_wol(struct phy_device *phydev,
+  struct ethtool_wolinfo *wol)
+{
+   struct net_device *ndev = phydev->attached_dev;
+   const u8 *mac;
+   u16

1 2 3 >

1 - 100 of 200 matches

Mail list logo