date:20181128

[PATCH RFC bpf-next 1/3] bpf: Add BPF_F_ANY_ALIGNMENT.

2018-11-28 Thread David Miller



Often we want to write tests cases that check things like bad context
offset accesses.  And one way to do this is to use an odd offset on,
for example, a 32-bit load.

This unfortunately triggers the alignment checks first on platforms
that do not set CONFIG_EFFICIENT_UNALIGNED_ACCESS.  So the test
case see the alignment failure rather than what it was testing for.

It is often not completely possible to respect the original intention
of the test, or even test the same exact thing, while solving the
alignment issue.

Another option could have been to check the alignment after the
context and other validations are performed by the verifier, but
that is a non-trivial change to the verifier.

Signed-off-by: David S. Miller 
---
 include/uapi/linux/bpf.h| 10 ++
 kernel/bpf/syscall.c|  2 +-
 kernel/bpf/verifier.c   |  2 ++
 tools/include/uapi/linux/bpf.h  | 10 ++
 tools/lib/bpf/bpf.c | 12 +---
 tools/lib/bpf/bpf.h |  6 +++---
 tools/testing/selftests/bpf/test_align.c|  2 +-
 tools/testing/selftests/bpf/test_verifier.c |  1 +
 8 files changed, 37 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1b7a3e6..c87b02b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -232,6 +232,16 @@ enum bpf_attach_type {
  */
 #define BPF_F_STRICT_ALIGNMENT (1U << 0)
 
+/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the
+ * verifier will allow any alignment whatsoever.  This bypasses
+ * what CONFIG_EFFICIENT_UNALIGNED_ACCESS would cause it to do.
+ * It is mostly used for testing when we want to validate the
+ * context and memory access aspects of the validator, but because
+ * of an unaligned access the alignment check would trigger before
+ * the one we are interested in.
+ */
+#define BPF_F_ANY_ALIGNMENT(1U << 1)
+
 /* when bpf_ldimm64->src_reg == BPF_PSEUDO_MAP_FD, bpf_ldimm64->imm == fd */
 #define BPF_PSEUDO_MAP_FD  1
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 85cbeec..531bd95 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1452,7 +1452,7 @@ static int bpf_prog_load(union bpf_attr *attr, union 
bpf_attr __user *uattr)
if (CHECK_ATTR(BPF_PROG_LOAD))
return -EINVAL;
 
-   if (attr->prog_flags & ~BPF_F_STRICT_ALIGNMENT)
+   if (attr->prog_flags & ~(BPF_F_STRICT_ALIGNMENT | BPF_F_ANY_ALIGNMENT))
return -EINVAL;
 
/* copy eBPF program license from user space */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 9584438..71988337 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6505,6 +6505,8 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr 
*attr,
env->strict_alignment = !!(attr->prog_flags & BPF_F_STRICT_ALIGNMENT);
if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS))
env->strict_alignment = true;
+   if (attr->prog_flags & BPF_F_ANY_ALIGNMENT)
+   env->strict_alignment = false;
 
ret = replace_map_fd_with_map_ptr(env);
if (ret < 0)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 1b7a3e6..c87b02b 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -232,6 +232,16 @@ enum bpf_attach_type {
  */
 #define BPF_F_STRICT_ALIGNMENT (1U << 0)
 
+/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the
+ * verifier will allow any alignment whatsoever.  This bypasses
+ * what CONFIG_EFFICIENT_UNALIGNED_ACCESS would cause it to do.
+ * It is mostly used for testing when we want to validate the
+ * context and memory access aspects of the validator, but because
+ * of an unaligned access the alignment check would trigger before
+ * the one we are interested in.
+ */
+#define BPF_F_ANY_ALIGNMENT(1U << 1)
+
 /* when bpf_ldimm64->src_reg == BPF_PSEUDO_MAP_FD, bpf_ldimm64->imm == fd */
 #define BPF_PSEUDO_MAP_FD  1
 
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index ce18221..7e34cb1 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -280,9 +280,11 @@ int bpf_load_program(enum bpf_prog_type type, const struct 
bpf_insn *insns,
 
 int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns,
   size_t insns_cnt, int strict_alignment,
-  const char *license, __u32 kern_version,
-  char *log_buf, size_t log_buf_sz, int log_level)
+  int any_alignment, const char *license,
+  __u32 kern_version, char *log_buf, size_t log_buf_sz,
+  int log_level)
 {
+   __u32 prog_flags = 0;
union bpf_attr attr;
 
bzero(, sizeof(attr));
@@ -295,7 +297,11 @@ int bpf_verify_program(enum bpf_prog_type type, const 
struct bpf_insn *insns,
attr.log_level =

[PATCH RFC bpf-next 0/3] bpf: Improve verifier coverage on sparc64 et al.

2018-11-28 Thread David Miller



On sparc64 a ton of test cases in test_verifier.c fail because
the memory accesses in the test case are unaligned (or cannot
be proven to be aligned by the verifier).

Perhaps we can eventually try to (carefully) modify each one
with this problem to not use unaligned accesses but:

1) That is delicate work.

2) The changes might not fully respect the original
   intention of the testcase.

3) In some cases, such a transformation might not even
   be feasible at all.

So add an "any alignment" flags to tell the verifier to forcefully
disable it's alignment checks completely.

test_verifier.c is then annotated to use this flag when necessary.

The presence of the flag in each test case is good documentation
to anyone who wants to actually tackling the job of eliminating
the unaligned memory accesses in the test cases.

I've also seen several weird things in test cases, like trying to
access __skb->mark in a packet buffer.

A couple of unresolves issues, in my opinion:

1) Maybe make this privileged, or at least prevent executing programs
   that have had the 'any alignment' flag passed to the verifier when
   we are on an inefficient unaligned access cpu.

2) Since we don't actually execute the "ACCEPT" cases (only enforced
   in the test_verifier.c code, see #1) maybe we should report them
   in some special way so that this is clear in the output.

This gets rid of 104 test_verifier.c failures on sparc64.

Signed-off-by: David S. Miller

[PATCH RFC bpf-next 2/3] bpf: Make use of 'any' alignment in selftests.

2018-11-28 Thread David Miller



Add F_LOAD_WITH_ANY_ALIGNMENT.

On architectures which are don't have efficient unaligned accesses,
this provides a way to force test cases to ignore alignment issues and
instead test the actual problem the testcase is targetting.

Only use this initially for programs where the expected result is
REJECT.

Signed-off-by: David S. Miller 
---
 tools/testing/selftests/bpf/test_verifier.c | 47 -
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index a8f0a54..1a68863 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -54,6 +54,7 @@
 
 #define F_NEEDS_EFFICIENT_UNALIGNED_ACCESS (1 << 0)
 #define F_LOAD_WITH_STRICT_ALIGNMENT   (1 << 1)
+#define F_LOAD_WITH_ANY_ALIGNMENT  (1 << 2)
 
 #define UNPRIV_SYSCTL "kernel/unprivileged_bpf_disabled"
 static bool unpriv_disabled = false;
@@ -1823,6 +1824,7 @@ static struct bpf_test tests[] = {
.errstr = "invalid bpf_context access",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SK_MSG,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"direct packet read for SK_MSG",
@@ -2199,6 +2201,7 @@ static struct bpf_test tests[] = {
},
.errstr = "invalid bpf_context access",
.result = REJECT,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"check skb->hash half load not permitted, unaligned 3",
@@ -2215,6 +2218,7 @@ static struct bpf_test tests[] = {
},
.errstr = "invalid bpf_context access",
.result = REJECT,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"check cb access: half, wrong type",
@@ -3281,6 +3285,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "R0 invalid mem access 'inv'",
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"raw_stack: skb_load_bytes, spilled regs corruption 2",
@@ -3311,6 +3316,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "R3 invalid mem access 'inv'",
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"raw_stack: skb_load_bytes, spilled regs + data",
@@ -3810,6 +3816,7 @@ static struct bpf_test tests[] = {
.errstr = "R2 invalid mem access 'inv'",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"direct packet access: test16 (arith on data_end)",
@@ -3993,6 +4000,7 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = REJECT,
.errstr = "invalid access to packet, off=0 size=8, 
R5(id=1,off=0,r=0)",
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"direct packet access: test24 (x += pkt_ptr, 5)",
@@ -5149,6 +5157,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "invalid access to map value, value_size=64 off=-2 
size=4",
.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"invalid cgroup storage access 5",
@@ -5265,6 +5274,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "invalid access to map value, value_size=64 off=-2 
size=4",
.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"invalid per-cpu cgroup storage access 5",
@@ -7206,6 +7216,7 @@ static struct bpf_test tests[] = {
.errstr = "invalid mem access 'inv'",
.result = REJECT,
.result_unpriv = REJECT,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"map element value illegal alu op, 5",
@@ -7228,6 +7239,7 @@ static struct bpf_test tests[] = {
.fixup_map_hash_48b = { 3 },
.errstr = "R0 invalid mem access 'inv'",
.result = REJECT,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"map element value is preserved across register spilling",
@@ -9720,6 +9732,7 @@ static struct bpf_test tests[] = {
.errstr = "R1 offset is outside of the packet",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"XDP

[PATCH RFC bpf-next 3/3] bpf: Apply F_LOAD_WITH_ANY_ALIGNMENT to ACCEPT test cases too.

2018-11-28 Thread David Miller



If a testcase has alignment problems but is expected to be ACCEPT,
verify it using F_LOAD_WITH_ANY_ALIGNMENT however to not try to
execute it.

In this way folks on inefficient unaligned access architectures
can more fully regression test the verifier.

Signed-off-by: David S. Miller 
---
 tools/testing/selftests/bpf/test_verifier.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 1a68863..8191f3e 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -3919,6 +3919,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"direct packet access: test21 (x += pkt_ptr, 2)",
@@ -3944,6 +3945,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"direct packet access: test22 (x += pkt_ptr, 3)",
@@ -3974,6 +3976,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"direct packet access: test23 (x += pkt_ptr, 4)",
@@ -4026,6 +4029,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"direct packet access: test25 (marking on <, good access)",
@@ -7733,6 +7737,7 @@ static struct bpf_test tests[] = {
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.retval = 0 /* csum_diff of 64-byte packet */,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"helper access to variable memory: size = 0 not allowed on NULL 
(!ARG_PTR_TO_MEM_OR_NULL)",
@@ -9695,6 +9700,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"XDP pkt read, pkt_data' > pkt_end, bad access 1",
@@ -9866,6 +9872,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"XDP pkt read, pkt_end < pkt_data', bad access 1",
@@ -9978,6 +9985,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"XDP pkt read, pkt_end >= pkt_data', bad access 1",
@@ -10035,6 +10043,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"XDP pkt read, pkt_data' <= pkt_end, bad access 1",
@@ -10147,6 +10156,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"XDP pkt read, pkt_meta' > pkt_data, bad access 1",
@@ -10318,6 +10328,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"XDP pkt read, pkt_data < pkt_meta', bad access 1",
@@ -10430,6 +10441,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"XDP pkt read, pkt_data >= pkt_meta', bad access 1",
@@ -10487,6 +10499,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"XDP pkt read, pkt_meta' <= pkt_data, bad access 1",
@@ -12406,6 +12419,7 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
.retval = 1,
+   .flags = F_LOAD_WITH_ANY_ALIGNMENT,
},
{
"calls: pkt_ptr spill into caller stack 4",
@@ -12440,6 +12454,7 @@ static struct

Re: pull-request: can-next 2018-11-28,pull-request: can-next 2018-11-28,Re: pull-request: can-next 2018-11-28,pull-request: can-next 2018-11-28

2018-11-28 Thread David Miller

From: Marc Kleine-Budde 
Date: Thu, 29 Nov 2018 08:14:36 +0100

> Done, here's the pull request with the git:// URL:

Pulled, thanks.

Re: pull-request: can-next 2018-11-28,pull-request: can-next 2018-11-28

2018-11-28 Thread Marc Kleine-Budde

On 11/28/18 8:21 PM, David Miller wrote:
> From: Marc Kleine-Budde 
> Date: Wed, 28 Nov 2018 17:01:13 +0100
> 
>>   
>> ssh://g...@gitolite.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next.git
>>  tags/linux-can-next-for-4.21-20181128
> 
> This doesn't work for me at all.
> 
> I'm not adding a custom .ssh config entry to point gitolite.kernel.org
> to the same SSH key that I use for gitol...@ra.kernel.org, no way.
> 
> I don't want to see these SSH based URLs any more, they are a pain and
> add overhead to my workflow.

Ok, I've adjusted my scripts.

> Please just use plain git:// URLs, thank you.

Done, here's the pull request with the git:// URL:

---

This is a pull request for net-next/master consisting of 18 patches.

The first patch is by Colin Ian King and fixes the spelling in the ucan
driver.

The next three patches target the xilinx driver. YueHaibing's patch
fixes the return type of ndo_start_xmit function. Two patches by
Shubhrajyoti Datta add support for the CAN FD 2.0 controllers.

Flavio Suligoi's patch for the sja1000 driver add support for the ASEM
CAN raw hardware.

Wolfram Sang's and Kuninori Morimoto's patches switch the rcar driver to
use SPDX license identifiers.

The remaining 111 patches improve the flexcan driver. Pankaj Bansal's
patch enables the driver in Kconfig on all architectures with IOMEM
support. The next four patches by me fix indention, add missing
parentheses and comments. Aisheng Dong's patches add self wake support
and document it in the DT bindings. The remaining patches by Pankaj
Bansal first fix the loopback support and prepare the driver for the
CAN-FD support needed for the LX2160A SoC. The actual CAN-FD support
will be added in a later patch series.

regards,
Marc

---

The following changes since commit 86d1d8b72caf648e5b14ac274f9afeaab58bbae1:

  net/ipv4: Fix missing raw_init when CONFIG_PROC_FS is disabled (2018-11-27 
20:58:02 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next.git 
tags/linux-can-next-for-4.21-20181128

for you to fetch changes up to 6cbf76028dcac01129211828d62314285231f79e:

  can: flexcan: split the Message Buffer RAM area (2018-11-28 16:52:25 +0100)

----
linux-can-next-for-4.21-20181128


Aisheng Dong (2):
  dt-bindings: can: flexcan: add stop mode property to device tree
  can: flexcan: add self wakeup support

Colin Ian King (1):
  can: ucan: fix spelling mistake: "resumbmitting" -> "resubmitting"

Flavio Suligoi (1):
  can: sja1000: plx_pci: add support for ASEM CAN raw device

Kuninori Morimoto (1):
  can: rcar: add SPDX identifiers to Kconfig and Makefile

Marc Kleine-Budde (4):
  can: flexcan: flexcan_start_xmit(): fix indention
  can: flexcan: flexcan_irq(): fix indention
  can: flexcan: FLEXCAN_IFLAG_MB: add () around macro argument
  can: flexcan: flexcan_chip_start(): adjust comment to match the code

Pankaj Bansal (5):
  can: flexcan: enable flexcan for all architectures
  can: flexcan: flexcan_chip_start(): enable loopback mode in flexcan
  can: flexcan: move rx_offload_add() from flexcan_probe() to flexcan_open()
  can: flexcan: Add provision for variable payload size
  can: flexcan: split the Message Buffer RAM area

Shubhrajyoti Datta (2):
  dt-bindings: can: xilinx_can: add Xilinx CAN FD 2.0 bindings
  can: xilinx: add can 2.0 support

Wolfram Sang (1):
  can: rcar: use SPDX identifier for Renesas drivers

YueHaibing (1):
  can: xilinx: fix return type of ndo_start_xmit function

 .../devicetree/bindings/net/can/fsl-flexcan.txt|   8 +
 .../devicetree/bindings/net/can/xilinx_can.txt |   1 +
 drivers/net/can/Kconfig|   2 +-
 drivers/net/can/flexcan.c  | 365 -
 drivers/net/can/rcar/Kconfig   |   1 +
 drivers/net/can/rcar/Makefile  |   1 +
 drivers/net/can/rcar/rcar_can.c|   6 +-
 drivers/net/can/rcar/rcar_canfd.c  |   6 +-
 drivers/net/can/sja1000/Kconfig|   1 +
 drivers/net/can/sja1000/plx_pci.c  |  65 +++-
 drivers/net/can/usb/ucan.c |   2 +-
 drivers/net/can/xilinx_can.c   |  36 +-
 12 files changed, 402 insertions(+), 92 deletions(-)

-- 
Pengutronix e.K.  | Marc Kleine-Budde   |
Industrial Linux Solutions| Phone: +49-231-2826-924 |
Vertretung West/Dortmund  | Fax:   +49-5121-206917- |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |



signature.asc
Description: OpenPGP digital signature

[PATCHv2 net] sctp: hold transport before accessing its asoc in sctp_hash_transport

2018-11-28 Thread Xin Long

In sctp_hash_transport, it dereferences a transport's asoc only under
rcu_read_lock. Without holding the transport, its asoc could be freed
already, which leads to a use-after-free panic.

A similar fix as Commit bab1be79a516 ("sctp: hold transport before
accessing its asoc in sctp_transport_get_next") is needed to hold
the transport before accessing its asoc in sctp_hash_transport.

Note that as rhlist keeps the lists to a small size, this extra
atomic operation won't cause a noticeable latency on inserting
a transport. Yet it's not in a datapath.

v1->v2:
  - improve the changelog.

Fixes: cd2b70875058 ("sctp: check duplicate node before inserting a new 
transport")
Reported-by: syzbot+0b05d8aa7cb185107...@syzkaller.appspotmail.com
Signed-off-by: Xin Long 
---
 net/sctp/input.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/sctp/input.c b/net/sctp/input.c
index ce7351c..c2c0816 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -896,11 +896,16 @@ int sctp_hash_transport(struct sctp_transport *t)
list = rhltable_lookup(_transport_hashtable, ,
   sctp_hash_params);
 
-   rhl_for_each_entry_rcu(transport, tmp, list, node)
+   rhl_for_each_entry_rcu(transport, tmp, list, node) {
+   if (!sctp_transport_hold(transport))
+   continue;
if (transport->asoc->ep == t->asoc->ep) {
+   sctp_transport_put(transport);
rcu_read_unlock();
return -EEXIST;
}
+   sctp_transport_put(transport);
+   }
rcu_read_unlock();
 
err = rhltable_insert_key(_transport_hashtable, ,
-- 
2.1.0

[PATCHv2 net] sctp: hold transport before accessing its asoc in sctp_epaddr_lookup_transport

2018-11-28 Thread Xin Long

Without holding transport to dereference its asoc, a use after
free panic can be caused in sctp_epaddr_lookup_transport. Note
that a sock lock can't protect these transports that belong to
other socks.

A similar fix as Commit bab1be79a516 ("sctp: hold transport
before accessing its asoc in sctp_transport_get_next") is
needed to hold the transport before accessing its asoc in
sctp_epaddr_lookup_transport.

Note that this extra atomic operation is on the datapath,
but as rhlist keeps the lists to a small size, it won't
see a noticeable performance hurt.

v1->v2:
  - improve the changelog.

Fixes: 7fda702f9315 ("sctp: use new rhlist interface on sctp transport 
rhashtable")
Reported-by: syzbot+aad231d51b1923158...@syzkaller.appspotmail.com
Signed-off-by: Xin Long 
---
 net/sctp/input.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/sctp/input.c b/net/sctp/input.c
index 5c36a99..ce7351c 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -967,9 +967,15 @@ struct sctp_transport *sctp_epaddr_lookup_transport(
list = rhltable_lookup(_transport_hashtable, ,
   sctp_hash_params);
 
-   rhl_for_each_entry_rcu(t, tmp, list, node)
-   if (ep == t->asoc->ep)
+   rhl_for_each_entry_rcu(t, tmp, list, node) {
+   if (!sctp_transport_hold(t))
+   continue;
+   if (ep == t->asoc->ep) {
+   sctp_transport_put(t);
return t;
+   }
+   sctp_transport_put(t);
+   }
 
return NULL;
 }
-- 
2.1.0

[PATCHv2 net] sctp: check and update stream->out_curr when allocating stream_out

2018-11-28 Thread Xin Long

Now when using stream reconfig to add out streams, stream->out
will get re-allocated, and all old streams' information will
be copied to the new ones and the old ones will be freed.

So without stream->out_curr updated, next time when trying to
send from stream->out_curr stream, a panic would be caused.

This patch is to check and update stream->out_curr when
allocating stream_out.

v1->v2:
  - define fa_index() to get elem index from stream->out_curr.

Fixes: 5e32a431 ("sctp: introduce stream scheduler foundations")
Reported-by: Ying Xu 
Reported-by: syzbot+e33a3a138267ca119...@syzkaller.appspotmail.com
Signed-off-by: Xin Long 
---
 net/sctp/stream.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index 3892e76..30e7809 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -84,6 +84,19 @@ static void fa_zero(struct flex_array *fa, size_t index, 
size_t count)
}
 }
 
+static size_t fa_index(struct flex_array *fa, void *elem, size_t count)
+{
+   size_t index = 0;
+
+   while (count--) {
+   if (elem == flex_array_get(fa, index))
+   break;
+   index++;
+   }
+
+   return index;
+}
+
 /* Migrates chunks from stream queues to new stream queues if needed,
  * but not across associations. Also, removes those chunks to streams
  * higher than the new max.
@@ -147,6 +160,13 @@ static int sctp_stream_alloc_out(struct sctp_stream 
*stream, __u16 outcnt,
 
if (stream->out) {
fa_copy(out, stream->out, 0, min(outcnt, stream->outcnt));
+   if (stream->out_curr) {
+   size_t index = fa_index(stream->out, stream->out_curr,
+   stream->outcnt);
+
+   BUG_ON(index == stream->outcnt);
+   stream->out_curr = flex_array_get(out, index);
+   }
fa_free(stream->out);
}
 
-- 
2.1.0

[PATCH bpf] bpf: Fix verifier log string check for bad alignment.

2018-11-28 Thread David Miller



The message got changed a lot time ago.

This was responsible for 36 test case failures on sparc64.

Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Signed-off-by: David S. Miller 
---
 tools/testing/selftests/bpf/test_verifier.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 550b7e46bf4a..5dd4410a716c 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -14230,7 +14230,7 @@ static void do_test_single(struct bpf_test *test, bool 
unpriv,
 
reject_from_alignment = fd_prog < 0 &&
(test->flags & 
F_NEEDS_EFFICIENT_UNALIGNED_ACCESS) &&
-   strstr(bpf_vlog, "Unknown alignment.");
+   strstr(bpf_vlog, "misaligned");
 #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
if (reject_from_alignment) {
printf("FAIL\nFailed due to alignment despite having efficient 
unaligned access: '%s'!\n",
-- 
2.19.2

Re: [PATCH iproute2] ss: add support for delivered and delivered_ce fields

2018-11-28 Thread Eric Dumazet

On Wed, Nov 28, 2018 at 9:16 PM David Ahern  wrote:
>
> On 11/28/18 10:10 PM, Eric Dumazet wrote:
> >
> >
> > On 11/28/2018 04:01 PM, Stephen Hemminger wrote:
> >> On Mon, 26 Nov 2018 14:29:53 -0800
> >> Eric Dumazet  wrote:
> >>
> >>> Kernel support was added in linux-4.18 in commit feb5f2ec6464
> >>> ("tcp: export packets delivery info")
> >>>
> >>> Tested:
> >>>
> >>> ss -ti
> >>> ...
> >>> ESTAB   0 2270520  [2607:f8b0:8099:e16::]:47646   
> >>> [2607:f8b0:8099:e18::]:38953
> >>>  ts sack cubic wscale:8,8 rto:7 rtt:2.824/0.278 mss:1428
> >>>  pmtu:1500 rcvmss:536 advmss:1428 cwnd:89 ssthresh:62 
> >>> bytes_acked:2097871945
> >>> segs_out:1469144 segs_in:65221 data_segs_out:1469142 send 360.0Mbps 
> >>> lastsnd:2
> >>> lastrcv:99231 lastack:2 pacing_rate 431.9Mbps delivery_rate 246.4Mbps
> >>> (*) delivered:1469099 delivered_ce:424799
> >>> busy:99231ms unacked:44 rcv_space:14280 rcv_ssthresh:65535
> >>> notsent:2207688 minrtt:0.228
> >>>
> >>> Signed-off-by: Eric Dumazet 
> >>
> >> Applied
> >>
> >
> > Thanks Stephen
> >
> > BTW, it seems ss had some recent regression :
> >
> > Old ss command was happy with "ss dst 127.0.0.1 and not dst :12865"
> >
> > But current one is not.
> >
> > root@edumazet-glaptop:/usr/src/iproute2# ss -V
> > ss utility, iproute2-ss171113
> >
> > root@edumazet-glaptop:/usr/src/iproute2# ss dst 127.0.0.1 and not dst :12865
> > Netid  State  Recv-Q Send-QLocal Address:Port   
> >   Peer Address:Port
> > tcpESTAB  0  0 127.0.0.1:45368  
> >  127.0.0.1:999
> > tcpESTAB  0  0 127.0.0.1:999
> >  127.0.0.1:45368
> >
> >
> > root@edumazet-glaptop:/usr/src/iproute2# misc/ss -V
> > ss utility, iproute2-ss181023
> >
> > root@edumazet-glaptop:/usr/src/iproute2# misc/ss dst 127.0.0.1 and not dst 
> > :12865
> > ss: bison bellows (while parsing filter): "syntax error!" Sorry.
> > Usage: ss [ OPTIONS ]
> >ss [ OPTIONS ] [ FILTER ]
> > ...
> >
>
> Most logical commit to ssfilter.y is:
>
> commit 38d209ecf2ae966b9b25de4acb60cdffb0e06ced
> Author: Phil Sutter 
> Date:   Tue Aug 14 14:18:06 2018 +0200
>
> ss: Review ssfilter
>
> Can you revert it and see if it helps?

It does not help.

It seem the new way is to use  : ss dst 127.0.0.1 and dport != 12865

Re: [PATCH bpf 2/2] bpf: test_verifier, test for lookup on queue/stack maps

2018-11-28 Thread Prashant Bhole





On 11/28/2018 10:06 PM, Mauricio Vasquez wrote:


On 11/28/18 3:45 AM, Daniel Borkmann wrote:

On 11/28/2018 08:51 AM, Prashant Bhole wrote:

This patch adds tests to check whether bpf verifier prevents lookup
on queue/stack maps

Signed-off-by: Prashant Bhole 
---
  tools/testing/selftests/bpf/test_verifier.c | 52 +
  1 file changed, 52 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c

index 550b7e46bf4a..becd9f4f3980 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -74,6 +74,8 @@ struct bpf_test {
  int fixup_map_in_map[MAX_FIXUPS];
  int fixup_cgroup_storage[MAX_FIXUPS];
  int fixup_percpu_cgroup_storage[MAX_FIXUPS];
+    int fixup_map_queue[MAX_FIXUPS];
+    int fixup_map_stack[MAX_FIXUPS];
  const char *errstr;
  const char *errstr_unpriv;
  uint32_t retval, retval_unpriv;
@@ -4611,6 +4613,38 @@ static struct bpf_test tests[] = {
  .errstr = "cannot pass map_type 7 into func 
bpf_map_lookup_elem",

  .prog_type = BPF_PROG_TYPE_PERF_EVENT,
  },
+    {
+    "prevent map lookup in queue map",
+    .insns = {
+    BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+    BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+    BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+    BPF_LD_MAP_FD(BPF_REG_1, 0),
+    BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+ BPF_FUNC_map_lookup_elem),
+    BPF_EXIT_INSN(),
+    },
+    .fixup_map_queue = { 3 },
+    .result = REJECT,
+    .errstr = "invalid stack type R2 off=-8 access_size=0",
+    .prog_type = BPF_PROG_TYPE_XDP,
Hmm, the approach in patch 1 is very fragile, and we're lucky in this 
case
that the verifier bailed out with 'invalid stack type R2 off=-8 
access_size=0'

because of key size being zero. If this would have not been the case then
the added ERR_PTR(-EOPNOTSUPP) would basically be seen as a valid 
pointer and
the program could read/write into it. Instead, this needs to be 
prevented much

earlier like check_map_func_compatibility(),


Actually it is prevented in check_map_func_compatibility(), but stack 
boundary check is done before in the verifier.



and I would like to have a split
on these approaches to make verifier more robust. While you want 
ERR_PTR(-EOPNOTSUPP)

for user space syscall side,


In the case of QUEUE and STACK maps this is not relevant because the 
lookup syscall is mapped into peek operation.


In fact queue_stack_map_lookup_elem() & queue_stack_map_update_elem() 
should be never called, I think we can remove them safely.


Got it. Shall we keep these verifier tests (patch 2)?

Thanks,
Prashant

Re: [PATCH iproute2] ss: add support for delivered and delivered_ce fields

2018-11-28 Thread David Ahern

On 11/28/18 10:10 PM, Eric Dumazet wrote:
> 
> 
> On 11/28/2018 04:01 PM, Stephen Hemminger wrote:
>> On Mon, 26 Nov 2018 14:29:53 -0800
>> Eric Dumazet  wrote:
>>
>>> Kernel support was added in linux-4.18 in commit feb5f2ec6464
>>> ("tcp: export packets delivery info")
>>>
>>> Tested:
>>>
>>> ss -ti
>>> ...
>>> ESTAB   0 2270520  [2607:f8b0:8099:e16::]:47646   
>>> [2607:f8b0:8099:e18::]:38953
>>>  ts sack cubic wscale:8,8 rto:7 rtt:2.824/0.278 mss:1428
>>>  pmtu:1500 rcvmss:536 advmss:1428 cwnd:89 ssthresh:62 
>>> bytes_acked:2097871945
>>> segs_out:1469144 segs_in:65221 data_segs_out:1469142 send 360.0Mbps 
>>> lastsnd:2
>>> lastrcv:99231 lastack:2 pacing_rate 431.9Mbps delivery_rate 246.4Mbps
>>> (*) delivered:1469099 delivered_ce:424799
>>> busy:99231ms unacked:44 rcv_space:14280 rcv_ssthresh:65535
>>> notsent:2207688 minrtt:0.228
>>>
>>> Signed-off-by: Eric Dumazet 
>>
>> Applied
>>
> 
> Thanks Stephen
> 
> BTW, it seems ss had some recent regression :
> 
> Old ss command was happy with "ss dst 127.0.0.1 and not dst :12865"
> 
> But current one is not.
> 
> root@edumazet-glaptop:/usr/src/iproute2# ss -V
> ss utility, iproute2-ss171113
> 
> root@edumazet-glaptop:/usr/src/iproute2# ss dst 127.0.0.1 and not dst :12865
> Netid  State  Recv-Q Send-QLocal Address:Port 
> Peer Address:Port
> tcpESTAB  0  0 127.0.0.1:45368
>127.0.0.1:999  
> tcpESTAB  0  0 127.0.0.1:999  
>127.0.0.1:45368   
> 
>  
> root@edumazet-glaptop:/usr/src/iproute2# misc/ss -V
> ss utility, iproute2-ss181023
> 
> root@edumazet-glaptop:/usr/src/iproute2# misc/ss dst 127.0.0.1 and not dst 
> :12865
> ss: bison bellows (while parsing filter): "syntax error!" Sorry.
> Usage: ss [ OPTIONS ]
>ss [ OPTIONS ] [ FILTER ]
> ...
> 

Most logical commit to ssfilter.y is:

commit 38d209ecf2ae966b9b25de4acb60cdffb0e06ced
Author: Phil Sutter 
Date:   Tue Aug 14 14:18:06 2018 +0200

ss: Review ssfilter

Can you revert it and see if it helps?

Re: [PATCH iproute2] ss: add support for delivered and delivered_ce fields

2018-11-28 Thread Eric Dumazet




On 11/28/2018 04:01 PM, Stephen Hemminger wrote:
> On Mon, 26 Nov 2018 14:29:53 -0800
> Eric Dumazet  wrote:
> 
>> Kernel support was added in linux-4.18 in commit feb5f2ec6464
>> ("tcp: export packets delivery info")
>>
>> Tested:
>>
>> ss -ti
>> ...
>> ESTAB   0 2270520  [2607:f8b0:8099:e16::]:47646   
>> [2607:f8b0:8099:e18::]:38953
>>   ts sack cubic wscale:8,8 rto:7 rtt:2.824/0.278 mss:1428
>>  pmtu:1500 rcvmss:536 advmss:1428 cwnd:89 ssthresh:62 
>> bytes_acked:2097871945
>> segs_out:1469144 segs_in:65221 data_segs_out:1469142 send 360.0Mbps 
>> lastsnd:2
>> lastrcv:99231 lastack:2 pacing_rate 431.9Mbps delivery_rate 246.4Mbps
>> (*) delivered:1469099 delivered_ce:424799
>> busy:99231ms unacked:44 rcv_space:14280 rcv_ssthresh:65535
>> notsent:2207688 minrtt:0.228
>>
>> Signed-off-by: Eric Dumazet 
> 
> Applied
> 

Thanks Stephen

BTW, it seems ss had some recent regression :

Old ss command was happy with "ss dst 127.0.0.1 and not dst :12865"

But current one is not.

root@edumazet-glaptop:/usr/src/iproute2# ss -V
ss utility, iproute2-ss171113

root@edumazet-glaptop:/usr/src/iproute2# ss dst 127.0.0.1 and not dst :12865
Netid  State  Recv-Q Send-QLocal Address:Port   
  Peer Address:Port
tcpESTAB  0  0 127.0.0.1:45368  
 127.0.0.1:999  
tcpESTAB  0  0 127.0.0.1:999
 127.0.0.1:45368   

 
root@edumazet-glaptop:/usr/src/iproute2# misc/ss -V
ss utility, iproute2-ss181023

root@edumazet-glaptop:/usr/src/iproute2# misc/ss dst 127.0.0.1 and not dst 
:12865
ss: bison bellows (while parsing filter): "syntax error!" Sorry.
Usage: ss [ OPTIONS ]
   ss [ OPTIONS ] [ FILTER ]
...

Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-28 Thread Eric Dumazet

On Wed, Nov 28, 2018 at 7:53 PM Cong Wang  wrote:
>
> On Wed, Nov 28, 2018 at 7:50 PM Eric Dumazet  wrote:
> >
> > On Wed, Nov 28, 2018 at 7:40 PM Cong Wang  wrote:
> > >
> > > On Wed, Nov 28, 2018 at 4:07 PM Eric Dumazet  wrote:
> > > >
> > > > A NIC is supposed to deliver frames, even the ones that 'seem' bad.
> > >
> > > A quick test shows this is not the case for mlx5.
> > >
> > > With the trafgen script you gave to me, with tot_len==40, the dest host
> > > could receive all the packets. Changing tot_len to 80, tcpdump could no
> > > longer see any packet. (Both sender and receiver are mlx5.)
> > >
> > > So, packets with tot_len > skb->len are clearly dropped before tcpdump
> > > could see it, that is likely by mlx5 hardware.
> >
> > Or a router, or a switch.
> >
> > Are your two hosts connected back to back ?
>
> Both should be plugged into a same switch. I fail to see why a
> switch could parse IP header as the packet is nothing of interest,
> like a IGMP snooping.

Well, _something_ is dropping the frames.
It can be mlx5, or something else.

Does ethtool -S show any increasing counter ?

Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-28 Thread Cong Wang

On Wed, Nov 28, 2018 at 7:50 PM Eric Dumazet  wrote:
>
> On Wed, Nov 28, 2018 at 7:40 PM Cong Wang  wrote:
> >
> > On Wed, Nov 28, 2018 at 4:07 PM Eric Dumazet  wrote:
> > >
> > > A NIC is supposed to deliver frames, even the ones that 'seem' bad.
> >
> > A quick test shows this is not the case for mlx5.
> >
> > With the trafgen script you gave to me, with tot_len==40, the dest host
> > could receive all the packets. Changing tot_len to 80, tcpdump could no
> > longer see any packet. (Both sender and receiver are mlx5.)
> >
> > So, packets with tot_len > skb->len are clearly dropped before tcpdump
> > could see it, that is likely by mlx5 hardware.
>
> Or a router, or a switch.
>
> Are your two hosts connected back to back ?

Both should be plugged into a same switch. I fail to see why a
switch could parse IP header as the packet is nothing of interest,
like a IGMP snooping.

Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-28 Thread Eric Dumazet

On Wed, Nov 28, 2018 at 7:40 PM Cong Wang  wrote:
>
> On Wed, Nov 28, 2018 at 4:07 PM Eric Dumazet  wrote:
> >
> > A NIC is supposed to deliver frames, even the ones that 'seem' bad.
>
> A quick test shows this is not the case for mlx5.
>
> With the trafgen script you gave to me, with tot_len==40, the dest host
> could receive all the packets. Changing tot_len to 80, tcpdump could no
> longer see any packet. (Both sender and receiver are mlx5.)
>
> So, packets with tot_len > skb->len are clearly dropped before tcpdump
> could see it, that is likely by mlx5 hardware.

Or a router, or a switch.

Are your two hosts connected back to back ?

Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-28 Thread Cong Wang

On Wed, Nov 28, 2018 at 4:07 PM Eric Dumazet  wrote:
>
> A NIC is supposed to deliver frames, even the ones that 'seem' bad.

A quick test shows this is not the case for mlx5.

With the trafgen script you gave to me, with tot_len==40, the dest host
could receive all the packets. Changing tot_len to 80, tcpdump could no
longer see any packet. (Both sender and receiver are mlx5.)

So, packets with tot_len > skb->len are clearly dropped before tcpdump
could see it, that is likely by mlx5 hardware.

Re: [PATCH] udp: Allow to defer reception until connect() happened

2018-11-28 Thread Eric Dumazet

On Wed, Nov 28, 2018 at 7:09 PM Jana Iyengar  wrote:
>
>
> On Wed, Nov 28, 2018 at 6:19 PM Eric Dumazet  wrote:
>>
>> On Wed, Nov 28, 2018 at 5:57 PM Christoph Paasch  wrote:
>> >
>> > There are use-cases where a host wants to use a UDP socket with a
>> > specific 4-tuple. The way to do this is to bind() and then connect() the
>> > socket. However, after the bind(), the socket starts receiving data even
>> > if it does not match the intended 4-tuple. That is because after the
>> > bind() UDP-socket will match in the lookup for all incoming UDP-traffic
>> > that has the specific IP/port.
>> >
>> > This patch prevents any incoming traffic until the connect() system-call
>> > is called whenever the app sets the UDP socket-option
>> > UDP_WAIT_FOR_CONNECT.
>>
>> Please do not add something that could mislead applications writers to
>> think UDP stack can scale.
>
>
>> UDP stack does not have a full hash on 4-tuples, it means that
>> incoming traffic on a 'shared port' has
>> to scan a list of XXX sockets to find the best match ...
>
>
>> Also you add another cache line miss in UDP lookup to access
>> udp_sk()->wait_for_connect.
>>
>> recvfrom() can be used to filter whatever frame that came before the 
>> connect()
>
>
> I don't think I understand that argument -- connect() is supported for UDP 
> sockets, and UDP sockets are being used for serving QUIC traffic. Are you 
> suggesting that connect() never be used?

If the source port is not shared, Christoph patch is not needed.

If it is shared, then a whole can of worm is opened.

Trying to hack UDP stack while it is not fully 4-tuple ready is not
going to fly.

Re: [PATCH net-next] net: qualcomm: rmnet: Remove set but not used variable 'cmd'

2018-11-28 Thread Subash Abhinov Kasiviswanathan


On 2018-11-28 19:36, YueHaibing wrote:

Fixes gcc '-Wunused-but-set-variable' warning:

drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c: In function
'rmnet_map_do_flow_control':
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c:23:36: warning:
 variable 'cmd' set but not used [-Wunused-but-set-variable]
  struct rmnet_map_control_command *cmd;

'cmd' not used anymore now, should also be removed.

Signed-off-by: YueHaibing 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
index 8990307..f6cf59a 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
@@ -20,14 +20,12 @@ static u8 rmnet_map_do_flow_control(struct sk_buff 
*skb,

struct rmnet_port *port,
int enable)
 {
-   struct rmnet_map_control_command *cmd;
struct rmnet_endpoint *ep;
struct net_device *vnd;
u8 mux_id;
int r;

mux_id = RMNET_MAP_GET_MUX_ID(skb);
-   cmd = RMNET_MAP_GET_CMD_START(skb);

if (mux_id >= RMNET_MAX_LOGICAL_EP) {
kfree_skb(skb);


Thanks for cleaning these up.

Acked-by: Subash Abhinov Kasiviswanathan 

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

[PATCH net-next] net: qualcomm: rmnet: Remove set but not used variable 'cmd'

2018-11-28 Thread YueHaibing

Fixes gcc '-Wunused-but-set-variable' warning:

drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c: In function 
'rmnet_map_do_flow_control':
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c:23:36: warning:
 variable 'cmd' set but not used [-Wunused-but-set-variable]
  struct rmnet_map_control_command *cmd;

'cmd' not used anymore now, should also be removed.

Signed-off-by: YueHaibing 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
index 8990307..f6cf59a 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
@@ -20,14 +20,12 @@ static u8 rmnet_map_do_flow_control(struct sk_buff *skb,
struct rmnet_port *port,
int enable)
 {
-   struct rmnet_map_control_command *cmd;
struct rmnet_endpoint *ep;
struct net_device *vnd;
u8 mux_id;
int r;
 
mux_id = RMNET_MAP_GET_MUX_ID(skb);
-   cmd = RMNET_MAP_GET_CMD_START(skb);
 
if (mux_id >= RMNET_MAX_LOGICAL_EP) {
kfree_skb(skb);

[PATCH net-next,v4 06/12] drivers: net: use flow action infrastructure

2018-11-28 Thread Pablo Neira Ayuso

This patch updates drivers to use the new flow action infrastructure.

Signed-off-by: Pablo Neira Ayuso 
---
v4: rebase.

 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c   |  74 +++---
 .../net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c   | 250 +--
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 266 ++---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c |   2 +-
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  |  54 +++--
 drivers/net/ethernet/netronome/nfp/flower/action.c | 185 +++---
 drivers/net/ethernet/qlogic/qede/qede_filter.c |  12 +-
 7 files changed, 417 insertions(+), 426 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index 09cd75f54eba..b7bd27edd80e 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -61,9 +61,9 @@ static u16 bnxt_flow_get_dst_fid(struct bnxt *pf_bp, struct 
net_device *dev)
 
 static int bnxt_tc_parse_redir(struct bnxt *bp,
   struct bnxt_tc_actions *actions,
-  const struct tc_action *tc_act)
+  const struct flow_action_entry *act)
 {
-   struct net_device *dev = tcf_mirred_dev(tc_act);
+   struct net_device *dev = act->dev;
 
if (!dev) {
netdev_info(bp->dev, "no dev in mirred action");
@@ -77,16 +77,16 @@ static int bnxt_tc_parse_redir(struct bnxt *bp,
 
 static int bnxt_tc_parse_vlan(struct bnxt *bp,
  struct bnxt_tc_actions *actions,
- const struct tc_action *tc_act)
+ const struct flow_action_entry *act)
 {
-   switch (tcf_vlan_action(tc_act)) {
-   case TCA_VLAN_ACT_POP:
+   switch (act->id) {
+   case FLOW_ACTION_VLAN_POP:
actions->flags |= BNXT_TC_ACTION_FLAG_POP_VLAN;
break;
-   case TCA_VLAN_ACT_PUSH:
+   case FLOW_ACTION_VLAN_PUSH:
actions->flags |= BNXT_TC_ACTION_FLAG_PUSH_VLAN;
-   actions->push_vlan_tci = htons(tcf_vlan_push_vid(tc_act));
-   actions->push_vlan_tpid = tcf_vlan_push_proto(tc_act);
+   actions->push_vlan_tci = htons(act->vlan.vid);
+   actions->push_vlan_tpid = act->vlan.proto;
break;
default:
return -EOPNOTSUPP;
@@ -96,10 +96,10 @@ static int bnxt_tc_parse_vlan(struct bnxt *bp,
 
 static int bnxt_tc_parse_tunnel_set(struct bnxt *bp,
struct bnxt_tc_actions *actions,
-   const struct tc_action *tc_act)
+   const struct flow_action_entry *act)
 {
-   struct ip_tunnel_info *tun_info = tcf_tunnel_info(tc_act);
-   struct ip_tunnel_key *tun_key = _info->key;
+   const struct ip_tunnel_info *tun_info = act->tunnel;
+   const struct ip_tunnel_key *tun_key = _info->key;
 
if (ip_tunnel_info_af(tun_info) != AF_INET) {
netdev_info(bp->dev, "only IPv4 tunnel-encap is supported");
@@ -113,51 +113,43 @@ static int bnxt_tc_parse_tunnel_set(struct bnxt *bp,
 
 static int bnxt_tc_parse_actions(struct bnxt *bp,
 struct bnxt_tc_actions *actions,
-struct tcf_exts *tc_exts)
+struct flow_action *flow_action)
 {
-   const struct tc_action *tc_act;
+   struct flow_action_entry *act;
int i, rc;
 
-   if (!tcf_exts_has_actions(tc_exts)) {
+   if (!flow_action_has_entries(flow_action)) {
netdev_info(bp->dev, "no actions");
return -EINVAL;
}
 
-   tcf_exts_for_each_action(i, tc_act, tc_exts) {
-   /* Drop action */
-   if (is_tcf_gact_shot(tc_act)) {
+   flow_action_for_each(i, act, flow_action) {
+   switch (act->id) {
+   case FLOW_ACTION_DROP:
actions->flags |= BNXT_TC_ACTION_FLAG_DROP;
return 0; /* don't bother with other actions */
-   }
-
-   /* Redirect action */
-   if (is_tcf_mirred_egress_redirect(tc_act)) {
-   rc = bnxt_tc_parse_redir(bp, actions, tc_act);
+   case FLOW_ACTION_REDIRECT:
+   rc = bnxt_tc_parse_redir(bp, actions, act);
if (rc)
return rc;
-   continue;
-   }
-
-   /* Push/pop VLAN */
-   if (is_tcf_vlan(tc_act)) {
-   rc = bnxt_tc_parse_vlan(bp, actions, tc_act);
+   break;
+   case FLOW_ACTION_VLAN_POP:
+   case FLOW_ACTION_VLAN_PUSH:
+   case FLOW_ACTION_VLAN_MANGLE:
+   rc = bnxt_tc_parse_vlan(bp, actions, act);

[PATCH net-next,v4 01/12] flow_offload: add flow_rule and flow_match structures and use them

2018-11-28 Thread Pablo Neira Ayuso

This patch wraps the dissector key and mask - that flower uses to
represent the matching side - around the flow_match structure.

To avoid a follow up patch that would edit the same LoCs in the drivers,
this patch also wraps this new flow match structure around the flow rule
object. This new structure will also contain the flow actions in follow
up patches.

This introduces two new interfaces:

bool flow_rule_match_key(rule, dissector_id)

that returns true if a given matching key is set on, and:

flow_rule_match_XYZ(rule, );

To fetch the matching side XYZ into the match container structure, to
retrieve the key and the mask with one single call.

Signed-off-by: Pablo Neira Ayuso 
---
v4: rebase.

 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c   | 174 -
 .../net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c   | 194 --
 drivers/net/ethernet/intel/i40e/i40e_main.c| 178 -
 drivers/net/ethernet/intel/iavf/iavf_main.c| 195 --
 drivers/net/ethernet/intel/igb/igb_main.c  |  64 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 420 +
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  | 202 +-
 drivers/net/ethernet/netronome/nfp/flower/action.c |  11 +-
 drivers/net/ethernet/netronome/nfp/flower/match.c  | 417 ++--
 .../net/ethernet/netronome/nfp/flower/offload.c| 145 +++
 drivers/net/ethernet/qlogic/qede/qede_filter.c |  85 ++---
 include/net/flow_offload.h | 115 ++
 include/net/pkt_cls.h  |  11 +-
 net/core/Makefile  |   2 +-
 net/core/flow_offload.c| 143 +++
 net/sched/cls_flower.c |  45 ++-
 16 files changed, 1195 insertions(+), 1206 deletions(-)
 create mode 100644 include/net/flow_offload.h
 create mode 100644 net/core/flow_offload.c

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index 749f63beddd8..b82143d6cdde 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -177,18 +177,12 @@ static int bnxt_tc_parse_actions(struct bnxt *bp,
return 0;
 }
 
-#define GET_KEY(flow_cmd, key_type)\
-   skb_flow_dissector_target((flow_cmd)->dissector, key_type,\
- (flow_cmd)->key)
-#define GET_MASK(flow_cmd, key_type)   \
-   skb_flow_dissector_target((flow_cmd)->dissector, key_type,\
- (flow_cmd)->mask)
-
 static int bnxt_tc_parse_flow(struct bnxt *bp,
  struct tc_cls_flower_offload *tc_flow_cmd,
  struct bnxt_tc_flow *flow)
 {
-   struct flow_dissector *dissector = tc_flow_cmd->dissector;
+   struct flow_rule *rule = tc_cls_flower_offload_flow_rule(tc_flow_cmd);
+   struct flow_dissector *dissector = rule->match.dissector;
 
/* KEY_CONTROL and KEY_BASIC are needed for forming a meaningful key */
if ((dissector->used_keys & BIT(FLOW_DISSECTOR_KEY_CONTROL)) == 0 ||
@@ -198,140 +192,120 @@ static int bnxt_tc_parse_flow(struct bnxt *bp,
return -EOPNOTSUPP;
}
 
-   if (dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_BASIC)) {
-   struct flow_dissector_key_basic *key =
-   GET_KEY(tc_flow_cmd, FLOW_DISSECTOR_KEY_BASIC);
-   struct flow_dissector_key_basic *mask =
-   GET_MASK(tc_flow_cmd, FLOW_DISSECTOR_KEY_BASIC);
+   if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_BASIC)) {
+   struct flow_match_basic match;
 
-   flow->l2_key.ether_type = key->n_proto;
-   flow->l2_mask.ether_type = mask->n_proto;
+   flow_rule_match_basic(rule, );
+   flow->l2_key.ether_type = match.key->n_proto;
+   flow->l2_mask.ether_type = match.mask->n_proto;
 
-   if (key->n_proto == htons(ETH_P_IP) ||
-   key->n_proto == htons(ETH_P_IPV6)) {
-   flow->l4_key.ip_proto = key->ip_proto;
-   flow->l4_mask.ip_proto = mask->ip_proto;
+   if (match.key->n_proto == htons(ETH_P_IP) ||
+   match.key->n_proto == htons(ETH_P_IPV6)) {
+   flow->l4_key.ip_proto = match.key->ip_proto;
+   flow->l4_mask.ip_proto = match.mask->ip_proto;
}
}
 
-   if (dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
-   struct flow_dissector_key_eth_addrs *key =
-   GET_KEY(tc_flow_cmd, FLOW_DISSECTOR_KEY_ETH_ADDRS);
-   struct flow_dissector_key_eth_addrs *mask =
-   GET_MASK(tc_flow_cmd, FLOW_DISSECTOR_KEY_ETH_ADDRS);
+

[PATCH net-next,v4 10/12] dsa: bcm_sf2: use flow_rule infrastructure

2018-11-28 Thread Pablo Neira Ayuso

Update this driver to use the flow_rule infrastructure, hence we can use
the same code to populate hardware IR from ethtool_rx_flow and the
cls_flower interfaces.

Signed-off-by: Pablo Neira Ayuso 
---
v4: comestic changes requested by Florian Fainelli.

 drivers/net/dsa/bcm_sf2_cfp.c | 98 +++
 1 file changed, 63 insertions(+), 35 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2_cfp.c b/drivers/net/dsa/bcm_sf2_cfp.c
index e14663ab6dbc..06a7f5f022a0 100644
--- a/drivers/net/dsa/bcm_sf2_cfp.c
+++ b/drivers/net/dsa/bcm_sf2_cfp.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "bcm_sf2.h"
 #include "bcm_sf2_regs.h"
@@ -257,7 +258,8 @@ static int bcm_sf2_cfp_act_pol_set(struct bcm_sf2_priv 
*priv,
 }
 
 static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv *priv,
-  struct ethtool_tcpip4_spec *v4_spec,
+  struct flow_dissector_key_ipv4_addrs *addrs,
+  struct flow_dissector_key_ports *ports,
   unsigned int slice_num,
   bool mask)
 {
@@ -278,7 +280,7 @@ static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv 
*priv,
 * UDF_n_A6 [23:8]
 * UDF_n_A5 [7:0]
 */
-   reg = be16_to_cpu(v4_spec->pdst) >> 8;
+   reg = be16_to_cpu(ports->dst) >> 8;
if (mask)
offset = CORE_CFP_MASK_PORT(3);
else
@@ -289,9 +291,9 @@ static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv 
*priv,
 * UDF_n_A4 [23:8]
 * UDF_n_A3 [7:0]
 */
-   reg = (be16_to_cpu(v4_spec->pdst) & 0xff) << 24 |
- (u32)be16_to_cpu(v4_spec->psrc) << 8 |
- (be32_to_cpu(v4_spec->ip4dst) & 0xff00) >> 8;
+   reg = (be16_to_cpu(ports->dst) & 0xff) << 24 |
+ (u32)be16_to_cpu(ports->src) << 8 |
+ (be32_to_cpu(addrs->dst) & 0xff00) >> 8;
if (mask)
offset = CORE_CFP_MASK_PORT(2);
else
@@ -302,9 +304,9 @@ static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv 
*priv,
 * UDF_n_A2 [23:8]
 * UDF_n_A1 [7:0]
 */
-   reg = (u32)(be32_to_cpu(v4_spec->ip4dst) & 0xff) << 24 |
- (u32)(be32_to_cpu(v4_spec->ip4dst) >> 16) << 8 |
- (be32_to_cpu(v4_spec->ip4src) & 0xff00) >> 8;
+   reg = (u32)(be32_to_cpu(addrs->dst) & 0xff) << 24 |
+ (u32)(be32_to_cpu(addrs->dst) >> 16) << 8 |
+ (be32_to_cpu(addrs->src) & 0xff00) >> 8;
if (mask)
offset = CORE_CFP_MASK_PORT(1);
else
@@ -317,8 +319,8 @@ static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv 
*priv,
 * Slice ID [3:2]
 * Slice valid  [1:0]
 */
-   reg = (u32)(be32_to_cpu(v4_spec->ip4src) & 0xff) << 24 |
- (u32)(be32_to_cpu(v4_spec->ip4src) >> 16) << 8 |
+   reg = (u32)(be32_to_cpu(addrs->src) & 0xff) << 24 |
+ (u32)(be32_to_cpu(addrs->src) >> 16) << 8 |
  SLICE_NUM(slice_num) | SLICE_VALID;
if (mask)
offset = CORE_CFP_MASK_PORT(0);
@@ -332,9 +334,12 @@ static int bcm_sf2_cfp_ipv4_rule_set(struct bcm_sf2_priv 
*priv, int port,
 unsigned int queue_num,
 struct ethtool_rx_flow_spec *fs)
 {
-   struct ethtool_tcpip4_spec *v4_spec, *v4_m_spec;
const struct cfp_udf_layout *layout;
unsigned int slice_num, rule_index;
+   struct ethtool_rx_flow_rule *flow;
+   struct flow_match_ipv4_addrs ipv4;
+   struct flow_match_ports ports;
+   struct flow_match_ip ip;
u8 ip_proto, ip_frag;
u8 num_udf;
u32 reg;
@@ -343,13 +348,9 @@ static int bcm_sf2_cfp_ipv4_rule_set(struct bcm_sf2_priv 
*priv, int port,
switch (fs->flow_type & ~FLOW_EXT) {
case TCP_V4_FLOW:
ip_proto = IPPROTO_TCP;
-   v4_spec = >h_u.tcp_ip4_spec;
-   v4_m_spec = >m_u.tcp_ip4_spec;
break;
case UDP_V4_FLOW:
ip_proto = IPPROTO_UDP;
-   v4_spec = >h_u.udp_ip4_spec;
-   v4_m_spec = >m_u.udp_ip4_spec;
break;
default:
return -EINVAL;
@@ -367,11 +368,21 @@ static int bcm_sf2_cfp_ipv4_rule_set(struct bcm_sf2_priv 
*priv, int port,
if (rule_index > bcm_sf2_cfp_rule_size(priv))
return -ENOSPC;
 
+   flow = ethtool_rx_flow_rule_create(fs);
+   if (IS_ERR(flow))
+   return PTR_ERR(flow);
+
+   flow_rule_match_ipv4_addrs(flow->rule, );
+   flow_rule_match_ports(flow->rule, );
+   flow_rule_match_ip(flow->rule, );
+
layout = _tcpip4_layout;
/* We only use one UDF slice for now */
slice_num = bcm_sf2_get_slice_number(layout, 0);
-   if

[PATCH net-next,v4 07/12] cls_flower: don't expose TC actions to drivers anymore

2018-11-28 Thread Pablo Neira Ayuso

Now that drivers have been converted to use the flow action
infrastructure, remove this field from the tc_cls_flower_offload
structure.

Signed-off-by: Pablo Neira Ayuso 
---
v4: rebase.

 include/net/pkt_cls.h  | 1 -
 net/sched/cls_flower.c | 5 -
 2 files changed, 6 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index a08c06e383db..9bd724bfa860 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -768,7 +768,6 @@ struct tc_cls_flower_offload {
unsigned long cookie;
struct flow_rule *rule;
struct flow_stats stats;
-   struct tcf_exts *exts;
u32 classid;
 };
 
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index b88cf29aff7b..ea92228ddc12 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -392,7 +392,6 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
cls_flower.rule->match.dissector = >mask->dissector;
cls_flower.rule->match.mask = >mask->key;
cls_flower.rule->match.key = >mkey;
-   cls_flower.exts = >exts;
cls_flower.classid = f->res.classid;
 
err = tc_setup_flow_action(_flower.rule->action, >exts);
@@ -427,7 +426,6 @@ static void fl_hw_update_stats(struct tcf_proto *tp, struct 
cls_fl_filter *f)
tc_cls_common_offload_init(_flower.common, tp, f->flags, NULL);
cls_flower.command = TC_CLSFLOWER_STATS;
cls_flower.cookie = (unsigned long) f;
-   cls_flower.exts = >exts;
cls_flower.classid = f->res.classid;
 
tc_setup_cb_call(block, >exts, TC_SETUP_CLSFLOWER,
@@ -1490,7 +1488,6 @@ static int fl_reoffload(struct tcf_proto *tp, bool add, 
tc_setup_cb_t *cb,
cls_flower.rule->match.dissector = >dissector;
cls_flower.rule->match.mask = >key;
cls_flower.rule->match.key = >mkey;
-   cls_flower.exts = >exts;
 
err = tc_setup_flow_action(_flower.rule->action,
   >exts);
@@ -1523,7 +1520,6 @@ static int fl_hw_create_tmplt(struct tcf_chain *chain,
 {
struct tc_cls_flower_offload cls_flower = {};
struct tcf_block *block = chain->block;
-   struct tcf_exts dummy_exts = { 0, };
 
cls_flower.rule = flow_rule_alloc(0);
if (!cls_flower.rule)
@@ -1535,7 +1531,6 @@ static int fl_hw_create_tmplt(struct tcf_chain *chain,
cls_flower.rule->match.dissector = >dissector;
cls_flower.rule->match.mask = >mask;
cls_flower.rule->match.key = >dummy_key;
-   cls_flower.exts = _exts;
 
/* We don't care if driver (any of them) fails to handle this
 * call. It serves just as a hint for it.
-- 
2.11.0

[PATCH net-next,v4 05/12] flow_offload: add statistics retrieval infrastructure and use it

2018-11-28 Thread Pablo Neira Ayuso

This patch provides the flow_stats structure that acts as container for
tc_cls_flower_offload, then we can use to restore the statistics on the
existing TC actions. Hence, tcf_exts_stats_update() is not used from
drivers anymore.

Signed-off-by: Pablo Neira Ayuso 
---
v4: rebase.

 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c  |  4 ++--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c  |  6 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c   |  2 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c |  2 +-
 drivers/net/ethernet/netronome/nfp/flower/offload.c   |  5 ++---
 include/net/flow_offload.h| 14 ++
 include/net/pkt_cls.h |  1 +
 net/sched/cls_flower.c|  4 
 8 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index b82143d6cdde..09cd75f54eba 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -1366,8 +1366,8 @@ static int bnxt_tc_get_flow_stats(struct bnxt *bp,
lastused = flow->lastused;
spin_unlock(>stats_lock);
 
-   tcf_exts_stats_update(tc_flow_cmd->exts, stats.bytes, stats.packets,
- lastused);
+   flow_stats_update(_flow_cmd->stats, stats.bytes, stats.packets,
+ lastused);
return 0;
 }
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
index 39c5af5dad3d..8a2d66ee1d7b 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
@@ -807,9 +807,9 @@ int cxgb4_tc_flower_stats(struct net_device *dev,
if (ofld_stats->packet_count != packets) {
if (ofld_stats->prev_packet_count != packets)
ofld_stats->last_used = jiffies;
-   tcf_exts_stats_update(cls->exts, bytes - ofld_stats->byte_count,
- packets - ofld_stats->packet_count,
- ofld_stats->last_used);
+   flow_stats_update(>stats, bytes - ofld_stats->byte_count,
+ packets - ofld_stats->packet_count,
+ ofld_stats->last_used);
 
ofld_stats->packet_count = packets;
ofld_stats->byte_count = bytes;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index afabd5e530f0..6bc5c57bda9d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -3225,7 +3225,7 @@ int mlx5e_stats_flower(struct mlx5e_priv *priv,
 
mlx5_fc_query_cached(counter, , , );
 
-   tcf_exts_stats_update(f->exts, bytes, packets, lastuse);
+   flow_stats_update(>stats, bytes, packets, lastuse);
 
return 0;
 }
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
index e6c4c672b1ca..60900e53243b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
@@ -460,7 +460,7 @@ int mlxsw_sp_flower_stats(struct mlxsw_sp *mlxsw_sp,
if (err)
goto err_rule_get_stats;
 
-   tcf_exts_stats_update(f->exts, bytes, packets, lastuse);
+   flow_stats_update(>stats, bytes, packets, lastuse);
 
mlxsw_sp_acl_ruleset_put(mlxsw_sp, ruleset);
return 0;
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 708331234908..524b9ae1a639 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -532,9 +532,8 @@ nfp_flower_get_stats(struct nfp_app *app, struct net_device 
*netdev,
ctx_id = be32_to_cpu(nfp_flow->meta.host_ctx_id);
 
spin_lock_bh(>stats_lock);
-   tcf_exts_stats_update(flow->exts, priv->stats[ctx_id].bytes,
- priv->stats[ctx_id].pkts,
- priv->stats[ctx_id].used);
+   flow_stats_update(>stats, priv->stats[ctx_id].bytes,
+ priv->stats[ctx_id].pkts, priv->stats[ctx_id].used);
 
priv->stats[ctx_id].pkts = 0;
priv->stats[ctx_id].bytes = 0;
diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index dabc819b6cc9..040c092c000a 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -179,4 +179,18 @@ static inline bool flow_rule_match_key(const struct 
flow_rule *rule,
return dissector_uses_key(rule->match.dissector, key);
 }
 
+struct flow_stats {
+   u64 pkts;
+   u64 bytes;
+   u64 lastused;
+};
+
+static inline

[PATCH net-next,v4 11/12] qede: place ethtool_rx_flow_spec after code after TC flower codebase

2018-11-28 Thread Pablo Neira Ayuso

This is a preparation patch to reuse the existing TC flower codebase
from ethtool_rx_flow_spec.

This patch is merely moving the core ethtool_rx_flow_spec parser after
tc flower offload driver code so we can skip a few forward function
declarations in the follow up patch.

Signed-off-by: Pablo Neira Ayuso 
---
v4: no changes.

 drivers/net/ethernet/qlogic/qede/qede_filter.c | 264 -
 1 file changed, 132 insertions(+), 132 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_filter.c 
b/drivers/net/ethernet/qlogic/qede/qede_filter.c
index 833c9ec58a6e..ed77950f6cf9 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_filter.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_filter.c
@@ -1791,72 +1791,6 @@ static int qede_flow_spec_to_tuple_udpv6(struct qede_dev 
*edev,
return 0;
 }
 
-static int qede_flow_spec_to_tuple(struct qede_dev *edev,
-  struct qede_arfs_tuple *t,
-  struct ethtool_rx_flow_spec *fs)
-{
-   memset(t, 0, sizeof(*t));
-
-   if (qede_flow_spec_validate_unused(edev, fs))
-   return -EOPNOTSUPP;
-
-   switch ((fs->flow_type & ~FLOW_EXT)) {
-   case TCP_V4_FLOW:
-   return qede_flow_spec_to_tuple_tcpv4(edev, t, fs);
-   case UDP_V4_FLOW:
-   return qede_flow_spec_to_tuple_udpv4(edev, t, fs);
-   case TCP_V6_FLOW:
-   return qede_flow_spec_to_tuple_tcpv6(edev, t, fs);
-   case UDP_V6_FLOW:
-   return qede_flow_spec_to_tuple_udpv6(edev, t, fs);
-   default:
-   DP_VERBOSE(edev, NETIF_MSG_IFUP,
-  "Can't support flow of type %08x\n", fs->flow_type);
-   return -EOPNOTSUPP;
-   }
-
-   return 0;
-}
-
-static int qede_flow_spec_validate(struct qede_dev *edev,
-  struct ethtool_rx_flow_spec *fs,
-  struct qede_arfs_tuple *t)
-{
-   if (fs->location >= QEDE_RFS_MAX_FLTR) {
-   DP_INFO(edev, "Location out-of-bounds\n");
-   return -EINVAL;
-   }
-
-   /* Check location isn't already in use */
-   if (test_bit(fs->location, edev->arfs->arfs_fltr_bmap)) {
-   DP_INFO(edev, "Location already in use\n");
-   return -EINVAL;
-   }
-
-   /* Check if the filtering-mode could support the filter */
-   if (edev->arfs->filter_count &&
-   edev->arfs->mode != t->mode) {
-   DP_INFO(edev,
-   "flow_spec would require filtering mode %08x, but %08x 
is configured\n",
-   t->mode, edev->arfs->filter_count);
-   return -EINVAL;
-   }
-
-   /* If drop requested then no need to validate other data */
-   if (fs->ring_cookie == RX_CLS_FLOW_DISC)
-   return 0;
-
-   if (ethtool_get_flow_spec_ring_vf(fs->ring_cookie))
-   return 0;
-
-   if (fs->ring_cookie >= QEDE_RSS_COUNT(edev)) {
-   DP_INFO(edev, "Queue out-of-bounds\n");
-   return -EINVAL;
-   }
-
-   return 0;
-}
-
 /* Must be called while qede lock is held */
 static struct qede_arfs_fltr_node *
 qede_flow_find_fltr(struct qede_dev *edev, struct qede_arfs_tuple *t)
@@ -1896,72 +1830,6 @@ static void qede_flow_set_destination(struct qede_dev 
*edev,
   "Configuring N-tuple for VF 0x%02x\n", n->vfid - 1);
 }
 
-int qede_add_cls_rule(struct qede_dev *edev, struct ethtool_rxnfc *info)
-{
-   struct ethtool_rx_flow_spec *fsp = >fs;
-   struct qede_arfs_fltr_node *n;
-   struct qede_arfs_tuple t;
-   int min_hlen, rc;
-
-   __qede_lock(edev);
-
-   if (!edev->arfs) {
-   rc = -EPERM;
-   goto unlock;
-   }
-
-   /* Translate the flow specification into something fittign our DB */
-   rc = qede_flow_spec_to_tuple(edev, , fsp);
-   if (rc)
-   goto unlock;
-
-   /* Make sure location is valid and filter isn't already set */
-   rc = qede_flow_spec_validate(edev, fsp, );
-   if (rc)
-   goto unlock;
-
-   if (qede_flow_find_fltr(edev, )) {
-   rc = -EINVAL;
-   goto unlock;
-   }
-
-   n = kzalloc(sizeof(*n), GFP_KERNEL);
-   if (!n) {
-   rc = -ENOMEM;
-   goto unlock;
-   }
-
-   min_hlen = qede_flow_get_min_header_size();
-   n->data = kzalloc(min_hlen, GFP_KERNEL);
-   if (!n->data) {
-   kfree(n);
-   rc = -ENOMEM;
-   goto unlock;
-   }
-
-   n->sw_id = fsp->location;
-   set_bit(n->sw_id, edev->arfs->arfs_fltr_bmap);
-   n->buf_len = min_hlen;
-
-   memcpy(>tuple, , sizeof(n->tuple));
-
-   qede_flow_set_destination(edev, n, fsp);
-
-   /* Build a minimal header according to the flow */
-   n->tuple.build_hdr(>tuple, n->data);
-
-   rc =

[PATCH net-next,v4 08/12] flow_offload: add wake-up-on-lan and queue to flow_action

2018-11-28 Thread Pablo Neira Ayuso

These actions need to be added to support bcm sf2 features available
through the ethtool_rx_flow interface.

Reviewed-by: Florian Fainelli 
Signed-off-by: Pablo Neira Ayuso 
---
v4: rebase.

 include/net/flow_offload.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index 040c092c000a..35a9c933a166 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -116,6 +116,8 @@ enum flow_action_id {
FLOW_ACTION_ADD,
FLOW_ACTION_CSUM,
FLOW_ACTION_MARK,
+   FLOW_ACTION_WAKE,
+   FLOW_ACTION_QUEUE,
 };
 
 /* This is mirroring enum pedit_header_type definition for easy mapping between
@@ -150,6 +152,7 @@ struct flow_action_entry {
const struct ip_tunnel_info *tunnel;/* 
FLOW_ACTION_TUNNEL_ENCAP */
u32 csum_flags; /* FLOW_ACTION_CSUM */
u32 mark;   /* FLOW_ACTION_MARK */
+   u32 queue_index;/* FLOW_ACTION_QUEUE */
};
 };
 
-- 
2.11.0

[PATCH net-next,v4 12/12] qede: use ethtool_rx_flow_rule() to remove duplicated parser code

2018-11-28 Thread Pablo Neira Ayuso

The qede driver supports for ethtool_rx_flow_spec and flower, both
codebases look very similar.

This patch uses the ethtool_rx_flow_rule() infrastructure to remove the
duplicated ethtool_rx_flow_spec parser and consolidate ACL offload
support around the flow_rule infrastructure.

Furthermore, more code can be consolidated by merging
qede_add_cls_rule() and qede_add_tc_flower_fltr(), these two functions
also look very similar.

This driver currently provides simple ACL support, such as 5-tuple
matching, drop policy and queue to CPU.

Drivers that support more features can benefit from this infrastructure
to save even more redundant codebase.

Signed-off-by: Pablo Neira Ayuso 
---
v4: Rename qede_parse_flower_attr() to qede_parse_flow_attr() and remove
_tc_ infix in parser, requested by Marcelo Ricardo Leitner.

 drivers/net/ethernet/qlogic/qede/qede_filter.c | 277 +++--
 1 file changed, 74 insertions(+), 203 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_filter.c 
b/drivers/net/ethernet/qlogic/qede/qede_filter.c
index ed77950f6cf9..5674a89ceab6 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_filter.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_filter.c
@@ -1665,132 +1665,6 @@ static int qede_set_v6_tuple_to_profile(struct qede_dev 
*edev,
return 0;
 }
 
-static int qede_flow_spec_to_tuple_ipv4_common(struct qede_dev *edev,
-  struct qede_arfs_tuple *t,
-  struct ethtool_rx_flow_spec *fs)
-{
-   if ((fs->h_u.tcp_ip4_spec.ip4src &
-fs->m_u.tcp_ip4_spec.ip4src) != fs->h_u.tcp_ip4_spec.ip4src) {
-   DP_INFO(edev, "Don't support IP-masks\n");
-   return -EOPNOTSUPP;
-   }
-
-   if ((fs->h_u.tcp_ip4_spec.ip4dst &
-fs->m_u.tcp_ip4_spec.ip4dst) != fs->h_u.tcp_ip4_spec.ip4dst) {
-   DP_INFO(edev, "Don't support IP-masks\n");
-   return -EOPNOTSUPP;
-   }
-
-   if ((fs->h_u.tcp_ip4_spec.psrc &
-fs->m_u.tcp_ip4_spec.psrc) != fs->h_u.tcp_ip4_spec.psrc) {
-   DP_INFO(edev, "Don't support port-masks\n");
-   return -EOPNOTSUPP;
-   }
-
-   if ((fs->h_u.tcp_ip4_spec.pdst &
-fs->m_u.tcp_ip4_spec.pdst) != fs->h_u.tcp_ip4_spec.pdst) {
-   DP_INFO(edev, "Don't support port-masks\n");
-   return -EOPNOTSUPP;
-   }
-
-   if (fs->h_u.tcp_ip4_spec.tos) {
-   DP_INFO(edev, "Don't support tos\n");
-   return -EOPNOTSUPP;
-   }
-
-   t->eth_proto = htons(ETH_P_IP);
-   t->src_ipv4 = fs->h_u.tcp_ip4_spec.ip4src;
-   t->dst_ipv4 = fs->h_u.tcp_ip4_spec.ip4dst;
-   t->src_port = fs->h_u.tcp_ip4_spec.psrc;
-   t->dst_port = fs->h_u.tcp_ip4_spec.pdst;
-
-   return qede_set_v4_tuple_to_profile(edev, t);
-}
-
-static int qede_flow_spec_to_tuple_tcpv4(struct qede_dev *edev,
-struct qede_arfs_tuple *t,
-struct ethtool_rx_flow_spec *fs)
-{
-   t->ip_proto = IPPROTO_TCP;
-
-   if (qede_flow_spec_to_tuple_ipv4_common(edev, t, fs))
-   return -EINVAL;
-
-   return 0;
-}
-
-static int qede_flow_spec_to_tuple_udpv4(struct qede_dev *edev,
-struct qede_arfs_tuple *t,
-struct ethtool_rx_flow_spec *fs)
-{
-   t->ip_proto = IPPROTO_UDP;
-
-   if (qede_flow_spec_to_tuple_ipv4_common(edev, t, fs))
-   return -EINVAL;
-
-   return 0;
-}
-
-static int qede_flow_spec_to_tuple_ipv6_common(struct qede_dev *edev,
-  struct qede_arfs_tuple *t,
-  struct ethtool_rx_flow_spec *fs)
-{
-   struct in6_addr zero_addr;
-
-   memset(_addr, 0, sizeof(zero_addr));
-
-   if ((fs->h_u.tcp_ip6_spec.psrc &
-fs->m_u.tcp_ip6_spec.psrc) != fs->h_u.tcp_ip6_spec.psrc) {
-   DP_INFO(edev, "Don't support port-masks\n");
-   return -EOPNOTSUPP;
-   }
-
-   if ((fs->h_u.tcp_ip6_spec.pdst &
-fs->m_u.tcp_ip6_spec.pdst) != fs->h_u.tcp_ip6_spec.pdst) {
-   DP_INFO(edev, "Don't support port-masks\n");
-   return -EOPNOTSUPP;
-   }
-
-   if (fs->h_u.tcp_ip6_spec.tclass) {
-   DP_INFO(edev, "Don't support tclass\n");
-   return -EOPNOTSUPP;
-   }
-
-   t->eth_proto = htons(ETH_P_IPV6);
-   memcpy(>src_ipv6, >h_u.tcp_ip6_spec.ip6src,
-  sizeof(struct in6_addr));
-   memcpy(>dst_ipv6, >h_u.tcp_ip6_spec.ip6dst,
-  sizeof(struct in6_addr));
-   t->src_port = fs->h_u.tcp_ip6_spec.psrc;
-   t->dst_port = fs->h_u.tcp_ip6_spec.pdst;
-
-   return qede_set_v6_tuple_to_profile(edev, t, _addr);
-}
-
-static int qede_flow_spec_to_tuple_tcpv6(struct qede_dev *edev,
-

[PATCH net-next,v4 00/12] add flow_rule infrastructure

2018-11-28 Thread Pablo Neira Ayuso

Hi,

This patchset is another iteration to introduce an in-kernel intermediate
representation (IR) to express ACL hardware offloads [1] [2] [3].

Changes from previous version, based on feedback from:

* Marcelo Ricardo Leitner:
- Fix bisect-ability due to update in flow_rule_alloc().
- s/key/entry variable name for struct flow_action_entry in
  tc_setup_flow_action().
- Rename qede_parse_flower_attr() to qede_parse_flow_attr().
  Remove _tc_ infix in qede parser too.

* Jiri Pirko:
- Rename to ethtool_rx_flow_rule_create() and
  ethtool_rx_flow_rule_destroy().

* Florian Fainelli:
- Use BIT() ethtool_rx_flow_rule_create() to .used_keys.
- Add support for FLOW_EXT and FLOW_MAC_EXT to
  ethtool_rx_flow_rule_create().
- ethtool inverts masks from userspace before passing the
  ethtool_rx_flow_rule blob to the kernel.
- Use post-increment in flow_action_for_each().

Other misc updates I made are:
- Use "flow_offload:" tag in patch subject whenever possible.
- Fix tos field in ethtool_rx_flow_rule_create().
- Remove unused flow_rule_match_basic() from bcm_sf2.

Thanks.

[1] https://lwn.net/Articles/766695/
[2] https://marc.info/?l=linux-netdev=154233253114506=2
[3] https://marc.info/?l=linux-netdev=154258780717036=2

Pablo Neira Ayuso (12):
  flow_offload: add flow_rule and flow_match structures and use them
  net/mlx5e: support for two independent packet edit actions
  flow_offload: add flow action infrastructure
  cls_api: add translator to flow_action representation
  flow_offload: add statistics retrieval infrastructure and use it
  drivers: net: use flow action infrastructure
  cls_flower: don't expose TC actions to drivers anymore
  flow_offload: add wake-up-on-lan and queue to flow_action
  ethtool: add basic ethtool_rx_flow_spec to flow_rule structure translator
  dsa: bcm_sf2: use flow_rule infrastructure
  qede: place ethtool_rx_flow_spec after code after TC flower codebase
  qede: use ethtool_rx_flow_rule() to remove duplicated parser code

 drivers/net/dsa/bcm_sf2_cfp.c  |  98 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c   | 252 +++
 .../net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c   | 450 ++---
 drivers/net/ethernet/intel/i40e/i40e_main.c| 178 ++---
 drivers/net/ethernet/intel/iavf/iavf_main.c| 195 +++---
 drivers/net/ethernet/intel/igb/igb_main.c  |  64 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 743 ++---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c |   2 +-
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  | 258 ---
 drivers/net/ethernet/netronome/nfp/flower/action.c | 196 +++---
 drivers/net/ethernet/netronome/nfp/flower/match.c  | 417 ++--
 .../net/ethernet/netronome/nfp/flower/offload.c| 150 ++---
 drivers/net/ethernet/qlogic/qede/qede_filter.c | 570 ++--
 include/linux/ethtool.h|  10 +
 include/net/flow_offload.h | 199 ++
 include/net/pkt_cls.h  |  18 +-
 net/core/Makefile  |   2 +-
 net/core/ethtool.c | 237 +++
 net/core/flow_offload.c| 153 +
 net/sched/cls_api.c| 116 
 net/sched/cls_flower.c |  69 +-
 21 files changed, 2384 insertions(+), 1993 deletions(-)
 create mode 100644 include/net/flow_offload.h
 create mode 100644 net/core/flow_offload.c

-- 
2.11.0

[PATCH net-next,v4 04/12] cls_api: add translator to flow_action representation

2018-11-28 Thread Pablo Neira Ayuso

This patch implements a new function to translate from native TC action
to the new flow_action representation. Moreover, this patch also updates
cls_flower to use this new function.

Signed-off-by: Pablo Neira Ayuso 
---
v4: s/key/entry variable name for struct flow_action_entry in
tc_setup_flow_action().

 include/net/pkt_cls.h  |  3 ++
 net/sched/cls_api.c| 99 ++
 net/sched/cls_flower.c | 14 +++
 3 files changed, 116 insertions(+)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 9ceac97e5eff..abb035f84321 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -622,6 +622,9 @@ tcf_match_indev(struct sk_buff *skb, int ifindex)
 
 unsigned int tcf_exts_num_actions(struct tcf_exts *exts);
 
+int tc_setup_flow_action(struct flow_action *flow_action,
+const struct tcf_exts *exts);
+
 int tc_setup_cb_call(struct tcf_block *block, struct tcf_exts *exts,
 enum tc_setup_type type, void *type_data, bool err_stop);
 
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3a4d36072fd5..00b7b639f713 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -32,6 +32,13 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 extern const struct nla_policy rtm_tca_policy[TCA_MAX + 1];
 
@@ -2568,6 +2575,98 @@ int tc_setup_cb_call(struct tcf_block *block, struct 
tcf_exts *exts,
 }
 EXPORT_SYMBOL(tc_setup_cb_call);
 
+int tc_setup_flow_action(struct flow_action *flow_action,
+const struct tcf_exts *exts)
+{
+   const struct tc_action *act;
+   int i, j, k;
+
+   if (!exts)
+   return 0;
+
+   j = 0;
+   tcf_exts_for_each_action(i, act, exts) {
+   struct flow_action_entry *entry;
+
+   entry = _action->entries[j];
+   if (is_tcf_gact_ok(act)) {
+   entry->id = FLOW_ACTION_ACCEPT;
+   } else if (is_tcf_gact_shot(act)) {
+   entry->id = FLOW_ACTION_DROP;
+   } else if (is_tcf_gact_trap(act)) {
+   entry->id = FLOW_ACTION_TRAP;
+   } else if (is_tcf_gact_goto_chain(act)) {
+   entry->id = FLOW_ACTION_GOTO;
+   entry->chain_index = tcf_gact_goto_chain_index(act);
+   } else if (is_tcf_mirred_egress_redirect(act)) {
+   entry->id = FLOW_ACTION_REDIRECT;
+   entry->dev = tcf_mirred_dev(act);
+   } else if (is_tcf_mirred_egress_mirror(act)) {
+   entry->id = FLOW_ACTION_MIRRED;
+   entry->dev = tcf_mirred_dev(act);
+   } else if (is_tcf_vlan(act)) {
+   switch (tcf_vlan_action(act)) {
+   case TCA_VLAN_ACT_PUSH:
+   entry->id = FLOW_ACTION_VLAN_PUSH;
+   entry->vlan.vid = tcf_vlan_push_vid(act);
+   entry->vlan.proto = tcf_vlan_push_proto(act);
+   entry->vlan.prio = tcf_vlan_push_prio(act);
+   break;
+   case TCA_VLAN_ACT_POP:
+   entry->id = FLOW_ACTION_VLAN_POP;
+   break;
+   case TCA_VLAN_ACT_MODIFY:
+   entry->id = FLOW_ACTION_VLAN_MANGLE;
+   entry->vlan.vid = tcf_vlan_push_vid(act);
+   entry->vlan.proto = tcf_vlan_push_proto(act);
+   entry->vlan.prio = tcf_vlan_push_prio(act);
+   break;
+   default:
+   goto err_out;
+   }
+   } else if (is_tcf_tunnel_set(act)) {
+   entry->id = FLOW_ACTION_TUNNEL_ENCAP;
+   entry->tunnel = tcf_tunnel_info(act);
+   } else if (is_tcf_tunnel_release(act)) {
+   entry->id = FLOW_ACTION_TUNNEL_DECAP;
+   entry->tunnel = tcf_tunnel_info(act);
+   } else if (is_tcf_pedit(act)) {
+   for (k = 0; k < tcf_pedit_nkeys(act); k++) {
+   switch (tcf_pedit_cmd(act, k)) {
+   case TCA_PEDIT_KEY_EX_CMD_SET:
+   entry->id = FLOW_ACTION_MANGLE;
+   break;
+   case TCA_PEDIT_KEY_EX_CMD_ADD:
+   entry->id = FLOW_ACTION_ADD;
+   break;
+   default:
+   goto err_out;
+   }
+   entry->mangle.htype =

[PATCH net-next,v4 09/12] ethtool: add basic ethtool_rx_flow_spec to flow_rule structure translator

2018-11-28 Thread Pablo Neira Ayuso

This patch adds a function to translate the ethtool_rx_flow_spec
structure to the flow_rule representation.

This allows us to reuse code from the driver side given that both flower
and ethtool_rx_flow interfaces use the same representation.

This patch also includes support for FLOW_EXT and FLOW_MAC_EXT.

Signed-off-by: Pablo Neira Ayuso 
---
v4: Rename to ethtool_rx_flow_rule_create() and ethtool_rx_flow_rule_destroy(),
requested by Jiri Pirko.
Add support for FLOW_EXT and FLOW_MAC_EXT, by Florian Fainelli.
Use BIT() to set .use_keys, also by Florian Fainelli.
Double-check mask field is set when value field is present in
ethtool_rx_flow_rule, requested by Florian Fainelli.

 include/linux/ethtool.h |  10 ++
 net/core/ethtool.c  | 237 
 2 files changed, 247 insertions(+)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index afd9596ce636..c76d1b34c9a2 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -400,4 +400,14 @@ struct ethtool_ops {
void(*get_ethtool_phy_stats)(struct net_device *,
 struct ethtool_stats *, u64 *);
 };
+
+struct ethtool_rx_flow_rule {
+   struct flow_rule*rule;
+   unsigned long   priv[0];
+};
+
+struct ethtool_rx_flow_rule *
+ethtool_rx_flow_rule_create(const struct ethtool_rx_flow_spec *fs);
+void ethtool_rx_flow_rule_destroy(struct ethtool_rx_flow_rule *rule);
+
 #endif /* _LINUX_ETHTOOL_H */
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index d05402868575..d25f48acac57 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Some useful ethtool_ops methods that're device independent.
@@ -2808,3 +2809,239 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
 
return rc;
 }
+
+struct ethtool_rx_flow_key {
+   struct flow_dissector_key_basic basic;
+   union {
+   struct flow_dissector_key_ipv4_addrsipv4;
+   struct flow_dissector_key_ipv6_addrsipv6;
+   };
+   struct flow_dissector_key_ports tp;
+   struct flow_dissector_key_ipip;
+   struct flow_dissector_key_vlan  vlan;
+   struct flow_dissector_key_eth_addrs eth_addrs;
+} __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as longs. 
*/
+
+struct ethtool_rx_flow_match {
+   struct flow_dissector   dissector;
+   struct ethtool_rx_flow_key  key;
+   struct ethtool_rx_flow_key  mask;
+};
+
+struct ethtool_rx_flow_rule *
+ethtool_rx_flow_rule_create(const struct ethtool_rx_flow_spec *fs)
+{
+   static struct in6_addr zero_addr = {};
+   struct ethtool_rx_flow_match *match;
+   struct ethtool_rx_flow_rule *flow;
+   struct flow_action_entry *act;
+
+   flow = kzalloc(sizeof(struct ethtool_rx_flow_rule) +
+  sizeof(struct ethtool_rx_flow_match), GFP_KERNEL);
+   if (!flow)
+   return ERR_PTR(-ENOMEM);
+
+   /* ethtool_rx supports only one single action per rule. */
+   flow->rule = flow_rule_alloc(1);
+   if (!flow->rule) {
+   kfree(flow);
+   return ERR_PTR(-ENOMEM);
+   }
+
+   match = (struct ethtool_rx_flow_match *)flow->priv;
+   flow->rule->match.dissector = >dissector;
+   flow->rule->match.mask  = >mask;
+   flow->rule->match.key   = >key;
+
+   match->mask.basic.n_proto = htons(0x);
+
+   switch (fs->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
+   case TCP_V4_FLOW:
+   case UDP_V4_FLOW: {
+   const struct ethtool_tcpip4_spec *v4_spec, *v4_m_spec;
+
+   match->key.basic.n_proto = htons(ETH_P_IP);
+
+   v4_spec = >h_u.tcp_ip4_spec;
+   v4_m_spec = >m_u.tcp_ip4_spec;
+
+   if (v4_m_spec->ip4src) {
+   match->key.ipv4.src = v4_spec->ip4src;
+   match->mask.ipv4.src = v4_m_spec->ip4src;
+   }
+   if (v4_m_spec->ip4dst) {
+   match->key.ipv4.dst = v4_spec->ip4dst;
+   match->mask.ipv4.dst = v4_m_spec->ip4dst;
+   }
+   if (v4_m_spec->ip4src ||
+   v4_m_spec->ip4dst) {
+   match->dissector.used_keys |=
+   BIT(FLOW_DISSECTOR_KEY_IPV4_ADDRS);
+   match->dissector.offset[FLOW_DISSECTOR_KEY_IPV4_ADDRS] =
+   offsetof(struct ethtool_rx_flow_key, ipv4);
+   }
+   if (v4_m_spec->psrc) {
+   match->key.tp.src = v4_spec->psrc;
+   match->mask.tp.src = v4_m_spec->psrc;
+   }
+   if (v4_m_spec->pdst) {
+   match->key.tp.dst = v4_spec->pdst;
+

[PATCH net-next,v4 02/12] net/mlx5e: support for two independent packet edit actions

2018-11-28 Thread Pablo Neira Ayuso

This patch adds pedit_headers_action structure to store the result of
parsing tc pedit actions. Then, it calls alloc_tc_pedit_action() to
populate the mlx5e hardware intermediate representation once all actions
have been parsed.

This patch comes in preparation for the new flow_action infrastructure,
where each packet mangling comes in an separated action, ie. not packed
as in tc pedit.

Signed-off-by: Pablo Neira Ayuso 
---
v4: rebase.

 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 81 ++---
 1 file changed, 59 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index cfde8ae9759c..afabd5e530f0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1749,6 +1749,12 @@ struct pedit_headers {
struct udphdr  udp;
 };
 
+struct pedit_headers_action {
+   struct pedit_headersvals;
+   struct pedit_headersmasks;
+   u32 pedits;
+};
+
 static int pedit_header_offsets[] = {
[TCA_PEDIT_KEY_EX_HDR_TYPE_ETH] = offsetof(struct pedit_headers, eth),
[TCA_PEDIT_KEY_EX_HDR_TYPE_IP4] = offsetof(struct pedit_headers, ip4),
@@ -1760,16 +1766,15 @@ static int pedit_header_offsets[] = {
 #define pedit_header(_ph, _htype) ((void *)(_ph) + 
pedit_header_offsets[_htype])
 
 static int set_pedit_val(u8 hdr_type, u32 mask, u32 val, u32 offset,
-struct pedit_headers *masks,
-struct pedit_headers *vals)
+struct pedit_headers_action *hdrs)
 {
u32 *curr_pmask, *curr_pval;
 
if (hdr_type >= __PEDIT_HDR_TYPE_MAX)
goto out_err;
 
-   curr_pmask = (u32 *)(pedit_header(masks, hdr_type) + offset);
-   curr_pval  = (u32 *)(pedit_header(vals, hdr_type) + offset);
+   curr_pmask = (u32 *)(pedit_header(>masks, hdr_type) + offset);
+   curr_pval  = (u32 *)(pedit_header(>vals, hdr_type) + offset);
 
if (*curr_pmask & mask)  /* disallow acting twice on the same location 
*/
goto out_err;
@@ -1825,8 +1830,7 @@ static struct mlx5_fields fields[] = {
  * max from the SW pedit action. On success, it says how many HW actions were
  * actually parsed.
  */
-static int offload_pedit_fields(struct pedit_headers *masks,
-   struct pedit_headers *vals,
+static int offload_pedit_fields(struct pedit_headers_action *hdrs,
struct mlx5e_tc_flow_parse_attr *parse_attr,
struct netlink_ext_ack *extack)
 {
@@ -1841,10 +1845,10 @@ static int offload_pedit_fields(struct pedit_headers 
*masks,
__be16 mask_be16;
void *action;
 
-   set_masks = [TCA_PEDIT_KEY_EX_CMD_SET];
-   add_masks = [TCA_PEDIT_KEY_EX_CMD_ADD];
-   set_vals = [TCA_PEDIT_KEY_EX_CMD_SET];
-   add_vals = [TCA_PEDIT_KEY_EX_CMD_ADD];
+   set_masks = [TCA_PEDIT_KEY_EX_CMD_SET].masks;
+   add_masks = [TCA_PEDIT_KEY_EX_CMD_ADD].masks;
+   set_vals = [TCA_PEDIT_KEY_EX_CMD_SET].vals;
+   add_vals = [TCA_PEDIT_KEY_EX_CMD_ADD].vals;
 
action_size = MLX5_UN_SZ_BYTES(set_action_in_add_action_in_auto);
action = parse_attr->mod_hdr_actions;
@@ -1940,12 +1944,14 @@ static int offload_pedit_fields(struct pedit_headers 
*masks,
 }
 
 static int alloc_mod_hdr_actions(struct mlx5e_priv *priv,
-const struct tc_action *a, int namespace,
+struct pedit_headers_action *hdrs,
+int namespace,
 struct mlx5e_tc_flow_parse_attr *parse_attr)
 {
int nkeys, action_size, max_actions;
 
-   nkeys = tcf_pedit_nkeys(a);
+   nkeys = hdrs[TCA_PEDIT_KEY_EX_CMD_SET].pedits +
+   hdrs[TCA_PEDIT_KEY_EX_CMD_ADD].pedits;
action_size = MLX5_UN_SZ_BYTES(set_action_in_add_action_in_auto);
 
if (namespace == MLX5_FLOW_NAMESPACE_FDB) /* FDB offloading */
@@ -1969,18 +1975,15 @@ static const struct pedit_headers zero_masks = {};
 static int parse_tc_pedit_action(struct mlx5e_priv *priv,
 const struct tc_action *a, int namespace,
 struct mlx5e_tc_flow_parse_attr *parse_attr,
+struct pedit_headers_action *hdrs,
 struct netlink_ext_ack *extack)
 {
-   struct pedit_headers masks[__PEDIT_CMD_MAX], vals[__PEDIT_CMD_MAX], 
*cmd_masks;
int nkeys, i, err = -EOPNOTSUPP;
u32 mask, val, offset;
u8 cmd, htype;
 
nkeys = tcf_pedit_nkeys(a);
 
-   memset(masks, 0, sizeof(struct pedit_headers) * __PEDIT_CMD_MAX);
-   memset(vals,  0, sizeof(struct pedit_headers) * __PEDIT_CMD_MAX);
-
for (i = 0; i < nkeys; i++) {
htype = tcf_pedit_htype(a, i);
cmd =

[PATCH net-next,v4 03/12] flow_offload: add flow action infrastructure

2018-11-28 Thread Pablo Neira Ayuso

This new infrastructure defines the nic actions that you can perform
from existing network drivers. This infrastructure allows us to avoid a
direct dependency with the native software TC action representation.

Signed-off-by: Pablo Neira Ayuso 
---
v4: Use post-increment in flow_action_for_each(), reported by Florian Fainelli.
Fix bisectability breakage, reported by Marcelo Ricardo Leitner.

 include/net/flow_offload.h | 69 +-
 include/net/pkt_cls.h  |  2 ++
 net/core/flow_offload.c| 14 --
 net/sched/cls_api.c| 17 
 net/sched/cls_flower.c |  7 +++--
 5 files changed, 103 insertions(+), 6 deletions(-)

diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index 461c66595763..dabc819b6cc9 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -100,11 +100,78 @@ void flow_rule_match_enc_keyid(const struct flow_rule 
*rule,
 void flow_rule_match_enc_opts(const struct flow_rule *rule,
  struct flow_match_enc_opts *out);
 
+enum flow_action_id {
+   FLOW_ACTION_ACCEPT  = 0,
+   FLOW_ACTION_DROP,
+   FLOW_ACTION_TRAP,
+   FLOW_ACTION_GOTO,
+   FLOW_ACTION_REDIRECT,
+   FLOW_ACTION_MIRRED,
+   FLOW_ACTION_VLAN_PUSH,
+   FLOW_ACTION_VLAN_POP,
+   FLOW_ACTION_VLAN_MANGLE,
+   FLOW_ACTION_TUNNEL_ENCAP,
+   FLOW_ACTION_TUNNEL_DECAP,
+   FLOW_ACTION_MANGLE,
+   FLOW_ACTION_ADD,
+   FLOW_ACTION_CSUM,
+   FLOW_ACTION_MARK,
+};
+
+/* This is mirroring enum pedit_header_type definition for easy mapping between
+ * tc pedit action. Legacy TCA_PEDIT_KEY_EX_HDR_TYPE_NETWORK is mapped to
+ * FLOW_ACT_MANGLE_UNSPEC, which is supported by no driver.
+ */
+enum flow_action_mangle_base {
+   FLOW_ACT_MANGLE_UNSPEC  = 0,
+   FLOW_ACT_MANGLE_HDR_TYPE_ETH,
+   FLOW_ACT_MANGLE_HDR_TYPE_IP4,
+   FLOW_ACT_MANGLE_HDR_TYPE_IP6,
+   FLOW_ACT_MANGLE_HDR_TYPE_TCP,
+   FLOW_ACT_MANGLE_HDR_TYPE_UDP,
+};
+
+struct flow_action_entry {
+   enum flow_action_id id;
+   union {
+   u32 chain_index;/* FLOW_ACTION_GOTO */
+   struct net_device   *dev;   /* FLOW_ACTION_REDIRECT 
*/
+   struct {/* FLOW_ACTION_VLAN */
+   u16 vid;
+   __be16  proto;
+   u8  prio;
+   } vlan;
+   struct {/* 
FLOW_ACTION_PACKET_EDIT */
+   enum flow_action_mangle_base htype;
+   u32 offset;
+   u32 mask;
+   u32 val;
+   } mangle;
+   const struct ip_tunnel_info *tunnel;/* 
FLOW_ACTION_TUNNEL_ENCAP */
+   u32 csum_flags; /* FLOW_ACTION_CSUM */
+   u32 mark;   /* FLOW_ACTION_MARK */
+   };
+};
+
+struct flow_action {
+   unsigned intnum_entries;
+   struct flow_action_entryentries[0];
+};
+
+static inline bool flow_action_has_entries(const struct flow_action *action)
+{
+   return action->num_entries;
+}
+
+#define flow_action_for_each(__i, __act, __actions)\
+for (__i = 0, __act = &(__actions)->entries[0]; __i < 
(__actions)->num_entries; __act = &(__actions)->entries[__i++])
+
 struct flow_rule {
struct flow_match   match;
+   struct flow_action  action;
 };
 
-struct flow_rule *flow_rule_alloc(void);
+struct flow_rule *flow_rule_alloc(unsigned int num_actions);
 
 static inline bool flow_rule_match_key(const struct flow_rule *rule,
   enum flow_dissector_key_id key)
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 359876ee32be..9ceac97e5eff 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -620,6 +620,8 @@ tcf_match_indev(struct sk_buff *skb, int ifindex)
 }
 #endif /* CONFIG_NET_CLS_IND */
 
+unsigned int tcf_exts_num_actions(struct tcf_exts *exts);
+
 int tc_setup_cb_call(struct tcf_block *block, struct tcf_exts *exts,
 enum tc_setup_type type, void *type_data, bool err_stop);
 
diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
index 2fbf6903d2f6..c3a00eac4804 100644
--- a/net/core/flow_offload.c
+++ b/net/core/flow_offload.c
@@ -3,9 +3,19 @@
 #include 
 #include 
 
-struct flow_rule *flow_rule_alloc(void)
+struct flow_rule *flow_rule_alloc(unsigned int num_actions)
 {
-   return kzalloc(sizeof(struct flow_rule), GFP_KERNEL);
+   struct flow_rule *rule;
+
+   rule = kzalloc(sizeof(struct flow_rule) +
+  sizeof(struct flow_action_entry) * num_actions,
+  GFP_KERNEL);
+   if (!rule)
+

Re: [PATCH] udp: Allow to defer reception until connect() happened

2018-11-28 Thread Eric Dumazet

On Wed, Nov 28, 2018 at 5:57 PM Christoph Paasch  wrote:
>
> There are use-cases where a host wants to use a UDP socket with a
> specific 4-tuple. The way to do this is to bind() and then connect() the
> socket. However, after the bind(), the socket starts receiving data even
> if it does not match the intended 4-tuple. That is because after the
> bind() UDP-socket will match in the lookup for all incoming UDP-traffic
> that has the specific IP/port.
>
> This patch prevents any incoming traffic until the connect() system-call
> is called whenever the app sets the UDP socket-option
> UDP_WAIT_FOR_CONNECT.

Please do not add something that could mislead applications writers to
think UDP stack can scale.

UDP stack does not have a full hash on 4-tuples, it means that
incoming traffic on a 'shared port' has
to scan a list of XXX sockets to find the best match ...

Also you add another cache line miss in UDP lookup to access
udp_sk()->wait_for_connect.

recvfrom() can be used to filter whatever frame that came before the connect()

[PATCH] udp: Allow to defer reception until connect() happened

2018-11-28 Thread Christoph Paasch

There are use-cases where a host wants to use a UDP socket with a
specific 4-tuple. The way to do this is to bind() and then connect() the
socket. However, after the bind(), the socket starts receiving data even
if it does not match the intended 4-tuple. That is because after the
bind() UDP-socket will match in the lookup for all incoming UDP-traffic
that has the specific IP/port.

This patch prevents any incoming traffic until the connect() system-call
is called whenever the app sets the UDP socket-option
UDP_WAIT_FOR_CONNECT.

Signed-off-by: Christoph Paasch 
---

Notes:
Changes compared to the original RFC-submission:

* Make it a UDP-specific socket-option
* Rename it to 'wait-for-connect'

Wrt to the discussion on the original RFC submission
(cfr., https://marc.info/?l=linux-netdev=154102843910587=2):

We believe that this patch is still useful to enable applications to use
different models of implementing a UDP-server. For some frameworks it is 
much
easier to have a socket per-connection and thus let the de-multiplexing 
happen
in the kernel instead of the app.

 include/linux/udp.h  |  5 -
 include/net/udp.h|  1 +
 include/uapi/linux/udp.h |  1 +
 net/ipv4/udp.c   | 26 +-
 net/ipv4/udplite.c   |  2 +-
 net/ipv6/udp.c   | 15 ++-
 net/ipv6/udp_impl.h  |  1 +
 net/ipv6/udplite.c   |  2 +-
 8 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 2725c83395bf..9a715d25ce36 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -55,7 +55,10 @@ struct udp_sock {
   * different encapsulation layer set
   * this
   */
-gro_enabled:1; /* Can accept GRO packets */
+gro_enabled:1, /* Can accept GRO packets */
+wait_for_connect:1;/* Wait until app calls connect()
+* before accepting incoming data
+*/
/*
 * Following member retains the information to create a UDP header
 * when the socket is uncorked.
diff --git a/include/net/udp.h b/include/net/udp.h
index fd6d948755c8..b7467a4129b3 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -285,6 +285,7 @@ int udp_get_port(struct sock *sk, unsigned short snum,
  const struct sock *));
 int udp_err(struct sk_buff *, u32);
 int udp_abort(struct sock *sk, int err);
+int udp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
 int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
 int udp_push_pending_frames(struct sock *sk);
 void udp_flush_pending_frames(struct sock *sk);
diff --git a/include/uapi/linux/udp.h b/include/uapi/linux/udp.h
index 30baccb6c9c4..b61f8e8dd80b 100644
--- a/include/uapi/linux/udp.h
+++ b/include/uapi/linux/udp.h
@@ -34,6 +34,7 @@ struct udphdr {
 #define UDP_NO_CHECK6_RX 102   /* Disable accpeting checksum for UDP6 */
 #define UDP_SEGMENT103 /* Set GSO segmentation size */
 #define UDP_GRO104 /* This socket can receive UDP GRO 
packets */
+#define UDP_WAIT_FOR_CONNECT   105 /* Don't accept incoming data until the 
app calls connect() */
 
 /* UDP encapsulation types */
 #define UDP_ENCAP_ESPINUDP_NON_IKE 1 /* draft-ietf-ipsec-nat-t-ike-00/01 */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index aff2a8e99e01..c5adafcfd52f 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -407,6 +407,9 @@ static int compute_score(struct sock *sk, struct net *net,
return -1;
score += 4;
 
+   if (udp_sk(sk)->wait_for_connect)
+   return -1;
+
if (sk->sk_incoming_cpu == raw_smp_processor_id())
score++;
return score;
@@ -2601,6 +2604,10 @@ int udp_lib_setsockopt(struct sock *sk, int level, int 
optname,
release_sock(sk);
break;
 
+   case UDP_WAIT_FOR_CONNECT:
+   up->wait_for_connect = valbool;
+   break;
+
/*
 *  UDP-Lite's partial checksum coverage (RFC 3828).
 */
@@ -2695,6 +2702,10 @@ int udp_lib_getsockopt(struct sock *sk, int level, int 
optname,
val = up->gso_size;
break;
 
+   case UDP_WAIT_FOR_CONNECT:
+   val = up->wait_for_connect;
+   break;
+
/* The following two cannot be changed on UDP sockets, the return is
 * always 0 (which corresponds to the full checksum coverage of UDP). */
case UDPLITE_SEND_CSCOV:
@@ -2779,12 +2790,25 @@ int udp_abort(struct sock *sk, int err)
 }
 EXPORT_SYMBOL_GPL(udp_abort);
 
+int udp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+{
+   int ret;
+
+   ret =

Re: [PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-11-28 Thread Peter Oskolkov

On Wed, Nov 28, 2018 at 4:47 PM David Ahern  wrote:
>
> On 11/28/18 5:22 PM, Peter Oskolkov wrote:
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index bd0df75dc7b6..17f3c37218e5 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -4793,6 +4793,60 @@ static int bpf_push_seg6_encap(struct sk_buff *skb, 
> > u32 type, void *hdr, u32 len
> >  }
> >  #endif /* CONFIG_IPV6_SEG6_BPF */
> >
> > +static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len)
> > +{
> > + struct dst_entry *dst;
> > + struct rtable *rt;
> > + struct iphdr *iph;
> > + struct net *net;
> > + int err;
> > +
> > + if (skb->protocol != htons(ETH_P_IP))
> > + return -EINVAL;  /* ETH_P_IPV6 not yet supported */
> > +
> > + iph = (struct iphdr *)hdr;
> > +
> > + if (unlikely(len < sizeof(struct iphdr) || len > 
> > LWTUNNEL_MAX_ENCAP_HSIZE))
> > + return -EINVAL;
> > + if (unlikely(iph->version != 4 || iph->ihl * 4 > len))
> > + return -EINVAL;
> > +
> > + if (skb->sk)
> > + net = sock_net(skb->sk);
> > + else {
> > + net = dev_net(skb_dst(skb)->dev);
> > + }
> > + rt = ip_route_output(net, iph->daddr, 0, 0, 0);
>
> That is a very limited use case. e.g., oif = 0 means you are not
> considering any kind of policy routing (e.g., VRF).

Hi David! Could you be a bit more specific re: what you would like to
see here? Thanks!

Re: [PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-11-28 Thread David Ahern

On 11/28/18 5:22 PM, Peter Oskolkov wrote:
> diff --git a/net/core/filter.c b/net/core/filter.c
> index bd0df75dc7b6..17f3c37218e5 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4793,6 +4793,60 @@ static int bpf_push_seg6_encap(struct sk_buff *skb, 
> u32 type, void *hdr, u32 len
>  }
>  #endif /* CONFIG_IPV6_SEG6_BPF */
>  
> +static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len)
> +{
> + struct dst_entry *dst;
> + struct rtable *rt;
> + struct iphdr *iph;
> + struct net *net;
> + int err;
> +
> + if (skb->protocol != htons(ETH_P_IP))
> + return -EINVAL;  /* ETH_P_IPV6 not yet supported */
> +
> + iph = (struct iphdr *)hdr;
> +
> + if (unlikely(len < sizeof(struct iphdr) || len > 
> LWTUNNEL_MAX_ENCAP_HSIZE))
> + return -EINVAL;
> + if (unlikely(iph->version != 4 || iph->ihl * 4 > len))
> + return -EINVAL;
> +
> + if (skb->sk)
> + net = sock_net(skb->sk);
> + else {
> + net = dev_net(skb_dst(skb)->dev);
> + }
> + rt = ip_route_output(net, iph->daddr, 0, 0, 0);

That is a very limited use case. e.g., oif = 0 means you are not
considering any kind of policy routing (e.g., VRF).

Re: [PATCH bpf] tools: bpftool: fix a bitfield pretty print issue

2018-11-28 Thread Alexei Starovoitov

On Wed, Nov 28, 2018 at 09:38:23AM -0800, Yonghong Song wrote:
> Commit b12d6ec09730 ("bpf: btf: add btf print functionality")
> added btf pretty print functionality to bpftool.
> There is a problem though in printing a bitfield whose type
> has modifiers.
> 
> For example, for a type like
>   typedef int ___int;
>   struct tmp_t {
>   int a:3;
>   ___int b:3;
>   };
> Suppose we have a map
>   struct bpf_map_def SEC("maps") tmpmap = {
>   .type = BPF_MAP_TYPE_HASH,
>   .key_size = sizeof(__u32),
>   .value_size = sizeof(struct tmp_t),
>   .max_entries = 1,
>   };
> and the hash table is populated with one element with
> key 0 and value (.a = 1 and .b = 2).
> 
> In BTF, the struct member "b" will have a type "typedef" which
> points to an int type. The current implementation does not
> pass the bit offset during transition from typedef to int type,
> hence incorrectly print the value as
>   $ bpftool m d id 79
>   [{
>   "key": 0,
>   "value": {
>   "a": 0x1,
>   "b": 0x1
>   }
>   }
>   ]
> 
> This patch fixed the issue by carrying bit_offset along the type
> chain during bit_field print. The correct result can be printed as
>   $ bpftool m d id 76
>   [{
>   "key": 0,
>   "value": {
>   "a": 0x1,
>   "b": 0x2
>   }
>   }
>   ]
> 
> The kernel pretty print is implemented correctly and does not
> have this issue.
> 
> Fixes: b12d6ec09730 ("bpf: btf: add btf print functionality")
> Signed-off-by: Yonghong Song 

Applied to bpf tree. Thanks

[PATCH bpf-next 2/2] selftests/bpf: add test_lwt_ip_encap selftest

2018-11-28 Thread Peter Oskolkov

This patch adds a sample/selftest that covers BPF_LWT_ENCAP_IP option
added in the first patch in the series.

Signed-off-by: Peter Oskolkov 
---
 tools/testing/selftests/bpf/Makefile  |   5 +-
 .../testing/selftests/bpf/test_lwt_ip_encap.c |  65 ++
 .../selftests/bpf/test_lwt_ip_encap.sh| 114 ++
 3 files changed, 182 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lwt_ip_encap.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_ip_encap.sh

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 73aa6d8f4a2f..044fcdbc9864 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -39,7 +39,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o \
test_sk_lookup_kern.o test_xdp_vlan.o test_queue_map.o test_stack_map.o 
\
-   xdp_dummy.o test_map_in_map.o
+   xdp_dummy.o test_map_in_map.o test_lwt_ip_encap.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
@@ -53,7 +53,8 @@ TEST_PROGS := test_kmod.sh \
test_lirc_mode2.sh \
test_skb_cgroup_id.sh \
test_flow_dissector.sh \
-   test_xdp_vlan.sh
+   test_xdp_vlan.sh \
+   test_lwt_ip_encap.sh
 
 TEST_PROGS_EXTENDED := with_addr.sh
 
diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.c 
b/tools/testing/selftests/bpf/test_lwt_ip_encap.c
new file mode 100644
index ..967db922dcc6
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+#define BPF_LWT_ENCAP_IP 2
+
+struct iphdr {
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+   __u8ihl:4,
+   version:4;
+#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+   __u8version:4,
+   ihl:4;
+#else
+#error "Fix your compiler's __BYTE_ORDER__?!"
+#endif
+   __u8tos;
+   __be16  tot_len;
+   __be16  id;
+   __be16  frag_off;
+   __u8ttl;
+   __u8protocol;
+   __sum16 check;
+   __be32  saddr;
+   __be32  daddr;
+};
+
+struct grehdr {
+   __be16 flags;
+   __be16 protocol;
+};
+
+SEC("encap_gre")
+int bpf_lwt_encap_gre(struct __sk_buff *skb)
+{
+   char encap_header[24];
+   int err;
+   struct iphdr *iphdr = (struct iphdr *)encap_header;
+   struct grehdr *greh = (struct grehdr *)(encap_header + sizeof(struct 
iphdr));
+
+   memset(encap_header, 0, sizeof(encap_header));
+
+   iphdr->ihl = 5;
+   iphdr->version = 4;
+   iphdr->tos = 0;
+   iphdr->ttl = 0x40;
+   iphdr->protocol = 47;  /* IPPROTO_GRE */
+   iphdr->saddr = 0x640110ac;  /* 172.16.1.100 */
+   iphdr->daddr = 0x640310ac;  /* 172.16.5.100 */
+   iphdr->check = 0;
+   iphdr->tot_len = bpf_htons(skb->len + sizeof(encap_header));
+
+   greh->protocol = bpf_htons(0x800);
+
+   err = bpf_lwt_push_encap(skb, BPF_LWT_ENCAP_IP, (void *)encap_header,
+sizeof(encap_header));
+   if (err)
+   return BPF_DROP;
+
+   return BPF_OK;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh 
b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
new file mode 100755
index ..4c32b754bf96
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
@@ -0,0 +1,114 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Setup:
+# - create VETH1/VETH2 veth
+# - VETH1 gets IP_SRC
+# - create netns NS
+# - move VETH2 to NS, add IP_DST
+# - in NS, create gre tunnel GREDEV, add IP_GRE
+# - in NS, configure GREDEV to route to IP_DST from IP_SRC
+# - configure route to IP_GRE via VETH1
+#   (note: there is no route to IP_DST from root/init ns)
+#
+# Test:
+# - listen on IP_DST
+# - send a packet to IP_DST: the listener does not get it
+# - add LWT_XMIT bpf to IP_DST that gre-encaps all packets to IP_GRE
+# - send a packet to IP_DST: the listener gets it
+
+
+# set -x  # debug ON
+set +x  # debug OFF
+set -e  # exit on error
+
+if [[ $EUID -ne 0 ]]; then
+   echo "This script must be run as root"
+   echo "FAIL"
+   exit 1
+fi
+
+readonly NS="ns-ip-encap-$(mktemp -u XX)"
+readonly OUT=$(mktemp /tmp/test_lwt_ip_incap.XX)
+
+readonly NET_SRC="172.16.1.0"
+
+readonly IP_SRC="172.16.1.100"
+readonly IP_DST="172.16.2.100"
+readonly IP_GRE="172.16.3.100"
+
+readonly PORT=
+readonly MSG="foo_bar"
+
+PID1=0
+PID2=0
+
+setup() {
+   ip link add veth1 type veth peer name veth2
+
+   ip netns add "${NS}"
+   ip link set veth2 netns ${NS}
+
+   ip link set dev veth1 up
+   ip -netns ${NS} link set dev veth2 up
+
+   ip

[PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-11-28 Thread Peter Oskolkov

This patch enables BPF programs (specifically, of LWT_XMIT type)
to add IP encapsulation headers to packets (e.g. IP/GRE, GUE, IPIP).

This is useful when thousands of different short-lived flows should be
encapped, each with different and dynamically determined destination.
Although lwtunnels can be used in some of these scenarios, the ability
to dynamically generate encap headers adds more flexibility, e.g.
when routing depends on the state of the host (reflected in global bpf
maps).

A future patch will enable IPv6 encapping (and IPv4/IPv6 cross-routing).

Tested: see the second patch in the series.

Signed-off-by: Peter Oskolkov 
---
 include/net/lwtunnel.h   |  2 ++
 include/uapi/linux/bpf.h |  7 -
 net/core/filter.c| 58 
 3 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index 33fd9ba7e0e5..6a1c5c2f16d5 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -16,6 +16,8 @@
 #define LWTUNNEL_STATE_INPUT_REDIRECT  BIT(1)
 #define LWTUNNEL_STATE_XMIT_REDIRECT   BIT(2)
 
+#define LWTUNNEL_MAX_ENCAP_HSIZE   80
+
 enum {
LWTUNNEL_XMIT_DONE,
LWTUNNEL_XMIT_CONTINUE,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 597afdbc1ab9..6f2efe2dca9f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1998,6 +1998,10 @@ union bpf_attr {
  * Only works if *skb* contains an IPv6 packet. Insert a
  * Segment Routing Header (**struct ipv6_sr_hdr**) inside
  * the IPv6 header.
+ * **BPF_LWT_ENCAP_IP**
+ * IP encapsulation (GRE/GUE/IPIP/etc). The outer header
+ * must be IPv4, followed by zero, one, or more additional
+ * headers.
  *
  * A call to this helper is susceptible to change the underlaying
  * packet buffer. Therefore, at load time, all checks on pointers
@@ -2444,7 +2448,8 @@ enum bpf_hdr_start_off {
 /* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
 enum bpf_lwt_encap_mode {
BPF_LWT_ENCAP_SEG6,
-   BPF_LWT_ENCAP_SEG6_INLINE
+   BPF_LWT_ENCAP_SEG6_INLINE,
+   BPF_LWT_ENCAP_IP,
 };
 
 /* user accessible mirror of in-kernel sk_buff.
diff --git a/net/core/filter.c b/net/core/filter.c
index bd0df75dc7b6..17f3c37218e5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4793,6 +4793,60 @@ static int bpf_push_seg6_encap(struct sk_buff *skb, u32 
type, void *hdr, u32 len
 }
 #endif /* CONFIG_IPV6_SEG6_BPF */
 
+static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len)
+{
+   struct dst_entry *dst;
+   struct rtable *rt;
+   struct iphdr *iph;
+   struct net *net;
+   int err;
+
+   if (skb->protocol != htons(ETH_P_IP))
+   return -EINVAL;  /* ETH_P_IPV6 not yet supported */
+
+   iph = (struct iphdr *)hdr;
+
+   if (unlikely(len < sizeof(struct iphdr) || len > 
LWTUNNEL_MAX_ENCAP_HSIZE))
+   return -EINVAL;
+   if (unlikely(iph->version != 4 || iph->ihl * 4 > len))
+   return -EINVAL;
+
+   if (skb->sk)
+   net = sock_net(skb->sk);
+   else {
+   net = dev_net(skb_dst(skb)->dev);
+   }
+   rt = ip_route_output(net, iph->daddr, 0, 0, 0);
+   if (IS_ERR(rt) || rt->dst.error)
+   return -EINVAL;
+   dst = >dst;
+
+   skb_reset_inner_headers(skb);
+   skb->encapsulation = 1;
+
+   err = skb_cow_head(skb, len + LL_RESERVED_SPACE(dst->dev));
+   if (unlikely(err))
+   return err;
+
+   skb_push(skb, len);
+   skb_reset_network_header(skb);
+
+   iph = ip_hdr(skb);
+   memcpy(iph, hdr, len);
+
+   bpf_compute_data_pointers(skb);
+   if (iph->ihl * 4 < len)
+   skb_set_transport_header(skb, iph->ihl * 4);
+   skb->protocol = htons(ETH_P_IP);
+   if (!iph->check)
+   iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
+
+   skb_dst_drop(skb);
+   dst_hold(dst);
+   skb_dst_set(skb, dst);
+   return 0;
+}
+
 BPF_CALL_4(bpf_lwt_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
   u32, len)
 {
@@ -4802,6 +4856,8 @@ BPF_CALL_4(bpf_lwt_push_encap, struct sk_buff *, skb, 
u32, type, void *, hdr,
case BPF_LWT_ENCAP_SEG6_INLINE:
return bpf_push_seg6_encap(skb, type, hdr, len);
 #endif
+   case BPF_LWT_ENCAP_IP:
+   return bpf_push_ip_encap(skb, hdr, len);
default:
return -EINVAL;
}
@@ -5687,6 +5743,8 @@ lwt_xmit_func_proto(enum bpf_func_id func_id, const 
struct bpf_prog *prog)
return _l4_csum_replace_proto;
case BPF_FUNC_set_hash_invalid:
return _set_hash_invalid_proto;
+   case BPF_FUNC_lwt_push_encap:
+   return _lwt_push_encap_proto;
default:

Re: [PATCH] bpf: Fix various lib and testsuite build failures on 32-bit.

2018-11-28 Thread Alexei Starovoitov

On Wed, Nov 28, 2018 at 12:56:10PM -0800, David Miller wrote:
> 
> Cannot cast a u64 to a pointer on 32-bit without an intervening (long)
> cast otherwise GCC warns.
> 
> Signed-off-by: David S. Miller 

I was contemplating to apply that to bpf tree, but the first hunk is bpf-next 
only
and later hunks are in test_progs.c, hence applied to bpf-next tree.
Thanks!

Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-28 Thread Eric Dumazet

On 11/28/2018 04:05 PM, Cong Wang wrote:

> While we are on this page, mlx5e_lro_update_hdr() incorrectly assumes
> TCP header is located right after struct iphdr, which is wrong if we could
> have IP options on this path.
> 
> It could the hardware which already verified this corner case though.
> 

GRO makes sure IPv4 header has no options, so I would not be surprised
that LRO on mlx5 has the same logic.

Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-28 Thread Eric Dumazet

On Wed, Nov 28, 2018 at 3:57 PM Cong Wang  wrote:
> But again, I kinda feel the hardware already does the sanity check,
> otherwise we have much more serious trouble in mlx5e_lro_update_hdr()
> which parses into TCP header.
>

LRO is a different beast.

For packets that are not recognized as LRO candidates
(for example because their IP length is bigger than the frame length),
we exactly take the code path you want to change.

A NIC is supposed to deliver frames, even the ones that 'seem' bad.

[PATCH net 3/3] tcp: fix SNMP TCP timeout under-estimation

2018-11-28 Thread Yuchung Cheng

Previously the SNMP TCPTIMEOUTS counter has inconsistent accounting:
1. It counts all SYN and SYN-ACK timeouts
2. It counts timeouts in other states except recurring timeouts and
   timeouts after fast recovery or disorder state.

Such selective accounting makes analysis difficult and complicated. For
example the monitoring system needs to collect many other SNMP counters
to infer the total amount of timeout events. This patch makes TCPTIMEOUTS
counter simply counts all the retransmit timeout (SYN or data or FIN).

Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
Signed-off-by: Neal Cardwell 
---
 net/ipv4/tcp_timer.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 94d858c604f6..5cd02b7b62f6 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -482,11 +482,12 @@ void tcp_retransmit_timer(struct sock *sk)
goto out_reset_timer;
}
 
+   __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPTIMEOUTS);
if (tcp_write_timeout(sk))
goto out;
 
if (icsk->icsk_retransmits == 0) {
-   int mib_idx;
+   int mib_idx = 0;
 
if (icsk->icsk_ca_state == TCP_CA_Recovery) {
if (tcp_is_sack(tp))
@@ -501,10 +502,9 @@ void tcp_retransmit_timer(struct sock *sk)
mib_idx = LINUX_MIB_TCPSACKFAILURES;
else
mib_idx = LINUX_MIB_TCPRENOFAILURES;
-   } else {
-   mib_idx = LINUX_MIB_TCPTIMEOUTS;
}
-   __NET_INC_STATS(sock_net(sk), mib_idx);
+   if (mib_idx)
+   __NET_INC_STATS(sock_net(sk), mib_idx);
}
 
tcp_enter_loss(sk);
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog

[PATCH net 1/3] tcp: fix off-by-one bug on aborting window-probing socket

2018-11-28 Thread Yuchung Cheng

Previously there is an off-by-one bug on determining when to abort
a stalled window-probing socket. This patch fixes that so it is
consistent with tcp_write_timeout().

Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
Signed-off-by: Neal Cardwell 
---
 net/ipv4/tcp_timer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 5f8b6d3cd855..94d858c604f6 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -376,7 +376,7 @@ static void tcp_probe_timer(struct sock *sk)
return;
}
 
-   if (icsk->icsk_probes_out > max_probes) {
+   if (icsk->icsk_probes_out >= max_probes) {
 abort: tcp_write_err(sk);
} else {
/* Only send another probe if we didn't close things up. */
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog

[PATCH net 2/3] tcp: fix SNMP under-estimation on failed retransmission

2018-11-28 Thread Yuchung Cheng

Previously the SNMP counter LINUX_MIB_TCPRETRANSFAIL is not counting
the TSO/GSO properly on failed retransmission. This patch fixes that.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
Signed-off-by: Neal Cardwell 
---
 net/ipv4/tcp_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c5dc4c4fdadd..87bd1c61f4bf 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2929,7 +2929,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb, int segs)
TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
trace_tcp_retransmit_skb(sk, skb);
} else if (err != -EBUSY) {
-   NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
+   NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL, segs);
}
return err;
 }
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog

[PATCH net 0/3] fixes in timeout and retransmission accounting

2018-11-28 Thread Yuchung Cheng

This patch set has assorted fixes of minor accounting issues in
timeout, window probe, and retransmission stats.

Yuchung Cheng (3):
  tcp: fix off-by-one bug on aborting window-probing socket
  tcp: fix SNMP under-estimation on failed retransmission
  tcp: fix SNMP TCP timeout under-estimation

 net/ipv4/tcp_output.c |  2 +-
 net/ipv4/tcp_timer.c  | 10 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

-- 
2.20.0.rc0.387.gc7a69e6b6c-goog

Re: [PATCH bpf v2 0/4] bpf: btf: check name validity for various types

2018-11-28 Thread Alexei Starovoitov

On Tue, Nov 27, 2018 at 01:23:26PM -0800, Yonghong Song wrote:
> This patch set added name checking for PTR, ARRAY, VOLATILE, TYPEDEF,
> CONST, RESTRICT, STRUCT, UNION, ENUM and FWD types. Such a strict
> name checking makes BTF more sound in the kernel and future
> BTF-to-header-file converesion ([1]) less fragile.
> 
> Patch #1 implemented btf_name_valid_identifier() for name checking
> which will be used in Patch #2.
> Patch #2 checked name validity for the above mentioned types.
> Patch #3 fixed two existing test_btf unit tests exposed by the strict
> name checking.
> Patch #4 added additional test cases.
> 
> This patch set is against bpf tree.
> 
> Patch #1 has been implemented in bpf-next commit
> Commit 2667a2626f4d ("bpf: btf: Add BTF_KIND_FUNC
> and BTF_KIND_FUNC_PROTO"), so there is no need to apply this
> patch to bpf-next. In case this patch is applied to bpf-next,
> there will be a minor conflict like
>   diff --cc kernel/bpf/btf.c
>   index a09b2f94ab25,93c233ab2db6..
>   --- a/kernel/bpf/btf.c
>   +++ b/kernel/bpf/btf.c
>   @@@ -474,7 -451,7 +474,11 @@@ static bool btf_name_valid_identifier(c
>   return !*src;
> }
> 
>   ++<<< HEAD
>+const char *btf_name_by_offset(const struct btf *btf, u32 offset)
>   ++===
>   + static const char *btf_name_by_offset(const struct btf *btf, u32 offset)
>   ++>>> fa9566b0847d... bpf: btf: implement btf_name_valid_identifier()
> {
>   if (!offset)
>   return "(anon)";
> Just resolve the conflict by taking the "const char ..." line.
> 
> Patches #2, #3 and #4 can be applied to bpf-next without conflict.
> 
> [1]: http://vger.kernel.org/lpc-bpf2018.html#session-2

Applied to bpf tree. Thanks!

Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-28 Thread Cong Wang

On Wed, Nov 28, 2018 at 3:57 PM Cong Wang  wrote:
> But again, I kinda feel the hardware already does the sanity check,
> otherwise we have much more serious trouble in mlx5e_lro_update_hdr()
> which parses into TCP header.
>

While we are on this page, mlx5e_lro_update_hdr() incorrectly assumes
TCP header is located right after struct iphdr, which is wrong if we could
have IP options on this path.

It could the hardware which already verified this corner case though.

Re: [PATCH iproute2] ss: add support for delivered and delivered_ce fields

2018-11-28 Thread Stephen Hemminger

On Mon, 26 Nov 2018 14:29:53 -0800
Eric Dumazet  wrote:

> Kernel support was added in linux-4.18 in commit feb5f2ec6464
> ("tcp: export packets delivery info")
> 
> Tested:
> 
> ss -ti
> ...
> ESTAB   0 2270520  [2607:f8b0:8099:e16::]:47646   
> [2607:f8b0:8099:e18::]:38953
>ts sack cubic wscale:8,8 rto:7 rtt:2.824/0.278 mss:1428
>  pmtu:1500 rcvmss:536 advmss:1428 cwnd:89 ssthresh:62 
> bytes_acked:2097871945
> segs_out:1469144 segs_in:65221 data_segs_out:1469142 send 360.0Mbps 
> lastsnd:2
> lastrcv:99231 lastack:2 pacing_rate 431.9Mbps delivery_rate 246.4Mbps
> (*) delivered:1469099 delivered_ce:424799
> busy:99231ms unacked:44 rcv_space:14280 rcv_ssthresh:65535
> notsent:2207688 minrtt:0.228
> 
> Signed-off-by: Eric Dumazet 

Applied

Re: [PATCH iproute2] bridge: make -c match -compressvlans first instead of -color

2018-11-28 Thread Stephen Hemminger

On Tue, 27 Nov 2018 18:02:52 -0800
Roopa Prabhu  wrote:

> From: Roopa Prabhu 
> 
> commit c7c1a1ef51ae ("bridge: colorize output and use JSON print library")
> broke previous use of -c to represent compressvlans. This restores
> previous use of -c to represent compressvlans. Understand the original
> motivation to use -c to represent color consistently everywhere but
> there are apps and network interface managers out there that are already
> using -c to prepresent compressed vlans.
> 
> Fixes: c7c1a1ef51ae ("bridge: colorize output and use JSON print library")
> Signed-off-by: Roopa Prabhu 

Applied.

Re: [iproute PATCH] man: rdma: Add reference to rdma-resource.8

2018-11-28 Thread Stephen Hemminger

On Mon, 26 Nov 2018 18:58:31 +0100
Phil Sutter  wrote:

> All rdma-related man pages list each other in SEE ALSO section, only
> rdma-resource.8 is missing. Add it for the sake of consistency.
> 
> Signed-off-by: Phil Sutter 

Applied

Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-28 Thread Cong Wang

On Wed, Nov 28, 2018 at 3:50 PM Eric Dumazet  wrote:
>
> On Wed, Nov 28, 2018 at 2:16 PM Cong Wang  wrote:
> >
> > On Wed, Nov 28, 2018 at 7:00 AM Eric Dumazet  wrote:
> > >
> > > Nice packet of death alert.
> > >
> > > pad_len can be 0xFF67  here, if frame_len is smaller than pad_offset.
> >
> > Unless IP header is malformed, how could it be?
>
> This is totally something an attacker can forge.

Of course, as in the email I sent to mellanox guys,__vlan_get_protocol()
could _literately_ exhaust all skb->len. If no sufficient skb tail room,
we could even possibly crash.

But again, I kinda feel the hardware already does the sanity check,
otherwise we have much more serious trouble in mlx5e_lro_update_hdr()
which parses into TCP header.

Thanks.

Re: [PATCH net-next v2 1/2] udp: msg_zerocopy

2018-11-28 Thread Willem de Bruijn

On Mon, Nov 26, 2018 at 2:49 PM Willem de Bruijn
 wrote:
>
> On Mon, Nov 26, 2018 at 1:19 PM Willem de Bruijn
>  wrote:
> >
> > On Mon, Nov 26, 2018 at 1:04 PM Paolo Abeni  wrote:
> > >
> > > On Mon, 2018-11-26 at 12:59 -0500, Willem de Bruijn wrote:
> > > > The callers of this function do flush the queue of the other skbs on
> > > > error, but only after the call to sock_zerocopy_put_abort.
> > > >
> > > > sock_zerocopy_put_abort depends on total rollback to revert the
> > > > sk_zckey increment and suppress the completion notification (which
> > > > must not happen on return with error).
> > > >
> > > > I don't immediately have a fix. Need to think about this some more..
> > >
> > > [still out of sheer ignorance] How about tacking a refcnt for the whole
> > > ip_append_data() scope, like in the tcp case? that will add an atomic
> > > op per loop (likely, hitting the cache) but will remove some code hunk
> > > in sock_zerocopy_put_abort() and sock_zerocopy_alloc().
> >
> > The atomic op pair is indeed what I was trying to avoid. But I also need
> > to solve the problem that the final decrement will happen from the freeing
> > of the other skbs in __ip_flush_pending_frames, and will not suppress
> > the notification.
> >
> > Freeing the entire queue inside __ip_append_data, effectively making it
> > a true noop on error is one approach. But that is invasive, also to non
> > zerocopy codepaths, so I would rather avoid that.
> >
> > Perhaps I need to handle the abort logic in udp_sendmsg directly,
> > after both __ip_append_data and __ip_flush_pending_frames.
>
> Actually,
>
> (1) the notification suppression is handled correctly, as .._abort
> decrements uarg->len. If now zero, this suppresses the notification
> in sock_zerocopy_callback, regardless whether that callback is
> called right away or from a later kfree_skb.
>
> (2) if moving skb_zcopy_set below getfrag, then no kfree_skb
> will be called on a zerocopy skb inside __ip_append_data. So on
> abort the refcount is exactly the number of zerocopy skbs on the
> queue that will call sock_zerocopy_put later. Abort then only needs
> to handle special case zero, and call sock_zerocopy_put right away.

An additional issue is that refcount_t cannot be initialized to
zero, then incremented for each skb, unlike the original patch
based on atomic_t.

I did revert to the basic implementation using an extra ref
for the function call, similar to TCP, as you suggested.

On top of that as a separate optimization patch I have a
variant that uses refcnt zero by replacing refcount_inc with
refcount_set(.., refcount_read(..) + 1). Not very pretty.

An alternative to elide the cost of sock_zerocopy_put in the
fast path is to add a static branch on SO_ZEROCOPY. That
would also compile it out for TCP in the common case.

Anyway, I do intend to send a v3, but it's taking a bit longer
to try to find a clean solution.

>
> Tentative fix on top of v2 (I'll squash into v3):
>
> ---
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 2179ef84bb44..4b21a58329d1 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1099,12 +1099,13 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg)
>
> -   if (sk->sk_type != SOCK_STREAM && 
> !refcount_read(>refcnt))
> +   if (sk->sk_type != SOCK_STREAM &&
> !refcount_read(>refcnt)) {
> refcount_set(>refcnt, 1);
> -
> -   sock_zerocopy_put(uarg);
> +   sock_zerocopy_put(uarg);
> +   }
> }
>  }
>
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 7504da2f33d6..a19396e21b35 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1014,13 +1014,6 @@ static int __ip_append_data(struct sock *sk,
> skb->csum = 0;
> skb_reserve(skb, hh_len);
>
> -   /* only the initial fragment is time stamped */
> -   skb_shinfo(skb)->tx_flags = cork->tx_flags;
> -   cork->tx_flags = 0;
> -   skb_shinfo(skb)->tskey = tskey;
> -   tskey = 0;
> -   skb_zcopy_set(skb, uarg);
> -
> /*
>  *  Find where to start putting bytes.
>  */
> @@ -1053,6 +1046,13 @@ static int __ip_append_data(struct sock *sk,
> exthdrlen = 0;
> csummode = CHECKSUM_NONE;
>
> +   /* only the initial fragment is time stamped */
> +   skb_shinfo(skb)->tx_flags = cork->tx_flags;
> +   cork->tx_flags = 0;
> +   skb_shinfo(skb)->tskey = tskey;
> +   tskey = 0;
> +   skb_zcopy_set(skb, uarg);
> +
> if ((flags & MSG_CONFIRM) && !skb_prev)
> skb_set_dst_pending_confirm(skb, 1);
>
> ---
>
> This

Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-28 Thread Eric Dumazet

On Wed, Nov 28, 2018 at 2:16 PM Cong Wang  wrote:
>
> On Wed, Nov 28, 2018 at 7:00 AM Eric Dumazet  wrote:
> >
> > Nice packet of death alert.
> >
> > pad_len can be 0xFF67  here, if frame_len is smaller than pad_offset.
>
> Unless IP header is malformed, how could it be?

This is totally something an attacker can forge.

ip_rcv_core()
...
len = ntohs(iph->tot_len);
if (skb->len < len) {
   __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
   goto drop;

No crash, but we drop and increment appropriate SNMP counter.

Re: [PATCH] bpf: Fix various lib and testsuite build failures on 32-bit.

2018-11-28 Thread Song Liu

On Wed, Nov 28, 2018 at 12:59 PM David Miller  wrote:
>
>
> Cannot cast a u64 to a pointer on 32-bit without an intervening (long)
> cast otherwise GCC warns.
>
> Signed-off-by: David S. Miller 

Acked-by: Song Liu 


> --
>
> diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
> index eadcf8d..c2d641f 100644
> --- a/tools/lib/bpf/btf.c
> +++ b/tools/lib/bpf/btf.c
> @@ -466,7 +466,7 @@ int btf__get_from_id(__u32 id, struct btf **btf)
> goto exit_free;
> }
>
> -   *btf = btf__new((__u8 *)btf_info.btf, btf_info.btf_size, NULL);
> +   *btf = btf__new((__u8 *)(long)btf_info.btf, btf_info.btf_size, NULL);
> if (IS_ERR(*btf)) {
> err = PTR_ERR(*btf);
> *btf = NULL;
> diff --git a/tools/testing/selftests/bpf/test_progs.c 
> b/tools/testing/selftests/bpf/test_progs.c
> index c1e688f6..1c57abb 100644
> --- a/tools/testing/selftests/bpf/test_progs.c
> +++ b/tools/testing/selftests/bpf/test_progs.c
> @@ -524,7 +524,7 @@ static void test_bpf_obj_id(void)
>   load_time < now - 60 || load_time > now + 60 ||
>   prog_infos[i].created_by_uid != my_uid ||
>   prog_infos[i].nr_map_ids != 1 ||
> - *(int *)prog_infos[i].map_ids != map_infos[i].id ||
> + *(int *)(long)prog_infos[i].map_ids != 
> map_infos[i].id ||
>   strcmp((char *)prog_infos[i].name, 
> expected_prog_name),
>   "get-prog-info(fd)",
>   "err %d errno %d i %d type %d(%d) info_len %u(%Zu) 
> jit_enabled %d jited_prog_len %u xlated_prog_len %u jited_prog %d xlated_prog 
> %d load_time %lu(%lu) uid %u(%u) nr_map_ids %u(%u) map_id %u(%u) name 
> %s(%s)\n",
> @@ -539,7 +539,7 @@ static void test_bpf_obj_id(void)
>   load_time, now,
>   prog_infos[i].created_by_uid, my_uid,
>   prog_infos[i].nr_map_ids, 1,
> - *(int *)prog_infos[i].map_ids, map_infos[i].id,
> + *(int *)(long)prog_infos[i].map_ids, 
> map_infos[i].id,
>   prog_infos[i].name, expected_prog_name))
> goto done;
> }
> @@ -585,7 +585,7 @@ static void test_bpf_obj_id(void)
> bzero(_info, sizeof(prog_info));
> info_len = sizeof(prog_info);
>
> -   saved_map_id = *(int *)(prog_infos[i].map_ids);
> +   saved_map_id = *(int *)((long)prog_infos[i].map_ids);
> prog_info.map_ids = prog_infos[i].map_ids;
> prog_info.nr_map_ids = 2;
> err = bpf_obj_get_info_by_fd(prog_fd, _info, _len);
> @@ -593,12 +593,12 @@ static void test_bpf_obj_id(void)
> prog_infos[i].xlated_prog_insns = 0;
> CHECK(err || info_len != sizeof(struct bpf_prog_info) ||
>   memcmp(_info, _infos[i], info_len) ||
> - *(int *)prog_info.map_ids != saved_map_id,
> + *(int *)(long)prog_info.map_ids != saved_map_id,
>   "get-prog-info(next_id->fd)",
>   "err %d errno %d info_len %u(%Zu) memcmp %d map_id 
> %u(%u)\n",
>   err, errno, info_len, sizeof(struct bpf_prog_info),
>   memcmp(_info, _infos[i], info_len),
> - *(int *)prog_info.map_ids, saved_map_id);
> + *(int *)(long)prog_info.map_ids, saved_map_id);
> close(prog_fd);
> }
> CHECK(nr_id_found != nr_iters,

Re: [PATCH bpf] tools: bpftool: fix a bitfield pretty print issue

2018-11-28 Thread Song Liu

On Wed, Nov 28, 2018 at 10:09 AM Yonghong Song  wrote:
>
> Commit b12d6ec09730 ("bpf: btf: add btf print functionality")
> added btf pretty print functionality to bpftool.
> There is a problem though in printing a bitfield whose type
> has modifiers.
>
> For example, for a type like
>   typedef int ___int;
>   struct tmp_t {
>   int a:3;
>   ___int b:3;
>   };
> Suppose we have a map
>   struct bpf_map_def SEC("maps") tmpmap = {
>   .type = BPF_MAP_TYPE_HASH,
>   .key_size = sizeof(__u32),
>   .value_size = sizeof(struct tmp_t),
>   .max_entries = 1,
>   };
> and the hash table is populated with one element with
> key 0 and value (.a = 1 and .b = 2).
>
> In BTF, the struct member "b" will have a type "typedef" which
> points to an int type. The current implementation does not
> pass the bit offset during transition from typedef to int type,
> hence incorrectly print the value as
>   $ bpftool m d id 79
>   [{
>   "key": 0,
>   "value": {
>   "a": 0x1,
>   "b": 0x1
>   }
>   }
>   ]
>
> This patch fixed the issue by carrying bit_offset along the type
> chain during bit_field print. The correct result can be printed as
>   $ bpftool m d id 76
>   [{
>   "key": 0,
>   "value": {
>   "a": 0x1,
>   "b": 0x2
>   }
>   }
>   ]
>
> The kernel pretty print is implemented correctly and does not
> have this issue.
>
> Fixes: b12d6ec09730 ("bpf: btf: add btf print functionality")
> Signed-off-by: Yonghong Song 

Acked-by: Song Liu 

> ---
>  tools/bpf/bpftool/btf_dumper.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/tools/bpf/bpftool/btf_dumper.c b/tools/bpf/bpftool/btf_dumper.c
> index 55bc512a1831..e4e6e2b3fd84 100644
> --- a/tools/bpf/bpftool/btf_dumper.c
> +++ b/tools/bpf/bpftool/btf_dumper.c
> @@ -32,7 +32,7 @@ static void btf_dumper_ptr(const void *data, json_writer_t 
> *jw,
>  }
>
>  static int btf_dumper_modifier(const struct btf_dumper *d, __u32 type_id,
> -  const void *data)
> +  __u8 bit_offset, const void *data)
>  {
> int actual_type_id;
>
> @@ -40,7 +40,7 @@ static int btf_dumper_modifier(const struct btf_dumper *d, 
> __u32 type_id,
> if (actual_type_id < 0)
> return actual_type_id;
>
> -   return btf_dumper_do_type(d, actual_type_id, 0, data);
> +   return btf_dumper_do_type(d, actual_type_id, bit_offset, data);
>  }
>
>  static void btf_dumper_enum(const void *data, json_writer_t *jw)
> @@ -237,7 +237,7 @@ static int btf_dumper_do_type(const struct btf_dumper *d, 
> __u32 type_id,
> case BTF_KIND_VOLATILE:
> case BTF_KIND_CONST:
> case BTF_KIND_RESTRICT:
> -   return btf_dumper_modifier(d, type_id, data);
> +   return btf_dumper_modifier(d, type_id, bit_offset, data);
> default:
> jsonw_printf(d->jw, "(unsupported-kind");
> return -EINVAL;
> --
> 2.17.1
>

[Patch net] mlx5: fix get_ip_proto()

2018-11-28 Thread Cong Wang

IP header is not necessarily located right after struct ethhdr,
there could be multiple 802.1Q headers in between, this is why
we call __vlan_get_protocol().

Fixes: fe1dc069990c ("net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets")
Cc: Alaa Hleihel 
Cc: Or Gerlitz 
Cc: Saeed Mahameed 
Signed-off-by: Cong Wang 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 9b6bd2b51556..f7c5dbb0ffcd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -724,9 +724,9 @@ static u32 mlx5e_get_fcs(const struct sk_buff *skb)
return __get_unaligned_cpu32(fcs_bytes);
 }
 
-static u8 get_ip_proto(struct sk_buff *skb, __be16 proto)
+static u8 get_ip_proto(struct sk_buff *skb, int network_depth, __be16 proto)
 {
-   void *ip_p = skb->data + sizeof(struct ethhdr);
+   void *ip_p = skb->data + network_depth;
 
return (proto == htons(ETH_P_IP)) ? ((struct iphdr *)ip_p)->protocol :
((struct ipv6hdr *)ip_p)->nexthdr;
@@ -786,7 +786,7 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
goto csum_unnecessary;
 
if (likely(is_last_ethertype_ip(skb, _depth, ))) {
-   if (unlikely(get_ip_proto(skb, proto) == IPPROTO_SCTP))
+   if (unlikely(get_ip_proto(skb, network_depth, proto) == 
IPPROTO_SCTP))
goto csum_unnecessary;
 
skb->ip_summed = CHECKSUM_COMPLETE;
-- 
2.19.1

Re: [PATCH RFC iproute2-next 1/2] rdma: add 'link add/delete' commands

2018-11-28 Thread Steve Wise




On 11/28/2018 4:25 PM, Jason Gunthorpe wrote:
> On Wed, Nov 28, 2018 at 04:21:48PM -0600, Steve Wise wrote:
>
 It does make sense to not require type.  The name must be unique so that
 should be enough.  I'll have to respin the kernel side though...
>>> The delete_link really should be an operation on the ib_device, not
>>> the link_ops thing. 
>>>
>>> That directly prevents mis-matching function callbacks..
>>>
>>> Jason
>> Looking at the rtnetlink newlink/dellink, I see they cache the link_ops
>> ptr in the net_device struct.  So when the link is deleted, then
>> appropriate driver-specific dellink function can be called after finding
>> the device to be deleted.  Should I do something along these lines?  IE
>> add a struct rdma_link_ops pointer to struct ib_device.
> I don't see a problem with that either..
>
> Jason

Ok, I'll respin the kernel and user patches tomorrow.  Thanks!

[PATCH net-next v2 0/3] vxlan: a few minor cleanups

2018-11-28 Thread Roopa Prabhu

From: Roopa Prabhu 

Roopa Prabhu (3):
  vxlan: support changelink for a few more attributes
  vxlan: extack support for some changelink cases
  vxlan: move flag sets to use a helper func

 drivers/net/vxlan.c | 199 +++-
 1 file changed, 102 insertions(+), 97 deletions(-)

-- 
2.1.4

[PATCH net-next v2 3/3] vxlan: move flag sets to use a helper func

2018-11-28 Thread Roopa Prabhu

From: Roopa Prabhu 

Signed-off-by: Roopa Prabhu 
---
 drivers/net/vxlan.c | 95 -
 1 file changed, 43 insertions(+), 52 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 4cb6b50..47671fd 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -3374,6 +3374,23 @@ static int __vxlan_dev_create(struct net *net, struct 
net_device *dev,
return err;
 }
 
+/* Set/clear flags based on attribute */
+static void vxlan_nl2flag(struct vxlan_config *conf, struct nlattr *tb[],
+ int attrtype, unsigned long mask)
+{
+   unsigned long flags;
+
+   if (!tb[attrtype])
+   return;
+
+   if (nla_get_u8(tb[attrtype]))
+   flags = conf->flags | mask;
+   else
+   flags = conf->flags & ~mask;
+
+   conf->flags = flags;
+}
+
 static int vxlan_nl2conf(struct nlattr *tb[], struct nlattr *data[],
 struct net_device *dev, struct vxlan_config *conf,
 bool changelink, struct netlink_ext_ack *extack)
@@ -3458,45 +3475,27 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
if (data[IFLA_VXLAN_TTL])
conf->ttl = nla_get_u8(data[IFLA_VXLAN_TTL]);
 
-   if (data[IFLA_VXLAN_TTL_INHERIT])
-   conf->flags |= VXLAN_F_TTL_INHERIT;
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_TTL_INHERIT,
+ VXLAN_F_TTL_INHERIT);
 
if (data[IFLA_VXLAN_LABEL])
conf->label = nla_get_be32(data[IFLA_VXLAN_LABEL]) &
 IPV6_FLOWLABEL_MASK;
 
-   if (data[IFLA_VXLAN_LEARNING]) {
-   if (nla_get_u8(data[IFLA_VXLAN_LEARNING]))
-   conf->flags |= VXLAN_F_LEARN;
-   else
-   conf->flags &= ~VXLAN_F_LEARN;
-   } else if (!changelink) {
+   if (data[IFLA_VXLAN_LEARNING])
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_LEARNING,
+ VXLAN_F_LEARN);
+   else if (!changelink)
/* default to learn on a new device */
conf->flags |= VXLAN_F_LEARN;
-   }
 
if (data[IFLA_VXLAN_AGEING])
conf->age_interval = nla_get_u32(data[IFLA_VXLAN_AGEING]);
 
-   if (data[IFLA_VXLAN_PROXY]) {
-   if (nla_get_u8(data[IFLA_VXLAN_PROXY]))
-   conf->flags |= VXLAN_F_PROXY;
-   }
-
-   if (data[IFLA_VXLAN_RSC]) {
-   if (nla_get_u8(data[IFLA_VXLAN_RSC]))
-   conf->flags |= VXLAN_F_RSC;
-   }
-
-   if (data[IFLA_VXLAN_L2MISS]) {
-   if (nla_get_u8(data[IFLA_VXLAN_L2MISS]))
-   conf->flags |= VXLAN_F_L2MISS;
-   }
-
-   if (data[IFLA_VXLAN_L3MISS]) {
-   if (nla_get_u8(data[IFLA_VXLAN_L3MISS]))
-   conf->flags |= VXLAN_F_L3MISS;
-   }
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_PROXY, VXLAN_F_PROXY);
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_RSC, VXLAN_F_RSC);
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_L2MISS, VXLAN_F_L2MISS);
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_L3MISS, VXLAN_F_L3MISS);
 
if (data[IFLA_VXLAN_LIMIT]) {
if (changelink) {
@@ -3514,8 +3513,8 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
"Cannot change metadata flag");
return -EOPNOTSUPP;
}
-   if (nla_get_u8(data[IFLA_VXLAN_COLLECT_METADATA]))
-   conf->flags |= VXLAN_F_COLLECT_METADATA;
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_COLLECT_METADATA,
+ VXLAN_F_COLLECT_METADATA);
}
 
if (data[IFLA_VXLAN_PORT_RANGE]) {
@@ -3553,34 +3552,26 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
conf->flags |= VXLAN_F_UDP_ZERO_CSUM_TX;
}
 
-   if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]) {
-   if (nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]))
-   conf->flags |= VXLAN_F_UDP_ZERO_CSUM6_TX;
-   }
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_UDP_ZERO_CSUM6_TX,
+ VXLAN_F_UDP_ZERO_CSUM6_TX);
 
-   if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]) {
-   if (nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]))
-   conf->flags |= VXLAN_F_UDP_ZERO_CSUM6_RX;
-   }
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
+ VXLAN_F_UDP_ZERO_CSUM6_RX);
 
-   if (data[IFLA_VXLAN_REMCSUM_TX]) {
-   if (nla_get_u8(data[IFLA_VXLAN_REMCSUM_TX]))
-   conf->flags |= VXLAN_F_REMCSUM_TX;
-   }
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_REMCSUM_TX,
+ VXLAN_F_REMCSUM_TX);
 
-   if (data[IFLA_VXLAN_REMCSUM_RX]) {
-   if

[PATCH net-next v2 2/3] vxlan: extack support for some changelink cases

2018-11-28 Thread Roopa Prabhu

From: Roopa Prabhu 

Signed-off-by: Roopa Prabhu 
---
 drivers/net/vxlan.c | 76 +
 1 file changed, 59 insertions(+), 17 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 73caa65..4cb6b50 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -3376,7 +3376,7 @@ static int __vxlan_dev_create(struct net *net, struct 
net_device *dev,
 
 static int vxlan_nl2conf(struct nlattr *tb[], struct nlattr *data[],
 struct net_device *dev, struct vxlan_config *conf,
-bool changelink)
+bool changelink, struct netlink_ext_ack *extack)
 {
struct vxlan_dev *vxlan = netdev_priv(dev);
 
@@ -3389,40 +3389,60 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
if (data[IFLA_VXLAN_ID]) {
__be32 vni = cpu_to_be32(nla_get_u32(data[IFLA_VXLAN_ID]));
 
-   if (changelink && (vni != conf->vni))
+   if (changelink && (vni != conf->vni)) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_ID],
+   "Cannot change vni");
return -EOPNOTSUPP;
+   }
conf->vni = cpu_to_be32(nla_get_u32(data[IFLA_VXLAN_ID]));
}
 
if (data[IFLA_VXLAN_GROUP]) {
-   if (changelink && (conf->remote_ip.sa.sa_family != AF_INET))
+   if (changelink && (conf->remote_ip.sa.sa_family != AF_INET)) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_GROUP],
+   "New group addr family does not 
match old group");
return -EOPNOTSUPP;
-
+   }
conf->remote_ip.sin.sin_addr.s_addr = 
nla_get_in_addr(data[IFLA_VXLAN_GROUP]);
conf->remote_ip.sa.sa_family = AF_INET;
} else if (data[IFLA_VXLAN_GROUP6]) {
-   if (!IS_ENABLED(CONFIG_IPV6))
+   if (!IS_ENABLED(CONFIG_IPV6)) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_GROUP6],
+   "IPv6 support not enabled in the 
kernel");
return -EPFNOSUPPORT;
+   }
 
-   if (changelink && (conf->remote_ip.sa.sa_family != AF_INET6))
+   if (changelink && (conf->remote_ip.sa.sa_family != AF_INET6)) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_GROUP6],
+   "New group addr family does not 
match old group");
return -EOPNOTSUPP;
+   }
 
conf->remote_ip.sin6.sin6_addr = 
nla_get_in6_addr(data[IFLA_VXLAN_GROUP6]);
conf->remote_ip.sa.sa_family = AF_INET6;
}
 
if (data[IFLA_VXLAN_LOCAL]) {
-   if (changelink && (conf->saddr.sa.sa_family != AF_INET))
+   if (changelink && (conf->saddr.sa.sa_family != AF_INET)) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_LOCAL],
+   "New local addr family does not 
match old");
return -EOPNOTSUPP;
+   }
 
conf->saddr.sin.sin_addr.s_addr = 
nla_get_in_addr(data[IFLA_VXLAN_LOCAL]);
conf->saddr.sa.sa_family = AF_INET;
} else if (data[IFLA_VXLAN_LOCAL6]) {
-   if (!IS_ENABLED(CONFIG_IPV6))
+   if (!IS_ENABLED(CONFIG_IPV6)) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_LOCAL6],
+   "IPv6 support not enabled in the 
kernel");
return -EPFNOSUPPORT;
+   }
 
-   if (changelink && (conf->saddr.sa.sa_family != AF_INET6))
+   if (changelink && (conf->saddr.sa.sa_family != AF_INET6)) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_LOCAL6],
+   "New local6 addr family does not 
match old");
return -EOPNOTSUPP;
+   }
 
/* TODO: respect scope id */
conf->saddr.sin6.sin6_addr = 
nla_get_in6_addr(data[IFLA_VXLAN_LOCAL6]);
@@ -3479,14 +3499,21 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
}
 
if (data[IFLA_VXLAN_LIMIT]) {
-   if (changelink)
+   if (changelink) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_LIMIT],
+   "Cannot change limit");
return -EOPNOTSUPP;
+   }
conf->addrmax = nla_get_u32(data[IFLA_VXLAN_LIMIT]);
}
 
if (data[IFLA_VXLAN_COLLECT_METADATA]) {
-   if (changelink)
+   if (changelink) {
+   NL_SET_ERR_MSG_ATTR(extack,
+

[PATCH net-next v2 1/3] vxlan: support changelink for a few more attributes

2018-11-28 Thread Roopa Prabhu

From: Roopa Prabhu 

We started very conservative when supporting changelink
especially because not all attribute changes could be
tested. This patch opens up a few more attributes for
changelink. The reason for choosing this set of attributes
is based on code references for these attributes. I have
tested TTL changes and did some changelink api testing
to sanity test the others.

Signed-off-by: Roopa Prabhu 
---
 drivers/net/vxlan.c | 36 
 1 file changed, 4 insertions(+), 32 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 9110662..73caa65 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -3438,11 +3438,8 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
if (data[IFLA_VXLAN_TTL])
conf->ttl = nla_get_u8(data[IFLA_VXLAN_TTL]);
 
-   if (data[IFLA_VXLAN_TTL_INHERIT]) {
-   if (changelink)
-   return -EOPNOTSUPP;
+   if (data[IFLA_VXLAN_TTL_INHERIT])
conf->flags |= VXLAN_F_TTL_INHERIT;
-   }
 
if (data[IFLA_VXLAN_LABEL])
conf->label = nla_get_be32(data[IFLA_VXLAN_LABEL]) &
@@ -3462,29 +3459,21 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
conf->age_interval = nla_get_u32(data[IFLA_VXLAN_AGEING]);
 
if (data[IFLA_VXLAN_PROXY]) {
-   if (changelink)
-   return -EOPNOTSUPP;
if (nla_get_u8(data[IFLA_VXLAN_PROXY]))
conf->flags |= VXLAN_F_PROXY;
}
 
if (data[IFLA_VXLAN_RSC]) {
-   if (changelink)
-   return -EOPNOTSUPP;
if (nla_get_u8(data[IFLA_VXLAN_RSC]))
conf->flags |= VXLAN_F_RSC;
}
 
if (data[IFLA_VXLAN_L2MISS]) {
-   if (changelink)
-   return -EOPNOTSUPP;
if (nla_get_u8(data[IFLA_VXLAN_L2MISS]))
conf->flags |= VXLAN_F_L2MISS;
}
 
if (data[IFLA_VXLAN_L3MISS]) {
-   if (changelink)
-   return -EOPNOTSUPP;
if (nla_get_u8(data[IFLA_VXLAN_L3MISS]))
conf->flags |= VXLAN_F_L3MISS;
}
@@ -3527,50 +3516,33 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
}
 
if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]) {
-   if (changelink)
-   return -EOPNOTSUPP;
if (nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]))
conf->flags |= VXLAN_F_UDP_ZERO_CSUM6_TX;
}
 
if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]) {
-   if (changelink)
-   return -EOPNOTSUPP;
if (nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]))
conf->flags |= VXLAN_F_UDP_ZERO_CSUM6_RX;
}
 
if (data[IFLA_VXLAN_REMCSUM_TX]) {
-   if (changelink)
-   return -EOPNOTSUPP;
if (nla_get_u8(data[IFLA_VXLAN_REMCSUM_TX]))
conf->flags |= VXLAN_F_REMCSUM_TX;
}
 
if (data[IFLA_VXLAN_REMCSUM_RX]) {
-   if (changelink)
-   return -EOPNOTSUPP;
if (nla_get_u8(data[IFLA_VXLAN_REMCSUM_RX]))
conf->flags |= VXLAN_F_REMCSUM_RX;
}
 
-   if (data[IFLA_VXLAN_GBP]) {
-   if (changelink)
-   return -EOPNOTSUPP;
+   if (data[IFLA_VXLAN_GBP])
conf->flags |= VXLAN_F_GBP;
-   }
 
-   if (data[IFLA_VXLAN_GPE]) {
-   if (changelink)
-   return -EOPNOTSUPP;
+   if (data[IFLA_VXLAN_GPE])
conf->flags |= VXLAN_F_GPE;
-   }
 
-   if (data[IFLA_VXLAN_REMCSUM_NOPARTIAL]) {
-   if (changelink)
-   return -EOPNOTSUPP;
+   if (data[IFLA_VXLAN_REMCSUM_NOPARTIAL])
conf->flags |= VXLAN_F_REMCSUM_NOPARTIAL;
-   }
 
if (tb[IFLA_MTU]) {
if (changelink)
-- 
2.1.4

Re: [PATCH RFC iproute2-next 1/2] rdma: add 'link add/delete' commands

2018-11-28 Thread Jason Gunthorpe

On Wed, Nov 28, 2018 at 04:21:48PM -0600, Steve Wise wrote:

> >> It does make sense to not require type.  The name must be unique so that
> >> should be enough.  I'll have to respin the kernel side though...
> > The delete_link really should be an operation on the ib_device, not
> > the link_ops thing. 
> >
> > That directly prevents mis-matching function callbacks..
> >
> > Jason
> Looking at the rtnetlink newlink/dellink, I see they cache the link_ops
> ptr in the net_device struct.  So when the link is deleted, then
> appropriate driver-specific dellink function can be called after finding
> the device to be deleted.  Should I do something along these lines?  IE
> add a struct rdma_link_ops pointer to struct ib_device.

I don't see a problem with that either..

Jason

Re: [PATCH net-next 3/3] vxlan: move flag sets to use a helper func vxlan_nl2conf

2018-11-28 Thread Roopa Prabhu

On Wed, Nov 28, 2018 at 2:10 PM Roopa Prabhu  wrote:
>
> From: Roopa Prabhu 
>
> Signed-off-by: Roopa Prabhu 
> ---

just noticed a typo in the title, spinning v2.


>  drivers/net/vxlan.c | 95 
> -
>  1 file changed, 43 insertions(+), 52 deletions(-)
>
> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index 4cb6b50..47671fd 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -3374,6 +3374,23 @@ static int __vxlan_dev_create(struct net *net, struct 
> net_device *dev,
> return err;
>  }
>
> +/* Set/clear flags based on attribute */
> +static void vxlan_nl2flag(struct vxlan_config *conf, struct nlattr *tb[],
> + int attrtype, unsigned long mask)
> +{
> +   unsigned long flags;
> +
> +   if (!tb[attrtype])
> +   return;
> +
> +   if (nla_get_u8(tb[attrtype]))
> +   flags = conf->flags | mask;
> +   else
> +   flags = conf->flags & ~mask;
> +
> +   conf->flags = flags;
> +}
> +
>  static int vxlan_nl2conf(struct nlattr *tb[], struct nlattr *data[],
>  struct net_device *dev, struct vxlan_config *conf,
>  bool changelink, struct netlink_ext_ack *extack)
> @@ -3458,45 +3475,27 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
> nlattr *data[],
> if (data[IFLA_VXLAN_TTL])
> conf->ttl = nla_get_u8(data[IFLA_VXLAN_TTL]);
>
> -   if (data[IFLA_VXLAN_TTL_INHERIT])
> -   conf->flags |= VXLAN_F_TTL_INHERIT;
> +   vxlan_nl2flag(conf, data, IFLA_VXLAN_TTL_INHERIT,
> + VXLAN_F_TTL_INHERIT);
>
> if (data[IFLA_VXLAN_LABEL])
> conf->label = nla_get_be32(data[IFLA_VXLAN_LABEL]) &
>  IPV6_FLOWLABEL_MASK;
>
> -   if (data[IFLA_VXLAN_LEARNING]) {
> -   if (nla_get_u8(data[IFLA_VXLAN_LEARNING]))
> -   conf->flags |= VXLAN_F_LEARN;
> -   else
> -   conf->flags &= ~VXLAN_F_LEARN;
> -   } else if (!changelink) {
> +   if (data[IFLA_VXLAN_LEARNING])
> +   vxlan_nl2flag(conf, data, IFLA_VXLAN_LEARNING,
> + VXLAN_F_LEARN);
> +   else if (!changelink)
> /* default to learn on a new device */
> conf->flags |= VXLAN_F_LEARN;
> -   }
>
> if (data[IFLA_VXLAN_AGEING])
> conf->age_interval = nla_get_u32(data[IFLA_VXLAN_AGEING]);
>
> -   if (data[IFLA_VXLAN_PROXY]) {
> -   if (nla_get_u8(data[IFLA_VXLAN_PROXY]))
> -   conf->flags |= VXLAN_F_PROXY;
> -   }
> -
> -   if (data[IFLA_VXLAN_RSC]) {
> -   if (nla_get_u8(data[IFLA_VXLAN_RSC]))
> -   conf->flags |= VXLAN_F_RSC;
> -   }
> -
> -   if (data[IFLA_VXLAN_L2MISS]) {
> -   if (nla_get_u8(data[IFLA_VXLAN_L2MISS]))
> -   conf->flags |= VXLAN_F_L2MISS;
> -   }
> -
> -   if (data[IFLA_VXLAN_L3MISS]) {
> -   if (nla_get_u8(data[IFLA_VXLAN_L3MISS]))
> -   conf->flags |= VXLAN_F_L3MISS;
> -   }
> +   vxlan_nl2flag(conf, data, IFLA_VXLAN_PROXY, VXLAN_F_PROXY);
> +   vxlan_nl2flag(conf, data, IFLA_VXLAN_RSC, VXLAN_F_RSC);
> +   vxlan_nl2flag(conf, data, IFLA_VXLAN_L2MISS, VXLAN_F_L2MISS);
> +   vxlan_nl2flag(conf, data, IFLA_VXLAN_L3MISS, VXLAN_F_L3MISS);
>
> if (data[IFLA_VXLAN_LIMIT]) {
> if (changelink) {
> @@ -3514,8 +3513,8 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
> nlattr *data[],
> "Cannot change metadata flag");
> return -EOPNOTSUPP;
> }
> -   if (nla_get_u8(data[IFLA_VXLAN_COLLECT_METADATA]))
> -   conf->flags |= VXLAN_F_COLLECT_METADATA;
> +   vxlan_nl2flag(conf, data, IFLA_VXLAN_COLLECT_METADATA,
> + VXLAN_F_COLLECT_METADATA);
> }
>
> if (data[IFLA_VXLAN_PORT_RANGE]) {
> @@ -3553,34 +3552,26 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
> nlattr *data[],
> conf->flags |= VXLAN_F_UDP_ZERO_CSUM_TX;
> }
>
> -   if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]) {
> -   if (nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]))
> -   conf->flags |= VXLAN_F_UDP_ZERO_CSUM6_TX;
> -   }
> +   vxlan_nl2flag(conf, data, IFLA_VXLAN_UDP_ZERO_CSUM6_TX,
> + VXLAN_F_UDP_ZERO_CSUM6_TX);
>
> -   if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]) {
> -   if (nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]))
> -   conf->flags |= VXLAN_F_UDP_ZERO_CSUM6_RX;
> -   }
> +   vxlan_nl2flag(conf, data, IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
> + VXLAN_F_UDP_ZERO_CSUM6_RX);
>
> -   if

Re: [PATCH RFC iproute2-next 1/2] rdma: add 'link add/delete' commands

2018-11-28 Thread Steve Wise




On 11/28/2018 4:17 PM, Jason Gunthorpe wrote:
> On Wed, Nov 28, 2018 at 02:18:55PM -0600, Steve Wise wrote:
>>
>> On 11/28/2018 2:13 PM, Leon Romanovsky wrote:
>>> On Wed, Nov 28, 2018 at 02:07:29PM -0600, Steve Wise wrote:
 On 11/28/2018 2:04 PM, Leon Romanovsky wrote:
> On Wed, Nov 28, 2018 at 01:08:05PM -0600, Steve Wise wrote:
>> On 11/28/2018 12:26 PM, Leon Romanovsky wrote:
>>> On Thu, Sep 13, 2018 at 10:19:21AM -0700, Steve Wise wrote:
 Add new 'link' subcommand 'add' and 'delete' to allow binding a 
 soft-rdma
 device to a netdev interface.

 EG:

 rdma link add rxe_eth0 type rxe dev eth0
 rdma link delete rxe_eth0

 Signed-off-by: Steve Wise 
  rdma/link.c  | 106 
 +++
  rdma/rdma.h  |   1 +
  rdma/utils.c |   2 +-
  3 files changed, 108 insertions(+), 1 deletion(-)

 diff --git a/rdma/link.c b/rdma/link.c
 index 7a6d4b7e356d..d4f76b0ce11f 100644
 +++ b/rdma/link.c
 @@ -14,6 +14,8 @@
  static int link_help(struct rd *rd)
  {
pr_out("Usage: %s link show [DEV/PORT_INDEX]\n", rd->filename);
 +  pr_out("Usage: %s link add NAME type TYPE dev DEV\n", 
 rd->filename);
>>> I suggest to rename "dev" to be "netdev", because we are using "dev" for
>>> ib devices.
>> Yea ok.
>>
 +  pr_out("Usage: %s link delete NAME type TYPE\n", rd->filename);
>>> Why do you need "type" for "delete" command?
>> Because the type is used in the kernel to find the appropriate link
>> ops.  I could change the kernel side to search all types for the device
>> name to delete? 
> I would say, yes.
> It makes "delete" operation more natural.
>
> Thanks
 Perhaps.

 Note: 'ip link delete' takes a type as well...
>>> According to man section, yes.
>>> According to various guides, no.
>>> https://docs.fedoraproject.org/en-US/Fedora/20/html/Networking_Guide/sec-Configure_802_1Q_VLAN_Tagging_ip_Commands.html
>>>
>>> Thanks
>> It does make sense to not require type.  The name must be unique so that
>> should be enough.  I'll have to respin the kernel side though...
> The delete_link really should be an operation on the ib_device, not
> the link_ops thing. 
>
> That directly prevents mis-matching function callbacks..
>
> Jason
Looking at the rtnetlink newlink/dellink, I see they cache the link_ops
ptr in the net_device struct.  So when the link is deleted, then
appropriate driver-specific dellink function can be called after finding
the device to be deleted.  Should I do something along these lines?  IE
add a struct rdma_link_ops pointer to struct ib_device.

Steve.

Re: [PATCH RFC iproute2-next 1/2] rdma: add 'link add/delete' commands

2018-11-28 Thread Jason Gunthorpe

On Wed, Nov 28, 2018 at 02:18:55PM -0600, Steve Wise wrote:
> 
> 
> On 11/28/2018 2:13 PM, Leon Romanovsky wrote:
> > On Wed, Nov 28, 2018 at 02:07:29PM -0600, Steve Wise wrote:
> >>
> >> On 11/28/2018 2:04 PM, Leon Romanovsky wrote:
> >>> On Wed, Nov 28, 2018 at 01:08:05PM -0600, Steve Wise wrote:
>  On 11/28/2018 12:26 PM, Leon Romanovsky wrote:
> > On Thu, Sep 13, 2018 at 10:19:21AM -0700, Steve Wise wrote:
> >> Add new 'link' subcommand 'add' and 'delete' to allow binding a 
> >> soft-rdma
> >> device to a netdev interface.
> >>
> >> EG:
> >>
> >> rdma link add rxe_eth0 type rxe dev eth0
> >> rdma link delete rxe_eth0
> >>
> >> Signed-off-by: Steve Wise 
> >>  rdma/link.c  | 106 
> >> +++
> >>  rdma/rdma.h  |   1 +
> >>  rdma/utils.c |   2 +-
> >>  3 files changed, 108 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/rdma/link.c b/rdma/link.c
> >> index 7a6d4b7e356d..d4f76b0ce11f 100644
> >> +++ b/rdma/link.c
> >> @@ -14,6 +14,8 @@
> >>  static int link_help(struct rd *rd)
> >>  {
> >>pr_out("Usage: %s link show [DEV/PORT_INDEX]\n", rd->filename);
> >> +  pr_out("Usage: %s link add NAME type TYPE dev DEV\n", 
> >> rd->filename);
> > I suggest to rename "dev" to be "netdev", because we are using "dev" for
> > ib devices.
>  Yea ok.
> 
> >> +  pr_out("Usage: %s link delete NAME type TYPE\n", rd->filename);
> > Why do you need "type" for "delete" command?
>  Because the type is used in the kernel to find the appropriate link
>  ops.  I could change the kernel side to search all types for the device
>  name to delete? 
> >>> I would say, yes.
> >>> It makes "delete" operation more natural.
> >>>
> >>> Thanks
> >> Perhaps.
> >>
> >> Note: 'ip link delete' takes a type as well...
> > According to man section, yes.
> > According to various guides, no.
> > https://docs.fedoraproject.org/en-US/Fedora/20/html/Networking_Guide/sec-Configure_802_1Q_VLAN_Tagging_ip_Commands.html
> >
> > Thanks
> 
> It does make sense to not require type.  The name must be unique so that
> should be enough.  I'll have to respin the kernel side though...

The delete_link really should be an operation on the ib_device, not
the link_ops thing. 

That directly prevents mis-matching function callbacks..

Jason

Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-28 Thread Cong Wang

On Wed, Nov 28, 2018 at 7:00 AM Eric Dumazet  wrote:
>
> Nice packet of death alert.
>
> pad_len can be 0xFF67  here, if frame_len is smaller than pad_offset.

Unless IP header is malformed, how could it be?

Speaking of IP header sanity, I am totally aware of it, I don't check it because
I know get_ip_proto() doesn't check either, it must be hardware which verifies
the sanity.

>
> Really I suggest you set ip_summed to CHECKSUM_NONE, then remove the
> initial test ( if (likely(frame_len > ETH_ZLEN)) ...)
>
> Until the firmware is fixed.

Hmm, why setting to CHECKSUM_NONE could get rid of the minimum ethernet
frame check? I am lost, there is no bug for packet > ETH_ZLEN _for me_, what
needs to fix here?

Overall, you keep pushing me to fix a bug I don't observe. I don't understand
why. If you see it, please come up with your own patch? Why do I have to fix
the problem you see??

>
> Otherwise frames with a wrong checksum and some non zero padding could
> potentially
> be seen as correct frames. (Probability of 1/65536)
>
> Do not focus on your immediate problem (small packets being padded by
> a non malicious entity)

Again, why _I_ should fix a problem I never observe? Why is it not you who
fix the problem you find during code review? No to mention I have no environment
to test it even if I really want to fix. I can' take such a risk.

Thanks.

[PATCH net-next 1/3] vxlan: support changelink for a few more attributes

2018-11-28 Thread Roopa Prabhu

From: Roopa Prabhu 

We started very conservative when supporting changelink
especially because not all attribute changes could be
tested. This patch opens up a few more attributes for
changelink. The reason for choosing this set of attributes
is based on code references for these attributes. I have
tested TTL changes and did some changelink api testing
to sanity test the others.

Signed-off-by: Roopa Prabhu 
---
 drivers/net/vxlan.c | 36 
 1 file changed, 4 insertions(+), 32 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 9110662..73caa65 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -3438,11 +3438,8 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
if (data[IFLA_VXLAN_TTL])
conf->ttl = nla_get_u8(data[IFLA_VXLAN_TTL]);
 
-   if (data[IFLA_VXLAN_TTL_INHERIT]) {
-   if (changelink)
-   return -EOPNOTSUPP;
+   if (data[IFLA_VXLAN_TTL_INHERIT])
conf->flags |= VXLAN_F_TTL_INHERIT;
-   }
 
if (data[IFLA_VXLAN_LABEL])
conf->label = nla_get_be32(data[IFLA_VXLAN_LABEL]) &
@@ -3462,29 +3459,21 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
conf->age_interval = nla_get_u32(data[IFLA_VXLAN_AGEING]);
 
if (data[IFLA_VXLAN_PROXY]) {
-   if (changelink)
-   return -EOPNOTSUPP;
if (nla_get_u8(data[IFLA_VXLAN_PROXY]))
conf->flags |= VXLAN_F_PROXY;
}
 
if (data[IFLA_VXLAN_RSC]) {
-   if (changelink)
-   return -EOPNOTSUPP;
if (nla_get_u8(data[IFLA_VXLAN_RSC]))
conf->flags |= VXLAN_F_RSC;
}
 
if (data[IFLA_VXLAN_L2MISS]) {
-   if (changelink)
-   return -EOPNOTSUPP;
if (nla_get_u8(data[IFLA_VXLAN_L2MISS]))
conf->flags |= VXLAN_F_L2MISS;
}
 
if (data[IFLA_VXLAN_L3MISS]) {
-   if (changelink)
-   return -EOPNOTSUPP;
if (nla_get_u8(data[IFLA_VXLAN_L3MISS]))
conf->flags |= VXLAN_F_L3MISS;
}
@@ -3527,50 +3516,33 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
}
 
if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]) {
-   if (changelink)
-   return -EOPNOTSUPP;
if (nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]))
conf->flags |= VXLAN_F_UDP_ZERO_CSUM6_TX;
}
 
if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]) {
-   if (changelink)
-   return -EOPNOTSUPP;
if (nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]))
conf->flags |= VXLAN_F_UDP_ZERO_CSUM6_RX;
}
 
if (data[IFLA_VXLAN_REMCSUM_TX]) {
-   if (changelink)
-   return -EOPNOTSUPP;
if (nla_get_u8(data[IFLA_VXLAN_REMCSUM_TX]))
conf->flags |= VXLAN_F_REMCSUM_TX;
}
 
if (data[IFLA_VXLAN_REMCSUM_RX]) {
-   if (changelink)
-   return -EOPNOTSUPP;
if (nla_get_u8(data[IFLA_VXLAN_REMCSUM_RX]))
conf->flags |= VXLAN_F_REMCSUM_RX;
}
 
-   if (data[IFLA_VXLAN_GBP]) {
-   if (changelink)
-   return -EOPNOTSUPP;
+   if (data[IFLA_VXLAN_GBP])
conf->flags |= VXLAN_F_GBP;
-   }
 
-   if (data[IFLA_VXLAN_GPE]) {
-   if (changelink)
-   return -EOPNOTSUPP;
+   if (data[IFLA_VXLAN_GPE])
conf->flags |= VXLAN_F_GPE;
-   }
 
-   if (data[IFLA_VXLAN_REMCSUM_NOPARTIAL]) {
-   if (changelink)
-   return -EOPNOTSUPP;
+   if (data[IFLA_VXLAN_REMCSUM_NOPARTIAL])
conf->flags |= VXLAN_F_REMCSUM_NOPARTIAL;
-   }
 
if (tb[IFLA_MTU]) {
if (changelink)
-- 
2.1.4

[PATCH net-next 3/3] vxlan: move flag sets to use a helper func vxlan_nl2conf

2018-11-28 Thread Roopa Prabhu

From: Roopa Prabhu 

Signed-off-by: Roopa Prabhu 
---
 drivers/net/vxlan.c | 95 -
 1 file changed, 43 insertions(+), 52 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 4cb6b50..47671fd 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -3374,6 +3374,23 @@ static int __vxlan_dev_create(struct net *net, struct 
net_device *dev,
return err;
 }
 
+/* Set/clear flags based on attribute */
+static void vxlan_nl2flag(struct vxlan_config *conf, struct nlattr *tb[],
+ int attrtype, unsigned long mask)
+{
+   unsigned long flags;
+
+   if (!tb[attrtype])
+   return;
+
+   if (nla_get_u8(tb[attrtype]))
+   flags = conf->flags | mask;
+   else
+   flags = conf->flags & ~mask;
+
+   conf->flags = flags;
+}
+
 static int vxlan_nl2conf(struct nlattr *tb[], struct nlattr *data[],
 struct net_device *dev, struct vxlan_config *conf,
 bool changelink, struct netlink_ext_ack *extack)
@@ -3458,45 +3475,27 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
if (data[IFLA_VXLAN_TTL])
conf->ttl = nla_get_u8(data[IFLA_VXLAN_TTL]);
 
-   if (data[IFLA_VXLAN_TTL_INHERIT])
-   conf->flags |= VXLAN_F_TTL_INHERIT;
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_TTL_INHERIT,
+ VXLAN_F_TTL_INHERIT);
 
if (data[IFLA_VXLAN_LABEL])
conf->label = nla_get_be32(data[IFLA_VXLAN_LABEL]) &
 IPV6_FLOWLABEL_MASK;
 
-   if (data[IFLA_VXLAN_LEARNING]) {
-   if (nla_get_u8(data[IFLA_VXLAN_LEARNING]))
-   conf->flags |= VXLAN_F_LEARN;
-   else
-   conf->flags &= ~VXLAN_F_LEARN;
-   } else if (!changelink) {
+   if (data[IFLA_VXLAN_LEARNING])
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_LEARNING,
+ VXLAN_F_LEARN);
+   else if (!changelink)
/* default to learn on a new device */
conf->flags |= VXLAN_F_LEARN;
-   }
 
if (data[IFLA_VXLAN_AGEING])
conf->age_interval = nla_get_u32(data[IFLA_VXLAN_AGEING]);
 
-   if (data[IFLA_VXLAN_PROXY]) {
-   if (nla_get_u8(data[IFLA_VXLAN_PROXY]))
-   conf->flags |= VXLAN_F_PROXY;
-   }
-
-   if (data[IFLA_VXLAN_RSC]) {
-   if (nla_get_u8(data[IFLA_VXLAN_RSC]))
-   conf->flags |= VXLAN_F_RSC;
-   }
-
-   if (data[IFLA_VXLAN_L2MISS]) {
-   if (nla_get_u8(data[IFLA_VXLAN_L2MISS]))
-   conf->flags |= VXLAN_F_L2MISS;
-   }
-
-   if (data[IFLA_VXLAN_L3MISS]) {
-   if (nla_get_u8(data[IFLA_VXLAN_L3MISS]))
-   conf->flags |= VXLAN_F_L3MISS;
-   }
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_PROXY, VXLAN_F_PROXY);
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_RSC, VXLAN_F_RSC);
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_L2MISS, VXLAN_F_L2MISS);
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_L3MISS, VXLAN_F_L3MISS);
 
if (data[IFLA_VXLAN_LIMIT]) {
if (changelink) {
@@ -3514,8 +3513,8 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
"Cannot change metadata flag");
return -EOPNOTSUPP;
}
-   if (nla_get_u8(data[IFLA_VXLAN_COLLECT_METADATA]))
-   conf->flags |= VXLAN_F_COLLECT_METADATA;
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_COLLECT_METADATA,
+ VXLAN_F_COLLECT_METADATA);
}
 
if (data[IFLA_VXLAN_PORT_RANGE]) {
@@ -3553,34 +3552,26 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
conf->flags |= VXLAN_F_UDP_ZERO_CSUM_TX;
}
 
-   if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]) {
-   if (nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]))
-   conf->flags |= VXLAN_F_UDP_ZERO_CSUM6_TX;
-   }
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_UDP_ZERO_CSUM6_TX,
+ VXLAN_F_UDP_ZERO_CSUM6_TX);
 
-   if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]) {
-   if (nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]))
-   conf->flags |= VXLAN_F_UDP_ZERO_CSUM6_RX;
-   }
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
+ VXLAN_F_UDP_ZERO_CSUM6_RX);
 
-   if (data[IFLA_VXLAN_REMCSUM_TX]) {
-   if (nla_get_u8(data[IFLA_VXLAN_REMCSUM_TX]))
-   conf->flags |= VXLAN_F_REMCSUM_TX;
-   }
+   vxlan_nl2flag(conf, data, IFLA_VXLAN_REMCSUM_TX,
+ VXLAN_F_REMCSUM_TX);
 
-   if (data[IFLA_VXLAN_REMCSUM_RX]) {
-   if

[PATCH net-next 2/3] vxlan: extack support for some changelink cases

2018-11-28 Thread Roopa Prabhu

From: Roopa Prabhu 

Signed-off-by: Roopa Prabhu 
---
 drivers/net/vxlan.c | 76 +
 1 file changed, 59 insertions(+), 17 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 73caa65..4cb6b50 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -3376,7 +3376,7 @@ static int __vxlan_dev_create(struct net *net, struct 
net_device *dev,
 
 static int vxlan_nl2conf(struct nlattr *tb[], struct nlattr *data[],
 struct net_device *dev, struct vxlan_config *conf,
-bool changelink)
+bool changelink, struct netlink_ext_ack *extack)
 {
struct vxlan_dev *vxlan = netdev_priv(dev);
 
@@ -3389,40 +3389,60 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
if (data[IFLA_VXLAN_ID]) {
__be32 vni = cpu_to_be32(nla_get_u32(data[IFLA_VXLAN_ID]));
 
-   if (changelink && (vni != conf->vni))
+   if (changelink && (vni != conf->vni)) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_ID],
+   "Cannot change vni");
return -EOPNOTSUPP;
+   }
conf->vni = cpu_to_be32(nla_get_u32(data[IFLA_VXLAN_ID]));
}
 
if (data[IFLA_VXLAN_GROUP]) {
-   if (changelink && (conf->remote_ip.sa.sa_family != AF_INET))
+   if (changelink && (conf->remote_ip.sa.sa_family != AF_INET)) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_GROUP],
+   "New group addr family does not 
match old group");
return -EOPNOTSUPP;
-
+   }
conf->remote_ip.sin.sin_addr.s_addr = 
nla_get_in_addr(data[IFLA_VXLAN_GROUP]);
conf->remote_ip.sa.sa_family = AF_INET;
} else if (data[IFLA_VXLAN_GROUP6]) {
-   if (!IS_ENABLED(CONFIG_IPV6))
+   if (!IS_ENABLED(CONFIG_IPV6)) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_GROUP6],
+   "IPv6 support not enabled in the 
kernel");
return -EPFNOSUPPORT;
+   }
 
-   if (changelink && (conf->remote_ip.sa.sa_family != AF_INET6))
+   if (changelink && (conf->remote_ip.sa.sa_family != AF_INET6)) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_GROUP6],
+   "New group addr family does not 
match old group");
return -EOPNOTSUPP;
+   }
 
conf->remote_ip.sin6.sin6_addr = 
nla_get_in6_addr(data[IFLA_VXLAN_GROUP6]);
conf->remote_ip.sa.sa_family = AF_INET6;
}
 
if (data[IFLA_VXLAN_LOCAL]) {
-   if (changelink && (conf->saddr.sa.sa_family != AF_INET))
+   if (changelink && (conf->saddr.sa.sa_family != AF_INET)) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_LOCAL],
+   "New local addr family does not 
match old");
return -EOPNOTSUPP;
+   }
 
conf->saddr.sin.sin_addr.s_addr = 
nla_get_in_addr(data[IFLA_VXLAN_LOCAL]);
conf->saddr.sa.sa_family = AF_INET;
} else if (data[IFLA_VXLAN_LOCAL6]) {
-   if (!IS_ENABLED(CONFIG_IPV6))
+   if (!IS_ENABLED(CONFIG_IPV6)) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_LOCAL6],
+   "IPv6 support not enabled in the 
kernel");
return -EPFNOSUPPORT;
+   }
 
-   if (changelink && (conf->saddr.sa.sa_family != AF_INET6))
+   if (changelink && (conf->saddr.sa.sa_family != AF_INET6)) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_LOCAL6],
+   "New local6 addr family does not 
match old");
return -EOPNOTSUPP;
+   }
 
/* TODO: respect scope id */
conf->saddr.sin6.sin6_addr = 
nla_get_in6_addr(data[IFLA_VXLAN_LOCAL6]);
@@ -3479,14 +3499,21 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
}
 
if (data[IFLA_VXLAN_LIMIT]) {
-   if (changelink)
+   if (changelink) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_LIMIT],
+   "Cannot change limit");
return -EOPNOTSUPP;
+   }
conf->addrmax = nla_get_u32(data[IFLA_VXLAN_LIMIT]);
}
 
if (data[IFLA_VXLAN_COLLECT_METADATA]) {
-   if (changelink)
+   if (changelink) {
+   NL_SET_ERR_MSG_ATTR(extack,
+

[PATCH net-next 0/3] vxlan: a few minor cleanups

2018-11-28 Thread Roopa Prabhu

From: Roopa Prabhu 

Roopa Prabhu (3):
  vxlan: support changelink for a few more attributes
  vxlan: extack support for some changelink cases
  vxlan: move flag sets to use a helper func vxlan_nl2conf

 drivers/net/vxlan.c | 199 +++-
 1 file changed, 102 insertions(+), 97 deletions(-)

-- 
2.1.4

Re: [PATCH net-next] tun: implement carrier change

2018-11-28 Thread Andrew Lunn

On Wed, Nov 28, 2018 at 07:12:56PM +0100, Nicolas Dichtel wrote:
> The userspace may need to control the carrier state.

Hi Nicolas

Could you explain your user case a bit more.

Are you running a routing daemon on top of the interface, and want it
to reroute when the carrier goes down?

   Thanks
Andrew

Re: [PATCH] selftests/bpf: add config fragment CONFIG_FTRACE_SYSCALLS

2018-11-28 Thread Daniel Borkmann

On 11/27/2018 04:24 PM, Naresh Kamboju wrote:
> CONFIG_FTRACE_SYSCALLS=y is required for get_cgroup_id_user test case
> this test reads a file from debug trace path
> /sys/kernel/debug/tracing/events/syscalls/sys_enter_nanosleep/id
> 
> Signed-off-by: Naresh Kamboju 
> ---
>  tools/testing/selftests/bpf/config | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/tools/testing/selftests/bpf/config 
> b/tools/testing/selftests/bpf/config
> index 7f90d3645af8..37f947ec44ed 100644
> --- a/tools/testing/selftests/bpf/config
> +++ b/tools/testing/selftests/bpf/config
> @@ -22,3 +22,4 @@ CONFIG_NET_CLS_FLOWER=m
>  CONFIG_LWTUNNEL=y
>  CONFIG_BPF_STREAM_PARSER=y
>  CONFIG_XDP_SOCKETS=y
> +CONFIG_FTRACE_SYSCALLS=y
> 

Applied to bpf-next, thanks!

Re: [PATCH bpf-next v2 0/3] bpf: add sk_msg helper sk_msg_pop_data

2018-11-28 Thread Daniel Borkmann

On 11/26/2018 11:16 PM, John Fastabend wrote:
> After being able to add metadata to messages with sk_msg_push_data we
> have also found it useful to be able to "pop" this metadata off before
> sending it to applications in some cases. This series adds a new helper
> sk_msg_pop_data() and the associated patches to add tests and tools/lib
> support.
> 
> Thanks!
> 
> v2: Daniel caught that we missed adding sk_msg_pop_data to the changes
> data helper so that the verifier ensures BPF programs revalidate
> data after using this helper. Also improve documentation adding a
> return description and using RST syntax per Quentin's comment. And
> delta calculations for DROP with pop'd data (albeit a strange set
> of operations for a program to be doing) had potential to be
> incorrect possibly confusing user space applications, so fix it.
> 
> John Fastabend (3):
>   bpf: helper to pop data from messages
>   bpf: add msg_pop_data helper to tools
>   bpf: test_sockmap, add options for msg_pop_data() helper usage
> 
>  include/uapi/linux/bpf.h|  13 +-
>  net/core/filter.c   | 169 
> 
>  net/ipv4/tcp_bpf.c  |  14 +-
>  tools/include/uapi/linux/bpf.h  |  13 +-
>  tools/testing/selftests/bpf/bpf_helpers.h   |   2 +
>  tools/testing/selftests/bpf/test_sockmap.c  | 127 +-
>  tools/testing/selftests/bpf/test_sockmap_kern.h |  70 --
>  7 files changed, 386 insertions(+), 22 deletions(-)
> 

Applied to bpf-next, thanks!

[PATCH] bpf: Fix various lib and testsuite build failures on 32-bit.

2018-11-28 Thread David Miller



Cannot cast a u64 to a pointer on 32-bit without an intervening (long)
cast otherwise GCC warns.

Signed-off-by: David S. Miller 
--

diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index eadcf8d..c2d641f 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -466,7 +466,7 @@ int btf__get_from_id(__u32 id, struct btf **btf)
goto exit_free;
}
 
-   *btf = btf__new((__u8 *)btf_info.btf, btf_info.btf_size, NULL);
+   *btf = btf__new((__u8 *)(long)btf_info.btf, btf_info.btf_size, NULL);
if (IS_ERR(*btf)) {
err = PTR_ERR(*btf);
*btf = NULL;
diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index c1e688f6..1c57abb 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -524,7 +524,7 @@ static void test_bpf_obj_id(void)
  load_time < now - 60 || load_time > now + 60 ||
  prog_infos[i].created_by_uid != my_uid ||
  prog_infos[i].nr_map_ids != 1 ||
- *(int *)prog_infos[i].map_ids != map_infos[i].id ||
+ *(int *)(long)prog_infos[i].map_ids != 
map_infos[i].id ||
  strcmp((char *)prog_infos[i].name, 
expected_prog_name),
  "get-prog-info(fd)",
  "err %d errno %d i %d type %d(%d) info_len %u(%Zu) 
jit_enabled %d jited_prog_len %u xlated_prog_len %u jited_prog %d xlated_prog 
%d load_time %lu(%lu) uid %u(%u) nr_map_ids %u(%u) map_id %u(%u) name %s(%s)\n",
@@ -539,7 +539,7 @@ static void test_bpf_obj_id(void)
  load_time, now,
  prog_infos[i].created_by_uid, my_uid,
  prog_infos[i].nr_map_ids, 1,
- *(int *)prog_infos[i].map_ids, map_infos[i].id,
+ *(int *)(long)prog_infos[i].map_ids, map_infos[i].id,
  prog_infos[i].name, expected_prog_name))
goto done;
}
@@ -585,7 +585,7 @@ static void test_bpf_obj_id(void)
bzero(_info, sizeof(prog_info));
info_len = sizeof(prog_info);
 
-   saved_map_id = *(int *)(prog_infos[i].map_ids);
+   saved_map_id = *(int *)((long)prog_infos[i].map_ids);
prog_info.map_ids = prog_infos[i].map_ids;
prog_info.nr_map_ids = 2;
err = bpf_obj_get_info_by_fd(prog_fd, _info, _len);
@@ -593,12 +593,12 @@ static void test_bpf_obj_id(void)
prog_infos[i].xlated_prog_insns = 0;
CHECK(err || info_len != sizeof(struct bpf_prog_info) ||
  memcmp(_info, _infos[i], info_len) ||
- *(int *)prog_info.map_ids != saved_map_id,
+ *(int *)(long)prog_info.map_ids != saved_map_id,
  "get-prog-info(next_id->fd)",
  "err %d errno %d info_len %u(%Zu) memcmp %d map_id 
%u(%u)\n",
  err, errno, info_len, sizeof(struct bpf_prog_info),
  memcmp(_info, _infos[i], info_len),
- *(int *)prog_info.map_ids, saved_map_id);
+ *(int *)(long)prog_info.map_ids, saved_map_id);
close(prog_fd);
}
CHECK(nr_id_found != nr_iters,

Re: [PATCH RFC iproute2-next 1/2] rdma: add 'link add/delete' commands

2018-11-28 Thread Steve Wise




On 11/28/2018 2:13 PM, Leon Romanovsky wrote:
> On Wed, Nov 28, 2018 at 02:07:29PM -0600, Steve Wise wrote:
>>
>> On 11/28/2018 2:04 PM, Leon Romanovsky wrote:
>>> On Wed, Nov 28, 2018 at 01:08:05PM -0600, Steve Wise wrote:
 On 11/28/2018 12:26 PM, Leon Romanovsky wrote:
> On Thu, Sep 13, 2018 at 10:19:21AM -0700, Steve Wise wrote:
>> Add new 'link' subcommand 'add' and 'delete' to allow binding a soft-rdma
>> device to a netdev interface.
>>
>> EG:
>>
>> rdma link add rxe_eth0 type rxe dev eth0
>> rdma link delete rxe_eth0
>>
>> Signed-off-by: Steve Wise 
>> ---
>>  rdma/link.c  | 106 
>> +++
>>  rdma/rdma.h  |   1 +
>>  rdma/utils.c |   2 +-
>>  3 files changed, 108 insertions(+), 1 deletion(-)
>>
>> diff --git a/rdma/link.c b/rdma/link.c
>> index 7a6d4b7e356d..d4f76b0ce11f 100644
>> --- a/rdma/link.c
>> +++ b/rdma/link.c
>> @@ -14,6 +14,8 @@
>>  static int link_help(struct rd *rd)
>>  {
>>  pr_out("Usage: %s link show [DEV/PORT_INDEX]\n", rd->filename);
>> +pr_out("Usage: %s link add NAME type TYPE dev DEV\n", 
>> rd->filename);
> I suggest to rename "dev" to be "netdev", because we are using "dev" for
> ib devices.
 Yea ok.

>> +pr_out("Usage: %s link delete NAME type TYPE\n", rd->filename);
> Why do you need "type" for "delete" command?
 Because the type is used in the kernel to find the appropriate link
 ops.  I could change the kernel side to search all types for the device
 name to delete? 
>>> I would say, yes.
>>> It makes "delete" operation more natural.
>>>
>>> Thanks
>> Perhaps.
>>
>> Note: 'ip link delete' takes a type as well...
> According to man section, yes.
> According to various guides, no.
> https://docs.fedoraproject.org/en-US/Fedora/20/html/Networking_Guide/sec-Configure_802_1Q_VLAN_Tagging_ip_Commands.html
>
> Thanks

It does make sense to not require type.  The name must be unique so that
should be enough.  I'll have to respin the kernel side though...

Thanks,

Steve.

Re: [PATCH RFC iproute2-next 1/2] rdma: add 'link add/delete' commands

2018-11-28 Thread Steve Wise




On 11/28/2018 2:08 PM, Leon Romanovsky wrote:
> On Wed, Nov 28, 2018 at 10:02:04PM +0200, Leon Romanovsky wrote:
>> On Wed, Nov 28, 2018 at 01:34:14PM -0600, Steve Wise wrote:
>>> ...
>>>
>> +rd_prepare_msg(rd, RDMA_NLDEV_CMD_NEWLINK, ,
>> +   (NLM_F_REQUEST | NLM_F_ACK));
>> +mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_DEV_NAME, name);
>> +mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_LINK_TYPE, type);
>> +mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_NDEV_NAME, dev);
>> +ret = rd_send_msg(rd);
>> +if (ret)
>> +return ret;
>> +
>> +ret = rd_recv_msg(rd, link_add_parse_cb, rd, seq);
>> +if (ret)
>> +perror(NULL);
> Why do you need rd_recv_msg()? I think that it is not needed, at least
> for rename, I didn't need it.
> https://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git/tree/rdma/dev.c#n244
 To get the response of if it was successfully added.  It provides the
 errno value.
>>> If I don't do the rd_recv_msg, then adding the same name twice fails
>>> without any error notification.  Ditto for deleting a non-existent
>>> link.  So the rd_recv_msg() allows getting the failure reason (and
>>> detecting the failure). 
>>>
>> Shouldn't extack provide such information as part of NLM_F_ACK flag?
>>
>> just shooting into the air, will take more close look tomorrow.
> OK, it was easier than I thought.
>
> You are right, need both send and receive to get the reason.
>
> Can you prepare general function and update rename part too?
> Something like send_receive(...) with dummy callback for receive path.
>
> Thanks

Sure, I'll whip something up for the next version of the patch series...

Re: [PATCH RFC iproute2-next 1/2] rdma: add 'link add/delete' commands

2018-11-28 Thread Leon Romanovsky

On Wed, Nov 28, 2018 at 02:07:29PM -0600, Steve Wise wrote:
>
>
> On 11/28/2018 2:04 PM, Leon Romanovsky wrote:
> > On Wed, Nov 28, 2018 at 01:08:05PM -0600, Steve Wise wrote:
> >>
> >> On 11/28/2018 12:26 PM, Leon Romanovsky wrote:
> >>> On Thu, Sep 13, 2018 at 10:19:21AM -0700, Steve Wise wrote:
>  Add new 'link' subcommand 'add' and 'delete' to allow binding a soft-rdma
>  device to a netdev interface.
> 
>  EG:
> 
>  rdma link add rxe_eth0 type rxe dev eth0
>  rdma link delete rxe_eth0
> 
>  Signed-off-by: Steve Wise 
>  ---
>   rdma/link.c  | 106 
>  +++
>   rdma/rdma.h  |   1 +
>   rdma/utils.c |   2 +-
>   3 files changed, 108 insertions(+), 1 deletion(-)
> 
>  diff --git a/rdma/link.c b/rdma/link.c
>  index 7a6d4b7e356d..d4f76b0ce11f 100644
>  --- a/rdma/link.c
>  +++ b/rdma/link.c
>  @@ -14,6 +14,8 @@
>   static int link_help(struct rd *rd)
>   {
>   pr_out("Usage: %s link show [DEV/PORT_INDEX]\n", rd->filename);
>  +pr_out("Usage: %s link add NAME type TYPE dev DEV\n", 
>  rd->filename);
> >>> I suggest to rename "dev" to be "netdev", because we are using "dev" for
> >>> ib devices.
> >> Yea ok.
> >>
>  +pr_out("Usage: %s link delete NAME type TYPE\n", rd->filename);
> >>> Why do you need "type" for "delete" command?
> >> Because the type is used in the kernel to find the appropriate link
> >> ops.  I could change the kernel side to search all types for the device
> >> name to delete? 
> > I would say, yes.
> > It makes "delete" operation more natural.
> >
> > Thanks
>
> Perhaps.
>
> Note: 'ip link delete' takes a type as well...

According to man section, yes.
According to various guides, no.
https://docs.fedoraproject.org/en-US/Fedora/20/html/Networking_Guide/sec-Configure_802_1Q_VLAN_Tagging_ip_Commands.html

Thanks

>
>


signature.asc
Description: PGP signature

Re: [PATCH RFC iproute2-next 1/2] rdma: add 'link add/delete' commands

2018-11-28 Thread Leon Romanovsky

On Wed, Nov 28, 2018 at 10:02:04PM +0200, Leon Romanovsky wrote:
> On Wed, Nov 28, 2018 at 01:34:14PM -0600, Steve Wise wrote:
> > ...
> >
> > >>> +   rd_prepare_msg(rd, RDMA_NLDEV_CMD_NEWLINK, ,
> > >>> +  (NLM_F_REQUEST | NLM_F_ACK));
> > >>> +   mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_DEV_NAME, name);
> > >>> +   mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_LINK_TYPE, type);
> > >>> +   mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_NDEV_NAME, dev);
> > >>> +   ret = rd_send_msg(rd);
> > >>> +   if (ret)
> > >>> +   return ret;
> > >>> +
> > >>> +   ret = rd_recv_msg(rd, link_add_parse_cb, rd, seq);
> > >>> +   if (ret)
> > >>> +   perror(NULL);
> > >> Why do you need rd_recv_msg()? I think that it is not needed, at least
> > >> for rename, I didn't need it.
> > >> https://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git/tree/rdma/dev.c#n244
> > > To get the response of if it was successfully added.  It provides the
> > > errno value.
> > If I don't do the rd_recv_msg, then adding the same name twice fails
> > without any error notification.  Ditto for deleting a non-existent
> > link.  So the rd_recv_msg() allows getting the failure reason (and
> > detecting the failure). 
> >
>
> Shouldn't extack provide such information as part of NLM_F_ACK flag?
>
> just shooting into the air, will take more close look tomorrow.

OK, it was easier than I thought.

You are right, need both send and receive to get the reason.

Can you prepare general function and update rename part too?
Something like send_receive(...) with dummy callback for receive path.

Thanks

>
> Thanks
>
> >




signature.asc
Description: PGP signature

Re: [PATCH RFC iproute2-next 1/2] rdma: add 'link add/delete' commands

2018-11-28 Thread Steve Wise




On 11/28/2018 2:04 PM, Leon Romanovsky wrote:
> On Wed, Nov 28, 2018 at 01:08:05PM -0600, Steve Wise wrote:
>>
>> On 11/28/2018 12:26 PM, Leon Romanovsky wrote:
>>> On Thu, Sep 13, 2018 at 10:19:21AM -0700, Steve Wise wrote:
 Add new 'link' subcommand 'add' and 'delete' to allow binding a soft-rdma
 device to a netdev interface.

 EG:

 rdma link add rxe_eth0 type rxe dev eth0
 rdma link delete rxe_eth0

 Signed-off-by: Steve Wise 
 ---
  rdma/link.c  | 106 
 +++
  rdma/rdma.h  |   1 +
  rdma/utils.c |   2 +-
  3 files changed, 108 insertions(+), 1 deletion(-)

 diff --git a/rdma/link.c b/rdma/link.c
 index 7a6d4b7e356d..d4f76b0ce11f 100644
 --- a/rdma/link.c
 +++ b/rdma/link.c
 @@ -14,6 +14,8 @@
  static int link_help(struct rd *rd)
  {
pr_out("Usage: %s link show [DEV/PORT_INDEX]\n", rd->filename);
 +  pr_out("Usage: %s link add NAME type TYPE dev DEV\n", rd->filename);
>>> I suggest to rename "dev" to be "netdev", because we are using "dev" for
>>> ib devices.
>> Yea ok.
>>
 +  pr_out("Usage: %s link delete NAME type TYPE\n", rd->filename);
>>> Why do you need "type" for "delete" command?
>> Because the type is used in the kernel to find the appropriate link
>> ops.  I could change the kernel side to search all types for the device
>> name to delete? 
> I would say, yes.
> It makes "delete" operation more natural.
>
> Thanks

Perhaps.

Note: 'ip link delete' takes a type as well...

Re: [PATCH RFC iproute2-next 1/2] rdma: add 'link add/delete' commands

2018-11-28 Thread Leon Romanovsky

On Wed, Nov 28, 2018 at 01:08:05PM -0600, Steve Wise wrote:
>
>
> On 11/28/2018 12:26 PM, Leon Romanovsky wrote:
> > On Thu, Sep 13, 2018 at 10:19:21AM -0700, Steve Wise wrote:
> >> Add new 'link' subcommand 'add' and 'delete' to allow binding a soft-rdma
> >> device to a netdev interface.
> >>
> >> EG:
> >>
> >> rdma link add rxe_eth0 type rxe dev eth0
> >> rdma link delete rxe_eth0
> >>
> >> Signed-off-by: Steve Wise 
> >> ---
> >>  rdma/link.c  | 106 
> >> +++
> >>  rdma/rdma.h  |   1 +
> >>  rdma/utils.c |   2 +-
> >>  3 files changed, 108 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/rdma/link.c b/rdma/link.c
> >> index 7a6d4b7e356d..d4f76b0ce11f 100644
> >> --- a/rdma/link.c
> >> +++ b/rdma/link.c
> >> @@ -14,6 +14,8 @@
> >>  static int link_help(struct rd *rd)
> >>  {
> >>pr_out("Usage: %s link show [DEV/PORT_INDEX]\n", rd->filename);
> >> +  pr_out("Usage: %s link add NAME type TYPE dev DEV\n", rd->filename);
> > I suggest to rename "dev" to be "netdev", because we are using "dev" for
> > ib devices.
>
> Yea ok.
>
> >> +  pr_out("Usage: %s link delete NAME type TYPE\n", rd->filename);
> > Why do you need "type" for "delete" command?
>
> Because the type is used in the kernel to find the appropriate link
> ops.  I could change the kernel side to search all types for the device
> name to delete? 

I would say, yes.
It makes "delete" operation more natural.

Thanks


signature.asc
Description: PGP signature

Re: [PATCH RFC iproute2-next 1/2] rdma: add 'link add/delete' commands

2018-11-28 Thread Leon Romanovsky

On Wed, Nov 28, 2018 at 01:34:14PM -0600, Steve Wise wrote:
> ...
>
> >>> + rd_prepare_msg(rd, RDMA_NLDEV_CMD_NEWLINK, ,
> >>> +(NLM_F_REQUEST | NLM_F_ACK));
> >>> + mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_DEV_NAME, name);
> >>> + mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_LINK_TYPE, type);
> >>> + mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_NDEV_NAME, dev);
> >>> + ret = rd_send_msg(rd);
> >>> + if (ret)
> >>> + return ret;
> >>> +
> >>> + ret = rd_recv_msg(rd, link_add_parse_cb, rd, seq);
> >>> + if (ret)
> >>> + perror(NULL);
> >> Why do you need rd_recv_msg()? I think that it is not needed, at least
> >> for rename, I didn't need it.
> >> https://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git/tree/rdma/dev.c#n244
> > To get the response of if it was successfully added.  It provides the
> > errno value.
> If I don't do the rd_recv_msg, then adding the same name twice fails
> without any error notification.  Ditto for deleting a non-existent
> link.  So the rd_recv_msg() allows getting the failure reason (and
> detecting the failure). 
>

Shouldn't extack provide such information as part of NLM_F_ACK flag?

just shooting into the air, will take more close look tomorrow.

Thanks

>


signature.asc
Description: PGP signature

Re: [PATCH] selinux: add support for RTM_NEWCHAIN, RTM_DELCHAIN, and RTM_GETCHAIN

2018-11-28 Thread David Miller

From: Paul Moore 
Date: Wed, 28 Nov 2018 13:47:25 -0500

> On Wed, Nov 28, 2018 at 1:44 PM Paul Moore  wrote:
>> Commit 32a4f5ecd738 ("net: sched: introduce chain object to uapi")
>> added new RTM_* definitions without properly updating SELinux, this
>> patch adds the necessary SELinux support.
>>
>> While there was a BUILD_BUG_ON() in the SELinux code to protect from
>> exactly this case, it was bypassed in the broken commit.  In order to
>> hopefully prevent this from happening in the future, add additional
>> comments which provide some instructions on how to resolve the
>> BUILD_BUG_ON() failures.
>>
>> Fixes: 32a4f5ecd738 ("net: sched: introduce chain object to uapi")
>> Cc:  # 4.19
>> Signed-off-by: Paul Moore 
>> ---
>>  security/selinux/nlmsgtab.c |   13 -
>>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> I'm building a test kernel right now, assuming all goes well I'm going
> to send this up to Linus for v4.20.

Acked-by: David S. Miller

Re: [PATCH RFC iproute2-next 1/2] rdma: add 'link add/delete' commands

2018-11-28 Thread Steve Wise

...

>>> +   rd_prepare_msg(rd, RDMA_NLDEV_CMD_NEWLINK, ,
>>> +  (NLM_F_REQUEST | NLM_F_ACK));
>>> +   mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_DEV_NAME, name);
>>> +   mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_LINK_TYPE, type);
>>> +   mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_NDEV_NAME, dev);
>>> +   ret = rd_send_msg(rd);
>>> +   if (ret)
>>> +   return ret;
>>> +
>>> +   ret = rd_recv_msg(rd, link_add_parse_cb, rd, seq);
>>> +   if (ret)
>>> +   perror(NULL);
>> Why do you need rd_recv_msg()? I think that it is not needed, at least
>> for rename, I didn't need it.
>> https://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git/tree/rdma/dev.c#n244
> To get the response of if it was successfully added.  It provides the
> errno value.
If I don't do the rd_recv_msg, then adding the same name twice fails
without any error notification.  Ditto for deleting a non-existent
link.  So the rd_recv_msg() allows getting the failure reason (and
detecting the failure).

Re: [net 0/4][pull request] Intel Wired LAN Driver Fixes 2018-11-28

2018-11-28 Thread David Miller

From: Jeff Kirsher 
Date: Wed, 28 Nov 2018 09:10:18 -0800

> This series contains fixes to igb, ixgbe and i40e.
> 
> Yunjian Wang from Huawei resolves a variable that could potentially be
> NULL before it is used.
> 
> Lihong fixes an i40e issue which goes back to 4.17 kernels, where
> deleting any of the MAC filters was causing the incorrect syncing for
> the PF.
> 
> Josh Elsasser caught that there were missing enum values in the link
> capabilities for x550 devices, which was preventing link for 1000BaseLX
> SFP modules.
> 
> Jan fixes the function header comments for XSK methods.
> 
> The following are changes since commit 
> 4df5ce9bc03e47d05f400e64aa32a82ec4cef419:
>   lan743x: Enable driver to work with LAN7431
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 1GbE

Pulled, thanks Jeff.

Re: [PATCH net] be2net: Fix NULL pointer dereference in be_tx_timeout()

2018-11-28 Thread David Miller

From: Ivan Vecera 
Date: Wed, 28 Nov 2018 20:29:44 +0100

> And the right way? netif_device_detach() that does not fire
> linkwatch events?

Allocate all the TX queue resources first.

Do not modify the TX queue configuration in any way whatsoever
meanwhile.

Only after successfully allocating the resources, should you
commit them to the hardware configuration.

That way there is no problem and you don't have to worry about
the OOPS in question.

Re: [PATCH net] be2net: Fix NULL pointer dereference in be_tx_timeout()

2018-11-28 Thread Ivan Vecera

On 28. 11. 18 20:00, David Miller wrote:

From: Ivan Vecera 
Date: Wed, 28 Nov 2018 11:12:22 +0100

On 27. 11. 18 23:51, David Miller wrote:

From: Petr Oros 
Date: Thu, 22 Nov 2018 15:24:07 +0100

@@ -4700,8 +4700,11 @@ int be_update_queues(struct be_adapter
*adapter)
struct net_device *netdev = adapter->netdev;
int status;
   -if (netif_running(netdev))
+   if (netif_running(netdev)) {
+   /* prevent netdev watchdog during tx queue destroy */
+   netif_carrier_off(netdev);
be_close(netdev);
+   }

This will make userspace networking daemons will think that the link
went down.
This absolutely should not be a side effect of making a simple
TX queue configuration change via ethtool.

Yes, you're right Dave. But the same thing (netif_carrier_off()) is
done by number of other drivers (igb, tg3, ixgbe...) during
.set_channels() callback. The patch that Petr sent does the
practically the same thing like this one:

Bug exist in other drivers, thanks for reporting that.

It doesn't mean you should add the same bug here as well.

OK.
And the right way? netif_device_detach() that does not fire linkwatch events?

Thx,
Ivan

Re: [PATCH bpf] bpf: Support sk lookup in netns with id 0

2018-11-28 Thread Joe Stringer

On Tue, 27 Nov 2018 at 13:12, Alexei Starovoitov
 wrote:
>
> On Tue, Nov 27, 2018 at 10:01:40AM -0800, Joe Stringer wrote:
> > On Tue, 27 Nov 2018 at 06:49, Nicolas Dichtel  
> > wrote:
> > >
> > > Le 26/11/2018 ą 23:08, David Ahern a écrit :
> > > > On 11/26/18 2:27 PM, Joe Stringer wrote:
> > > >> @@ -2405,6 +2407,9 @@ enum bpf_func_id {
> > > >>  /* BPF_FUNC_perf_event_output for sk_buff input context. */
> > > >>  #define BPF_F_CTXLEN_MASK   (0xfULL << 32)
> > > >>
> > > >> +/* BPF_FUNC_sk_lookup_tcp and BPF_FUNC_sk_lookup_udp flags. */
> > > >> +#define BPF_F_SK_CURRENT_NS 0x8000 /* For netns field */
> > > >> +
> > > >
> > > > I went down the nsid road because it will be needed for other use cases
> > > > (e.g., device lookups), and we should have a general API for network
> > > > namespaces. Given that, I think the _SK should be dropped from the name.
> >
> > Fair point, I'll drop _SK from the name
> >
> > > >
> > > Would it not be possible to have a s32 instead of an u32 for the coming 
> > > APIs?
> > > It would be better to match the current netlink and kernel APIs.
> >
> > Sure, I'll look into this.
> >
> > I had earlier considered whether it's worth attempting to leave the
> > upper 32 bits of this parameter open for potential future expansion,
> > but at this point I'm not taking that into consideration. If anyone
> > has preferences or thoughts on that I'd be interested to hear them.
>
> Can we keep u64 as an argument type and do
> if ((s32)netns_id < 0) {
>   net = caller_net;
> } else {
>   if (netns_id > S32_MAX)
> goto err;
>   net = get_net_ns_by_id(caller_net, netns_id);
> }
>
> No need for extra macro in such case and passing -1 would match the rest of 
> the kernel.
> Upper 32-bit would still be open for future expansion.

Sounds good.

Re: pull-request: can-next 2018-11-28,pull-request: can-next 2018-11-28

2018-11-28 Thread David Miller

From: Marc Kleine-Budde 
Date: Wed, 28 Nov 2018 17:01:13 +0100

>   
> ssh://g...@gitolite.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next.git
>  tags/linux-can-next-for-4.21-20181128

This doesn't work for me at all.

I'm not adding a custom .ssh config entry to point gitolite.kernel.org
to the same SSH key that I use for gitol...@ra.kernel.org, no way.

I don't want to see these SSH based URLs any more, they are a pain and
add overhead to my workflow.

Please just use plain git:// URLs, thank you.

Re: [PATCH net,stable] s390/qeth: fix length check in SNMP processing

2018-11-28 Thread David Miller

From: Julian Wiedmann 
Date: Wed, 28 Nov 2018 16:20:50 +0100

> The response for a SNMP request can consist of multiple parts, which
> the cmd callback stages into a kernel buffer until all parts have been
> received. If the callback detects that the staging buffer provides
> insufficient space, it bails out with error.
> This processing is buggy for the first part of the response - while it
> initially checks for a length of 'data_len', it later copies an
> additional amount of 'offsetof(struct qeth_snmp_cmd, data)' bytes.
> 
> Fix the calculation of 'data_len' for the first part of the response.
> This also nicely cleans up the memcpy code.
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Signed-off-by: Julian Wiedmann 
> Reviewed-by: Ursula Braun 

Applied and queued up for -stable.

Re: [PATCH RFC iproute2-next 1/2] rdma: add 'link add/delete' commands

2018-11-28 Thread Steve Wise




On 11/28/2018 12:26 PM, Leon Romanovsky wrote:
> On Thu, Sep 13, 2018 at 10:19:21AM -0700, Steve Wise wrote:
>> Add new 'link' subcommand 'add' and 'delete' to allow binding a soft-rdma
>> device to a netdev interface.
>>
>> EG:
>>
>> rdma link add rxe_eth0 type rxe dev eth0
>> rdma link delete rxe_eth0
>>
>> Signed-off-by: Steve Wise 
>> ---
>>  rdma/link.c  | 106 
>> +++
>>  rdma/rdma.h  |   1 +
>>  rdma/utils.c |   2 +-
>>  3 files changed, 108 insertions(+), 1 deletion(-)
>>
>> diff --git a/rdma/link.c b/rdma/link.c
>> index 7a6d4b7e356d..d4f76b0ce11f 100644
>> --- a/rdma/link.c
>> +++ b/rdma/link.c
>> @@ -14,6 +14,8 @@
>>  static int link_help(struct rd *rd)
>>  {
>>  pr_out("Usage: %s link show [DEV/PORT_INDEX]\n", rd->filename);
>> +pr_out("Usage: %s link add NAME type TYPE dev DEV\n", rd->filename);
> I suggest to rename "dev" to be "netdev", because we are using "dev" for
> ib devices.

Yea ok.

>> +pr_out("Usage: %s link delete NAME type TYPE\n", rd->filename);
> Why do you need "type" for "delete" command?

Because the type is used in the kernel to find the appropriate link
ops.  I could change the kernel side to search all types for the device
name to delete? 

>>  return 0;
>>  }
>>
>> @@ -315,10 +317,114 @@ static int link_show(struct rd *rd)
>>  return rd_exec_link(rd, link_one_show, true);
>>  }
>>
>> +static int link_add_parse_cb(const struct nlmsghdr *nlh, void *data)
>> +{
>> +return MNL_CB_OK;
>> +}
>> +
>> +static int link_add(struct rd *rd)
>> +{
>> +char *name;
>> +char *type = NULL;
>> +char *dev = NULL;
>> +uint32_t seq;
>> +int ret;
>> +
>> +if (rd_no_arg(rd)) {
>> +pr_err("No link name was supplied\n");
>> +return -EINVAL;
>> +}
>> +name = rd_argv(rd);
>> +rd_arg_inc(rd);
>> +while (!rd_no_arg(rd)) {
>> +if (rd_argv_match(rd, "type")) {
>> +rd_arg_inc(rd);
>> +type = rd_argv(rd);
>> +} else if (rd_argv_match(rd, "dev")) {
>> +rd_arg_inc(rd);
>> +dev = rd_argv(rd);
>> +} else {
>> +pr_err("Invalid parameter %s\n", rd_argv(rd));
>> +return -EINVAL;
>> +}
>> +rd_arg_inc(rd);
>> +}
>> +if (!type) {
>> +pr_err("No type was supplied\n");
>> +return -EINVAL;
>> +}
>> +if (!dev) {
>> +pr_err("No net device was supplied\n");
>> +return -EINVAL;
>> +}
>> +
>> +rd_prepare_msg(rd, RDMA_NLDEV_CMD_NEWLINK, ,
>> +   (NLM_F_REQUEST | NLM_F_ACK));
>> +mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_DEV_NAME, name);
>> +mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_LINK_TYPE, type);
>> +mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_NDEV_NAME, dev);
>> +ret = rd_send_msg(rd);
>> +if (ret)
>> +return ret;
>> +
>> +ret = rd_recv_msg(rd, link_add_parse_cb, rd, seq);
>> +if (ret)
>> +perror(NULL);
> Why do you need rd_recv_msg()? I think that it is not needed, at least
> for rename, I didn't need it.
> https://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git/tree/rdma/dev.c#n244

To get the response of if it was successfully added.  It provides the
errno value.

>> +return ret;
>> +}
>> +
>> +static int link_del_parse_cb(const struct nlmsghdr *nlh, void *data)
>> +{
>> +return MNL_CB_OK;
>> +}
>> +
>> +static int link_del(struct rd *rd)
>> +{
>> +char *name;
>> +char *type = NULL;
>> +uint32_t seq;
>> +int ret;
>> +
>> +if (rd_no_arg(rd)) {
>> +pr_err("No link type was supplied\n");
>> +return -EINVAL;
>> +}
>> +name = rd_argv(rd);
>> +rd_arg_inc(rd);
>> +while (!rd_no_arg(rd)) {
>> +if (rd_argv_match(rd, "type")) {
>> +rd_arg_inc(rd);
>> +type = rd_argv(rd);
>> +} else {
>> +pr_err("Invalid parameter %s\n", rd_argv(rd));
>> +return -EINVAL;
>> +}
>> +rd_arg_inc(rd);
>> +}
>> +if (!type) {
>> +pr_err("No type was supplied\n");
>> +return -EINVAL;
>> +}
>> +rd_prepare_msg(rd, RDMA_NLDEV_CMD_DELLINK, ,
>> +   (NLM_F_REQUEST | NLM_F_ACK));
>> +mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_DEV_NAME, name);
>> +mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_LINK_TYPE, type);
>> +ret = rd_send_msg(rd);
>> +if (ret)
>> +return ret;
>> +
>> +ret = rd_recv_msg(rd, link_del_parse_cb, rd, seq);
>> +if (ret)
>> +perror(NULL);
>> +return ret;
> The same question as above.
>
>> +}
>> +
>>  int cmd_link(struct rd *rd)
>>  {
>>  const struct rd_cmd cmds[] = {
>>  { NULL, link_show },
>> +{ "add",link_add },
>> +{

Re: BPF uapi structures and 32-bit

2018-11-28 Thread David Miller

From: Daniel Borkmann 
Date: Wed, 28 Nov 2018 11:34:55 +0100

> Yeah fully agree. Thinking diff below should address it, do you
> have a chance to give this a spin for sparc / 32 bit to check if
> test_verifier still explodes?

Great, let me play with this.

I did something simpler yesterday, just changing the data pointers to
"u64" and that made at least one test pass that didn't before :-)

I'll get back to you with results.

Re: [PATCH net] be2net: Fix NULL pointer dereference in be_tx_timeout()

2018-11-28 Thread David Miller

From: Ivan Vecera 
Date: Wed, 28 Nov 2018 11:12:22 +0100

> On 27. 11. 18 23:51, David Miller wrote:
>> From: Petr Oros 
>> Date: Thu, 22 Nov 2018 15:24:07 +0100
>> 
>>> @@ -4700,8 +4700,11 @@ int be_update_queues(struct be_adapter
>>> *adapter)
>>> struct net_device *netdev = adapter->netdev;
>>> int status;
>>>   - if (netif_running(netdev))
>>> +   if (netif_running(netdev)) {
>>> +   /* prevent netdev watchdog during tx queue destroy */
>>> +   netif_carrier_off(netdev);
>>> be_close(netdev);
>>> +   }
>> This will make userspace networking daemons will think that the link
>> went down.
>> This absolutely should not be a side effect of making a simple
>> TX queue configuration change via ethtool.
>> 
> 
> Yes, you're right Dave. But the same thing (netif_carrier_off()) is
> done by number of other drivers (igb, tg3, ixgbe...) during
> .set_channels() callback. The patch that Petr sent does the
> practically the same thing like this one:

Bug exist in other drivers, thanks for reporting that.

It doesn't mean you should add the same bug here as well.

Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support

2018-11-28 Thread David Miller

From: Ioana Ciocoi Radulescu 
Date: Wed, 28 Nov 2018 09:18:28 +

> They apply cleanly for me.

I figured out what happend.

The patches were mis-ordered (specifically patches #3 and #4) when I added
them to the patchwork bundle, and that is what causes them to fail.

Series applied, thanks!

[PATCH mlx5-next 06/11] RDMA/mlx5: Remove SRQ signature global flag

2018-11-28 Thread Leon Romanovsky

From: Leon Romanovsky 

SRQ signature is not supported, hence no need for special static
global variable to announce it.

Reviewed-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/srq.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/srq.c b/drivers/infiniband/hw/mlx5/srq.c
index a86d9f153805..154fd0d3718e 100644
--- a/drivers/infiniband/hw/mlx5/srq.c
+++ b/drivers/infiniband/hw/mlx5/srq.c
@@ -12,9 +12,6 @@
 #include "mlx5_ib.h"
 #include "srq.h"

-/* not supported currently */
-static int srq_signature;
-
 static void *get_wqe(struct mlx5_ib_srq *srq, int n)
 {
return mlx5_buf_offset(>buf, n << srq->msrq.wqe_shift);
@@ -175,7 +172,7 @@ static int create_srq_kernel(struct mlx5_ib_dev *dev, 
struct mlx5_ib_srq *srq,
err = -ENOMEM;
goto err_in;
}
-   srq->wq_sig = !!srq_signature;
+   srq->wq_sig = 0;

in->log_page_size = srq->buf.page_shift - MLX5_ADAPTER_PAGE_SHIFT;
if (MLX5_CAP_GEN(dev->mdev, cqe_version) == MLX5_CQE_VERSION_V1 &&
--
2.19.1

[PATCH rdma-next 11/11] RDMA/mlx5: Unfold modify RMP function

2018-11-28 Thread Leon Romanovsky

From: Leon Romanovsky 

There is no need to perform modify_rmp in two separate function,
while one of them uses stack as a placeholder for data while other
allocates it dynamically. Combine those two functions to one call
instead of two.

Reviewed-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/srq_cmd.c | 62 +++-
 1 file changed, 34 insertions(+), 28 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/srq_cmd.c 
b/drivers/infiniband/hw/mlx5/srq_cmd.c
index 955df638b416..7aaaffbd4afa 100644
--- a/drivers/infiniband/hw/mlx5/srq_cmd.c
+++ b/drivers/infiniband/hw/mlx5/srq_cmd.c
@@ -298,24 +298,6 @@ static int query_xrc_srq_cmd(struct mlx5_ib_dev *dev,
return err;
 }
 
-static int mlx5_core_modify_rmp(struct mlx5_ib_dev *dev, u32 *in, int inlen)
-{
-   u32 out[MLX5_ST_SZ_DW(modify_rmp_out)] = {0};
-
-   MLX5_SET(modify_rmp_in, in, opcode, MLX5_CMD_OP_MODIFY_RMP);
-   return mlx5_cmd_exec(dev->mdev, in, inlen, out, sizeof(out));
-}
-
-static int mlx5_core_query_rmp(struct mlx5_ib_dev *dev, u32 rmpn, u32 *out)
-{
-   u32 in[MLX5_ST_SZ_DW(query_rmp_in)] = {0};
-   int outlen = MLX5_ST_SZ_BYTES(query_rmp_out);
-
-   MLX5_SET(query_rmp_in, in, opcode, MLX5_CMD_OP_QUERY_RMP);
-   MLX5_SET(query_rmp_in, in, rmpn,   rmpn);
-   return mlx5_cmd_exec(dev->mdev, in, sizeof(in), out, outlen);
-}
-
 static int create_rmp_cmd(struct mlx5_ib_dev *dev, struct mlx5_core_srq *srq,
  struct mlx5_srq_attr *in)
 {
@@ -373,15 +355,24 @@ static int destroy_rmp_cmd(struct mlx5_ib_dev *dev, 
struct mlx5_core_srq *srq)
 static int arm_rmp_cmd(struct mlx5_ib_dev *dev, struct mlx5_core_srq *srq,
   u16 lwm)
 {
-   void *in;
+   void *out = NULL;
+   void *in = NULL;
void *rmpc;
void *wq;
void *bitmask;
+   int outlen;
+   int inlen;
int err;
 
-   in = kvzalloc(MLX5_ST_SZ_BYTES(modify_rmp_in), GFP_KERNEL);
-   if (!in)
-   return -ENOMEM;
+   inlen = MLX5_ST_SZ_BYTES(modify_rmp_in);
+   outlen = MLX5_ST_SZ_BYTES(modify_rmp_out);
+
+   in = kvzalloc(inlen, GFP_KERNEL);
+   out = kvzalloc(outlen, GFP_KERNEL);
+   if (!in || !out) {
+   err = -ENOMEM;
+   goto out;
+   }
 
rmpc =MLX5_ADDR_OF(modify_rmp_in,   in,   ctx);
bitmask = MLX5_ADDR_OF(modify_rmp_in,   in,   bitmask);
@@ -393,25 +384,39 @@ static int arm_rmp_cmd(struct mlx5_ib_dev *dev, struct 
mlx5_core_srq *srq,
MLX5_SET(wq,wq,  lwm,   lwm);
MLX5_SET(rmp_bitmask,   bitmask, lwm,   1);
MLX5_SET(rmpc, rmpc, state, MLX5_RMPC_STATE_RDY);
+   MLX5_SET(modify_rmp_in, in, opcode, MLX5_CMD_OP_MODIFY_RMP);
 
-   err = mlx5_core_modify_rmp(dev, in, MLX5_ST_SZ_BYTES(modify_rmp_in));
+   err = mlx5_cmd_exec(dev->mdev, in, inlen, out, outlen);
 
+out:
kvfree(in);
+   kvfree(out);
return err;
 }
 
 static int query_rmp_cmd(struct mlx5_ib_dev *dev, struct mlx5_core_srq *srq,
 struct mlx5_srq_attr *out)
 {
-   u32 *rmp_out;
+   u32 *rmp_out = NULL;
+   u32 *rmp_in = NULL;
void *rmpc;
+   int outlen;
+   int inlen;
int err;
 
-   rmp_out =  kvzalloc(MLX5_ST_SZ_BYTES(query_rmp_out), GFP_KERNEL);
-   if (!rmp_out)
-   return -ENOMEM;
+   outlen = MLX5_ST_SZ_BYTES(query_rmp_out);
+   inlen = MLX5_ST_SZ_BYTES(query_rmp_in);
+
+   rmp_out = kvzalloc(outlen, GFP_KERNEL);
+   rmp_in = kvzalloc(inlen, GFP_KERNEL);
+   if (!rmp_out || !rmp_in) {
+   err = -ENOMEM;
+   goto out;
+   }
 
-   err = mlx5_core_query_rmp(dev, srq->srqn, rmp_out);
+   MLX5_SET(query_rmp_in, rmp_in, opcode, MLX5_CMD_OP_QUERY_RMP);
+   MLX5_SET(query_rmp_in, rmp_in, rmpn,   srq->srqn);
+   err = mlx5_cmd_exec(dev->mdev, rmp_in, inlen, rmp_out, outlen);
if (err)
goto out;
 
@@ -422,6 +427,7 @@ static int query_rmp_cmd(struct mlx5_ib_dev *dev, struct 
mlx5_core_srq *srq,
 
 out:
kvfree(rmp_out);
+   kvfree(rmp_in);
return err;
 }
 
-- 
2.19.1

[PATCH mlx5-next 10/11] RDMA/mlx5: Unfold create RMP function

2018-11-28 Thread Leon Romanovsky

From: Leon Romanovsky 

There is no need to perform create_rmp in two separate function, while
one of them uses stack as a placeholder for data while other allocates
it dynamically. Combine those two functions to one instead of two.

Reviewed-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/srq_cmd.c | 35 +---
 1 file changed, 16 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/srq_cmd.c 
b/drivers/infiniband/hw/mlx5/srq_cmd.c
index 6be89c6be40f..955df638b416 100644
--- a/drivers/infiniband/hw/mlx5/srq_cmd.c
+++ b/drivers/infiniband/hw/mlx5/srq_cmd.c
@@ -298,20 +298,6 @@ static int query_xrc_srq_cmd(struct mlx5_ib_dev *dev,
return err;
 }

-static int mlx5_core_create_rmp(struct mlx5_ib_dev *dev, u32 *in, int inlen,
-   u32 *rmpn)
-{
-   u32 out[MLX5_ST_SZ_DW(create_rmp_out)] = { 0 };
-   int err;
-
-   MLX5_SET(create_rmp_in, in, opcode, MLX5_CMD_OP_CREATE_RMP);
-   err = mlx5_cmd_exec(dev->mdev, in, inlen, out, sizeof(out));
-   if (!err)
-   *rmpn = MLX5_GET(create_rmp_out, out, rmpn);
-
-   return err;
-}
-
 static int mlx5_core_modify_rmp(struct mlx5_ib_dev *dev, u32 *in, int inlen)
 {
u32 out[MLX5_ST_SZ_DW(modify_rmp_out)] = {0};
@@ -333,18 +319,24 @@ static int mlx5_core_query_rmp(struct mlx5_ib_dev *dev, 
u32 rmpn, u32 *out)
 static int create_rmp_cmd(struct mlx5_ib_dev *dev, struct mlx5_core_srq *srq,
  struct mlx5_srq_attr *in)
 {
-   void *create_in;
+   void *create_out = NULL;
+   void *create_in = NULL;
void *rmpc;
void *wq;
int pas_size;
+   int outlen;
int inlen;
int err;

pas_size = get_pas_size(in);
inlen = MLX5_ST_SZ_BYTES(create_rmp_in) + pas_size;
+   outlen = MLX5_ST_SZ_BYTES(create_rmp_out);
create_in = kvzalloc(inlen, GFP_KERNEL);
-   if (!create_in)
-   return -ENOMEM;
+   create_out = kvzalloc(outlen, GFP_KERNEL);
+   if (!create_in || !create_out) {
+   err = -ENOMEM;
+   goto out;
+   }

rmpc = MLX5_ADDR_OF(create_rmp_in, create_in, ctx);
wq = MLX5_ADDR_OF(rmpc, rmpc, wq);
@@ -354,11 +346,16 @@ static int create_rmp_cmd(struct mlx5_ib_dev *dev, struct 
mlx5_core_srq *srq,
set_wq(wq, in);
memcpy(MLX5_ADDR_OF(rmpc, rmpc, wq.pas), in->pas, pas_size);

-   err = mlx5_core_create_rmp(dev, create_in, inlen, >srqn);
-   if (!err)
+   MLX5_SET(create_rmp_in, create_in, opcode, MLX5_CMD_OP_CREATE_RMP);
+   err = mlx5_cmd_exec(dev->mdev, create_in, inlen, create_out, outlen);
+   if (!err) {
+   srq->srqn = MLX5_GET(create_rmp_out, create_out, rmpn);
srq->uid = in->uid;
+   }

+out:
kvfree(create_in);
+   kvfree(create_out);
return err;
 }

--
2.19.1

[PATCH mlx5-next 09/11] RDMA/mlx5: Initialize SRQ tables on mlx5_ib

2018-11-28 Thread Leon Romanovsky

From: Leon Romanovsky 

Transfer initialization and cleanup from mlx5_priv struct of
mlx5_core_dev to be part of mlx5_ib_dev. This completes removal
of SRQ from mlx5_core.

Reviewed-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/ib_rep.c   |  4 ++
 drivers/infiniband/hw/mlx5/main.c |  7 ++
 drivers/infiniband/hw/mlx5/mlx5_ib.h  |  5 +-
 drivers/infiniband/hw/mlx5/srq.c  |  1 -
 drivers/infiniband/hw/mlx5/srq.h  | 25 +++
 drivers/infiniband/hw/mlx5/srq_cmd.c  | 72 +++
 .../net/ethernet/mellanox/mlx5/core/Makefile  |  2 +-
 .../net/ethernet/mellanox/mlx5/core/main.c|  5 --
 drivers/net/ethernet/mellanox/mlx5/core/srq.c | 63 
 include/linux/mlx5/driver.h   | 25 ---
 include/linux/mlx5/srq.h  | 14 
 11 files changed, 101 insertions(+), 122 deletions(-)
 delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/srq.c
 delete mode 100644 include/linux/mlx5/srq.h

diff --git a/drivers/infiniband/hw/mlx5/ib_rep.c 
b/drivers/infiniband/hw/mlx5/ib_rep.c
index 584ff2ea7810..8a682d86d634 100644
--- a/drivers/infiniband/hw/mlx5/ib_rep.c
+++ b/drivers/infiniband/hw/mlx5/ib_rep.c
@@ -4,6 +4,7 @@
  */
 
 #include "ib_rep.h"
+#include "srq.h"
 
 static const struct mlx5_ib_profile rep_profile = {
STAGE_CREATE(MLX5_IB_STAGE_INIT,
@@ -21,6 +22,9 @@ static const struct mlx5_ib_profile rep_profile = {
STAGE_CREATE(MLX5_IB_STAGE_ROCE,
 mlx5_ib_stage_rep_roce_init,
 mlx5_ib_stage_rep_roce_cleanup),
+   STAGE_CREATE(MLX5_IB_STAGE_SRQ,
+mlx5_init_srq_table,
+mlx5_cleanup_srq_table),
STAGE_CREATE(MLX5_IB_STAGE_DEVICE_RESOURCES,
 mlx5_ib_stage_dev_res_init,
 mlx5_ib_stage_dev_res_cleanup),
diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index b15f1a70d33b..24a0957ae630 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -60,6 +60,7 @@
 #include "mlx5_ib.h"
 #include "ib_rep.h"
 #include "cmd.h"
+#include "srq.h"
 #include 
 #include 
 #include 
@@ -6289,6 +6290,9 @@ static const struct mlx5_ib_profile pf_profile = {
STAGE_CREATE(MLX5_IB_STAGE_ROCE,
 mlx5_ib_stage_roce_init,
 mlx5_ib_stage_roce_cleanup),
+   STAGE_CREATE(MLX5_IB_STAGE_SRQ,
+mlx5_init_srq_table,
+mlx5_cleanup_srq_table),
STAGE_CREATE(MLX5_IB_STAGE_DEVICE_RESOURCES,
 mlx5_ib_stage_dev_res_init,
 mlx5_ib_stage_dev_res_cleanup),
@@ -6343,6 +6347,9 @@ static const struct mlx5_ib_profile nic_rep_profile = {
STAGE_CREATE(MLX5_IB_STAGE_ROCE,
 mlx5_ib_stage_rep_roce_init,
 mlx5_ib_stage_rep_roce_cleanup),
+   STAGE_CREATE(MLX5_IB_STAGE_SRQ,
+mlx5_init_srq_table,
+mlx5_cleanup_srq_table),
STAGE_CREATE(MLX5_IB_STAGE_DEVICE_RESOURCES,
 mlx5_ib_stage_dev_res_init,
 mlx5_ib_stage_dev_res_cleanup),
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index fe34129bb780..4ed9f76b49de 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -41,7 +41,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -50,6 +49,8 @@
 #include 
 #include 
 
+#include "srq.h"
+
 #define mlx5_ib_dbg(_dev, format, arg...)  
\
dev_dbg(&(_dev)->ib_dev.dev, "%s:%d:(pid %d): " format, __func__,  \
__LINE__, current->pid, ##arg)
@@ -774,6 +775,7 @@ enum mlx5_ib_stages {
MLX5_IB_STAGE_CAPS,
MLX5_IB_STAGE_NON_DEFAULT_CB,
MLX5_IB_STAGE_ROCE,
+   MLX5_IB_STAGE_SRQ,
MLX5_IB_STAGE_DEVICE_RESOURCES,
MLX5_IB_STAGE_DEVICE_NOTIFIER,
MLX5_IB_STAGE_ODP,
@@ -940,6 +942,7 @@ struct mlx5_ib_dev {
u64 sys_image_guid;
struct mlx5_memic   memic;
u16 devx_whitelist_uid;
+   struct mlx5_srq_table   srq_table;
 };
 
 static inline struct mlx5_ib_cq *to_mibcq(struct mlx5_core_cq *mcq)
diff --git a/drivers/infiniband/hw/mlx5/srq.c b/drivers/infiniband/hw/mlx5/srq.c
index 2b184c7f531a..91dcd3918d96 100644
--- a/drivers/infiniband/hw/mlx5/srq.c
+++ b/drivers/infiniband/hw/mlx5/srq.c
@@ -5,7 +5,6 @@
 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/infiniband/hw/mlx5/srq.h b/drivers/infiniband/hw/mlx5/srq.h
index 1110aeaa775e..75eb5839ae95 100644
--- a/drivers/infiniband/hw/mlx5/srq.h
+++ b/drivers/infiniband/hw/mlx5/srq.h
@@ -37,6 +37,28 @@ struct mlx5_srq_attr {
 
 struct mlx5_ib_dev;
 
+struct mlx5_core_srq {
+   struct

1 2 >

1 - 100 of 167 matches

Mail list logo