Re: [PATCH net-next,v4 01/12] flow_offload: add flow_rule and flow_match structures and use them (fwd)

2018-11-30 Thread Julia Lawall
Hello,

It looks like the kfree on line 1573 should be moved down a few lines.

julia

-- Forwarded message --
Date: Sat, 1 Dec 2018 11:11:55 +0800
From: kbuild test robot 
To: kbu...@01.org
Cc: Julia Lawall 
Subject: Re: [PATCH net-next,
v4 01/12] flow_offload: add flow_rule and flow_match structures and use them

CC: kbuild-...@01.org
In-Reply-To: <20181129022231.2740-2-pa...@netfilter.org>
References: <20181129022231.2740-2-pa...@netfilter.org>
TO: Pablo Neira Ayuso 
CC: netdev@vger.kernel.org

Hi Pablo,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]
[also build test WARNING on next-20181130]
[cannot apply to v4.20-rc4]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Pablo-Neira-Ayuso/add-flow_rule-infrastructure/20181130-204709
:: branch date: 14 hours ago
:: commit date: 14 hours ago

>> net/sched/cls_flower.c:1586:7-9: ERROR: reference preceded by free on line 
>> 1573

# 
https://github.com/0day-ci/linux/commit/012ba7afb48db31be34a84d5fe82c4ceb409fd2d
git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout 012ba7afb48db31be34a84d5fe82c4ceb409fd2d
vim +1586 net/sched/cls_flower.c

34738452 Jiri Pirko2018-07-23  1544
b95ec7eb Jiri Pirko2018-07-23  1545  static void 
*fl_tmplt_create(struct net *net, struct tcf_chain *chain,
b95ec7eb Jiri Pirko2018-07-23  1546  struct 
nlattr **tca,
b95ec7eb Jiri Pirko2018-07-23  1547  struct 
netlink_ext_ack *extack)
b95ec7eb Jiri Pirko2018-07-23  1548  {
b95ec7eb Jiri Pirko2018-07-23  1549 struct fl_flow_tmplt *tmplt;
b95ec7eb Jiri Pirko2018-07-23  1550 struct nlattr **tb;
b95ec7eb Jiri Pirko2018-07-23  1551 int err;
b95ec7eb Jiri Pirko2018-07-23  1552
b95ec7eb Jiri Pirko2018-07-23  1553 if (!tca[TCA_OPTIONS])
b95ec7eb Jiri Pirko2018-07-23  1554 return ERR_PTR(-EINVAL);
b95ec7eb Jiri Pirko2018-07-23  1555
b95ec7eb Jiri Pirko2018-07-23  1556 tb = kcalloc(TCA_FLOWER_MAX + 
1, sizeof(struct nlattr *), GFP_KERNEL);
b95ec7eb Jiri Pirko2018-07-23  1557 if (!tb)
b95ec7eb Jiri Pirko2018-07-23  1558 return 
ERR_PTR(-ENOBUFS);
b95ec7eb Jiri Pirko2018-07-23  1559 err = nla_parse_nested(tb, 
TCA_FLOWER_MAX, tca[TCA_OPTIONS],
b95ec7eb Jiri Pirko2018-07-23  1560
fl_policy, NULL);
b95ec7eb Jiri Pirko2018-07-23  1561 if (err)
b95ec7eb Jiri Pirko2018-07-23  1562 goto errout_tb;
b95ec7eb Jiri Pirko2018-07-23  1563
b95ec7eb Jiri Pirko2018-07-23  1564 tmplt = kzalloc(sizeof(*tmplt), 
GFP_KERNEL);
1cbc36a5 Dan Carpenter 2018-08-03  1565 if (!tmplt) {
1cbc36a5 Dan Carpenter 2018-08-03  1566 err = -ENOMEM;
b95ec7eb Jiri Pirko2018-07-23  1567 goto errout_tb;
1cbc36a5 Dan Carpenter 2018-08-03  1568 }
b95ec7eb Jiri Pirko2018-07-23  1569 tmplt->chain = chain;
b95ec7eb Jiri Pirko2018-07-23  1570 err = fl_set_key(net, tb, 
>dummy_key, >mask, extack);
b95ec7eb Jiri Pirko2018-07-23  1571 if (err)
b95ec7eb Jiri Pirko2018-07-23  1572 goto errout_tmplt;
b95ec7eb Jiri Pirko2018-07-23 @1573 kfree(tb);
b95ec7eb Jiri Pirko2018-07-23  1574
b95ec7eb Jiri Pirko2018-07-23  1575 
fl_init_dissector(>dissector, >mask);
b95ec7eb Jiri Pirko2018-07-23  1576
012ba7af Pablo Neira Ayuso 2018-11-29  1577 err = fl_hw_create_tmplt(chain, 
tmplt);
012ba7af Pablo Neira Ayuso 2018-11-29  1578 if (err)
012ba7af Pablo Neira Ayuso 2018-11-29  1579 goto errout_tmplt;
34738452 Jiri Pirko2018-07-23  1580
b95ec7eb Jiri Pirko2018-07-23  1581 return tmplt;
b95ec7eb Jiri Pirko2018-07-23  1582
b95ec7eb Jiri Pirko2018-07-23  1583  errout_tmplt:
b95ec7eb Jiri Pirko2018-07-23  1584 kfree(tmplt);
b95ec7eb Jiri Pirko2018-07-23  1585  errout_tb:
b95ec7eb Jiri Pirko2018-07-23 @1586 kfree(tb);
b95ec7eb Jiri Pirko2018-07-23  1587 return ERR_PTR(err);
b95ec7eb Jiri Pirko2018-07-23  1588  }
b95ec7eb Jiri Pirko2018-07-23  1589

:: The code at line 1586 was first introduced by commit
:: b95ec7eb3b4d2f158dd15c912cf670b546f09571 net: sched: cls_flower: 
implement chain templates

:: TO: Jiri Pirko 
:: CC: David S. Miller 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


Re: [PATCH bpf-next v2 0/2] sample: xdp1 improvements

2018-11-30 Thread Alexei Starovoitov
On Sat, Dec 01, 2018 at 01:23:04AM +0100, Matteo Croce wrote:
> Small improvements to improve the readability and easiness
> to use of the xdp1 sample.

Applied to bpf-next.

I think that sample code could be more useful if it's wrapped with bash
script like selftests/test_xdp* tests do.
At that point it can move to selftests to get 0bot coverage.
Would you be interested in doing that?



VIEW THE ATTACHED FILE YOUR CERTIFIED BANK DRAFT FOR CONFIRMATION

2018-11-30 Thread REV,DR.JOHN H.WILLIAMS
DEAR BENEFICIARY,


I AM REV,DR.JOHN H.WILLIAMS,THE AUDITOR GENERAL OF THE FEDERAL
REPUBLIC OF NIGERIA AND I WISH TO KNOW IF YOU HAVE RECEIVED THE SCAN
COPY OF YOUR CERTIFIED BANK DRAFT IN THE AMOUNT OF $10,5MUSD WHICH I
SENT TO YOUR EMAIL ADDRESS YESTERDAY AND THIS APPROVED DRAFT INCLUDES
YOUR FUND INTEREST AMOUNT FOR THIS YEARS AWAITING,SO CALL ME NOW TO
CONFIRM IT OKAY:-+234-8132902282

NOTE THAT WE ARE PAYING YOU THROUGH OUR OIL RESERVE BANK ACCOUNT WITH
THE JP MORGAN CHASE BANK NEW YORK,USA AND THEY WILL CREDIT YOUR FUND
ONCE YOU HAVE THE ORIGINAL CERTIFIED BANK DRAFT HARD COPY AND DEPOSIT
IT TO ANY BANK OF YOUR CHOICE IN THE WORLD.

YOU ARE HEREBY ADVICE TO TAKE THE SCAN COPY OF YOUR CERTIFIED BANK
DRAFT TO YOUR LOCAL BANK AND CONFIRM IT BEFORE WE CAN SEND YOU THE
ORIGINAL HARD COPY OR YOU CAN COME DOWN FOR THE COLLECTION HERE OKAY.

I AWAIT TO HEAR FROM YOU SOONEST.

YOURS,

REV,DR.JOHN H.WILLIAMS


Re: [PATCH v2 bpf-next 0/4] bpf: Improve verifier test coverage on sparc64 et al.

2018-11-30 Thread Alexei Starovoitov
On Fri, Nov 30, 2018 at 09:07:54PM -0800, David Miller wrote:
> 
> On sparc64 a ton of test cases in test_verifier.c fail because
> the memory accesses in the test case are unaligned (or cannot
> be proven to be aligned by the verifier).
> 
> Perhaps we can eventually try to (carefully) modify each test case
> which has this problem to not use unaligned accesses but:
> 
> 1) That is delicate work.
> 
> 2) The changes might not fully respect the original
>intention of the testcase.
> 
> 3) In some cases, such a transformation might not even
>be feasible at all.
> 
> So add an "any alignment" flag to tell the verifier to forcefully
> disable it's alignment checks completely.
> 
> test_verifier.c is then annotated to use this flag when necessary.
> 
> The presence of the flag in each test case is good documentation to
> anyone who wants to actually tackle the job of eliminating the
> unaligned memory accesses in the test cases.
> 
> I've also seen several weird things in test cases, like trying to
> access __skb->mark in a packet buffer.
> 
> This gets rid of 104 test_verifier.c failures on sparc64.
> 
> Changes since v1:
> 
> 1) Explain the new BPF_PROG_LOAD flag in easier to understand terms.
>Suggested by Alexei.
> 
> 2) Make bpf_verify_program() just take a __u32 prog_flags instead of
>just accumulating boolean arguments over and over.  Also suggested
>by Alexei.
> 
> Changes since RFC:
> 
> 1) Only the admin can allow the relaxation of alignment restrictions
>on inefficient unaligned access architectures.
> 
> 2) Use F_NEEDS_EFFICIENT_UNALIGNED_ACCESS instead of making a new
>flag.
> 
> 3) Annotate in the output, when we have a test case that the verifier
>accepted but we did not try to execute because we are on an
>inefficient unaligned access platform.  Maybe with some arch
>machinery we can avoid this in the future.
> 
> Signed-off-by: David S. Miller 

The patch 2 didn't apply as-is to bpf-next, since I applied your earlier fix
"bpf: Fix verifier log string check for bad alignment" to bpf tree.
So I applied that fix to bpf-next as well and then pushed your series on top.
I think git should do the right thing when bpf and bpf-next trees converge.
But... let me know if I should drop that fix from bpf tree...
just to be on a safe side.



Re: [PATCH v2 bpf-next 0/4] bpf: Improve verifier test coverage on sparc64 et al.

2018-11-30 Thread David Miller
From: Alexei Starovoitov 
Date: Fri, 30 Nov 2018 21:55:02 -0800

> The patch 2 didn't apply as-is to bpf-next, since I applied your
> earlier fix "bpf: Fix verifier log string check for bad alignment"
> to bpf tree.  So I applied that fix to bpf-next as well and then
> pushed your series on top.

That seems best, thanks for resolving this.


Re: [PATCH bpf-next 1/4] bpf: Add BPF_F_ANY_ALIGNMENT.

2018-11-30 Thread Alexei Starovoitov
On Fri, Nov 30, 2018 at 08:44:20PM -0800, David Miller wrote:
> From: Alexei Starovoitov 
> Date: Fri, 30 Nov 2018 13:58:20 -0800
> 
> > On Thu, Nov 29, 2018 at 07:32:41PM -0800, David Miller wrote:
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index 426b5c8..c9647ea 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h
> >> @@ -232,6 +232,16 @@ enum bpf_attach_type {
> >>   */
> >>  #define BPF_F_STRICT_ALIGNMENT(1U << 0)
> >>  
> >> +/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the
> >> + * verifier will allow any alignment whatsoever.  This bypasses
> >> + * what CONFIG_EFFICIENT_UNALIGNED_ACCESS would cause it to do.
> > 
> > I think majority of user space folks who read uapi/bpf.h have no idea
> > what that kernel config does.
> > Could you reword the comment here to say that this flag is only
> > effective on architectures and like sparc and mips that don't
> > have efficient unaligned access and ignored on x86/arm64 ?
> 
> I just want to point out in passing that your feeback applies also to
> the comment above BPF_F_STRICT_ALIGNMENT, which I used as a model for
> my comment.

Good point. Missed that earlier.
NET_IP_ALIGN is even more cryptic and it's not the same as 
HAVE_EFFICIENT_UNALIGNED_ACCESS.
Example: s390
We need to reword it.



[PATCH v2 bpf-next 4/4] bpf: Apply F_NEEDS_EFFICIENT_UNALIGNED_ACCESS to more ACCEPT test cases.

2018-11-30 Thread David Miller


If a testcase has alignment problems but is expected to be ACCEPT,
verify it using F_NEEDS_EFFICIENT_UNALIGNED_ACCESS too.

Maybe in the future if we add some architecture specific code to elide
the unaligned memory access warnings during the test, we can execute
these as well.

Signed-off-by: David S. Miller 
---
 tools/testing/selftests/bpf/test_verifier.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 5d97fb8..beec10b 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -3886,6 +3886,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test21 (x += pkt_ptr, 2)",
@@ -3911,6 +3912,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test22 (x += pkt_ptr, 3)",
@@ -3941,6 +3943,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test23 (x += pkt_ptr, 4)",
@@ -3993,6 +3996,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test25 (marking on <, good access)",
@@ -7675,6 +7679,7 @@ static struct bpf_test tests[] = {
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.retval = 0 /* csum_diff of 64-byte packet */,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"helper access to variable memory: size = 0 not allowed on NULL 
(!ARG_PTR_TO_MEM_OR_NULL)",
@@ -9637,6 +9642,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_data' > pkt_end, bad access 1",
@@ -9808,6 +9814,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_end < pkt_data', bad access 1",
@@ -9920,6 +9927,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_end >= pkt_data', bad access 1",
@@ -9977,6 +9985,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_data' <= pkt_end, bad access 1",
@@ -10089,6 +10098,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_meta' > pkt_data, bad access 1",
@@ -10260,6 +10270,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_data < pkt_meta', bad access 1",
@@ -10372,6 +10383,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_data >= pkt_meta', bad access 1",
@@ -10429,6 +10441,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_meta' <= pkt_data, bad access 1",
@@ -12348,6 +12361,7 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
.retval = 1,
+   .flags = 

[PATCH v2 bpf-next 3/4] bpf: Make more use of 'any' alignment in test_verifier.c

2018-11-30 Thread David Miller


Use F_NEEDS_EFFICIENT_UNALIGNED_ACCESS in more tests where the
expected result is REJECT.

Signed-off-by: David S. Miller 
---
 tools/testing/selftests/bpf/test_verifier.c | 44 +
 1 file changed, 44 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 428a84d..5d97fb8 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -1823,6 +1823,7 @@ static struct bpf_test tests[] = {
.errstr = "invalid bpf_context access",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SK_MSG,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet read for SK_MSG",
@@ -2187,6 +2188,8 @@ static struct bpf_test tests[] = {
},
.errstr = "invalid bpf_context access",
.result = REJECT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"check cb access: half, wrong type",
@@ -3249,6 +3252,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "R0 invalid mem access 'inv'",
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"raw_stack: skb_load_bytes, spilled regs corruption 2",
@@ -3279,6 +3283,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "R3 invalid mem access 'inv'",
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"raw_stack: skb_load_bytes, spilled regs + data",
@@ -3778,6 +3783,7 @@ static struct bpf_test tests[] = {
.errstr = "R2 invalid mem access 'inv'",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test16 (arith on data_end)",
@@ -3961,6 +3967,7 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = REJECT,
.errstr = "invalid access to packet, off=0 size=8, 
R5(id=1,off=0,r=0)",
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test24 (x += pkt_ptr, 5)",
@@ -5117,6 +5124,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "invalid access to map value, value_size=64 off=-2 
size=4",
.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"invalid cgroup storage access 5",
@@ -5233,6 +5241,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "invalid access to map value, value_size=64 off=-2 
size=4",
.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"invalid per-cpu cgroup storage access 5",
@@ -7149,6 +7158,7 @@ static struct bpf_test tests[] = {
.errstr = "invalid mem access 'inv'",
.result = REJECT,
.result_unpriv = REJECT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"map element value illegal alu op, 5",
@@ -7171,6 +7181,7 @@ static struct bpf_test tests[] = {
.fixup_map_hash_48b = { 3 },
.errstr = "R0 invalid mem access 'inv'",
.result = REJECT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"map element value is preserved across register spilling",
@@ -9663,6 +9674,7 @@ static struct bpf_test tests[] = {
.errstr = "R1 offset is outside of the packet",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_end > pkt_data', good access",
@@ -9701,6 +9713,7 @@ static struct bpf_test tests[] = {
.errstr = "R1 offset is outside of the packet",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_end > pkt_data', bad access 2",
@@ -9719,6 +9732,7 @@ static struct bpf_test tests[] = {
.errstr = "R1 offset is outside of the packet",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,

[PATCH v2 bpf-next 2/4] bpf: Adjust F_NEEDS_EFFICIENT_UNALIGNED_ACCESS handling in test_verifier.c

2018-11-30 Thread David Miller


Make it set the flag argument to bpf_verify_program() which will relax
the alignment restrictions.

Now all such test cases will go properly through the verifier even on
inefficient unaligned access architectures.

On inefficient unaligned access architectures do not try to run such
programs, instead mark the test case as passing but annotate the
result similarly to how it is done now in the presence of this flag.

So, we get complete full coverage for all REJECT test cases, and at
least verifier level coverage for ACCEPT test cases.

Signed-off-by: David S. Miller 
---
 tools/testing/selftests/bpf/test_verifier.c | 42 -
 1 file changed, 24 insertions(+), 18 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 4224133..428a84d 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -14200,13 +14200,14 @@ static int set_admin(bool admin)
 static void do_test_single(struct bpf_test *test, bool unpriv,
   int *passes, int *errors)
 {
-   int fd_prog, expected_ret, reject_from_alignment;
+   int fd_prog, expected_ret, alignment_prevented_execution;
int prog_len, prog_type = test->prog_type;
struct bpf_insn *prog = test->insns;
int map_fds[MAX_NR_MAPS];
const char *expected_err;
uint32_t expected_val;
uint32_t retval;
+   __u32 pflags;
int i, err;
 
for (i = 0; i < MAX_NR_MAPS; i++)
@@ -14217,9 +14218,12 @@ static void do_test_single(struct bpf_test *test, bool 
unpriv,
do_test_fixup(test, prog_type, prog, map_fds);
prog_len = probe_filter_length(prog);
 
-   fd_prog = bpf_verify_program(prog_type, prog, prog_len,
-test->flags & F_LOAD_WITH_STRICT_ALIGNMENT 
?
-BPF_F_STRICT_ALIGNMENT : 0,
+   pflags = 0;
+   if (test->flags & F_LOAD_WITH_STRICT_ALIGNMENT)
+   pflags |= BPF_F_STRICT_ALIGNMENT;
+   if (test->flags & F_NEEDS_EFFICIENT_UNALIGNED_ACCESS)
+   pflags |= BPF_F_ANY_ALIGNMENT;
+   fd_prog = bpf_verify_program(prog_type, prog, prog_len, pflags,
 "GPL", 0, bpf_vlog, sizeof(bpf_vlog), 1);
 
expected_ret = unpriv && test->result_unpriv != UNDEF ?
@@ -14229,28 +14233,27 @@ static void do_test_single(struct bpf_test *test, 
bool unpriv,
expected_val = unpriv && test->retval_unpriv ?
   test->retval_unpriv : test->retval;
 
-   reject_from_alignment = fd_prog < 0 &&
-   (test->flags & 
F_NEEDS_EFFICIENT_UNALIGNED_ACCESS) &&
-   strstr(bpf_vlog, "misaligned");
-#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
-   if (reject_from_alignment) {
-   printf("FAIL\nFailed due to alignment despite having efficient 
unaligned access: '%s'!\n",
-  strerror(errno));
-   goto fail_log;
-   }
-#endif
+   alignment_prevented_execution = 0;
+
if (expected_ret == ACCEPT) {
-   if (fd_prog < 0 && !reject_from_alignment) {
+   if (fd_prog < 0) {
printf("FAIL\nFailed to load prog '%s'!\n",
   strerror(errno));
goto fail_log;
}
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+   if (fd_prog >= 0 &&
+   (test->flags & F_NEEDS_EFFICIENT_UNALIGNED_ACCESS)) {
+   alignment_prevented_execution = 1;
+   goto test_ok;
+   }
+#endif
} else {
if (fd_prog >= 0) {
printf("FAIL\nUnexpected success to load!\n");
goto fail_log;
}
-   if (!strstr(bpf_vlog, expected_err) && !reject_from_alignment) {
+   if (!strstr(bpf_vlog, expected_err)) {
printf("FAIL\nUnexpected error message!\n\tEXP: 
%s\n\tRES: %s\n",
  expected_err, bpf_vlog);
goto fail_log;
@@ -14278,9 +14281,12 @@ static void do_test_single(struct bpf_test *test, bool 
unpriv,
goto fail_log;
}
}
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+test_ok:
+#endif
(*passes)++;
-   printf("OK%s\n", reject_from_alignment ?
-  " (NOTE: reject due to unknown alignment)" : "");
+   printf("OK%s\n", alignment_prevented_execution ?
+  " (NOTE: not executed due to unknown alignment)" : "");
 close_fds:
close(fd_prog);
for (i = 0; i < MAX_NR_MAPS; i++)
-- 
2.1.2.532.g19b5d50



[PATCH v2 bpf-next 0/4] bpf: Improve verifier test coverage on sparc64 et al.

2018-11-30 Thread David Miller


On sparc64 a ton of test cases in test_verifier.c fail because
the memory accesses in the test case are unaligned (or cannot
be proven to be aligned by the verifier).

Perhaps we can eventually try to (carefully) modify each test case
which has this problem to not use unaligned accesses but:

1) That is delicate work.

2) The changes might not fully respect the original
   intention of the testcase.

3) In some cases, such a transformation might not even
   be feasible at all.

So add an "any alignment" flag to tell the verifier to forcefully
disable it's alignment checks completely.

test_verifier.c is then annotated to use this flag when necessary.

The presence of the flag in each test case is good documentation to
anyone who wants to actually tackle the job of eliminating the
unaligned memory accesses in the test cases.

I've also seen several weird things in test cases, like trying to
access __skb->mark in a packet buffer.

This gets rid of 104 test_verifier.c failures on sparc64.

Changes since v1:

1) Explain the new BPF_PROG_LOAD flag in easier to understand terms.
   Suggested by Alexei.

2) Make bpf_verify_program() just take a __u32 prog_flags instead of
   just accumulating boolean arguments over and over.  Also suggested
   by Alexei.

Changes since RFC:

1) Only the admin can allow the relaxation of alignment restrictions
   on inefficient unaligned access architectures.

2) Use F_NEEDS_EFFICIENT_UNALIGNED_ACCESS instead of making a new
   flag.

3) Annotate in the output, when we have a test case that the verifier
   accepted but we did not try to execute because we are on an
   inefficient unaligned access platform.  Maybe with some arch
   machinery we can avoid this in the future.

Signed-off-by: David S. Miller 


[PATCH v2 bpf-next 1/4] bpf: Add BPF_F_ANY_ALIGNMENT.

2018-11-30 Thread David Miller


Often we want to write tests cases that check things like bad context
offset accesses.  And one way to do this is to use an odd offset on,
for example, a 32-bit load.

This unfortunately triggers the alignment checks first on platforms
that do not set CONFIG_EFFICIENT_UNALIGNED_ACCESS.  So the test
case see the alignment failure rather than what it was testing for.

It is often not completely possible to respect the original intention
of the test, or even test the same exact thing, while solving the
alignment issue.

Another option could have been to check the alignment after the
context and other validations are performed by the verifier, but
that is a non-trivial change to the verifier.

Signed-off-by: David S. Miller 
---
 include/uapi/linux/bpf.h| 14 ++
 kernel/bpf/syscall.c|  7 ++-
 kernel/bpf/verifier.c   |  2 ++
 tools/include/uapi/linux/bpf.h  | 14 ++
 tools/lib/bpf/bpf.c |  8 
 tools/lib/bpf/bpf.h |  2 +-
 tools/testing/selftests/bpf/test_align.c|  4 ++--
 tools/testing/selftests/bpf/test_verifier.c |  3 ++-
 8 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 426b5c8..6620c37 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -232,6 +232,20 @@ enum bpf_attach_type {
  */
 #define BPF_F_STRICT_ALIGNMENT (1U << 0)
 
+/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the
+ * verifier will allow any alignment whatsoever.  On platforms
+ * with strict alignment requirements for loads ands stores (such
+ * as sparc and mips) the verifier validates that all loads and
+ * stores provably follow this requirement.  This flag turns that
+ * checking and enforcement off.
+ *
+ * It is mostly used for testing when we want to validate the
+ * context and memory access aspects of the verifier, but because
+ * of an unaligned access the alignment check would trigger before
+ * the one we are interested in.
+ */
+#define BPF_F_ANY_ALIGNMENT(1U << 1)
+
 /* when bpf_ldimm64->src_reg == BPF_PSEUDO_MAP_FD, bpf_ldimm64->imm == fd */
 #define BPF_PSEUDO_MAP_FD  1
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cf5040f..cae65bb 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1450,9 +1450,14 @@ static int bpf_prog_load(union bpf_attr *attr)
if (CHECK_ATTR(BPF_PROG_LOAD))
return -EINVAL;
 
-   if (attr->prog_flags & ~BPF_F_STRICT_ALIGNMENT)
+   if (attr->prog_flags & ~(BPF_F_STRICT_ALIGNMENT | BPF_F_ANY_ALIGNMENT))
return -EINVAL;
 
+   if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
+   (attr->prog_flags & BPF_F_ANY_ALIGNMENT) &&
+   !capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
/* copy eBPF program license from user space */
if (strncpy_from_user(license, u64_to_user_ptr(attr->license),
  sizeof(license) - 1) < 0)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6dd4195..2fefeae 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6350,6 +6350,8 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr 
*attr)
env->strict_alignment = !!(attr->prog_flags & BPF_F_STRICT_ALIGNMENT);
if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS))
env->strict_alignment = true;
+   if (attr->prog_flags & BPF_F_ANY_ALIGNMENT)
+   env->strict_alignment = false;
 
ret = replace_map_fd_with_map_ptr(env);
if (ret < 0)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 426b5c8..6620c37 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -232,6 +232,20 @@ enum bpf_attach_type {
  */
 #define BPF_F_STRICT_ALIGNMENT (1U << 0)
 
+/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the
+ * verifier will allow any alignment whatsoever.  On platforms
+ * with strict alignment requirements for loads ands stores (such
+ * as sparc and mips) the verifier validates that all loads and
+ * stores provably follow this requirement.  This flag turns that
+ * checking and enforcement off.
+ *
+ * It is mostly used for testing when we want to validate the
+ * context and memory access aspects of the verifier, but because
+ * of an unaligned access the alignment check would trigger before
+ * the one we are interested in.
+ */
+#define BPF_F_ANY_ALIGNMENT(1U << 1)
+
 /* when bpf_ldimm64->src_reg == BPF_PSEUDO_MAP_FD, bpf_ldimm64->imm == fd */
 #define BPF_PSEUDO_MAP_FD  1
 
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 03f9bcc..b9b8c5a 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -231,9 +231,9 @@ int bpf_load_program(enum bpf_prog_type type, const struct 
bpf_insn *insns,
 }
 
 int bpf_verify_program(enum bpf_prog_type 

Re: [PATCH bpf-next 1/4] bpf: Add BPF_F_ANY_ALIGNMENT.

2018-11-30 Thread David Miller
From: Alexei Starovoitov 
Date: Fri, 30 Nov 2018 13:58:20 -0800

> On Thu, Nov 29, 2018 at 07:32:41PM -0800, David Miller wrote:
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 426b5c8..c9647ea 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -232,6 +232,16 @@ enum bpf_attach_type {
>>   */
>>  #define BPF_F_STRICT_ALIGNMENT  (1U << 0)
>>  
>> +/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the
>> + * verifier will allow any alignment whatsoever.  This bypasses
>> + * what CONFIG_EFFICIENT_UNALIGNED_ACCESS would cause it to do.
> 
> I think majority of user space folks who read uapi/bpf.h have no idea
> what that kernel config does.
> Could you reword the comment here to say that this flag is only
> effective on architectures and like sparc and mips that don't
> have efficient unaligned access and ignored on x86/arm64 ?

I just want to point out in passing that your feeback applies also to
the comment above BPF_F_STRICT_ALIGNMENT, which I used as a model for
my comment.


Re: consistency for statistics with XDP mode

2018-11-30 Thread Jakub Kicinski
On Fri, 30 Nov 2018 13:35:53 -0700, David Ahern wrote:
> On 11/30/18 1:30 PM, Michael S. Tsirkin wrote:
>  I would like to see basic packets, bytes, and dropped counters
>  tracked
>  for Rx and Tx via the standard netdev counters for all devices.   
> >>
> >> The problem of reporting XDP_DROP in the netedev drop counter is that
> >> they don't fit this counter description : "no space in linux buffers"
> >> and it will be hard for the user to determine whether these drops are
> >> coming from XDP or because no buffer is available, which will make it
> >> impossible to estimate packet rate performance without looking at
> >> ethtool stats.
> >> And reporting XDP_DROP in the netdev rx packets counter is somehow
> >> misleading.. since those packets never made it out of this driver.. 
> >>
> >>
> >> And reporting XDP_DROP in the netdev rx packets counter is somehow
> >> misleading.. since those packets never made it out of this driver..  
> > 
> > I think I agree. XDP needs minimal overhead - if user wants to do
> > counters then user can via maps. And in a sense XDP dropping packet
> > is much like e.g. TCP dropping packet - it is not counted
> > against the driver since it's not driver's fault.
> >   
> 
> XDP dropping a packet is completely different.
> 
> stats are important. packets disappearing with no counters -- standard
> counters visible by standard tools -- is a user nightmare. If the
> agreement is for XDP drops to be in driver level (e.g., xdp_drop) that
> is fine since it is still retrievable by ethtool -S (existing APIs and
> existing tools).

I don't think that's completely fair.  Disappearing packets are a
nightmare, but if the user installed a program which silently drops
packets without incrementing any counter it's their own fault.  If
cls_bpf returns STOLEN or TRAP, I don't think that's gonna get counted
anywhere.

I don't think DPDK drivers maintain "just in case" statistics, either..


Re: [PATCHv3 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Alexei Starovoitov
On Fri, Nov 30, 2018 at 03:32:20PM -0800, Joe Stringer wrote:
> David Ahern and Nicolas Dichtel report that the handling of the netns id
> 0 is incorrect for the BPF socket lookup helpers: rather than finding
> the netns with id 0, it is resolving to the current netns. This renders
> the netns_id 0 inaccessible.
> 
> To fix this, adjust the API for the netns to treat all negative s32
> values as a lookup in the current netns (including u64 values which when
> truncated to s32 become negative), while any values with a positive
> value in the signed 32-bit integer space would result in a lookup for a
> socket in the netns corresponding to that id. As before, if the netns
> with that ID does not exist, no socket will be found. Any netns outside
> of these ranges will fail to find a corresponding socket, as those
> values are reserved for future usage.
> 
> Signed-off-by: Joe Stringer 
> Acked-by: Nicolas Dichtel 

Applied both. Thanks everyone.

Joe, please provide a cover letter 0/N next time for the series
or if they're really separate patches submit them one by one.



Re: [PATCH net] tun: forbid iface creation with rtnl ops

2018-11-30 Thread David Miller
From: Nicolas Dichtel 
Date: Thu, 29 Nov 2018 14:45:39 +0100

> It's not supported right now (the goal of the initial patch was to support
> 'ip link del' only).
> 
> Before the patch:
> $ ip link add foo type tun
> [  239.632660] BUG: unable to handle kernel NULL pointer dereference at 
> 
> [snip]
> [  239.636410] RIP: 0010:register_netdevice+0x8e/0x3a0
> 
> This panic occurs because dev->netdev_ops is not set by tun_setup(). But to
> have something usable, it will require more than just setting
> netdev_ops.
> 
> Fixes: f019a7a594d9 ("tun: Implement ip link del tunXXX")
> CC: Eric W. Biederman 
> Signed-off-by: Nicolas Dichtel 

Super old bug, scary.

Applied, thanks.


Re: [PATCH net-next] net: qualcomm: rmnet: Remove set but not used variable 'cmd'

2018-11-30 Thread David Miller
From: YueHaibing 
Date: Thu, 29 Nov 2018 02:36:32 +

> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c: In function 
> 'rmnet_map_do_flow_control':
> drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c:23:36: warning:
>  variable 'cmd' set but not used [-Wunused-but-set-variable]
>   struct rmnet_map_control_command *cmd;
> 
> 'cmd' not used anymore now, should also be removed.
> 
> Signed-off-by: YueHaibing 

Applied.


Re: [PATCH net 0/3] fixes in timeout and retransmission accounting

2018-11-30 Thread David Miller
From: Yuchung Cheng 
Date: Wed, 28 Nov 2018 16:06:42 -0800

> This patch set has assorted fixes of minor accounting issues in
> timeout, window probe, and retransmission stats.

Looks good, series applied.


Re: [PATCHv3 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Joey Pabalinas
On Fri, Nov 30, 2018 at 02:36:57PM -1000, Joey Pabalinas wrote:
> On Fri, Nov 30, 2018 at 03:32:20PM -0800, Joe Stringer wrote:
> > + * the netns associated with the *ctx*. *netns* values beyond the
> > + * range of 32-bit integers are reserved for future use.
> 
> Would adding a word or two before "*netns*" here be helpful to placate
> the english pedants among us (such as myself)? Starting a sentence with
> a lowercase word, even if it's a variable name, just never really sits
> right with me.
> 
>   Any *netns* values ...
> 
> Doesn't that kind of flow a bit better anyway?

Hah,

> Anyway, now with that unimportant nitpick if off my chest, the rest
^
|
I make an absolutely terrible pedant. --

-- 
Cheers,
Joey Pabalinas


signature.asc
Description: PGP signature


Re: [Patch net] mlx5: fix get_ip_proto()

2018-11-30 Thread David Miller
From: Cong Wang 
Date: Wed, 28 Nov 2018 15:04:05 -0800

> IP header is not necessarily located right after struct ethhdr,
> there could be multiple 802.1Q headers in between, this is why
> we call __vlan_get_protocol().
> 
> Fixes: fe1dc069990c ("net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets")
> Cc: Alaa Hleihel 
> Cc: Or Gerlitz 
> Cc: Saeed Mahameed 
> Signed-off-by: Cong Wang 

Applied.


Re: [PATCH net-next] tun: implement carrier change

2018-11-30 Thread David Miller
From: Nicolas Dichtel 
Date: Wed, 28 Nov 2018 19:12:56 +0100

> The userspace may need to control the carrier state.
> 
> Signed-off-by: Nicolas Dichtel 
> Signed-off-by: Didier Pallard 

Applied.


Re: [PATCH bpf] bpf: fix pointer offsets in context for 32 bit

2018-11-30 Thread Alexei Starovoitov
On Sat, Dec 01, 2018 at 01:18:53AM +0100, Daniel Borkmann wrote:
> Currently, pointer offsets in three BPF context structures are
> broken in two scenarios: i) 32 bit compiled applications running
> on 64 bit kernels, and ii) LLVM compiled BPF programs running
> on 32 bit kernels. The latter is due to BPF target machine being
> strictly 64 bit. So in each of the cases the offsets will mismatch
> in verifier when checking / rewriting context access. Fix this by
> providing a helper macro __bpf_md_ptr() that will enforce padding
> up to 64 bit and proper alignment, and for context access a macro
> bpf_ctx_range_ptr() which will cover full 64 bit member range on
> 32 bit archs. For flow_keys, we additionally need to force the
> size check to sizeof(__u64) as with other pointer types.
> 
> Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook")
> Fixes: 4f738adba30a ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket 
> TX/RX data")
> Fixes: 2dbb9b9e6df6 ("bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT")
> Reported-by: David S. Miller 
> Signed-off-by: Daniel Borkmann 
> Acked-by: David S. Miller 

Applied. Thanks everyone



Re: [PATCH net] net/sched: act_police: fix memory leak in case of invalid control action

2018-11-30 Thread David Miller
From: Davide Caratti 
Date: Wed, 28 Nov 2018 18:43:42 +0100

> when users set an invalid control action, kmemleak complains as follows:
> 
>  # echo clear >/sys/kernel/debug/kmemleak
>  # ./tdc.py -e b48b
>  Test b48b: Add police action with exceed goto chain control action
>  All test results:
> 
>  1..1
>  ok 1 - b48b # Add police action with exceed goto chain control action
>  about to flush the tap output if tests need to be skipped
>  done flushing skipped test tap output
>  # echo scan >/sys/kernel/debug/kmemleak
>  # cat /sys/kernel/debug/kmemleak
>  unreferenced object 0xa0fafbc3dde0 (size 96):
>   comm "tc", pid 2358, jiffies 4294922738 (age 17.022s)
>   hex dump (first 32 bytes):
> 2a 00 00 20 00 00 00 00 00 00 7d 00 00 00 00 00  *.. ..}.
> f8 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>   backtrace:
> [<648803d2>] tcf_action_init_1+0x384/0x4c0
> [] tcf_action_init+0x12b/0x1a0
> [<847ef0d4>] tcf_action_add+0x73/0x170
> [<93656e14>] tc_ctl_action+0x122/0x160
> [<23c98e32>] rtnetlink_rcv_msg+0x263/0x2d0
> [<3493ae9c>] netlink_rcv_skb+0x4d/0x130
> [] netlink_unicast+0x209/0x2d0
> [] netlink_sendmsg+0x2c1/0x3c0
> [<7a9e0753>] sock_sendmsg+0x33/0x40
> [<457c6d2e>] ___sys_sendmsg+0x2a0/0x2f0
> [] __sys_sendmsg+0x5e/0xa0
> [<446eafce>] do_syscall_64+0x5b/0x180
> [<4aa871f2>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [<450c38ef>] 0x
> 
> change tcf_police_init() to avoid leaking 'new' in case TCA_POLICE_RESULT
> contains TC_ACT_GOTO_CHAIN extended action.
> 
> Fixes: c08f5ed5d625 ("net/sched: act_police: disallow 'goto chain' on 
> fallback control action")
> Reported-by: Dan Carpenter 
> Signed-off-by: Davide Caratti 

Applied.


Re: [PATCH net-next] net: reorder flowi_common fields to avoid holes

2018-11-30 Thread David Miller
From: Paolo Abeni 
Date: Wed, 28 Nov 2018 17:37:53 +0100

> the flowi* structures are used and memsetted by server functions
> in critical path. Currently flowi_common has a couple of holes that
> we can eliminate reordering the struct fields. As a side effect,
> both flowi4 and flowi6 shrink by 8 bytes.
 ...
> Signed-off-by: Paolo Abeni 

Applied, thanks.


Re: BPF uapi structures and 32-bit

2018-11-30 Thread David Miller
From: Alexei Starovoitov 
Date: Fri, 30 Nov 2018 15:33:08 -0800

> On Wed, Nov 28, 2018 at 11:02:00AM -0800, David Miller wrote:
>> From: Daniel Borkmann 
>> Date: Wed, 28 Nov 2018 11:34:55 +0100
>> 
>> > Yeah fully agree. Thinking diff below should address it, do you
>> > have a chance to give this a spin for sparc / 32 bit to check if
>> > test_verifier still explodes?
>> 
>> Great, let me play with this.
>> 
>> I did something simpler yesterday, just changing the data pointers to
>> "u64" and that made at least one test pass that didn't before :-)
>> 
>> I'll get back to you with results.
> 
> Did you have a chance to test it ?
> 
> I'd like to add a tested-by before I apply Daniel's patch
> which looks good to me. btw.

Tested-by: David S. Miller 

Yes, it does the trick.


Re: [PATCH bpf-next] bpf: allow BPF read access to qdisc pkt_len

2018-11-30 Thread Daniel Borkmann
On 12/01/2018 12:42 AM, Willem de Bruijn wrote:
> On Fri, Nov 30, 2018 at 5:48 PM Song Liu  wrote:
>>
>> On Fri, Nov 30, 2018 at 12:09 PM Willem de Bruijn
>>  wrote:
>>>
>>> From: Petar Penkov 
>>>
>>> The pkt_len field in qdisc_skb_cb stores the skb length as it will
>>> appear on the wire after segmentation. For byte accounting, this value
>>> is more accurate than skb->len. It is computed on entry to the TC
>>> layer, so only valid there.
>>>
>>> Allow read access to this field from BPF tc classifier and action
>>> programs. The implementation is analogous to tc_classid, aside from
>>> restricting to read access.
>>>
>>> Signed-off-by: Petar Penkov 
>>> Signed-off-by: Vlad Dumitrescu 
>>> Signed-off-by: Willem de Bruijn 
>>> ---
>>>  include/uapi/linux/bpf.h|  1 +
>>>  net/core/filter.c   | 16 +++
>>>  tools/include/uapi/linux/bpf.h  |  1 +
>>>  tools/testing/selftests/bpf/test_verifier.c | 32 +
>>>  4 files changed, 50 insertions(+)
>>
>> Please split this into 3 patches:
>> 1 for include/uapi/linux/bpf.h and filter.c
>> 1 for tools/include/uapi/linux/bpf.h
>> 1 for tools/testing/selftests/bpf/test_verifier.c
>>
>> Other than this
>> Acked-by: Song Liu 
> 
> Thanks for the fast review.
> 
> I'm happy to resend in three parts, of course, but am curious: what is
> the rationale for splitting this up?
> 
> This patch follows the process for commit  f11216b24219 ("bpf: add
> skb->tstamp r/w access from tc clsact and cg skb progs"), which went
> in as a single patch just last week.

Yeah, I think it's fine as is, one small thing I'm wondering though is
given that we now would have both 'skb->len' and 'skb->pkt_len', would
it be more intuitive for a BPF program developer to distinguish the two
by having the latter named e.g. 'skb->wire_len' so it's slightly more
obvious that it's including header size at post-segmentation? If not
probably some comment in the uapi header similar as in qdisc_pkt_len_init()
might be helpful in any case.

Thanks,
Daniel


Re: [PATCHv3 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Joey Pabalinas
On Fri, Nov 30, 2018 at 03:32:20PM -0800, Joe Stringer wrote:
> + *   the netns associated with the *ctx*. *netns* values beyond the
> + *   range of 32-bit integers are reserved for future use.

Would adding a word or two before "*netns*" here be helpful to placate
the english pedants among us (such as myself)? Starting a sentence with
a lowercase word, even if it's a variable name, just never really sits
right with me.

Any *netns* values ...

Doesn't that kind of flow a bit better anyway?

Anyway, now with that unimportant nitpick if off my chest, the rest
looks fine to me. Ack with or without any changes to that comment.

Acked-by: Joey Pabalinas 

On Fri, Nov 30, 2018 at 03:32:20PM -0800, Joe Stringer wrote:
> David Ahern and Nicolas Dichtel report that the handling of the netns id
> 0 is incorrect for the BPF socket lookup helpers: rather than finding
> the netns with id 0, it is resolving to the current netns. This renders
> the netns_id 0 inaccessible.
> 
> To fix this, adjust the API for the netns to treat all negative s32
> values as a lookup in the current netns (including u64 values which when
> truncated to s32 become negative), while any values with a positive
> value in the signed 32-bit integer space would result in a lookup for a
> socket in the netns corresponding to that id. As before, if the netns
> with that ID does not exist, no socket will be found. Any netns outside
> of these ranges will fail to find a corresponding socket, as those
> values are reserved for future usage.
> 
> Signed-off-by: Joe Stringer 
> Acked-by: Nicolas Dichtel 
> ---
>  include/uapi/linux/bpf.h  | 35 ++---
>  net/core/filter.c | 11 +++---
>  tools/include/uapi/linux/bpf.h| 39 ---
>  tools/testing/selftests/bpf/bpf_helpers.h |  4 +-
>  .../selftests/bpf/test_sk_lookup_kern.c   | 18 -
>  5 files changed, 63 insertions(+), 44 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 852dc17ab47a..ad68b472dad2 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2170,7 +2170,7 @@ union bpf_attr {
>   *   Return
>   *   0 on success, or a negative error in case of failure.
>   *
> - * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple 
> *tuple, u32 tuple_size, u32 netns, u64 flags)
> + * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple 
> *tuple, u32 tuple_size, u64 netns, u64 flags)
>   *   Description
>   *   Look for TCP socket matching *tuple*, optionally in a child
>   *   network namespace *netns*. The return value must be checked,
> @@ -2187,12 +2187,14 @@ union bpf_attr {
>   *   **sizeof**\ (*tuple*\ **->ipv6**)
>   *   Look for an IPv6 socket.
>   *
> - *   If the *netns* is zero, then the socket lookup table in the
> - *   netns associated with the *ctx* will be used. For the TC hooks,
> - *   this in the netns of the device in the skb. For socket hooks,
> - *   this in the netns of the socket. If *netns* is non-zero, then
> - *   it specifies the ID of the netns relative to the netns
> - *   associated with the *ctx*.
> + *   If the *netns* is a negative signed 32-bit integer, then the
> + *   socket lookup table in the netns associated with the *ctx* will
> + *   will be used. For the TC hooks, this is the netns of the device
> + *   in the skb. For socket hooks, this is the netns of the socket.
> + *   If *netns* is any other signed 32-bit value greater than or
> + *   equal to zero then it specifies the ID of the netns relative to
> + *   the netns associated with the *ctx*. *netns* values beyond the
> + *   range of 32-bit integers are reserved for future use.
>   *
>   *   All values for *flags* are reserved for future usage, and must
>   *   be left at zero.
> @@ -2202,7 +2204,7 @@ union bpf_attr {
>   *   Return
>   *   Pointer to *struct bpf_sock*, or NULL in case of failure.
>   *
> - * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple 
> *tuple, u32 tuple_size, u32 netns, u64 flags)
> + * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple 
> *tuple, u32 tuple_size, u64 netns, u64 flags)
>   *   Description
>   *   Look for UDP socket matching *tuple*, optionally in a child
>   *   network namespace *netns*. The return value must be checked,
> @@ -2219,12 +2221,14 @@ union bpf_attr {
>   *   **sizeof**\ (*tuple*\ **->ipv6**)
>   *   Look for an IPv6 socket.
>   *
> - *   If the *netns* is zero, then the socket lookup table in the
> - *   netns associated with the *ctx* will be used. For the TC hooks,
> - *   this in the netns of the device in the skb. For socket hooks,
> - *   this 

Re: [PATCH net-next v2 2/3] vxlan: extack support for some changelink cases

2018-11-30 Thread kbuild test robot
Hi Roopa,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Roopa-Prabhu/vxlan-a-few-minor-cleanups/20181130-184658
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/net/vxlan.c: In function 'vxlan_nl2conf':
>> drivers/net/vxlan.c:3527:3: error: expected '}' before 'else'
  else {
  ^~~~

vim +3527 drivers/net/vxlan.c

  3376  
  3377  static int vxlan_nl2conf(struct nlattr *tb[], struct nlattr *data[],
  3378   struct net_device *dev, struct vxlan_config 
*conf,
  3379   bool changelink, struct netlink_ext_ack 
*extack)
  3380  {
  3381  struct vxlan_dev *vxlan = netdev_priv(dev);
  3382  
  3383  memset(conf, 0, sizeof(*conf));
  3384  
  3385  /* if changelink operation, start with old existing cfg */
  3386  if (changelink)
  3387  memcpy(conf, >cfg, sizeof(*conf));
  3388  
  3389  if (data[IFLA_VXLAN_ID]) {
  3390  __be32 vni = 
cpu_to_be32(nla_get_u32(data[IFLA_VXLAN_ID]));
  3391  
  3392  if (changelink && (vni != conf->vni)) {
  3393  NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_ID],
  3394  "Cannot change vni");
  3395  return -EOPNOTSUPP;
  3396  }
  3397  conf->vni = 
cpu_to_be32(nla_get_u32(data[IFLA_VXLAN_ID]));
  3398  }
  3399  
  3400  if (data[IFLA_VXLAN_GROUP]) {
  3401  if (changelink && (conf->remote_ip.sa.sa_family != 
AF_INET)) {
  3402  NL_SET_ERR_MSG_ATTR(extack, 
tb[IFLA_VXLAN_GROUP],
  3403  "New group addr family does 
not match old group");
  3404  return -EOPNOTSUPP;
  3405  }
  3406  conf->remote_ip.sin.sin_addr.s_addr = 
nla_get_in_addr(data[IFLA_VXLAN_GROUP]);
  3407  conf->remote_ip.sa.sa_family = AF_INET;
  3408  } else if (data[IFLA_VXLAN_GROUP6]) {
  3409  if (!IS_ENABLED(CONFIG_IPV6)) {
  3410  NL_SET_ERR_MSG_ATTR(extack, 
tb[IFLA_VXLAN_GROUP6],
  3411  "IPv6 support not enabled 
in the kernel");
  3412  return -EPFNOSUPPORT;
  3413  }
  3414  
  3415  if (changelink && (conf->remote_ip.sa.sa_family != 
AF_INET6)) {
  3416  NL_SET_ERR_MSG_ATTR(extack, 
tb[IFLA_VXLAN_GROUP6],
  3417  "New group addr family does 
not match old group");
  3418  return -EOPNOTSUPP;
  3419  }
  3420  
  3421  conf->remote_ip.sin6.sin6_addr = 
nla_get_in6_addr(data[IFLA_VXLAN_GROUP6]);
  3422  conf->remote_ip.sa.sa_family = AF_INET6;
  3423  }
  3424  
  3425  if (data[IFLA_VXLAN_LOCAL]) {
  3426  if (changelink && (conf->saddr.sa.sa_family != 
AF_INET)) {
  3427  NL_SET_ERR_MSG_ATTR(extack, 
tb[IFLA_VXLAN_LOCAL],
  3428  "New local addr family does 
not match old");
  3429  return -EOPNOTSUPP;
  3430  }
  3431  
  3432  conf->saddr.sin.sin_addr.s_addr = 
nla_get_in_addr(data[IFLA_VXLAN_LOCAL]);
  3433  conf->saddr.sa.sa_family = AF_INET;
  3434  } else if (data[IFLA_VXLAN_LOCAL6]) {
  3435  if (!IS_ENABLED(CONFIG_IPV6)) {
  3436  NL_SET_ERR_MSG_ATTR(extack, 
tb[IFLA_VXLAN_LOCAL6],
  3437  "IPv6 support not enabled 
in the kernel");
  3438  return -EPFNOSUPPORT;
  3439  }
  3440  
  3441  if (changelink && (conf->saddr.sa.sa_family != 
AF_INET6)) {
  3442  NL_SET_ERR_MSG_ATTR(extack, 
tb[IFLA_VXLAN_LOCAL6],
  3443  "New local6 addr family 
does not match old");
  3444  return -EOPNOTSUPP;
  3445  }
  3446  
  3447  /* TODO: respect scope id */
  3448  conf->saddr.sin6.sin6_addr = 
nla_get_in6_addr(data[IFLA_VXLAN_LOCAL6]);
  3449  conf->saddr.sa.sa_family = AF_INET6;
  3450  }
  3451  
  3452  if (data[IFLA_VXLAN_LINK])
  3453  conf->remote_ifind

[PATCH bpf-next v2 0/2] sample: xdp1 improvements

2018-11-30 Thread Matteo Croce
Small improvements to improve the readability and easiness
to use of the xdp1 sample.

Matteo Croce (2):
  samples: bpf: improve xdp1 example
  samples: bpf: get ifindex from ifname

 samples/bpf/xdp1_user.c | 27 +++
 1 file changed, 15 insertions(+), 12 deletions(-)

-- 
2.19.1



[PATCH bpf-next v2 1/2] samples: bpf: improve xdp1 example

2018-11-30 Thread Matteo Croce
Store only the total packet count for every protocol, instead of the
whole per-cpu array.
Use bpf_map_get_next_key() to iterate the map, instead of looking up
all the protocols.

Signed-off-by: Matteo Croce 
---
 samples/bpf/xdp1_user.c | 18 --
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/samples/bpf/xdp1_user.c b/samples/bpf/xdp1_user.c
index b02c531510ed..4f3d824fc044 100644
--- a/samples/bpf/xdp1_user.c
+++ b/samples/bpf/xdp1_user.c
@@ -34,26 +34,24 @@ static void int_exit(int sig)
 static void poll_stats(int map_fd, int interval)
 {
unsigned int nr_cpus = bpf_num_possible_cpus();
-   const unsigned int nr_keys = 256;
-   __u64 values[nr_cpus], prev[nr_keys][nr_cpus];
-   __u32 key;
+   __u64 values[nr_cpus], prev[UINT8_MAX] = { 0 };
int i;
 
-   memset(prev, 0, sizeof(prev));
-
while (1) {
+   __u32 key = UINT32_MAX;
+
sleep(interval);
 
-   for (key = 0; key < nr_keys; key++) {
+   while (bpf_map_get_next_key(map_fd, , ) != -1) {
__u64 sum = 0;
 
assert(bpf_map_lookup_elem(map_fd, , values) == 0);
for (i = 0; i < nr_cpus; i++)
-   sum += (values[i] - prev[key][i]);
-   if (sum)
+   sum += values[i];
+   if (sum > prev[key])
printf("proto %u: %10llu pkt/s\n",
-  key, sum / interval);
-   memcpy(prev[key], values, sizeof(values));
+  key, (sum - prev[key]) / interval);
+   prev[key] = sum;
}
}
 }
-- 
2.19.1



[PATCH bpf-next v2 2/2] samples: bpf: get ifindex from ifname

2018-11-30 Thread Matteo Croce
Find the ifindex with if_nametoindex() instead of requiring the
numeric ifindex.

Signed-off-by: Matteo Croce 
---
v2: use if_nametoindex() instead of ioctl()

 samples/bpf/xdp1_user.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/samples/bpf/xdp1_user.c b/samples/bpf/xdp1_user.c
index 4f3d824fc044..0a197f86ac43 100644
--- a/samples/bpf/xdp1_user.c
+++ b/samples/bpf/xdp1_user.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "bpf_util.h"
 #include "bpf/bpf.h"
@@ -59,7 +60,7 @@ static void poll_stats(int map_fd, int interval)
 static void usage(const char *prog)
 {
fprintf(stderr,
-   "usage: %s [OPTS] IFINDEX\n\n"
+   "usage: %s [OPTS] IFACE\n\n"
"OPTS:\n"
"-Suse skb-mode\n"
"-Nenforce native mode\n",
@@ -102,7 +103,11 @@ int main(int argc, char **argv)
return 1;
}
 
-   ifindex = strtoul(argv[optind], NULL, 0);
+   ifindex = if_nametoindex(argv[1]);
+   if (!ifindex) {
+   perror("if_nametoindex");
+   return 1;
+   }
 
snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
prog_load_attr.file = filename;
-- 
2.19.1



[PATCH bpf] bpf: fix pointer offsets in context for 32 bit

2018-11-30 Thread Daniel Borkmann
Currently, pointer offsets in three BPF context structures are
broken in two scenarios: i) 32 bit compiled applications running
on 64 bit kernels, and ii) LLVM compiled BPF programs running
on 32 bit kernels. The latter is due to BPF target machine being
strictly 64 bit. So in each of the cases the offsets will mismatch
in verifier when checking / rewriting context access. Fix this by
providing a helper macro __bpf_md_ptr() that will enforce padding
up to 64 bit and proper alignment, and for context access a macro
bpf_ctx_range_ptr() which will cover full 64 bit member range on
32 bit archs. For flow_keys, we additionally need to force the
size check to sizeof(__u64) as with other pointer types.

Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook")
Fixes: 4f738adba30a ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket 
TX/RX data")
Fixes: 2dbb9b9e6df6 ("bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT")
Reported-by: David S. Miller 
Signed-off-by: Daniel Borkmann 
Acked-by: David S. Miller 
---
 include/linux/filter.h |  7 +++
 include/uapi/linux/bpf.h   | 17 -
 net/core/filter.c  | 16 
 tools/include/uapi/linux/bpf.h | 17 -
 4 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 448dcc4..795ff0b 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -449,6 +449,13 @@ struct sock_reuseport;
offsetof(TYPE, MEMBER) ... offsetofend(TYPE, MEMBER) - 1
 #define bpf_ctx_range_till(TYPE, MEMBER1, MEMBER2) 
\
offsetof(TYPE, MEMBER1) ... offsetofend(TYPE, MEMBER2) - 1
+#if BITS_PER_LONG == 64
+# define bpf_ctx_range_ptr(TYPE, MEMBER)   
\
+   offsetof(TYPE, MEMBER) ... offsetofend(TYPE, MEMBER) - 1
+#else
+# define bpf_ctx_range_ptr(TYPE, MEMBER)   
\
+   offsetof(TYPE, MEMBER) ... offsetof(TYPE, MEMBER) + 8 - 1
+#endif /* BITS_PER_LONG == 64 */
 
 #define bpf_target_off(TYPE, MEMBER, SIZE, PTR_SIZE)   
\
({  
\
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 852dc17..426b5c8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2422,6 +2422,12 @@ enum bpf_lwt_encap_mode {
BPF_LWT_ENCAP_SEG6_INLINE
 };
 
+#define __bpf_md_ptr(type, name)   \
+union {\
+   type name;  \
+   __u64 :64;  \
+} __attribute__((aligned(8)))
+
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
  */
@@ -2456,7 +2462,7 @@ struct __sk_buff {
/* ... here. */
 
__u32 data_meta;
-   struct bpf_flow_keys *flow_keys;
+   __bpf_md_ptr(struct bpf_flow_keys *, flow_keys);
 };
 
 struct bpf_tunnel_key {
@@ -2572,8 +2578,8 @@ enum sk_action {
  * be added to the end of this structure
  */
 struct sk_msg_md {
-   void *data;
-   void *data_end;
+   __bpf_md_ptr(void *, data);
+   __bpf_md_ptr(void *, data_end);
 
__u32 family;
__u32 remote_ip4;   /* Stored in network byte order */
@@ -2589,8 +2595,9 @@ struct sk_reuseport_md {
 * Start of directly accessible data. It begins from
 * the tcp/udp header.
 */
-   void *data;
-   void *data_end; /* End of directly accessible data */
+   __bpf_md_ptr(void *, data);
+   /* End of directly accessible data */
+   __bpf_md_ptr(void *, data_end);
/*
 * Total length of packet (starting from the tcp/udp header).
 * Note that the directly accessible bytes (data_end - data)
diff --git a/net/core/filter.c b/net/core/filter.c
index 9a1327e..6ee605d 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5435,8 +5435,8 @@ static bool bpf_skb_is_valid_access(int off, int size, 
enum bpf_access_type type
if (size != size_default)
return false;
break;
-   case bpf_ctx_range(struct __sk_buff, flow_keys):
-   if (size != sizeof(struct bpf_flow_keys *))
+   case bpf_ctx_range_ptr(struct __sk_buff, flow_keys):
+   if (size != sizeof(__u64))
return false;
break;
default:
@@ -5464,7 +5464,7 @@ static bool sk_filter_is_valid_access(int off, int size,
case bpf_ctx_range(struct __sk_buff, data):
case bpf_ctx_range(struct __sk_buff, data_meta):
case bpf_ctx_range(struct __sk_buff, data_end):
-   case bpf_ctx_range(struct __sk_buff, flow_keys):
+   case bpf_ctx_range_ptr(struct __sk_buff, flow_keys):
case bpf_ctx_range_till(struct __sk_buff, family, local_port):
return false;
}
@@ -5489,7 +5489,7 @@ static bool 

Re: [PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-11-30 Thread Peter Oskolkov
On Fri, Nov 30, 2018 at 3:52 PM David Ahern  wrote:
>
> On 11/30/18 4:35 PM, Peter Oskolkov wrote:
> > Thanks, David! This is for egress only, so I'll add an appropriate
> > check. I'll also address your other comments/concerns in a v2 version
> > of this patchset.
>
> Why are you limiting this to egress only?

I'm focusing on egress because this is a use case that we have and
understand well, and would like to have a solution for sooner rather
than later.

Without understanding why anybody would want to do lwt-bpf encap on
ingress, I don't know, for example, what a good test would look like;
but I'd be happy to look into a specific ingress use case if you have
one.


Re: consistency for statistics with XDP mode

2018-11-30 Thread Saeed Mahameed
On Fri, 2018-11-30 at 15:30 -0500, Michael S. Tsirkin wrote:
> On Fri, Nov 30, 2018 at 08:10:58PM +, Saeed Mahameed wrote:
> > On Thu, 2018-11-22 at 18:00 +0100, Toke Høiland-Jørgensen wrote:
> > > David Ahern  writes:
> > > 
> > > > On 11/22/18 1:26 AM, Toke Høiland-Jørgensen wrote:
> > > > > Saeed Mahameed  writes:
> > > > > 
> > > > > > > > I'd say it sounds reasonable to include XDP in the
> > > > > > > > normal
> > > > > > > > traffic
> > > > > > > > counters, but having the detailed XDP-specific counters
> > > > > > > > is
> > > > > > > > quite
> > > > > > > > useful
> > > > > > > > as well... So can't we do both (for all drivers)?
> > > > > > > > 
> > > > > > 
> > > > > > What are you thinking ? 
> > > > > > reporting XDP_DROP in interface dropped counter ?
> > > > > > and XDP_TX/REDIRECT in the TX counter ?
> > > > > > XDP_ABORTED in the  err/drop counter ?
> > > > > > 
> > > > > > how about having a special XDP command in the .ndo_bpf that
> > > > > > would query
> > > > > > the standardized XDP stats ?
> > > > > the XDP-specific stats are useful to have separately as well
> > > > > :)
> > > > > 
> > > > 
> > > > I would like to see basic packets, bytes, and dropped counters
> > > > tracked
> > > > for Rx and Tx via the standard netdev counters for all
> > > > devices. 
> > 
> > The problem of reporting XDP_DROP in the netedev drop counter is
> > that
> > they don't fit this counter description : "no space in linux
> > buffers"
> > and it will be hard for the user to determine whether these drops
> > are
> > coming from XDP or because no buffer is available, which will make
> > it
> > impossible to estimate packet rate performance without looking at
> > ethtool stats.
> > And reporting XDP_DROP in the netdev rx packets counter is somehow
> > misleading.. since those packets never made it out of this
> > driver.. 
> > 
> > 
> > And reporting XDP_DROP in the netdev rx packets counter is somehow
> > misleading.. since those packets never made it out of this driver..
> 
> I think I agree. XDP needs minimal overhead - if user wants to do
> counters then user can via maps. And in a sense XDP dropping packet
> is much like e.g. TCP dropping packet - it is not counted
> against the driver since it's not driver's fault.

So we should count all XDP RX packets as successful rx packets i.e
netdev->stats.rx_packets++; regardless of the XDP program decision ? 

this implies that XDP_TX packets will be counted twice once in 
netdev->stats.rx_packets and once in netdev->stats.tx_packets

I think this is the only valid option if we are going to use standard
netdev stats for XDP use cases.





Re: BPF uapi structures and 32-bit

2018-11-30 Thread Daniel Borkmann
On 12/01/2018 12:33 AM, Alexei Starovoitov wrote:
> On Wed, Nov 28, 2018 at 11:02:00AM -0800, David Miller wrote:
>> From: Daniel Borkmann 
>> Date: Wed, 28 Nov 2018 11:34:55 +0100
>>
>>> Yeah fully agree. Thinking diff below should address it, do you
>>> have a chance to give this a spin for sparc / 32 bit to check if
>>> test_verifier still explodes?
>>
>> Great, let me play with this.
>>
>> I did something simpler yesterday, just changing the data pointers to
>> "u64" and that made at least one test pass that didn't before :-)
>>
>> I'll get back to you with results.
> 
> Did you have a chance to test it ?

David got back to me and mentioned it worked fine on sparc.

> I'd like to add a tested-by before I apply Daniel's patch
> which looks good to me. btw.

Yeah, wanted to cook an official patch today, let me do that now.


Re: [PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-11-30 Thread David Ahern
On 11/30/18 4:35 PM, Peter Oskolkov wrote:
> Thanks, David! This is for egress only, so I'll add an appropriate
> check. I'll also address your other comments/concerns in a v2 version
> of this patchset.

Why are you limiting this to egress only?


Re: [PATCH bpf-next] bpf: allow BPF read access to qdisc pkt_len

2018-11-30 Thread Willem de Bruijn
On Fri, Nov 30, 2018 at 5:48 PM Song Liu  wrote:
>
> On Fri, Nov 30, 2018 at 12:09 PM Willem de Bruijn
>  wrote:
> >
> > From: Petar Penkov 
> >
> > The pkt_len field in qdisc_skb_cb stores the skb length as it will
> > appear on the wire after segmentation. For byte accounting, this value
> > is more accurate than skb->len. It is computed on entry to the TC
> > layer, so only valid there.
> >
> > Allow read access to this field from BPF tc classifier and action
> > programs. The implementation is analogous to tc_classid, aside from
> > restricting to read access.
> >
> > Signed-off-by: Petar Penkov 
> > Signed-off-by: Vlad Dumitrescu 
> > Signed-off-by: Willem de Bruijn 
> > ---
> >  include/uapi/linux/bpf.h|  1 +
> >  net/core/filter.c   | 16 +++
> >  tools/include/uapi/linux/bpf.h  |  1 +
> >  tools/testing/selftests/bpf/test_verifier.c | 32 +
> >  4 files changed, 50 insertions(+)
>
> Please split this into 3 patches:
> 1 for include/uapi/linux/bpf.h and filter.c
> 1 for tools/include/uapi/linux/bpf.h
> 1 for tools/testing/selftests/bpf/test_verifier.c
>
> Other than this
> Acked-by: Song Liu 

Thanks for the fast review.

I'm happy to resend in three parts, of course, but am curious: what is
the rationale for splitting this up?

This patch follows the process for commit  f11216b24219 ("bpf: add
skb->tstamp r/w access from tc clsact and cg skb progs"), which went
in as a single patch just last week.


Re: [PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-11-30 Thread Peter Oskolkov
Thanks, David! This is for egress only, so I'll add an appropriate
check. I'll also address your other comments/concerns in a v2 version
of this patchset.
On Fri, Nov 30, 2018 at 12:08 PM David Ahern  wrote:
>
> On 11/28/18 6:34 PM, Peter Oskolkov wrote:
> > On Wed, Nov 28, 2018 at 4:47 PM David Ahern  wrote:
> >>
> >> On 11/28/18 5:22 PM, Peter Oskolkov wrote:
> >>> diff --git a/net/core/filter.c b/net/core/filter.c
> >>> index bd0df75dc7b6..17f3c37218e5 100644
> >>> --- a/net/core/filter.c
> >>> +++ b/net/core/filter.c
> >>> @@ -4793,6 +4793,60 @@ static int bpf_push_seg6_encap(struct sk_buff 
> >>> *skb, u32 type, void *hdr, u32 len
> >>>  }
> >>>  #endif /* CONFIG_IPV6_SEG6_BPF */
> >>>
> >>> +static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len)
> >>> +{
> >>> + struct dst_entry *dst;
> >>> + struct rtable *rt;
> >>> + struct iphdr *iph;
> >>> + struct net *net;
> >>> + int err;
> >>> +
> >>> + if (skb->protocol != htons(ETH_P_IP))
> >>> + return -EINVAL;  /* ETH_P_IPV6 not yet supported */
> >>> +
> >>> + iph = (struct iphdr *)hdr;
> >>> +
> >>> + if (unlikely(len < sizeof(struct iphdr) || len > 
> >>> LWTUNNEL_MAX_ENCAP_HSIZE))
> >>> + return -EINVAL;
> >>> + if (unlikely(iph->version != 4 || iph->ihl * 4 > len))
> >>> + return -EINVAL;
> >>> +
> >>> + if (skb->sk)
> >>> + net = sock_net(skb->sk);
> >>> + else {
> >>> + net = dev_net(skb_dst(skb)->dev);
> >>> + }
> >>> + rt = ip_route_output(net, iph->daddr, 0, 0, 0);
> >>
> >> That is a very limited use case. e.g., oif = 0 means you are not
> >> considering any kind of policy routing (e.g., VRF).
> >
> > Hi David! Could you be a bit more specific re: what you would like to
> > see here? Thanks!
> >
>
> Is the encap happening on ingress or egress? Seems like the current code
> does not assume either direction for lwt (BPF_PROG_TYPE_LWT_IN vs
> BPF_PROG_TYPE_LWT_OUT), yet your change does - output only. Basically,
> you should be filling in a flow struct and doing a proper lookup.
>
> When the potentially custom encap header is pushed on, seems to me skb
> marks should still be honored for the route lookup. If not, you should
> handle that in the API.
>
> From there skb->dev at a minimum should be used as either iif (ingress)
> or oif (egress).
>
> The iph is already set so you have quick access to the tos.
>
> Also, this should implement IPv6 as well before going in.


Re: No address after suspend/resume with 4.18

2018-11-30 Thread Jeff Kirsher
On Fri, 2018-11-30 at 15:26 -0800, Stephen Hemminger wrote:
> On my box with Debian testing, I see a new problem with
> suspend/resume of wired network device.
> Using stock Debian kernel 4.18.0-2-amd64
> 
> After suspend/resume cycle, IP address is lost.
> 
> Device Info:
> $ /sbin/ethtool -i enp12s0
> driver: igb
> version: 5.4.0-k
> firmware-version:  0. 6-1
> expansion-rom-version: 
> bus-info: :0c:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes
> 
> 
> $ lspci -v -s :0c:00.0
> 0c:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> Connection (rev 03)
>   Subsystem: Gigabyte Technology Co., Ltd I211 Gigabit Network
> Connection
>   Flags: bus master, fast devsel, latency 0, IRQ 17, NUMA node 0
>   Memory at dfb0 (32-bit, non-prefetchable) [size=128K]
>   I/O ports at c000 [size=32]
>   Memory at dfb2 (32-bit, non-prefetchable) [size=16K]
>   Capabilities: 
>   Kernel driver in use: igb
>   Kernel modules: igb
> 
> State before suspend:
> $ ip addr show dev enp12s0
> 4: enp12s0:  mtu 1500 qdisc mq state
> UP group default qlen 1000
> link/ether 1c:1b:0d:0a:4b:0e brd ff:ff:ff:ff:ff:ff
> inet 192.168.1.18/24 brd 192.168.1.255 scope global enp12s0
>valid_lft forever preferred_lft forever
> 
> State after resume:
> 
> $ ip addr show dev enp12s0
> 4: enp12s0:  mtu 1500 qdisc mq state
> UP group default qlen 1000
> link/ether 1c:1b:0d:0a:4b:0e brd ff:ff:ff:ff:ff:f
> 
> Doing ifdown/ifup which restarts the DHCP client does restore the
> address.
> 
> Not sure if this is a kernel issue with carrier handling, Intel
> driver issue, or DHCP client issue.

Thanks Stephen, I will have someone look into it.


signature.asc
Description: This is a digitally signed message part


Re: BPF uapi structures and 32-bit

2018-11-30 Thread Alexei Starovoitov
On Wed, Nov 28, 2018 at 11:02:00AM -0800, David Miller wrote:
> From: Daniel Borkmann 
> Date: Wed, 28 Nov 2018 11:34:55 +0100
> 
> > Yeah fully agree. Thinking diff below should address it, do you
> > have a chance to give this a spin for sparc / 32 bit to check if
> > test_verifier still explodes?
> 
> Great, let me play with this.
> 
> I did something simpler yesterday, just changing the data pointers to
> "u64" and that made at least one test pass that didn't before :-)
> 
> I'll get back to you with results.

Did you have a chance to test it ?

I'd like to add a tested-by before I apply Daniel's patch
which looks good to me. btw.



[PATCHv3 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Joe Stringer
David Ahern and Nicolas Dichtel report that the handling of the netns id
0 is incorrect for the BPF socket lookup helpers: rather than finding
the netns with id 0, it is resolving to the current netns. This renders
the netns_id 0 inaccessible.

To fix this, adjust the API for the netns to treat all negative s32
values as a lookup in the current netns (including u64 values which when
truncated to s32 become negative), while any values with a positive
value in the signed 32-bit integer space would result in a lookup for a
socket in the netns corresponding to that id. As before, if the netns
with that ID does not exist, no socket will be found. Any netns outside
of these ranges will fail to find a corresponding socket, as those
values are reserved for future usage.

Signed-off-by: Joe Stringer 
Acked-by: Nicolas Dichtel 
---
 include/uapi/linux/bpf.h  | 35 ++---
 net/core/filter.c | 11 +++---
 tools/include/uapi/linux/bpf.h| 39 ---
 tools/testing/selftests/bpf/bpf_helpers.h |  4 +-
 .../selftests/bpf/test_sk_lookup_kern.c   | 18 -
 5 files changed, 63 insertions(+), 44 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 852dc17ab47a..ad68b472dad2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2170,7 +2170,7 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
- * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u32 netns, u64 flags)
+ * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
  * Look for TCP socket matching *tuple*, optionally in a child
  * network namespace *netns*. The return value must be checked,
@@ -2187,12 +2187,14 @@ union bpf_attr {
  * **sizeof**\ (*tuple*\ **->ipv6**)
  * Look for an IPv6 socket.
  *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is a negative signed 32-bit integer, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is any other signed 32-bit value greater than or
+ * equal to zero then it specifies the ID of the netns relative to
+ * the netns associated with the *ctx*. *netns* values beyond the
+ * range of 32-bit integers are reserved for future use.
  *
  * All values for *flags* are reserved for future usage, and must
  * be left at zero.
@@ -2202,7 +2204,7 @@ union bpf_attr {
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
  *
- * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u32 netns, u64 flags)
+ * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
  * Look for UDP socket matching *tuple*, optionally in a child
  * network namespace *netns*. The return value must be checked,
@@ -2219,12 +2221,14 @@ union bpf_attr {
  * **sizeof**\ (*tuple*\ **->ipv6**)
  * Look for an IPv6 socket.
  *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is a negative signed 32-bit integer, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is any other signed 32-bit value greater than or
+ * equal to zero then it specifies the ID of the netns relative to
+ * the netns associated with the *ctx*. *netns* values beyond the
+ * range of 32-bit integers are reserved for future use.
  *
  * All values for *flags* are 

[PATCHv3 bpf 2/2] bpf: Improve socket lookup reuseport documentation

2018-11-30 Thread Joe Stringer
Improve the wording around socket lookup for reuseport sockets, and
ensure that both bpf.h headers are in sync.

Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h   | 4 
 tools/include/uapi/linux/bpf.h | 8 
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ad68b472dad2..47f620b5cc5c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2203,6 +2203,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
@@ -2237,6 +2239,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * int bpf_sk_release(struct bpf_sock *sk)
  * Description
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index de2072ef475b..47f620b5cc5c 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2203,8 +2203,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
- * For sockets with reuseport option, *struct bpf_sock*
- * return is from reuse->socks[] using hash of the packet.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
@@ -2239,8 +2239,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
- * For sockets with reuseport option, *struct bpf_sock*
- * return is from reuse->socks[] using hash of the packet.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * int bpf_sk_release(struct bpf_sock *sk)
  * Description
-- 
2.19.1



Re: [PATCHv2 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Joe Stringer
On Fri, 30 Nov 2018 at 15:27, Alexei Starovoitov
 wrote:
>
> On Fri, Nov 30, 2018 at 03:18:25PM -0800, Joe Stringer wrote:
> > On Fri, 30 Nov 2018 at 14:42, Alexei Starovoitov
> >  wrote:
> > >
> > > On Thu, Nov 29, 2018 at 04:29:33PM -0800, Joe Stringer wrote:
> > > > David Ahern and Nicolas Dichtel report that the handling of the netns id
> > > > 0 is incorrect for the BPF socket lookup helpers: rather than finding
> > > > the netns with id 0, it is resolving to the current netns. This renders
> > > > the netns_id 0 inaccessible.
> > > >
> > > > To fix this, adjust the API for the netns to treat all negative s32
> > > > values as a lookup in the current netns, while any values with a
> > > > positive value in the signed 32-bit integer space would result in a
> > > > lookup for a socket in the netns corresponding to that id. As before, if
> > > > the netns with that ID does not exist, no socket will be found.
> > > > Furthermore, if any bits are set in the upper 32-bits, then no socket
> > > > will be found.
> > > >
> > > > Signed-off-by: Joe Stringer 
> > > ..
> > > > +/* Current network namespace */
> > > > +#define BPF_CURRENT_NETNS(-1L)
> > >
> > > I was about to apply it, but then noticed that the name doesn't match
> > > the rest of the names.
> > > Could you rename it to BPF_F_CURRENT_NETNS ?
> >
> > I skipped the F_ part since it's not really a flag, it's a value. I
> > can put it back though.
>
> BPF_F_ prefix has smaller chance of conflicts.
> I wish we did that sooner.
> In retrospect BPF_ANY, BPF_EXIST were poorly picked names.

OK, I'll send out a v3 shortly.


Re: [PATCHv2 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Alexei Starovoitov
On Fri, Nov 30, 2018 at 03:18:25PM -0800, Joe Stringer wrote:
> On Fri, 30 Nov 2018 at 14:42, Alexei Starovoitov
>  wrote:
> >
> > On Thu, Nov 29, 2018 at 04:29:33PM -0800, Joe Stringer wrote:
> > > David Ahern and Nicolas Dichtel report that the handling of the netns id
> > > 0 is incorrect for the BPF socket lookup helpers: rather than finding
> > > the netns with id 0, it is resolving to the current netns. This renders
> > > the netns_id 0 inaccessible.
> > >
> > > To fix this, adjust the API for the netns to treat all negative s32
> > > values as a lookup in the current netns, while any values with a
> > > positive value in the signed 32-bit integer space would result in a
> > > lookup for a socket in the netns corresponding to that id. As before, if
> > > the netns with that ID does not exist, no socket will be found.
> > > Furthermore, if any bits are set in the upper 32-bits, then no socket
> > > will be found.
> > >
> > > Signed-off-by: Joe Stringer 
> > ..
> > > +/* Current network namespace */
> > > +#define BPF_CURRENT_NETNS(-1L)
> >
> > I was about to apply it, but then noticed that the name doesn't match
> > the rest of the names.
> > Could you rename it to BPF_F_CURRENT_NETNS ?
> 
> I skipped the F_ part since it's not really a flag, it's a value. I
> can put it back though.

BPF_F_ prefix has smaller chance of conflicts.
I wish we did that sooner.
In retrospect BPF_ANY, BPF_EXIST were poorly picked names.



[PATCH net] macvlan: return correct error value

2018-11-30 Thread Matteo Croce
A MAC address must be unique among all the macvlan devices with the same
lower device. The only exception is the passthru [sic] mode,
which shares the lower device address.

When duplicate addresses are detected, EBUSY is returned when bringing
the interface up:

# ip link add macvlan0 link eth0 type macvlan
# read addr 
---
 drivers/net/macvlan.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index fc8d5f1ee1ad..0da3d36b283b 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -608,7 +608,7 @@ static int macvlan_open(struct net_device *dev)
goto hash_add;
}
 
-   err = -EBUSY;
+   err = -EADDRINUSE;
if (macvlan_addr_busy(vlan->port, dev->dev_addr))
goto out;
 
@@ -706,7 +706,7 @@ static int macvlan_sync_address(struct net_device *dev, 
unsigned char *addr)
} else {
/* Rehash and update the device filters */
if (macvlan_addr_busy(vlan->port, addr))
-   return -EBUSY;
+   return -EADDRINUSE;
 
if (!macvlan_passthru(port)) {
err = dev_uc_add(lowerdev, addr);
@@ -747,6 +747,9 @@ static int macvlan_set_mac_address(struct net_device *dev, 
void *p)
return dev_set_mac_address(vlan->lowerdev, addr);
}
 
+   if (macvlan_addr_busy(vlan->port, addr->sa_data))
+   return -EADDRINUSE;
+
return macvlan_sync_address(dev, addr->sa_data);
 }
 
-- 
2.19.1



No address after suspend/resume with 4.18

2018-11-30 Thread Stephen Hemminger
On my box with Debian testing, I see a new problem with suspend/resume of wired 
network device.
Using stock Debian kernel 4.18.0-2-amd64

After suspend/resume cycle, IP address is lost.

Device Info:
$ /sbin/ethtool -i enp12s0
driver: igb
version: 5.4.0-k
firmware-version:  0. 6-1
expansion-rom-version: 
bus-info: :0c:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes


$ lspci -v -s :0c:00.0
0c:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection 
(rev 03)
Subsystem: Gigabyte Technology Co., Ltd I211 Gigabit Network Connection
Flags: bus master, fast devsel, latency 0, IRQ 17, NUMA node 0
Memory at dfb0 (32-bit, non-prefetchable) [size=128K]
I/O ports at c000 [size=32]
Memory at dfb2 (32-bit, non-prefetchable) [size=16K]
Capabilities: 
Kernel driver in use: igb
Kernel modules: igb

State before suspend:
$ ip addr show dev enp12s0
4: enp12s0:  mtu 1500 qdisc mq state UP group 
default qlen 1000
link/ether 1c:1b:0d:0a:4b:0e brd ff:ff:ff:ff:ff:ff
inet 192.168.1.18/24 brd 192.168.1.255 scope global enp12s0
   valid_lft forever preferred_lft forever

State after resume:

$ ip addr show dev enp12s0
4: enp12s0:  mtu 1500 qdisc mq state UP group 
default qlen 1000
link/ether 1c:1b:0d:0a:4b:0e brd ff:ff:ff:ff:ff:f

Doing ifdown/ifup which restarts the DHCP client does restore the address.

Not sure if this is a kernel issue with carrier handling, Intel driver issue, 
or DHCP client issue.


Re: [PATCHv2 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Joe Stringer
On Fri, 30 Nov 2018 at 14:42, Alexei Starovoitov
 wrote:
>
> On Thu, Nov 29, 2018 at 04:29:33PM -0800, Joe Stringer wrote:
> > David Ahern and Nicolas Dichtel report that the handling of the netns id
> > 0 is incorrect for the BPF socket lookup helpers: rather than finding
> > the netns with id 0, it is resolving to the current netns. This renders
> > the netns_id 0 inaccessible.
> >
> > To fix this, adjust the API for the netns to treat all negative s32
> > values as a lookup in the current netns, while any values with a
> > positive value in the signed 32-bit integer space would result in a
> > lookup for a socket in the netns corresponding to that id. As before, if
> > the netns with that ID does not exist, no socket will be found.
> > Furthermore, if any bits are set in the upper 32-bits, then no socket
> > will be found.
> >
> > Signed-off-by: Joe Stringer 
> ..
> > +/* Current network namespace */
> > +#define BPF_CURRENT_NETNS(-1L)
>
> I was about to apply it, but then noticed that the name doesn't match
> the rest of the names.
> Could you rename it to BPF_F_CURRENT_NETNS ?

I skipped the F_ part since it's not really a flag, it's a value. I
can put it back though.

> Also reword the commit log so it's less misleading.

Can do.

Cheers,
Joe


Re: [PATCH net] bpf: uninitialized variables in test code

2018-11-30 Thread Alexei Starovoitov
On Thu, Nov 29, 2018 at 01:27:03PM +0300, Dan Carpenter wrote:
> Smatch complains that if bpf_test_run() fails with -ENOMEM at the
> begining then the "duration" is uninitialized.  We then copy the
> unintialized variables to the user inside the bpf_test_finish()
> function.  The functions require CAP_SYS_ADMIN so it's not really an
> information leak.
> 
> Fixes: 1cf1cae963c2 ("bpf: introduce BPF_PROG_TEST_RUN command")
> Signed-off-by: Dan Carpenter 

That is incorrect fixes tag.
It should be pointing to commit f42ee093be29 ("bpf/test_run: support cgroup 
local storage")

bpf_test_run() can only return the value that bpf program returned.
It cannot return -ENOMEM.
That code needs to be refactored.
I think the proper way for bpf_test_run() would be to return 0 or -ENOMEM
and store bpf's retval into extra pointer.
Proper checks need to be added in the callers (bpf_prog_test_run_skb, etc).

Dan, can you do such refactoring or you want to punt back to Roman ?



Re: [PATCH net] bpf: uninitialized variables in test code

2018-11-30 Thread Song Liu
On Thu, Nov 29, 2018 at 2:28 AM Dan Carpenter  wrote:
>
> Smatch complains that if bpf_test_run() fails with -ENOMEM at the
> begining then the "duration" is uninitialized.  We then copy the
> unintialized variables to the user inside the bpf_test_finish()
> function.  The functions require CAP_SYS_ADMIN so it's not really an
> information leak.
>
> Fixes: 1cf1cae963c2 ("bpf: introduce BPF_PROG_TEST_RUN command")
> Signed-off-by: Dan Carpenter 
Acked-by: Song Liu 

> ---
>  net/bpf/test_run.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
> index c89c22c49015..49304192a031 100644
> --- a/net/bpf/test_run.c
> +++ b/net/bpf/test_run.c
> @@ -114,7 +114,7 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const 
> union bpf_attr *kattr,
> bool is_l2 = false, is_direct_pkt_access = false;
> u32 size = kattr->test.data_size_in;
> u32 repeat = kattr->test.repeat;
> -   u32 retval, duration;
> +   u32 retval, duration = 0;
> int hh_len = ETH_HLEN;
> struct sk_buff *skb;
> struct sock *sk;
> @@ -196,7 +196,7 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const 
> union bpf_attr *kattr,
> u32 repeat = kattr->test.repeat;
> struct netdev_rx_queue *rxqueue;
> struct xdp_buff xdp = {};
> -   u32 retval, duration;
> +   u32 retval, duration = 0;
> void *data;
> int ret;
>
> --
> 2.11.0
>


Re: [PATCH bpf-next] bpf: allow BPF read access to qdisc pkt_len

2018-11-30 Thread Song Liu
On Fri, Nov 30, 2018 at 12:09 PM Willem de Bruijn
 wrote:
>
> From: Petar Penkov 
>
> The pkt_len field in qdisc_skb_cb stores the skb length as it will
> appear on the wire after segmentation. For byte accounting, this value
> is more accurate than skb->len. It is computed on entry to the TC
> layer, so only valid there.
>
> Allow read access to this field from BPF tc classifier and action
> programs. The implementation is analogous to tc_classid, aside from
> restricting to read access.
>
> Signed-off-by: Petar Penkov 
> Signed-off-by: Vlad Dumitrescu 
> Signed-off-by: Willem de Bruijn 
> ---
>  include/uapi/linux/bpf.h|  1 +
>  net/core/filter.c   | 16 +++
>  tools/include/uapi/linux/bpf.h  |  1 +
>  tools/testing/selftests/bpf/test_verifier.c | 32 +
>  4 files changed, 50 insertions(+)

Please split this into 3 patches:
1 for include/uapi/linux/bpf.h and filter.c
1 for tools/include/uapi/linux/bpf.h
1 for tools/testing/selftests/bpf/test_verifier.c

Other than this
Acked-by: Song Liu 


>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 597afdbc1ab9..ce028482d423 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2483,6 +2483,7 @@ struct __sk_buff {
> __u32 data_meta;
> struct bpf_flow_keys *flow_keys;
> __u64 tstamp;
> +   __u32 pkt_len;
>  };
>
>  struct bpf_tunnel_key {
> diff --git a/net/core/filter.c b/net/core/filter.c
> index bd0df75dc7b6..af071ef22c0a 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5773,6 +5773,7 @@ static bool sk_filter_is_valid_access(int off, int size,
> case bpf_ctx_range(struct __sk_buff, flow_keys):
> case bpf_ctx_range_till(struct __sk_buff, family, local_port):
> case bpf_ctx_range(struct __sk_buff, tstamp):
> +   case bpf_ctx_range(struct __sk_buff, pkt_len):
> return false;
> }
>
> @@ -5797,6 +5798,7 @@ static bool cg_skb_is_valid_access(int off, int size,
> case bpf_ctx_range(struct __sk_buff, tc_classid):
> case bpf_ctx_range(struct __sk_buff, data_meta):
> case bpf_ctx_range(struct __sk_buff, flow_keys):
> +   case bpf_ctx_range(struct __sk_buff, pkt_len):
> return false;
> case bpf_ctx_range(struct __sk_buff, data):
> case bpf_ctx_range(struct __sk_buff, data_end):
> @@ -5843,6 +5845,7 @@ static bool lwt_is_valid_access(int off, int size,
> case bpf_ctx_range(struct __sk_buff, data_meta):
> case bpf_ctx_range(struct __sk_buff, flow_keys):
> case bpf_ctx_range(struct __sk_buff, tstamp):
> +   case bpf_ctx_range(struct __sk_buff, pkt_len):
> return false;
> }
>
> @@ -6273,6 +6276,7 @@ static bool sk_skb_is_valid_access(int off, int size,
> case bpf_ctx_range(struct __sk_buff, data_meta):
> case bpf_ctx_range(struct __sk_buff, flow_keys):
> case bpf_ctx_range(struct __sk_buff, tstamp):
> +   case bpf_ctx_range(struct __sk_buff, pkt_len):
> return false;
> }
>
> @@ -6360,6 +6364,7 @@ static bool flow_dissector_is_valid_access(int off, int 
> size,
> case bpf_ctx_range(struct __sk_buff, data_meta):
> case bpf_ctx_range_till(struct __sk_buff, family, local_port):
> case bpf_ctx_range(struct __sk_buff, tstamp):
> +   case bpf_ctx_range(struct __sk_buff, pkt_len):
> return false;
> }
>
> @@ -6685,6 +6690,17 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type 
> type,
>   bpf_target_off(struct sk_buff,
>  tstamp, 8,
>  target_size));
> +   break;
> +
> +   case offsetof(struct __sk_buff, pkt_len):
> +   BUILD_BUG_ON(FIELD_SIZEOF(struct qdisc_skb_cb, pkt_len) != 4);
> +
> +   off = si->off;
> +   off -= offsetof(struct __sk_buff, pkt_len);
> +   off += offsetof(struct sk_buff, cb);
> +   off += offsetof(struct qdisc_skb_cb, pkt_len);
> +   *target_size = 4;
> +   *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg, off);
> }
>
> return insn - insn_buf;
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 597afdbc1ab9..ce028482d423 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -2483,6 +2483,7 @@ struct __sk_buff {
> __u32 data_meta;
> struct bpf_flow_keys *flow_keys;
> __u64 tstamp;
> +   __u32 pkt_len;
>  };
>
>  struct bpf_tunnel_key {
> diff --git a/tools/testing/selftests/bpf/test_verifier.c 
> b/tools/testing/selftests/bpf/test_verifier.c
> index 17021d2b6bfe..0ab37fb4afad 100644
> --- a/tools/testing/selftests/bpf/test_verifier.c
> +++ 

Re: [PATCH bpf] bpf: Fix verifier log string check for bad alignment.

2018-11-30 Thread Alexei Starovoitov
On Wed, Nov 28, 2018 at 10:33:53PM -0800, David Miller wrote:
> 
> The message got changed a lot time ago.
> 
> This was responsible for 36 test case failures on sparc64.
> 
> Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
> Signed-off-by: David S. Miller 

Applied to bpf tree. Thanks!



Re: [PATCHv2 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Alexei Starovoitov
On Thu, Nov 29, 2018 at 04:29:33PM -0800, Joe Stringer wrote:
> David Ahern and Nicolas Dichtel report that the handling of the netns id
> 0 is incorrect for the BPF socket lookup helpers: rather than finding
> the netns with id 0, it is resolving to the current netns. This renders
> the netns_id 0 inaccessible.
> 
> To fix this, adjust the API for the netns to treat all negative s32
> values as a lookup in the current netns, while any values with a
> positive value in the signed 32-bit integer space would result in a
> lookup for a socket in the netns corresponding to that id. As before, if
> the netns with that ID does not exist, no socket will be found.
> Furthermore, if any bits are set in the upper 32-bits, then no socket
> will be found.
> 
> Signed-off-by: Joe Stringer 
..
> +/* Current network namespace */
> +#define BPF_CURRENT_NETNS(-1L)

I was about to apply it, but then noticed that the name doesn't match
the rest of the names.
Could you rename it to BPF_F_CURRENT_NETNS ?

Also reword the commit log so it's less misleading.

thx!



Re: [PATCH bpf-next 1/4] bpf: Add BPF_F_ANY_ALIGNMENT.

2018-11-30 Thread David Miller
From: Alexei Starovoitov 
Date: Fri, 30 Nov 2018 13:58:20 -0800

> On Thu, Nov 29, 2018 at 07:32:41PM -0800, David Miller wrote:
>> +/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the
>> + * verifier will allow any alignment whatsoever.  This bypasses
>> + * what CONFIG_EFFICIENT_UNALIGNED_ACCESS would cause it to do.
> 
> I think majority of user space folks who read uapi/bpf.h have no idea
> what that kernel config does.
> Could you reword the comment here to say that this flag is only
> effective on architectures and like sparc and mips that don't
> have efficient unaligned access and ignored on x86/arm64 ?

Ok.

>> +if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
>> +(attr->prog_flags & BPF_F_ANY_ALIGNMENT) &&
>> +!capable(CAP_SYS_ADMIN))
>> +return -EPERM;
> 
> I guess we don't want to add:
> if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
> (attr->prog_flags & BPF_F_ANY_ALIGNMENT))
> return -EINVAL;
> 
> so that test_verifier.c can just unconditionally pass this flag
> on all archs ?

Right.

>> @@ -247,7 +249,11 @@ int bpf_verify_program(enum bpf_prog_type type, const 
>> struct bpf_insn *insns,
>>  attr.log_level = log_level;
>>  log_buf[0] = 0;
>>  attr.kern_version = kern_version;
>> -attr.prog_flags = strict_alignment ? BPF_F_STRICT_ALIGNMENT : 0;
>> +if (strict_alignment)
>> +prog_flags |= BPF_F_STRICT_ALIGNMENT;
>> +if (any_alignment)
>> +prog_flags |= BPF_F_ANY_ALIGNMENT;
>> +attr.prog_flags = prog_flags;
> 
> instead of adding another argument may be replace 'int strict_alignment'
> with '__u32 prog_flags' ?
> and future flags won't need tweaks to bpf_verify_program() api.

Yeah the number of arguments are getting out of control, I'll do that.


Re: [PATCH 2/2] net: phy: ensure autoneg is configured when resuming a phydev

2018-11-30 Thread Heiner Kallweit
On 30.11.2018 19:45, Anssi Hannula wrote:
> When a PHY_HALTED phydev is resumed by phy_start(), it is set to
> PHY_RESUMING to wait for the link to come up.
> 
> However, if the phydev was put to PHY_HALTED (by e.g. phy_stop()) before
> autonegotiation was ever started by phy_state_machine(), autonegotiation
> remains unconfigured, i.e. phy_config_aneg() is never called.
> 
phy_stop() should only be called once the PHY was started. But it's
right that we don't have full control over it because misbehaving
drivers may call phy_stop() even if the PHY hasn't been started yet.

I think phy_error() and phy_stop() should be extended to set the state
to PHY_HALTED only if the PHY is in a started state, means:
don't touch the state if it's DOWN, READY, or HALTED.
In case of phy_error() it's more a precaution, because it's used within
phylib only and AFAICS it can be triggered from a started state only.

So yes, there is a theoretical issue. But as you wrote already, I think
there's a better way to deal with it.

For checking whether PHY is in a started state you could use the helper
which is being discussed here:
https://marc.info/?l=linux-netdev=154360368104946=2


> Fix that by going to PHY_UP instead of PHY_RESUMING if autonegotiation
> has never been configured.
> 
> Signed-off-by: Anssi Hannula 
> ---
> 
> This doesn't feel as clean is I'd like to, though. Maybe there is a
> better way?
> 
> 
>  drivers/net/phy/phy.c | 11 ++-
>  include/linux/phy.h   |  2 ++
>  2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index d46858694db9..7a650cce7599 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -553,6 +553,8 @@ int phy_start_aneg(struct phy_device *phydev)
>   if (err < 0)
>   goto out_unlock;
>  
> + phydev->autoneg_configured = 1;
> +
>   if (phydev->state != PHY_HALTED) {
>   if (AUTONEG_ENABLE == phydev->autoneg) {
>   err = phy_check_link_status(phydev);
> @@ -893,7 +895,14 @@ void phy_start(struct phy_device *phydev)
>   break;
>   }
>  
> - phydev->state = PHY_RESUMING;
> + /* if autoneg/forcing was never configured, go back to PHY_UP
> +  * to make that happen
> +  */
> + if (!phydev->autoneg_configured)
> + phydev->state = PHY_UP;
> + else
> + phydev->state = PHY_RESUMING;
> +
>   break;
>   default:
>   break;
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index 8f927246acdb..b306d5ed9d09 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -389,6 +389,8 @@ struct phy_device {
>   unsigned loopback_enabled:1;
>  
>   unsigned autoneg:1;
> + /* autoneg has been configured at least once */
> + unsigned autoneg_configured:1;
>   /* The most recently read link state */
>   unsigned link:1;
>  
> 



Re: [PATCH bpf-next 0/5] tools: bpftool: fixes and small improvements

2018-11-30 Thread Alexei Starovoitov
On Fri, Nov 30, 2018 at 04:25:43PM +, Quentin Monnet wrote:
> Hi,
> Several items for bpftool are included in this set: the first three patches
> are fixes for bpftool itself and bash completion, while the last two
> slightly improve the information obtained when dumping programs or maps, on
> Daniel's suggestion. Please refer to individual commit logs for more
> details.

Applied, Thanks



Re: [PATCH bpf-next 1/4] bpf: Add BPF_F_ANY_ALIGNMENT.

2018-11-30 Thread Alexei Starovoitov
On Thu, Nov 29, 2018 at 07:32:41PM -0800, David Miller wrote:
> 
> Often we want to write tests cases that check things like bad context
> offset accesses.  And one way to do this is to use an odd offset on,
> for example, a 32-bit load.
> 
> This unfortunately triggers the alignment checks first on platforms
> that do not set CONFIG_EFFICIENT_UNALIGNED_ACCESS.  So the test
> case see the alignment failure rather than what it was testing for.
> 
> It is often not completely possible to respect the original intention
> of the test, or even test the same exact thing, while solving the
> alignment issue.
> 
> Another option could have been to check the alignment after the
> context and other validations are performed by the verifier, but
> that is a non-trivial change to the verifier.
> 
> Signed-off-by: David S. Miller 
> ---
>  include/uapi/linux/bpf.h| 10 ++
>  kernel/bpf/syscall.c|  7 ++-
>  kernel/bpf/verifier.c   |  2 ++
>  tools/include/uapi/linux/bpf.h  | 10 ++
>  tools/lib/bpf/bpf.c | 12 +---
>  tools/lib/bpf/bpf.h |  6 +++---
>  tools/testing/selftests/bpf/test_align.c|  2 +-
>  tools/testing/selftests/bpf/test_verifier.c |  1 +
>  8 files changed, 42 insertions(+), 8 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 426b5c8..c9647ea 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -232,6 +232,16 @@ enum bpf_attach_type {
>   */
>  #define BPF_F_STRICT_ALIGNMENT   (1U << 0)
>  
> +/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the
> + * verifier will allow any alignment whatsoever.  This bypasses
> + * what CONFIG_EFFICIENT_UNALIGNED_ACCESS would cause it to do.

I think majority of user space folks who read uapi/bpf.h have no idea
what that kernel config does.
Could you reword the comment here to say that this flag is only
effective on architectures and like sparc and mips that don't
have efficient unaligned access and ignored on x86/arm64 ?

> + * It is mostly used for testing when we want to validate the
> + * context and memory access aspects of the validator, but because
> + * of an unaligned access the alignment check would trigger before
> + * the one we are interested in.
> + */
> +#define BPF_F_ANY_ALIGNMENT  (1U << 1)
> +
>  /* when bpf_ldimm64->src_reg == BPF_PSEUDO_MAP_FD, bpf_ldimm64->imm == fd */
>  #define BPF_PSEUDO_MAP_FD1
>  
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index cf5040f..cae65bb 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1450,9 +1450,14 @@ static int bpf_prog_load(union bpf_attr *attr)
>   if (CHECK_ATTR(BPF_PROG_LOAD))
>   return -EINVAL;
>  
> - if (attr->prog_flags & ~BPF_F_STRICT_ALIGNMENT)
> + if (attr->prog_flags & ~(BPF_F_STRICT_ALIGNMENT | BPF_F_ANY_ALIGNMENT))
>   return -EINVAL;
>  
> + if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
> + (attr->prog_flags & BPF_F_ANY_ALIGNMENT) &&
> + !capable(CAP_SYS_ADMIN))
> + return -EPERM;

I guess we don't want to add:
if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
(attr->prog_flags & BPF_F_ANY_ALIGNMENT))
return -EINVAL;

so that test_verifier.c can just unconditionally pass this flag
on all archs ?
Kinda weird that this knob doesn't do anything on x86.
But I guess it's fine.

> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index 03f9bcc..d4f76d2 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -232,9 +232,11 @@ int bpf_load_program(enum bpf_prog_type type, const 
> struct bpf_insn *insns,
>  
>  int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns,
>  size_t insns_cnt, int strict_alignment,
> -const char *license, __u32 kern_version,
> -char *log_buf, size_t log_buf_sz, int log_level)
> +int any_alignment, const char *license,
> +__u32 kern_version, char *log_buf, size_t log_buf_sz,
> +int log_level)
>  {
> + __u32 prog_flags = 0;
>   union bpf_attr attr;
>  
>   bzero(, sizeof(attr));
> @@ -247,7 +249,11 @@ int bpf_verify_program(enum bpf_prog_type type, const 
> struct bpf_insn *insns,
>   attr.log_level = log_level;
>   log_buf[0] = 0;
>   attr.kern_version = kern_version;
> - attr.prog_flags = strict_alignment ? BPF_F_STRICT_ALIGNMENT : 0;
> + if (strict_alignment)
> + prog_flags |= BPF_F_STRICT_ALIGNMENT;
> + if (any_alignment)
> + prog_flags |= BPF_F_ANY_ALIGNMENT;
> + attr.prog_flags = prog_flags;

instead of adding another argument may be replace 'int strict_alignment'
with '__u32 prog_flags' ?
and future flags won't need tweaks to bpf_verify_program() api.



Re: [PATCH net-next 0/6 v3] qed*: Doorbell overflow recovery

2018-11-30 Thread David Miller
From: Ariel Elior 
Date: Wed, 28 Nov 2018 18:16:01 +0200

> Doorbell Overflow
> If sufficient CPU cores will send doorbells at a sufficiently high rate, they
> can cause an overflow in the doorbell queue block message fifo. When fill 
> level
> reaches maximum, the device stops accepting all doorbells from that PF until a
> recovery procedure has taken place.
> 
> Doorbell Overflow Recovery
> The recovery procedure basically means resending the last doorbell for every
> doorbelling entity. A doorbelling entity is anything which may send doorbells:
> L2 tx ring, rdma sq/rq/cq, light l2, vf l2 tx ring, spq, etc. This relies on
> the design assumption that all doorbells are aggregative, so last doorbell
> carries the information of all previous doorbells.
> 
> APIs
> All doorbelling entities need to register with the mechanism before sending
> doorbells. The registration entails providing the doorbell address the entity
> would be using, and a virtual address where last doorbell data can be found.
> Typically fastpath structures already have this construct.
> 
> Executing the recovery procedure
> Handling the attentions, iterating over all the registered entities and
> resending their doorbells, is all handled within qed core module.
> 
> Relevance
> All doorbelling entities in all protocols need to register with the mechanism,
> via the new APIs. Technically this is quite simple (just call the API). Some
> protocol fastpath implementation may not have the doorbell data stored 
> anywhere
> (compute it from scratch every time) and will have to add such a place.
> This is rare and is also better practice (save some cycles on the fastpath).
> 
> Performance Penalty
> No performance penalty should incur as a result of this feature. If anything
> performance can improve by avoiding recalcualtion of doorbell data everytime
> doorbell is sent (in some flows).
> 
> Add the database used to register doorbelling entities, and APIs for adding
> and deleting entries, and logic for traversing the database and doorbelling
> once on behalf of all entities.
> 
> Please consider applying to net-next.

Series applied, thanks.


Re: [PATCH net-next 0/2] rtnetlink: avoid a warning in rtnl_newlink()

2018-11-30 Thread David Miller
From: Jakub Kicinski 
Date: Tue, 27 Nov 2018 22:32:29 -0800

> I've been hoping for some time that someone more competent would fix
> the stack frame size warning in rtnl_newlink(), but looks like I'll
> have to take a stab at it myself :)  That's the only warning I see
> in most of my builds.
> 
> First patch refactors away a somewhat surprising if (1) code block.
> Reindentation will most likely cause cherry-pick problems but OTOH
> rtnl_newlink() doesn't seem to be changed often, so perhaps we can
> risk it in the name of cleaner code?
> 
> Second patch fixes the warning in simplest possible way.  I was
> pondering if there is any more clever solution, but I can't see it..
> rtnl_newlink() is quite long with a lot of possible execution paths
> so doing memory allocations half way through leads to very ugly
> results.

Series applied, thanks for tackling this Jakub.

That whole "if (1) {" was probably just a construct used in order
to create an inner basic block for local variables, nothing more.


Re: [next] bonding: pass link-local packets to bonding master also.

2018-11-30 Thread Vincent Bernat
 ❦ 15 juillet 2018 19:12 -0700, Mahesh Bandewar :

> Commit b89f04c61efe ("bonding: deliver link-local packets with
> skb->dev set to link that packets arrived on") changed the behavior
> of how link-local-multicast packets are processed. The change in
> the behavior broke some legacy use cases where these packets are
> expected to arrive on bonding master device also.

Unfortunately, this doesn't completely restore the previous
functionality as PACKET_ORIGDEV is broken for the copy: the original
interface is lost through the call to netif_rx(). A LLDP daemon
listening to the master interface won't get the original interface like
it was able to before 4.12.

I am a bit lost of what the original patch was trying to achieve. I am
using the following test program:

#v+
#!/usr/bin/env python3

import sys
import socket
import datetime

socket.SOL_PACKET = 263
socket.ETH_P_ALL = 3
socket.PACKET_ORIGDEV = 9

interface = sys.argv[1] if len(sys.argv) > 1 else 'lag1'

s = socket.socket(socket.AF_PACKET,
  socket.SOCK_RAW,
  socket.htons(socket.ETH_P_ALL))
s.bind((interface, 0))
s.setsockopt(socket.SOL_PACKET, socket.PACKET_ORIGDEV, 1)
while True:
data, addrinfo = s.recvfrom(1500)
if addrinfo[2] == socket.PACKET_OUTGOING:
continue
print(f"{datetime.datetime.now().isoformat()}: "
  f"Received {len(data)} bytes from {addrinfo}")
#v-

If I run it with a kernel compiled with the commit before b89f04c61efe
(plus a few more cherry-pick to make it work like ea8ffc0818d8 and
72ccc471e13b), I get:

#v+
2018-11-30T22:20:40.193378: Received 221 bytes from ('eth1', 35020, 2, 1, 
b'RT3\x00\x00\x02')
2018-11-30T22:20:40.194504: Received 221 bytes from ('eth0', 35020, 2, 1, 
b'RT3\x00\x00\x01')
#v-

If I send non link-local packets, I get:

#v+
2018-11-30T22:25:57.300965: Received 98 bytes from ('eth0', 2048, 0, 1, 
b'PT3\x00\x00\x02')
#v-

I am also able to correctly receive link-local packets directly on each
interface. So, it seems everything was working as expected before
b89f04c61efe.
-- 
AWAKE! FEAR! FIRE! FOES! AWAKE!
FEAR! FIRE! FOES!
AWAKE! AWAKE!
-- J. R. R. Tolkien


Re: [PATCH net-next 00/11] nfp: update TX path to enable repr offloads

2018-11-30 Thread David Miller
From: Jakub Kicinski 
Date: Tue, 27 Nov 2018 22:24:47 -0800

> This set starts with three micro optimizations to the TX path.
> The improvement is measurable, but below 1% of CPU utilization.
> 
> Patches 4 - 9 add basic TX offloads to representor devices, like
> checksum offload or TSO, and remove the unnecessary TX lock and
> Qdisc (our representors are software constructs on top of the PF).
> 
> The last 2 patches add more info to error messages - id of command
> which failed and exact location of incorrect TLVs, very useful for
> debugging.

Series applied, thanks Jakub.


Re: [PATCH net-next] net: phy: Also request modules for C45 IDs

2018-11-30 Thread Andrew Lunn
> @@ -606,6 +606,18 @@ struct phy_device *phy_device_create(struct mii_bus 
> *bus, int addr, int phy_id,
>* there's no driver _already_ loaded.
>*/
>   request_module(MDIO_MODULE_PREFIX MDIO_ID_FMT, MDIO_ID_ARGS(phy_id));

This line above is for C22. If you look at phy_bus_match(), it will
perform a match on the c45_ids->device_ids[] if it is a c45 PHY, or
the phy_id if it is c22. So i think we should also have this one or
the other here, not both.

Thanks
Andrew

> + if (c45_ids) {
> + const int num_ids = ARRAY_SIZE(c45_ids->device_ids);
> + int i;
> +
> + for (i = 1; i < num_ids; i++) {
> + if (!(c45_ids->devices_in_package & (1 << i)))
> + continue;
> +
> + request_module(MDIO_MODULE_PREFIX MDIO_ID_FMT,
> +MDIO_ID_ARGS(c45_ids->device_ids[i]));
> + }
> + }
>  
>   device_initialize(>dev);
>  
> -- 
> 2.7.4
> 
> 


Re: [PATCH net-next] tcp: md5: add tcp_md5_needed jump label

2018-11-30 Thread David Miller
From: Eric Dumazet 
Date: Tue, 27 Nov 2018 15:03:21 -0800

> Most linux hosts never setup TCP MD5 keys. We can avoid a
> cache line miss (accessing tp->md5ig_info) on RX and TX
> using a jump label.
> 
> Signed-off-by: Eric Dumazet 

Also applied, thanks.


Re: [PATCH v3 net-next 0/4] tcp: take a bit more care of backlog stress

2018-11-30 Thread David Miller
From: Eric Dumazet 
Date: Tue, 27 Nov 2018 14:41:59 -0800

> While working on the SACK compression issue Jean-Louis Dupond
> reported, we found that his linux box was suffering very hard
> from tail drops on the socket backlog queue.
> 
> First patch hints the compiler about sack flows being the norm.
> 
> Second patch changes non-sack code in preparation of the ack
> compression.
> 
> Third patch fixes tcp_space() to take backlog into account.
> 
> Fourth patch is attempting coalescing when a new packet must
> be added to the backlog queue. Cooking bigger skbs helps
> to keep backlog list smaller and speeds its handling when
> user thread finally releases the socket lock.
> 
> v3: Neal/Yuchung feedback addressed :
>  Do not aggregate if any skb has URG bit set.
>  Do not aggregate if the skbs have different ECE/CWR bits
> 
> v2: added feedback from Neal : tcp: take care of compressed acks in 
> tcp_add_reno_sack() 
> added : tcp: hint compiler about sack flows
>   added : tcp: make tcp_space() aware of socket backlog

Series applied, thanks Eric.

I'll push this out after the build check finishes.


Re: [PATCH net 2/2] nfp: flower: prevent offload if rhashtable insert fails

2018-11-30 Thread David Miller
From: Jakub Kicinski 
Date: Tue, 27 Nov 2018 14:04:12 -0800

> From: John Hurley 
> 
> For flow offload adds, if the rhash insert code fails, the flow will still
> have been offloaded but the reference to it in the driver freed.
> 
> Re-order the offload setup calls to ensure that a flow will only be written
> to FW if a kernel reference is held and stored in the rhashtable. Remove
> this hashtable entry if the offload fails.
> 
> Fixes: c01d0efa5136 ("nfp: flower: use rhashtable for flow caching")
> Signed-off-by: John Hurley 
> Reviewed-by: Pieter Jansen van Vuuren 
> Reviewed-by: Jakub Kicinski 
> ---
> Merge note: there will be a slight merge conflict with net-next
> here:
>  - the first argument of nfp_flower_xmit_flow() changed from 'netdev'
>to 'app';
>  - we only bump the port offload cnt if (port).
> 
> FWIW the net-next version of the patch can be found at:
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux.git
> 
>  17ed95873d51 nfp: flower: prevent offload if rhashtable insert fails
>  d857fc8f472b nfp: flower: release metadata on offload failure
> 
> CC: Stephen Rothwell  # for linux-next

Applied.


Re: [PATCH net 1/2] nfp: flower: release metadata on offload failure

2018-11-30 Thread David Miller
From: Jakub Kicinski 
Date: Tue, 27 Nov 2018 14:04:11 -0800

> From: John Hurley 
> 
> Calling nfp_compile_flow_metadata both assigns a stats context and
> increments a ref counter on (or allocates) a mask id table entry. These
> are released by the nfp_modify_flow_metadata call on flow deletion,
> however, if a flow add fails after metadata is set then the flow entry
> will be deleted but the metadata assignments leaked.
> 
> Add an error path to the flow add offload function to ensure allocated
> metadata is released in the event of an offload fail.
> 
> Fixes: 81f3ddf2547d ("nfp: add control message passing capabilities to flower 
> offloads")
> Signed-off-by: John Hurley 
> Reviewed-by: Pieter Jansen van Vuuren 
> Reviewed-by: Jakub Kicinski 

Applied.


Re: [PATCH v4] net: Add trace events for all receive exit points

2018-11-30 Thread David Miller
From: Geneviève Bastien 
Date: Tue, 27 Nov 2018 12:52:39 -0500

> Trace events are already present for the receive entry points, to indicate
> how the reception entered the stack.
> 
> This patch adds the corresponding exit trace events that will bound the
> reception such that all events occurring between the entry and the exit
> can be considered as part of the reception context. This greatly helps
> for dependency and root cause analyses.
> 
> Without this, it is not possible with tracepoint instrumentation to
> determine whether a sched_wakeup event following a netif_receive_skb
> event is the result of the packet reception or a simple coincidence after
> further processing by the thread. It is possible using other mechanisms
> like kretprobes, but considering the "entry" points are already present,
> it would be good to add the matching exit events.
> 
> In addition to linking packets with wakeups, the entry/exit event pair
> can also be used to perform network stack latency analyses.
> 
> Signed-off-by: Geneviève Bastien 
> CC: Mathieu Desnoyers 
> CC: Steven Rostedt 
> CC: Ingo Molnar 
> CC: David S. Miller 
> Reviewed-by: Steven Rostedt (VMware)  (tracing
> side)

Applied to net-next, thanks for sticking with this.


Re: [PATCH net-next] net/flow_dissector: correct comments on enum flow_dissector_key_id

2018-11-30 Thread David Miller
From: Edward Cree 
Date: Tue, 27 Nov 2018 15:40:59 +

> There are no such structs flow_dissector_key_flow_vlan or
>  flow_dissector_key_flow_tags, the actual structs used are struct
>  flow_dissector_key_vlan and struct flow_dissector_key_tags.  So correct the
>  comments against FLOW_DISSECTOR_KEY_VLAN, FLOW_DISSECTOR_KEY_FLOW_LABEL and
>  FLOW_DISSECTOR_KEY_CVLAN to refer to those.
> 
> Signed-off-by: Edward Cree 

Applied, thanks.


Re: [PATCH v2 net] bonding: fix 802.3ad state sent to partner when unbinding slave

2018-11-30 Thread David Miller
From: Toni Peltonen 
Date: Tue, 27 Nov 2018 16:56:57 +0200

> Previously when unbinding a slave the 802.3ad implementation only told
> partner that the port is not suitable for aggregation by setting the port
> aggregation state from aggregatable to individual. This is not enough. If the
> physical layer still stays up and we only unbinded this port from the bond 
> there
> is nothing in the aggregation status alone to prevent the partner from sending
> traffic towards us. To ensure that the partner doesn't consider this
> port at all anymore we should also disable collecting and distributing to
> signal that this actor is going away. Also clear AD_STATE_SYNCHRONIZATION to
> ensure partner exits collecting + distributing state.
> 
> I have tested this behaviour againts Arista EOS switches with mlx5 cards
> (physical link stays up even when interface is down) and simulated
> the same situation virtually Linux <-> Linux with two network namespaces
> running two veth device pairs. In both cases setting aggregation to
> individual doesn't alone prevent traffic from being to sent towards this
> port given that the link stays up in partners end. Partner still keeps
> it's end in collecting + distributing state and continues until timeout is
> reached. In most cases this means we are losing the traffic partner sends
> towards our port while we wait for timeout. This is most visible with slow
> periodic time (LACP rate slow).
> 
> Other open source implementations like Open VSwitch and libreswitch, and
> vendor implementations like Arista EOS, seem to disable collecting +
> distributing to when doing similar port disabling/detaching/removing change.
> With this patch kernel implementation would behave the same way and ensure
> partner doesn't consider our actor viable anymore.
> 
> Signed-off-by: Toni Peltonen 
> ---
> v2 changes:
> * Fix typo in commit message
> * Also clear AD_STATE_SYNCHRONIZATION

Applied, thank you.


Re: [PATCH v4 3/4] libbpf: add bpf_prog_test_run_xattr

2018-11-30 Thread Alexei Starovoitov
On Wed, Nov 28, 2018 at 04:53:11PM +, Lorenz Bauer wrote:
> Add a new function, which encourages safe usage of the test interface.
> bpf_prog_test_run continues to work as before, but should be considered
> unsafe.
> 
> Signed-off-by: Lorenz Bauer 
..
> +
> +LIBBPF_API int bpf_prog_test_run_xattr(struct bpf_prog_test_run_attr 
> *test_attr);

it fails to be build due to:
  LINK tools/testing/selftests/bpf/test_libbpf
(117) does NOT match with num of versioned symbols in 
testing/selftests/bpf/libbpf.so (116).
Please make sure all LIBBPF_API symbols are versioned in libbpf.map.

Please fixup and respin.
The rest looks good.



Re: [PATCH net] net: aquantia: fix rx checksum offload bits

2018-11-30 Thread David Miller
From: Igor Russkikh 
Date: Tue, 27 Nov 2018 14:51:17 +

> From: Dmitry Bogdanov 
> 
> The last set of csum offload fixes had a leak:
> 
> Checksum enabled status bits from rx descriptor were incorrectly
> interpreted. Consequently all the other valid logic worked on zero bits.
> That caused rx checksum offloads never to trigger.
> 
> Tested by dumping rx descriptors and validating resulting csum_level.
> 
> Reported-by: Igor Russkikh 
> Signed-off-by: Dmitry Bogdanov 
> Signed-off-by: Igor Russkikh 
> Fixes: ad703c2b9127f ("net: aquantia: invalid checksumm offload 
> implementation")

Applied.

This shows that the feature added was not actually tested.

Thanks.


Re: [PATCH v6 1/4] udp_tunnel: add config option to bind to a device

2018-11-30 Thread David Miller
From: Alexis Bauvin 
Date: Tue, 27 Nov 2018 14:05:42 +0100

> UDP tunnel sockets are always opened unbound to a specific device. This
> patch allow the socket to be bound on a custom device, which
> incidentally makes UDP tunnels VRF-aware if binding to an l3mdev.
> 
> Signed-off-by: Alexis Bauvin 
> Reviewed-by: Amine Kherbouche 
> Reviewed-by: David Ahern 
> Tested-by: Amine Kherbouche 

Please address Sabrina's feedback and respin this series.

Thanks.


Re: [PATCHv2 net] sctp: update frag_point when stream_interleave is set

2018-11-30 Thread David Miller
From: Xin Long 
Date: Tue, 27 Nov 2018 19:11:50 +0800

> sctp_assoc_update_frag_point() should be called whenever asoc->pathmtu
> changes, but we missed one place in sctp_association_init(). It would
> cause frag_point is zero when sending data.
> 
> As says in Jakub's reproducer, if sp->pathmtu is set by socketopt, the
> new asoc->pathmtu inherits it in sctp_association_init(). Later when
> transports are added and their pmtu >= asoc->pathmtu, it will never
> call sctp_assoc_update_frag_point() to set frag_point.
> 
> This patch is to fix it by updating frag_point after asoc->pathmtu is
> set as sp->pathmtu in sctp_association_init(). Note that it moved them
> after sctp_stream_init(), as stream->si needs to be set first.
> 
> Frag_point's calculation is also related with datachunk's type, so it
> needs to update frag_point when stream->si may be changed in
> sctp_process_init().
> 
> v1->v2:
>   - call sctp_assoc_update_frag_point() separately in sctp_process_init
> and sctp_association_init, per Marcelo's suggestion.
> 
> Fixes: 2f5e3c9df693 ("sctp: introduce sctp_assoc_update_frag_point")
> Reported-by: Jakub Audykowicz 
> Signed-off-by: Xin Long 

Applied and queued up for -stable back to v4.18


Re: [PATCH v3] ipv4: make DSCP values works with ip rules

2018-11-30 Thread David Miller
From: Pavel Balaev 
Date: Tue, 27 Nov 2018 13:07:10 +0300

> -#define IPTOS_RT_MASK(IPTOS_TOS_MASK & ~3)
> +#define IPTOS_RT_MASK(IPTOS_DSCP_MASK & ~3)

I was hoping my original feedback would have you actually go
investigate why IPTOS_RT_MASK is defined the way that it is.

I will try again.

Please grep around for RTO_ONLINK and see how that specifically
interferes with what you are trying to do here.

I don't think your goal is achievable with how things are defined
currently.

Sorry.


Re: [PATCH v2 net-next] cxgb4: number of VFs supported is not always 16

2018-11-30 Thread David Miller
From: Ganesh Goudar 
Date: Tue, 27 Nov 2018 14:59:06 +0530

> Total number of VFs supported by PF is used to determine the last
> byte of VF's mac address. Number of VFs supported is not always
> 16, use the variable nvfs to get the number of VFs supported
> rather than hard coding it to 16.
> 
> Signed-off-by: Casey Leedom 
> Signed-off-by: Ganesh Goudar 
> ---
> V2: Fixes typo in commit message

Applied.


Re: consistency for statistics with XDP mode

2018-11-30 Thread David Ahern
On 11/30/18 1:30 PM, Michael S. Tsirkin wrote:
 I would like to see basic packets, bytes, and dropped counters
 tracked
 for Rx and Tx via the standard netdev counters for all devices. 
>>
>> The problem of reporting XDP_DROP in the netedev drop counter is that
>> they don't fit this counter description : "no space in linux buffers"
>> and it will be hard for the user to determine whether these drops are
>> coming from XDP or because no buffer is available, which will make it
>> impossible to estimate packet rate performance without looking at
>> ethtool stats.
>> And reporting XDP_DROP in the netdev rx packets counter is somehow
>> misleading.. since those packets never made it out of this driver.. 
>>
>>
>> And reporting XDP_DROP in the netdev rx packets counter is somehow
>> misleading.. since those packets never made it out of this driver..
> 
> I think I agree. XDP needs minimal overhead - if user wants to do
> counters then user can via maps. And in a sense XDP dropping packet
> is much like e.g. TCP dropping packet - it is not counted
> against the driver since it's not driver's fault.
> 

XDP dropping a packet is completely different.

stats are important. packets disappearing with no counters -- standard
counters visible by standard tools -- is a user nightmare. If the
agreement is for XDP drops to be in driver level (e.g., xdp_drop) that
is fine since it is still retrievable by ethtool -S (existing APIs and
existing tools).


[PATCH net-next v4 2/3] udp: elide zerocopy operation in hot path

2018-11-30 Thread Willem de Bruijn
From: Willem de Bruijn 

With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info.
Release of its last reference triggers a completion notification.

The TCP stack in tcp_sendmsg_locked holds an extra ref independent of
the skbs, because it can build, send and free skbs within its loop,
possibly reaching refcount zero and freeing the ubuf_info too soon.

The UDP stack currently also takes this extra ref, but does not need
it as all skbs are sent after return from __ip(6)_append_data.

Avoid the extra refcount_inc and refcount_dec_and_test, and generally
the sock_zerocopy_put in the common path, by passing the initial
reference to the first skb.

This approach is taken instead of initializing the refcount to 0, as
that would generate error "refcount_t: increment on 0" on the
next skb_zcopy_set.

Changes
  v3 -> v4
- Move skb_zcopy_set below the only kfree_skb that might cause
  a premature uarg destroy before skb_zerocopy_put_abort
  - Move the entire skb_shinfo assignment block, to keep that
cacheline access in one place

Signed-off-by: Willem de Bruijn 
---
 include/linux/skbuff.h | 12 
 net/core/skbuff.c  |  9 +
 net/ipv4/ip_output.c   | 22 +++---
 net/ipv4/tcp.c |  2 +-
 net/ipv6/ip6_output.c  | 22 +++---
 5 files changed, 36 insertions(+), 31 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 04f52e719571..75d50ab7997c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -481,7 +481,7 @@ static inline void sock_zerocopy_get(struct ubuf_info *uarg)
 }
 
 void sock_zerocopy_put(struct ubuf_info *uarg);
-void sock_zerocopy_put_abort(struct ubuf_info *uarg);
+void sock_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref);
 
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
 
@@ -1326,10 +1326,14 @@ static inline struct ubuf_info *skb_zcopy(struct 
sk_buff *skb)
return is_zcopy ? skb_uarg(skb) : NULL;
 }
 
-static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg)
+static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg,
+bool *have_ref)
 {
if (skb && uarg && !skb_zcopy(skb)) {
-   sock_zerocopy_get(uarg);
+   if (unlikely(have_ref && *have_ref))
+   *have_ref = false;
+   else
+   sock_zerocopy_get(uarg);
skb_shinfo(skb)->destructor_arg = uarg;
skb_shinfo(skb)->tx_flags |= SKBTX_ZEROCOPY_FRAG;
}
@@ -1374,7 +1378,7 @@ static inline void skb_zcopy_abort(struct sk_buff *skb)
struct ubuf_info *uarg = skb_zcopy(skb);
 
if (uarg) {
-   sock_zerocopy_put_abort(uarg);
+   sock_zerocopy_put_abort(uarg, false);
skb_shinfo(skb)->tx_flags &= ~SKBTX_ZEROCOPY_FRAG;
}
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1350901c5cb8..c78ce114537e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1089,7 +1089,7 @@ void sock_zerocopy_put(struct ubuf_info *uarg)
 }
 EXPORT_SYMBOL_GPL(sock_zerocopy_put);
 
-void sock_zerocopy_put_abort(struct ubuf_info *uarg)
+void sock_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
 {
if (uarg) {
struct sock *sk = skb_from_uarg(uarg)->sk;
@@ -1097,7 +1097,8 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg)
atomic_dec(>sk_zckey);
uarg->len--;
 
-   sock_zerocopy_put(uarg);
+   if (have_uref)
+   sock_zerocopy_put(uarg);
}
 }
 EXPORT_SYMBOL_GPL(sock_zerocopy_put_abort);
@@ -1137,7 +1138,7 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct 
sk_buff *skb,
return err;
}
 
-   skb_zcopy_set(skb, uarg);
+   skb_zcopy_set(skb, uarg, NULL);
return skb->len - orig_len;
 }
 EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream);
@@ -1157,7 +1158,7 @@ static int skb_zerocopy_clone(struct sk_buff *nskb, 
struct sk_buff *orig,
if (skb_copy_ubufs(nskb, GFP_ATOMIC))
return -EIO;
}
-   skb_zcopy_set(nskb, skb_uarg(orig));
+   skb_zcopy_set(nskb, skb_uarg(orig), NULL);
}
return 0;
 }
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 6f843aff628c..78f028bdad30 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -881,8 +881,8 @@ static int __ip_append_data(struct sock *sk,
int csummode = CHECKSUM_NONE;
struct rtable *rt = (struct rtable *)cork->dst;
unsigned int wmem_alloc_delta = 0;
+   bool paged, extra_uref;
u32 tskey = 0;
-   bool paged;
 
skb = skb_peek_tail(queue);
 
@@ -921,12 +921,13 @@ static int __ip_append_data(struct sock *sk,
uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
if 

[PATCH net-next v4 0/3] udp msg_zerocopy

2018-11-30 Thread Willem de Bruijn
From: Willem de Bruijn 

Enable MSG_ZEROCOPY for udp sockets

Patch 1/3 is the main patch, a rework of RFC patch
  http://patchwork.ozlabs.org/patch/899630/
  more details in the patch commit message

Patch 2/3 is an optimization to remove a branch from the UDP hot path
  and refcount_inc/refcount_dec_and_test pair when zerocopy is used.
  This used to be included in the first patch in v2.

Patch 3/3 runs the already existing udp zerocopy tests
  as part of kselftest

See also recent Linux Plumbers presentation
  
https://linuxplumbersconf.org/event/2/contributions/106/attachments/104/128/willemdebruijn-lpc2018-udpgso-presentation-20181113.pdf

Changes:
  v1 -> v2
- Fixup reverse christmas tree violation
  v2 -> v3
- Split refcount avoidance optimization into separate patch
  - Fix refcount leak on error in fragmented case
(thanks to Paolo Abeni for pointing this one out!)
  - Fix refcount inc on zero
  v3 -> v4
- Move skb_zcopy_set below the only kfree_skb that might cause
  a premature uarg destroy before skb_zerocopy_put_abort
  - Move the entire skb_shinfo assignment block, to keep that
cacheline access in one place

Willem de Bruijn (3):
  udp: msg_zerocopy
  udp: elide zerocopy operation in hot path
  selftests: extend zerocopy tests to udp

 include/linux/skbuff.h  | 13 +---
 net/core/skbuff.c   | 15 ++---
 net/core/sock.c |  5 ++-
 net/ipv4/ip_output.c| 37 -
 net/ipv4/tcp.c  |  2 +-
 net/ipv6/ip6_output.c   | 37 -
 tools/testing/selftests/net/msg_zerocopy.c  |  3 +-
 tools/testing/selftests/net/msg_zerocopy.sh |  2 ++
 tools/testing/selftests/net/udpgso_bench.sh |  3 ++
 9 files changed, 90 insertions(+), 27 deletions(-)

-- 
2.20.0.rc1.387.gf8505762e3-goog



[PATCH net-next v4 1/3] udp: msg_zerocopy

2018-11-30 Thread Willem de Bruijn
From: Willem de Bruijn 

Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and
interpret flag MSG_ZEROCOPY.

This patch was previously part of the zerocopy RFC patchsets. Zerocopy
is not effective at small MTU. With segmentation offload building
larger datagrams, the benefit of page flipping outweights the cost of
generating a completion notification.

tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on
test patch and making skb_orphan_frags_rx same as skb_orphan_frags:

ipv4 udp -t 1
tx=191312 (11938 MB) txc=0 zc=n
rx=191312 (11938 MB)
ipv4 udp -z -t 1
tx=304507 (19002 MB) txc=304507 zc=y
rx=304507 (19002 MB)
ok
ipv6 udp -t 1
tx=174485 (10888 MB) txc=0 zc=n
rx=174485 (10888 MB)
ipv6 udp -z -t 1
tx=294801 (18396 MB) txc=294801 zc=y
rx=294801 (18396 MB)
ok

Changes
  v1 -> v2
- Fixup reverse christmas tree violation
  v2 -> v3
- Split refcount avoidance optimization into separate patch
  - Fix refcount leak on error in fragmented case
(thanks to Paolo Abeni for pointing this one out!)
  - Fix refcount inc on zero
  - Test sock_flag SOCK_ZEROCOPY directly in __ip_append_data.
This is needed since commit 5cf4a8532c99 ("tcp: really ignore
MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp.

Signed-off-by: Willem de Bruijn 
---
 include/linux/skbuff.h |  1 +
 net/core/skbuff.c  |  6 ++
 net/core/sock.c|  5 -
 net/ipv4/ip_output.c   | 23 ++-
 net/ipv6/ip6_output.c  | 23 ++-
 5 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 73902acf2b71..04f52e719571 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -485,6 +485,7 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg);
 
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
 
+int skb_zerocopy_iter_dgram(struct sk_buff *skb, struct msghdr *msg, int len);
 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 struct msghdr *msg, int len,
 struct ubuf_info *uarg);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3c814565ed7c..1350901c5cb8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1105,6 +1105,12 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_put_abort);
 extern int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
   struct iov_iter *from, size_t length);
 
+int skb_zerocopy_iter_dgram(struct sk_buff *skb, struct msghdr *msg, int len)
+{
+   return __zerocopy_sg_from_iter(skb->sk, skb, >msg_iter, len);
+}
+EXPORT_SYMBOL_GPL(skb_zerocopy_iter_dgram);
+
 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 struct msghdr *msg, int len,
 struct ubuf_info *uarg)
diff --git a/net/core/sock.c b/net/core/sock.c
index 6d7e189e3cd9..f5bb89785e47 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1018,7 +1018,10 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
 
case SO_ZEROCOPY:
if (sk->sk_family == PF_INET || sk->sk_family == PF_INET6) {
-   if (sk->sk_protocol != IPPROTO_TCP)
+   if (!((sk->sk_type == SOCK_STREAM &&
+  sk->sk_protocol == IPPROTO_TCP) ||
+ (sk->sk_type == SOCK_DGRAM &&
+  sk->sk_protocol == IPPROTO_UDP)))
ret = -ENOTSUPP;
} else if (sk->sk_family != PF_RDS) {
ret = -ENOTSUPP;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 5dbec21856f4..6f843aff628c 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -867,6 +867,7 @@ static int __ip_append_data(struct sock *sk,
unsigned int flags)
 {
struct inet_sock *inet = inet_sk(sk);
+   struct ubuf_info *uarg = NULL;
struct sk_buff *skb;
 
struct ip_options *opt = cork->opt;
@@ -916,6 +917,19 @@ static int __ip_append_data(struct sock *sk,
(!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
csummode = CHECKSUM_PARTIAL;
 
+   if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) {
+   uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
+   if (!uarg)
+   return -ENOBUFS;
+   if (rt->dst.dev->features & NETIF_F_SG &&
+   csummode == CHECKSUM_PARTIAL) {
+   paged = true;
+   } else {
+   uarg->zerocopy = 0;
+   skb_zcopy_set(skb, uarg);
+   }
+   }
+
cork->length += length;
 
/* So, what's going on in the loop below?
@@ -1006,6 +1020,7 @@ static int __ip_append_data(struct 

[PATCH net-next v4 3/3] selftests: extend zerocopy tests to udp

2018-11-30 Thread Willem de Bruijn
From: Willem de Bruijn 

Both msg_zerocopy and udpgso_bench have udp zerocopy variants.
Exercise these as part of the standard kselftest run.

With udp, msg_zerocopy has no control channel. Ensure that the
receiver exits after the sender by accounting for the initial
delay in starting them (in msg_zerocopy.sh).

Signed-off-by: Willem de Bruijn 
---
 tools/testing/selftests/net/msg_zerocopy.c  | 3 ++-
 tools/testing/selftests/net/msg_zerocopy.sh | 2 ++
 tools/testing/selftests/net/udpgso_bench.sh | 3 +++
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c 
b/tools/testing/selftests/net/msg_zerocopy.c
index 406cc70c571d..4b02933cab8a 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -651,12 +651,13 @@ static void do_flush_datagram(int fd, int type)
 
 static void do_rx(int domain, int type, int protocol)
 {
+   const int cfg_receiver_wait_ms = 400;
uint64_t tstop;
int fd;
 
fd = do_setup_rx(domain, type, protocol);
 
-   tstop = gettimeofday_ms() + cfg_runtime_ms;
+   tstop = gettimeofday_ms() + cfg_runtime_ms + cfg_receiver_wait_ms;
do {
if (type == SOCK_STREAM)
do_flush_tcp(fd);
diff --git a/tools/testing/selftests/net/msg_zerocopy.sh 
b/tools/testing/selftests/net/msg_zerocopy.sh
index c43c6debda06..825ffec85cea 100755
--- a/tools/testing/selftests/net/msg_zerocopy.sh
+++ b/tools/testing/selftests/net/msg_zerocopy.sh
@@ -25,6 +25,8 @@ readonly path_sysctl_mem="net.core.optmem_max"
 if [[ "$#" -eq "0" ]]; then
$0 4 tcp -t 1
$0 6 tcp -t 1
+   $0 4 udp -t 1
+   $0 6 udp -t 1
echo "OK. All tests passed"
exit 0
 fi
diff --git a/tools/testing/selftests/net/udpgso_bench.sh 
b/tools/testing/selftests/net/udpgso_bench.sh
index 0f0628613f81..5670a9ffd8eb 100755
--- a/tools/testing/selftests/net/udpgso_bench.sh
+++ b/tools/testing/selftests/net/udpgso_bench.sh
@@ -35,6 +35,9 @@ run_udp() {
 
echo "udp gso"
run_in_netns ${args} -S 0
+
+   echo "udp gso zerocopy"
+   run_in_netns ${args} -S 0 -z
 }
 
 run_tcp() {
-- 
2.20.0.rc1.387.gf8505762e3-goog



Re: consistency for statistics with XDP mode

2018-11-30 Thread Michael S. Tsirkin
On Fri, Nov 30, 2018 at 08:10:58PM +, Saeed Mahameed wrote:
> On Thu, 2018-11-22 at 18:00 +0100, Toke Høiland-Jørgensen wrote:
> > David Ahern  writes:
> > 
> > > On 11/22/18 1:26 AM, Toke Høiland-Jørgensen wrote:
> > > > Saeed Mahameed  writes:
> > > > 
> > > > > > > I'd say it sounds reasonable to include XDP in the normal
> > > > > > > traffic
> > > > > > > counters, but having the detailed XDP-specific counters is
> > > > > > > quite
> > > > > > > useful
> > > > > > > as well... So can't we do both (for all drivers)?
> > > > > > > 
> > > > > 
> > > > > What are you thinking ? 
> > > > > reporting XDP_DROP in interface dropped counter ?
> > > > > and XDP_TX/REDIRECT in the TX counter ?
> > > > > XDP_ABORTED in the  err/drop counter ?
> > > > > 
> > > > > how about having a special XDP command in the .ndo_bpf that
> > > > > would query
> > > > > the standardized XDP stats ?
> > > > the XDP-specific stats are useful to have separately as well :)
> > > > 
> > > 
> > > I would like to see basic packets, bytes, and dropped counters
> > > tracked
> > > for Rx and Tx via the standard netdev counters for all devices. 
> 
> The problem of reporting XDP_DROP in the netedev drop counter is that
> they don't fit this counter description : "no space in linux buffers"
> and it will be hard for the user to determine whether these drops are
> coming from XDP or because no buffer is available, which will make it
> impossible to estimate packet rate performance without looking at
> ethtool stats.
> And reporting XDP_DROP in the netdev rx packets counter is somehow
> misleading.. since those packets never made it out of this driver.. 
> 
> 
> And reporting XDP_DROP in the netdev rx packets counter is somehow
> misleading.. since those packets never made it out of this driver..

I think I agree. XDP needs minimal overhead - if user wants to do
counters then user can via maps. And in a sense XDP dropping packet
is much like e.g. TCP dropping packet - it is not counted
against the driver since it's not driver's fault.


> > > for ease in accounting as well as speed and simplicity for bumping
> > > counters for virtual devices from bpf helpers.
> > > 
> > > From there, the XDP ones can be in the driver private stats as they
> > > are
> > > currently but with some consistency across drivers for redirects,
> > > drops,
> > > any thing else.
> > > 
> > > So not a radical departure from where we are today, just getting
> > > the
> > > agreement for consistency and driver owners to make the changes.
> > 
> > Sounds good to me :)
> > 
> > -Toke


Re: [Patch net-next 2/2] net: dump whole skb data in netdev_rx_csum_fault()

2018-11-30 Thread Saeed Mahameed
On Thu, 2018-11-22 at 17:45 -0800, Cong Wang wrote:
> On Wed, Nov 21, 2018 at 11:33 AM Saeed Mahameed 
> wrote:
> > On Wed, 2018-11-21 at 10:26 -0800, Eric Dumazet wrote:
> > > On Wed, Nov 21, 2018 at 10:17 AM Cong Wang <
> > > xiyou.wangc...@gmail.com>
> > > wrote:
> > > > On Wed, Nov 21, 2018 at 5:05 AM Eric Dumazet <
> > > > eric.duma...@gmail.com> wrote:
> > > > > 
> > > > > On 11/20/2018 06:13 PM, Cong Wang wrote:
> > > > > > Currently, we only dump a few selected skb fields in
> > > > > > netdev_rx_csum_fault(). It is not suffient for debugging
> > > > > > checksum
> > > > > > fault. This patch introduces skb_dump() which dumps skb mac
> > > > > > header,
> > > > > > network header and its whole skb->data too.
> > > > > > 
> > > > > > Cc: Herbert Xu 
> > > > > > Cc: Eric Dumazet 
> > > > > > Cc: David Miller 
> > > > > > Signed-off-by: Cong Wang 
> > > > > > ---
> > > > > > + print_hex_dump(level, "skb data: ",
> > > > > > DUMP_PREFIX_OFFSET,
> > > > > > 16, 1,
> > > > > > +skb->data, skb->len, false);
> > > > > 
> > > > > As I mentioned to David, we want all the bytes that were
> > > > > maybe
> > > > > already pulled
> > > > > 
> > > > > (skb->head starting point, not skb->data)
> > > > 
> > > > Hmm, with mac header and network header, it is effectively from
> > > > skb->head, no?
> > > > Is there anything between skb->head and mac header?
> > > 
> > > Oh, I guess we wanted a single hex dump, or we need some user
> > > program
> > > to be able to
> > > rebuild from different memory zones the original
> > > CHECKSUM_COMPLETE
> > > value.
> > > 
> > 
> > Normally the driver keeps some headroom @skb->head, so the actual
> > mac
> > header starts @ skb->head + driver_specific_headroom
> 
> Good to know, but this headroom isn't covered by skb->csum, so
> not useful here, right? The skb->csum for mlx5 only covers network
> header and its payload.

correct



Re: consistency for statistics with XDP mode

2018-11-30 Thread Saeed Mahameed
On Thu, 2018-11-22 at 18:00 +0100, Toke Høiland-Jørgensen wrote:
> David Ahern  writes:
> 
> > On 11/22/18 1:26 AM, Toke Høiland-Jørgensen wrote:
> > > Saeed Mahameed  writes:
> > > 
> > > > > > I'd say it sounds reasonable to include XDP in the normal
> > > > > > traffic
> > > > > > counters, but having the detailed XDP-specific counters is
> > > > > > quite
> > > > > > useful
> > > > > > as well... So can't we do both (for all drivers)?
> > > > > > 
> > > > 
> > > > What are you thinking ? 
> > > > reporting XDP_DROP in interface dropped counter ?
> > > > and XDP_TX/REDIRECT in the TX counter ?
> > > > XDP_ABORTED in the  err/drop counter ?
> > > > 
> > > > how about having a special XDP command in the .ndo_bpf that
> > > > would query
> > > > the standardized XDP stats ?
> > > the XDP-specific stats are useful to have separately as well :)
> > > 
> > 
> > I would like to see basic packets, bytes, and dropped counters
> > tracked
> > for Rx and Tx via the standard netdev counters for all devices. 

The problem of reporting XDP_DROP in the netedev drop counter is that
they don't fit this counter description : "no space in linux buffers"
and it will be hard for the user to determine whether these drops are
coming from XDP or because no buffer is available, which will make it
impossible to estimate packet rate performance without looking at
ethtool stats.
And reporting XDP_DROP in the netdev rx packets counter is somehow
misleading.. since those packets never made it out of this driver.. 


And reporting XDP_DROP in the netdev rx packets counter is somehow
misleading.. since those packets never made it out of this driver..
> > for ease in accounting as well as speed and simplicity for bumping
> > counters for virtual devices from bpf helpers.
> > 
> > From there, the XDP ones can be in the driver private stats as they
> > are
> > currently but with some consistency across drivers for redirects,
> > drops,
> > any thing else.
> > 
> > So not a radical departure from where we are today, just getting
> > the
> > agreement for consistency and driver owners to make the changes.
> 
> Sounds good to me :)
> 
> -Toke


[PATCH bpf-next] bpf: allow BPF read access to qdisc pkt_len

2018-11-30 Thread Willem de Bruijn
From: Petar Penkov 

The pkt_len field in qdisc_skb_cb stores the skb length as it will
appear on the wire after segmentation. For byte accounting, this value
is more accurate than skb->len. It is computed on entry to the TC
layer, so only valid there.

Allow read access to this field from BPF tc classifier and action
programs. The implementation is analogous to tc_classid, aside from
restricting to read access.

Signed-off-by: Petar Penkov 
Signed-off-by: Vlad Dumitrescu 
Signed-off-by: Willem de Bruijn 
---
 include/uapi/linux/bpf.h|  1 +
 net/core/filter.c   | 16 +++
 tools/include/uapi/linux/bpf.h  |  1 +
 tools/testing/selftests/bpf/test_verifier.c | 32 +
 4 files changed, 50 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 597afdbc1ab9..ce028482d423 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2483,6 +2483,7 @@ struct __sk_buff {
__u32 data_meta;
struct bpf_flow_keys *flow_keys;
__u64 tstamp;
+   __u32 pkt_len;
 };
 
 struct bpf_tunnel_key {
diff --git a/net/core/filter.c b/net/core/filter.c
index bd0df75dc7b6..af071ef22c0a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5773,6 +5773,7 @@ static bool sk_filter_is_valid_access(int off, int size,
case bpf_ctx_range(struct __sk_buff, flow_keys):
case bpf_ctx_range_till(struct __sk_buff, family, local_port):
case bpf_ctx_range(struct __sk_buff, tstamp):
+   case bpf_ctx_range(struct __sk_buff, pkt_len):
return false;
}
 
@@ -5797,6 +5798,7 @@ static bool cg_skb_is_valid_access(int off, int size,
case bpf_ctx_range(struct __sk_buff, tc_classid):
case bpf_ctx_range(struct __sk_buff, data_meta):
case bpf_ctx_range(struct __sk_buff, flow_keys):
+   case bpf_ctx_range(struct __sk_buff, pkt_len):
return false;
case bpf_ctx_range(struct __sk_buff, data):
case bpf_ctx_range(struct __sk_buff, data_end):
@@ -5843,6 +5845,7 @@ static bool lwt_is_valid_access(int off, int size,
case bpf_ctx_range(struct __sk_buff, data_meta):
case bpf_ctx_range(struct __sk_buff, flow_keys):
case bpf_ctx_range(struct __sk_buff, tstamp):
+   case bpf_ctx_range(struct __sk_buff, pkt_len):
return false;
}
 
@@ -6273,6 +6276,7 @@ static bool sk_skb_is_valid_access(int off, int size,
case bpf_ctx_range(struct __sk_buff, data_meta):
case bpf_ctx_range(struct __sk_buff, flow_keys):
case bpf_ctx_range(struct __sk_buff, tstamp):
+   case bpf_ctx_range(struct __sk_buff, pkt_len):
return false;
}
 
@@ -6360,6 +6364,7 @@ static bool flow_dissector_is_valid_access(int off, int 
size,
case bpf_ctx_range(struct __sk_buff, data_meta):
case bpf_ctx_range_till(struct __sk_buff, family, local_port):
case bpf_ctx_range(struct __sk_buff, tstamp):
+   case bpf_ctx_range(struct __sk_buff, pkt_len):
return false;
}
 
@@ -6685,6 +6690,17 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type 
type,
  bpf_target_off(struct sk_buff,
 tstamp, 8,
 target_size));
+   break;
+
+   case offsetof(struct __sk_buff, pkt_len):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct qdisc_skb_cb, pkt_len) != 4);
+
+   off = si->off;
+   off -= offsetof(struct __sk_buff, pkt_len);
+   off += offsetof(struct sk_buff, cb);
+   off += offsetof(struct qdisc_skb_cb, pkt_len);
+   *target_size = 4;
+   *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg, off);
}
 
return insn - insn_buf;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 597afdbc1ab9..ce028482d423 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2483,6 +2483,7 @@ struct __sk_buff {
__u32 data_meta;
struct bpf_flow_keys *flow_keys;
__u64 tstamp;
+   __u32 pkt_len;
 };
 
 struct bpf_tunnel_key {
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 17021d2b6bfe..0ab37fb4afad 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -13972,6 +13972,38 @@ static struct bpf_test tests[] = {
.result_unpriv = REJECT,
.result = ACCEPT,
},
+   {
+   "check pkt_len is not readable by sockets",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+   offsetof(struct __sk_buff, pkt_len)),
+   BPF_EXIT_INSN(),
+ 

Re: [PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-11-30 Thread David Ahern
On 11/28/18 6:34 PM, Peter Oskolkov wrote:
> On Wed, Nov 28, 2018 at 4:47 PM David Ahern  wrote:
>>
>> On 11/28/18 5:22 PM, Peter Oskolkov wrote:
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index bd0df75dc7b6..17f3c37218e5 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -4793,6 +4793,60 @@ static int bpf_push_seg6_encap(struct sk_buff *skb, 
>>> u32 type, void *hdr, u32 len
>>>  }
>>>  #endif /* CONFIG_IPV6_SEG6_BPF */
>>>
>>> +static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len)
>>> +{
>>> + struct dst_entry *dst;
>>> + struct rtable *rt;
>>> + struct iphdr *iph;
>>> + struct net *net;
>>> + int err;
>>> +
>>> + if (skb->protocol != htons(ETH_P_IP))
>>> + return -EINVAL;  /* ETH_P_IPV6 not yet supported */
>>> +
>>> + iph = (struct iphdr *)hdr;
>>> +
>>> + if (unlikely(len < sizeof(struct iphdr) || len > 
>>> LWTUNNEL_MAX_ENCAP_HSIZE))
>>> + return -EINVAL;
>>> + if (unlikely(iph->version != 4 || iph->ihl * 4 > len))
>>> + return -EINVAL;
>>> +
>>> + if (skb->sk)
>>> + net = sock_net(skb->sk);
>>> + else {
>>> + net = dev_net(skb_dst(skb)->dev);
>>> + }
>>> + rt = ip_route_output(net, iph->daddr, 0, 0, 0);
>>
>> That is a very limited use case. e.g., oif = 0 means you are not
>> considering any kind of policy routing (e.g., VRF).
> 
> Hi David! Could you be a bit more specific re: what you would like to
> see here? Thanks!
> 

Is the encap happening on ingress or egress? Seems like the current code
does not assume either direction for lwt (BPF_PROG_TYPE_LWT_IN vs
BPF_PROG_TYPE_LWT_OUT), yet your change does - output only. Basically,
you should be filling in a flow struct and doing a proper lookup.

When the potentially custom encap header is pushed on, seems to me skb
marks should still be honored for the route lookup. If not, you should
handle that in the API.

>From there skb->dev at a minimum should be used as either iif (ingress)
or oif (egress).

The iph is already set so you have quick access to the tos.

Also, this should implement IPv6 as well before going in.


Re: [PATCH iproute2] stats output

2018-11-30 Thread Stephen Hemminger
On Fri, 30 Nov 2018 20:22:35 +0100
Alexis Vachette  wrote:

> Hi Stephen,
> 
> Thanks for your kind reply.
> 
> It's sad to hear that, I am aware of your concern too.
> 
> But it's not the best behavior for a network engineer.
> 
> Is it possible to add a new option or everything is stone graved ?

You can add new options at will as long as they don't conflict with existing
usage (and you update the man page). JSON and colorized output were added
in the last years.

Note: the iproute options are definitely a hodge-podge collection, the code 
doesn't
follow the typical Linux style of using getopt and getopt_long, and has a bit of
network OS model as well.

What are you thinking of.


Re: [PATCHv2 net] sctp: check and update stream->out_curr when allocating stream_out

2018-11-30 Thread Neil Horman
On Sat, Dec 01, 2018 at 03:53:26AM +0900, Xin Long wrote:
> On Sat, Dec 1, 2018 at 12:23 AM Neil Horman  wrote:
> >
> > On Fri, Nov 30, 2018 at 10:48:10PM +0900, Xin Long wrote:
> > > On Fri, Nov 30, 2018 at 9:21 PM Neil Horman  wrote:
> > > >
> > > > On Fri, Nov 30, 2018 at 03:22:39PM +0900, Xin Long wrote:
> > > > > On Thu, Nov 29, 2018 at 11:39 PM Neil Horman  
> > > > > wrote:
> > > > > >
> > > > > > On Thu, Nov 29, 2018 at 02:42:56PM +0800, Xin Long wrote:
> > > > > > > Now when using stream reconfig to add out streams, stream->out
> > > > > > > will get re-allocated, and all old streams' information will
> > > > > > > be copied to the new ones and the old ones will be freed.
> > > > > > >
> > > > > > > So without stream->out_curr updated, next time when trying to
> > > > > > > send from stream->out_curr stream, a panic would be caused.
> > > > > > >
> > > > > > > This patch is to check and update stream->out_curr when
> > > > > > > allocating stream_out.
> > > > > > >
> > > > > > > v1->v2:
> > > > > > >   - define fa_index() to get elem index from stream->out_curr.
> > > > > > >
> > > > > > > Fixes: 5e32a431 ("sctp: introduce stream scheduler 
> > > > > > > foundations")
> > > > > > > Reported-by: Ying Xu 
> > > > > > > Reported-by: syzbot+e33a3a138267ca119...@syzkaller.appspotmail.com
> > > > > > > Signed-off-by: Xin Long 
> > > > > > > ---
> > > > > > >  net/sctp/stream.c | 20 
> > > > > > >  1 file changed, 20 insertions(+)
> > > > > > >
> > > > > > > diff --git a/net/sctp/stream.c b/net/sctp/stream.c
> > > > > > > index 3892e76..30e7809 100644
> > > > > > > --- a/net/sctp/stream.c
> > > > > > > +++ b/net/sctp/stream.c
> > > > > > > @@ -84,6 +84,19 @@ static void fa_zero(struct flex_array *fa, 
> > > > > > > size_t index, size_t count)
> > > > > > >   }
> > > > > > >  }
> > > > > > >
> > > > > > > +static size_t fa_index(struct flex_array *fa, void *elem, size_t 
> > > > > > > count)
> > > > > > > +{
> > > > > > > + size_t index = 0;
> > > > > > > +
> > > > > > > + while (count--) {
> > > > > > > + if (elem == flex_array_get(fa, index))
> > > > > > > + break;
> > > > > > > + index++;
> > > > > > > + }
> > > > > > > +
> > > > > > > + return index;
> > > > > > > +}
> > > > > > > +
> > > > > > >  /* Migrates chunks from stream queues to new stream queues if 
> > > > > > > needed,
> > > > > > >   * but not across associations. Also, removes those chunks to 
> > > > > > > streams
> > > > > > >   * higher than the new max.
> > > > > > > @@ -147,6 +160,13 @@ static int sctp_stream_alloc_out(struct 
> > > > > > > sctp_stream *stream, __u16 outcnt,
> > > > > > >
> > > > > > >   if (stream->out) {
> > > > > > >   fa_copy(out, stream->out, 0, min(outcnt, 
> > > > > > > stream->outcnt));
> > > > > > > + if (stream->out_curr) {
> > > > > > > + size_t index = fa_index(stream->out, 
> > > > > > > stream->out_curr,
> > > > > > > + stream->outcnt);
> > > > > > > +
> > > > > > > + BUG_ON(index == stream->outcnt);
> > > > > > > + stream->out_curr = flex_array_get(out, 
> > > > > > > index);
> > > > > > > + }
> > > > > > >   fa_free(stream->out);
> > > > > > >   }
> > > > > > >
> > > > > > > --
> > > > > > > 2.1.0
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > This is the sort of thing I'm talking about. Its a little more 
> > > > > > code, but if you
> > > > > > augment the flex_array api like this, you can preform a resize 
> > > > > > operation on your
> > > > > > existing flex array, and you can avoid all the copying, and need to 
> > > > > > update
> > > > > > pointers maintained outside the array.  Note this code isn't tested 
> > > > > > at all, but
> > > > > > its close to what I think should work.
> > > > > >
> > > > > >
> > > > > > diff --git a/include/linux/flex_array.h b/include/linux/flex_array.h
> > > > > > index b94fa61b51fb..7fa1f27a91b5 100644
> > > > > > --- a/include/linux/flex_array.h
> > > > > > +++ b/include/linux/flex_array.h
> > > > > > @@ -73,6 +73,8 @@ struct flex_array {
> > > > > >  struct flex_array *flex_array_alloc(int element_size, unsigned int 
> > > > > > total,
> > > > > > gfp_t flags);
> > > > > >
> > > > > > +struct flex_array *flex_array_resize(struct flex_array *fa, 
> > > > > > unsigned int total, gfp_t flags);
> > > > > > +
> > > > > >  /**
> > > > > >   * flex_array_prealloc() - Ensures that memory for the elements 
> > > > > > indexed in the
> > > > > >   * range defined by start and nr_elements has been allocated.
> > > > > > diff --git a/lib/flex_array.c b/lib/flex_array.c
> > > > > > index 2eed22fa507c..f8d54af3891b 100644
> > > > > > --- a/lib/flex_array.c
> > > > > > +++ b/lib/flex_array.c
> > > > > > @@ -109,6 +109,7 @@ struct flex_array *flex_array_alloc(int 
> > > > > > element_size, unsigned int 

Re: [PATCH iproute2] ss: add support for bytes_sent, bytes_retrans, dsack_dups and reord_seen

2018-11-30 Thread Stefano Brivio
Hi David,

On Thu, 29 Nov 2018 11:51:48 -0700
David Ahern  wrote:

> On 11/29/18 11:50 AM, Stephen Hemminger wrote:
> > PS: ss still doesn't support JSON output, given the volume of output it 
> > would be good.  
> 
> I thought Stefano was investigating it as an alternative to the 'display
> selected columns' patches.

Thanks for Cc'ing me, I missed this.

I'm not really investigating it at the moment, I'm convinced it's not
complicated, it's on my to-do list, but I'm not working on it right now
(and maybe also not in a very near future).

I would still say the "display selected columns" patches could be a
stopgap for now (even though it's not related to this one patch).

-- 
Stefano


Re: [PATCHv2 net] sctp: check and update stream->out_curr when allocating stream_out

2018-11-30 Thread Neil Horman
On Sat, Dec 01, 2018 at 03:53:26AM +0900, Xin Long wrote:
> On Sat, Dec 1, 2018 at 12:23 AM Neil Horman  wrote:
> >
> > On Fri, Nov 30, 2018 at 10:48:10PM +0900, Xin Long wrote:
> > > On Fri, Nov 30, 2018 at 9:21 PM Neil Horman  wrote:
> > > >
> > > > On Fri, Nov 30, 2018 at 03:22:39PM +0900, Xin Long wrote:
> > > > > On Thu, Nov 29, 2018 at 11:39 PM Neil Horman  
> > > > > wrote:
> > > > > >
> > > > > > On Thu, Nov 29, 2018 at 02:42:56PM +0800, Xin Long wrote:
> > > > > > > Now when using stream reconfig to add out streams, stream->out
> > > > > > > will get re-allocated, and all old streams' information will
> > > > > > > be copied to the new ones and the old ones will be freed.
> > > > > > >
> > > > > > > So without stream->out_curr updated, next time when trying to
> > > > > > > send from stream->out_curr stream, a panic would be caused.
> > > > > > >
> > > > > > > This patch is to check and update stream->out_curr when
> > > > > > > allocating stream_out.
> > > > > > >
> > > > > > > v1->v2:
> > > > > > >   - define fa_index() to get elem index from stream->out_curr.
> > > > > > >
> > > > > > > Fixes: 5e32a431 ("sctp: introduce stream scheduler 
> > > > > > > foundations")
> > > > > > > Reported-by: Ying Xu 
> > > > > > > Reported-by: syzbot+e33a3a138267ca119...@syzkaller.appspotmail.com
> > > > > > > Signed-off-by: Xin Long 
> > > > > > > ---
> > > > > > >  net/sctp/stream.c | 20 
> > > > > > >  1 file changed, 20 insertions(+)
> > > > > > >
> > > > > > > diff --git a/net/sctp/stream.c b/net/sctp/stream.c
> > > > > > > index 3892e76..30e7809 100644
> > > > > > > --- a/net/sctp/stream.c
> > > > > > > +++ b/net/sctp/stream.c
> > > > > > > @@ -84,6 +84,19 @@ static void fa_zero(struct flex_array *fa, 
> > > > > > > size_t index, size_t count)
> > > > > > >   }
> > > > > > >  }
> > > > > > >
> > > > > > > +static size_t fa_index(struct flex_array *fa, void *elem, size_t 
> > > > > > > count)
> > > > > > > +{
> > > > > > > + size_t index = 0;
> > > > > > > +
> > > > > > > + while (count--) {
> > > > > > > + if (elem == flex_array_get(fa, index))
> > > > > > > + break;
> > > > > > > + index++;
> > > > > > > + }
> > > > > > > +
> > > > > > > + return index;
> > > > > > > +}
> > > > > > > +
> > > > > > >  /* Migrates chunks from stream queues to new stream queues if 
> > > > > > > needed,
> > > > > > >   * but not across associations. Also, removes those chunks to 
> > > > > > > streams
> > > > > > >   * higher than the new max.
> > > > > > > @@ -147,6 +160,13 @@ static int sctp_stream_alloc_out(struct 
> > > > > > > sctp_stream *stream, __u16 outcnt,
> > > > > > >
> > > > > > >   if (stream->out) {
> > > > > > >   fa_copy(out, stream->out, 0, min(outcnt, 
> > > > > > > stream->outcnt));
> > > > > > > + if (stream->out_curr) {
> > > > > > > + size_t index = fa_index(stream->out, 
> > > > > > > stream->out_curr,
> > > > > > > + stream->outcnt);
> > > > > > > +
> > > > > > > + BUG_ON(index == stream->outcnt);
> > > > > > > + stream->out_curr = flex_array_get(out, 
> > > > > > > index);
> > > > > > > + }
> > > > > > >   fa_free(stream->out);
> > > > > > >   }
> > > > > > >
> > > > > > > --
> > > > > > > 2.1.0
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > This is the sort of thing I'm talking about. Its a little more 
> > > > > > code, but if you
> > > > > > augment the flex_array api like this, you can preform a resize 
> > > > > > operation on your
> > > > > > existing flex array, and you can avoid all the copying, and need to 
> > > > > > update
> > > > > > pointers maintained outside the array.  Note this code isn't tested 
> > > > > > at all, but
> > > > > > its close to what I think should work.
> > > > > >
> > > > > >
> > > > > > diff --git a/include/linux/flex_array.h b/include/linux/flex_array.h
> > > > > > index b94fa61b51fb..7fa1f27a91b5 100644
> > > > > > --- a/include/linux/flex_array.h
> > > > > > +++ b/include/linux/flex_array.h
> > > > > > @@ -73,6 +73,8 @@ struct flex_array {
> > > > > >  struct flex_array *flex_array_alloc(int element_size, unsigned int 
> > > > > > total,
> > > > > > gfp_t flags);
> > > > > >
> > > > > > +struct flex_array *flex_array_resize(struct flex_array *fa, 
> > > > > > unsigned int total, gfp_t flags);
> > > > > > +
> > > > > >  /**
> > > > > >   * flex_array_prealloc() - Ensures that memory for the elements 
> > > > > > indexed in the
> > > > > >   * range defined by start and nr_elements has been allocated.
> > > > > > diff --git a/lib/flex_array.c b/lib/flex_array.c
> > > > > > index 2eed22fa507c..f8d54af3891b 100644
> > > > > > --- a/lib/flex_array.c
> > > > > > +++ b/lib/flex_array.c
> > > > > > @@ -109,6 +109,7 @@ struct flex_array *flex_array_alloc(int 
> > > > > > element_size, unsigned int 

Re: [PATCH iproute2] stats output

2018-11-30 Thread Alexis Vachette
Hi Stephen,

Thanks for your kind reply.

It's sad to hear that, I am aware of your concern too.

But it's not the best behavior for a network engineer.

Is it possible to add a new option or everything is stone graved ?

If yes I will fix my e-mail client to submit a correct PR.
On Fri, 30 Nov 2018 at 19:12, Stephen Hemminger
 wrote:
>
> On Fri, 30 Nov 2018 14:33:49 +0100
> Alexis Vachette  wrote:
>
> > When using:
> >
> > - ip -s link
> >
> > I think it should be better to print errors stats without adding -s
> > option twice.
> >
> > This option print stats for each network interfaces and we want to see
> > if something is off and especially timers with errors.
> >
> > Man page of ip command is updated accordingly.
> >
> > Here is a patchset:
> >
> > Signed-off-by: Alexis Vachette 
> > ---
> > diff --git a/ip/ipaddress.c b/ip/ipaddress.c
> > index 85f05a2..c70394d 100644
> > --- a/ip/ipaddress.c
> > +++ b/ip/ipaddress.c
> > @@ -323,7 +323,7 @@ int print_linkinfo(const struct sockaddr_nl *who,
> >   fprintf(fp,"\nalias %s",
> >   (const char *) RTA_DATA(tb[IFLA_IFALIAS]));
> >
> > - if (do_link && tb[IFLA_STATS64] && show_stats) {
> > + if (do_link && tb[IFLA_STATS64]) {
> >   struct rtnl_link_stats64 slocal;
> >   struct rtnl_link_stats64 *s = RTA_DATA(tb[IFLA_STATS64]);
> >   if (((unsigned long)s) & (sizeof(unsigned long)-1)) {
> > @@ -343,16 +343,14 @@ int print_linkinfo(const struct sockaddr_nl *who,
> >   if (s->rx_compressed)
> >   fprintf(fp, " %-7llu",
> >   (unsigned long long)s->rx_compressed);
> > - if (show_stats > 1) {
> > - fprintf(fp, "%s", _SL_);
> > - fprintf(fp, "RX errors: length  crc frame   fifomissed%s", 
> > _SL_);
> > - fprintf(fp, "   %-7llu  %-7llu %-7llu %-7llu %-7llu",
> > - (unsigned long long)s->rx_length_errors,
> > - (unsigned long long)s->rx_crc_errors,
> > - (unsigned long long)s->rx_frame_errors,
> > - (unsigned long long)s->rx_fifo_errors,
> > - (unsigned long long)s->rx_missed_errors);
> > - }
> > + fprintf(fp, "%s", _SL_);
> > + fprintf(fp, "RX errors: length  crc frame   fifomissed%s", 
> > _SL_);
> > + fprintf(fp, "   %-7llu  %-7llu %-7llu %-7llu %-7llu",
> > + (unsigned long long)s->rx_length_errors,
> > + (unsigned long long)s->rx_crc_errors,
> > + (unsigned long long)s->rx_frame_errors,
> > + (unsigned long long)s->rx_fifo_errors,
> > + (unsigned long long)s->rx_missed_errors);
> >   fprintf(fp, "%s", _SL_);
> >   fprintf(fp, "TX: bytes  packets  errors  dropped carrier collsns 
> > %s%s",
> >   s->tx_compressed ? "compressed" : "", _SL_);
> > @@ -366,17 +364,15 @@ int print_linkinfo(const struct sockaddr_nl *who,
> >   if (s->tx_compressed)
> >   fprintf(fp, " %-7llu",
> >   (unsigned long long)s->tx_compressed);
> > - if (show_stats > 1) {
> > - fprintf(fp, "%s", _SL_);
> > - fprintf(fp, "TX errors: aborted fifowindow  heartbeat%s", _SL_);
> > - fprintf(fp, "   %-7llu  %-7llu %-7llu %-7llu",
> > - (unsigned long long)s->tx_aborted_errors,
> > - (unsigned long long)s->tx_fifo_errors,
> > - (unsigned long long)s->tx_window_errors,
> > - (unsigned long long)s->tx_heartbeat_errors);
> > - }
> > + fprintf(fp, "%s", _SL_);
> > + fprintf(fp, "TX errors: aborted fifowindow  heartbeat%s", _SL_);
> > + fprintf(fp, "   %-7llu  %-7llu %-7llu %-7llu",
> > + (unsigned long long)s->tx_aborted_errors,
> > + (unsigned long long)s->tx_fifo_errors,
> > + (unsigned long long)s->tx_window_errors,
> > + (unsigned long long)s->tx_heartbeat_errors);
> >   }
> > - if (do_link && !tb[IFLA_STATS64] && tb[IFLA_STATS] && show_stats) {
> > + if (do_link && !tb[IFLA_STATS64] && tb[IFLA_STATS]) {
> >   struct rtnl_link_stats slocal;
> >   struct rtnl_link_stats *s = RTA_DATA(tb[IFLA_STATS]);
> >   if (((unsigned long)s) & (sizeof(unsigned long)-1)) {
> > @@ -393,17 +389,15 @@ int print_linkinfo(const struct sockaddr_nl *who,
> >   );
> >   if (s->rx_compressed)
> >   fprintf(fp, " %-7u", s->rx_compressed);
> > - if (show_stats > 1) {
> > - fprintf(fp, "%s", _SL_);
> > - fprintf(fp, "RX errors: length  crc frame   fifomissed%s", 
> > _SL_);
> > - fprintf(fp, "   %-7u  %-7u %-7u %-7u %-7u",
> > - s->rx_length_errors,
> > - s->rx_crc_errors,
> > - s->rx_frame_errors,
> > - s->rx_fifo_errors,
> > - s->rx_missed_errors
> > - );
> > - }
> > + fprintf(fp, "%s", _SL_);
> > + fprintf(fp, "RX errors: length  crc frame   fifomissed%s", 
> > _SL_);
> > + fprintf(fp, "   %-7u  %-7u %-7u %-7u %-7u",
> > + s->rx_length_errors,
> > + s->rx_crc_errors,
> > + s->rx_frame_errors,
> > + s->rx_fifo_errors,
> > + s->rx_missed_errors
> > + );
> >   fprintf(fp, "%s", _SL_);
> >   fprintf(fp, "TX: bytes  packets  errors  dropped carrier collsns 
> > %s%s",
> >   s->tx_compressed ? "compressed" : "", _SL_);
> > @@ -412,16 +406,14 @@ int print_linkinfo(const struct sockaddr_nl *who,
> >   s->tx_dropped, s->tx_carrier_errors, s->collisions);
> >   if (s->tx_compressed)
> 

Re: [PATCH net-next 1/2] ixgbe: register a mdiobus

2018-11-30 Thread Andrew Lunn
> 'cards_found' doesn't exist for the ixgbe driver.

Agh, sorry, i was looking at ixgb, not ixgbe.

 Andrew


Re: [PATCHv2 net] sctp: check and update stream->out_curr when allocating stream_out

2018-11-30 Thread Xin Long
On Sat, Dec 1, 2018 at 12:23 AM Neil Horman  wrote:
>
> On Fri, Nov 30, 2018 at 10:48:10PM +0900, Xin Long wrote:
> > On Fri, Nov 30, 2018 at 9:21 PM Neil Horman  wrote:
> > >
> > > On Fri, Nov 30, 2018 at 03:22:39PM +0900, Xin Long wrote:
> > > > On Thu, Nov 29, 2018 at 11:39 PM Neil Horman  
> > > > wrote:
> > > > >
> > > > > On Thu, Nov 29, 2018 at 02:42:56PM +0800, Xin Long wrote:
> > > > > > Now when using stream reconfig to add out streams, stream->out
> > > > > > will get re-allocated, and all old streams' information will
> > > > > > be copied to the new ones and the old ones will be freed.
> > > > > >
> > > > > > So without stream->out_curr updated, next time when trying to
> > > > > > send from stream->out_curr stream, a panic would be caused.
> > > > > >
> > > > > > This patch is to check and update stream->out_curr when
> > > > > > allocating stream_out.
> > > > > >
> > > > > > v1->v2:
> > > > > >   - define fa_index() to get elem index from stream->out_curr.
> > > > > >
> > > > > > Fixes: 5e32a431 ("sctp: introduce stream scheduler foundations")
> > > > > > Reported-by: Ying Xu 
> > > > > > Reported-by: syzbot+e33a3a138267ca119...@syzkaller.appspotmail.com
> > > > > > Signed-off-by: Xin Long 
> > > > > > ---
> > > > > >  net/sctp/stream.c | 20 
> > > > > >  1 file changed, 20 insertions(+)
> > > > > >
> > > > > > diff --git a/net/sctp/stream.c b/net/sctp/stream.c
> > > > > > index 3892e76..30e7809 100644
> > > > > > --- a/net/sctp/stream.c
> > > > > > +++ b/net/sctp/stream.c
> > > > > > @@ -84,6 +84,19 @@ static void fa_zero(struct flex_array *fa, 
> > > > > > size_t index, size_t count)
> > > > > >   }
> > > > > >  }
> > > > > >
> > > > > > +static size_t fa_index(struct flex_array *fa, void *elem, size_t 
> > > > > > count)
> > > > > > +{
> > > > > > + size_t index = 0;
> > > > > > +
> > > > > > + while (count--) {
> > > > > > + if (elem == flex_array_get(fa, index))
> > > > > > + break;
> > > > > > + index++;
> > > > > > + }
> > > > > > +
> > > > > > + return index;
> > > > > > +}
> > > > > > +
> > > > > >  /* Migrates chunks from stream queues to new stream queues if 
> > > > > > needed,
> > > > > >   * but not across associations. Also, removes those chunks to 
> > > > > > streams
> > > > > >   * higher than the new max.
> > > > > > @@ -147,6 +160,13 @@ static int sctp_stream_alloc_out(struct 
> > > > > > sctp_stream *stream, __u16 outcnt,
> > > > > >
> > > > > >   if (stream->out) {
> > > > > >   fa_copy(out, stream->out, 0, min(outcnt, 
> > > > > > stream->outcnt));
> > > > > > + if (stream->out_curr) {
> > > > > > + size_t index = fa_index(stream->out, 
> > > > > > stream->out_curr,
> > > > > > + stream->outcnt);
> > > > > > +
> > > > > > + BUG_ON(index == stream->outcnt);
> > > > > > + stream->out_curr = flex_array_get(out, index);
> > > > > > + }
> > > > > >   fa_free(stream->out);
> > > > > >   }
> > > > > >
> > > > > > --
> > > > > > 2.1.0
> > > > > >
> > > > > >
> > > > >
> > > > > This is the sort of thing I'm talking about. Its a little more code, 
> > > > > but if you
> > > > > augment the flex_array api like this, you can preform a resize 
> > > > > operation on your
> > > > > existing flex array, and you can avoid all the copying, and need to 
> > > > > update
> > > > > pointers maintained outside the array.  Note this code isn't tested 
> > > > > at all, but
> > > > > its close to what I think should work.
> > > > >
> > > > >
> > > > > diff --git a/include/linux/flex_array.h b/include/linux/flex_array.h
> > > > > index b94fa61b51fb..7fa1f27a91b5 100644
> > > > > --- a/include/linux/flex_array.h
> > > > > +++ b/include/linux/flex_array.h
> > > > > @@ -73,6 +73,8 @@ struct flex_array {
> > > > >  struct flex_array *flex_array_alloc(int element_size, unsigned int 
> > > > > total,
> > > > > gfp_t flags);
> > > > >
> > > > > +struct flex_array *flex_array_resize(struct flex_array *fa, unsigned 
> > > > > int total, gfp_t flags);
> > > > > +
> > > > >  /**
> > > > >   * flex_array_prealloc() - Ensures that memory for the elements 
> > > > > indexed in the
> > > > >   * range defined by start and nr_elements has been allocated.
> > > > > diff --git a/lib/flex_array.c b/lib/flex_array.c
> > > > > index 2eed22fa507c..f8d54af3891b 100644
> > > > > --- a/lib/flex_array.c
> > > > > +++ b/lib/flex_array.c
> > > > > @@ -109,6 +109,7 @@ struct flex_array *flex_array_alloc(int 
> > > > > element_size, unsigned int total,
> > > > > ret->total_nr_elements = total;
> > > > > ret->elems_per_part = elems_per_part;
> > > > > ret->reciprocal_elems = reciprocal_elems;
> > > > > +   ret->elements_used = 0;
> > > > > if (elements_fit_in_base(ret) && !(flags & __GFP_ZERO))
> > > > > 

[PATCH 2/2] net: phy: ensure autoneg is configured when resuming a phydev

2018-11-30 Thread Anssi Hannula
When a PHY_HALTED phydev is resumed by phy_start(), it is set to
PHY_RESUMING to wait for the link to come up.

However, if the phydev was put to PHY_HALTED (by e.g. phy_stop()) before
autonegotiation was ever started by phy_state_machine(), autonegotiation
remains unconfigured, i.e. phy_config_aneg() is never called.

Fix that by going to PHY_UP instead of PHY_RESUMING if autonegotiation
has never been configured.

Signed-off-by: Anssi Hannula 
---

This doesn't feel as clean is I'd like to, though. Maybe there is a
better way?


 drivers/net/phy/phy.c | 11 ++-
 include/linux/phy.h   |  2 ++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index d46858694db9..7a650cce7599 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -553,6 +553,8 @@ int phy_start_aneg(struct phy_device *phydev)
if (err < 0)
goto out_unlock;
 
+   phydev->autoneg_configured = 1;
+
if (phydev->state != PHY_HALTED) {
if (AUTONEG_ENABLE == phydev->autoneg) {
err = phy_check_link_status(phydev);
@@ -893,7 +895,14 @@ void phy_start(struct phy_device *phydev)
break;
}
 
-   phydev->state = PHY_RESUMING;
+   /* if autoneg/forcing was never configured, go back to PHY_UP
+* to make that happen
+*/
+   if (!phydev->autoneg_configured)
+   phydev->state = PHY_UP;
+   else
+   phydev->state = PHY_RESUMING;
+
break;
default:
break;
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 8f927246acdb..b306d5ed9d09 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -389,6 +389,8 @@ struct phy_device {
unsigned loopback_enabled:1;
 
unsigned autoneg:1;
+   /* autoneg has been configured at least once */
+   unsigned autoneg_configured:1;
/* The most recently read link state */
unsigned link:1;
 
-- 
2.17.2



[PATCH 1/2] net: phy: suspend phydev on PHY_HALTED even if there is no link

2018-11-30 Thread Anssi Hannula
When the phydev is put to PHY_HALTED (by e.g. phy_stop()) the state
machine suspends the phydev to save power.

However, this is wrongly inside a phydev->link check, causing the phydev
not to be suspended if there was no link at the time of stopping it.

Fix that by setting do_suspend regardless of whether there was a link or
not.

Signed-off-by: Anssi Hannula 
Fixes: be9dad1f9f26 ("net: phy: suspend phydev when going to HALTED")
Cc: Sebastian Hesselbarth 
---
 drivers/net/phy/phy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 376a0d8a2b61..d46858694db9 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -956,8 +956,8 @@ void phy_state_machine(struct work_struct *work)
if (phydev->link) {
phydev->link = 0;
phy_link_down(phydev, true);
-   do_suspend = true;
}
+   do_suspend = true;
break;
}
 
-- 
2.17.2



[PATCH 0/2] net: phy: fixes to PHY_HALTED handling

2018-11-30 Thread Anssi Hannula
Hi all,

Here are a couple of fixes for PHY_HALTED/PHY_RESUMING handling.


On a related note, it feels a bit strange that AFAICS phydevs will only
be put to powerdown state (via PHY_HALTED) after the network interface
has been brought up and down once. If the ethernet interface is never
brought up, the phydev remains powered on (in PHY_READY).

This is because the phydev is only put to PHY_HALTED from phy_stop() and
phy_error(), and phy_stop() seems to be normally called only from
.ndo_stop().

But I didn't touch that now as I wasn't sure what is the intent there.


Anssi Hannula (2):
  net: phy: suspend phydev on PHY_HALTED even if there is no link
  net: phy: ensure autoneg is configured when resuming a phydev

 drivers/net/phy/phy.c | 13 +++--
 include/linux/phy.h   |  2 ++
 2 files changed, 13 insertions(+), 2 deletions(-)

-- 
Anssi Hannula / Bitwise Oy



Re: [PATCH net-next 1/2] ixgbe: register a mdiobus

2018-11-30 Thread Steve Douthit
On 11/30/18 12:43 PM, Florian Fainelli wrote:
> 
> 
> On 11/30/2018 9:34 AM, Steve Douthit wrote:
>> On 11/30/18 11:34 AM, Andrew Lunn wrote:
 Yep, registering multiple interfaces is wrong.  The first board I tested
 against only had a single MAC enabled (they can be disabled/hidden via
 straps) so it just happened to work.
>>>
>>> Hi Steve
>>>
>>> Can you hide any/all via straps, or is 00.0 always guaranteed to
>>> exist?
>>
>> You can hide all the devices, but if function 1 is enabled then function
>> 0 must also be enabled, so not all combinations are valid.
>>
 The Intel C3xxx family of SoCs have up to four ixgbe MACs.  These are
 structured as two devices of two functions each on fixed internal root
 ports.

 from lspci:
 
   +-16.0-[05]--+-00.0
   |\-00.1
   +-17.0-[06]--+-00.0
   |\-00.1
 
>>>
>>> Is there any other hardware resource which is shared between the MAC
>>> interfaces? I'm just wondering if the driver has already solved this
>>> once. Is there an EEPROM per interface for the MAC address, or one
>>> shared EEPROM?
>>
>> NVM is shared, it's actually the same SPI flash that holds the BIOS for
>> this SoC.  Access is serialized by the swfw_sync semaphore.  I think the
>> device firmware is automagically handling offset translation.
>>
>> I don't think that helps for this case.
>>
>> There might be a better match for shared resources, but nothing springs
>> to mind.
>>
>>> Ah, how about using the 'cards_found' found variable. It is not
>>> perfect, in that it is not decremented in ixgb_remove(), and i wonder
>>> about race conditions since there does not appear to be any lock when
>>> it is incremented. But if cards_found == 0, register the MDIO bus.
>>
>> 'cards_found' doesn't exist for the ixgbe driver.  I could add it and
>> fix the race/decrement issues you mention, but it'd have to be a device
>> type specific count.  It's still possible there are other non-x550em_a
>> ixgbe devices in external PCIe slots that have different resource
>> sharing setups.
>>
>> It's still a lighter weight solution than poking around the parent bus
>> so I'll add a 'x550em_a_devs_found' counter to v2.
>>
> 
> If the MDIO bus block is hardwired to function 0, would it be acceptable
> to walk the PCI bus hierarchy from any driver's probe routine where PCI
> function != 0 and see if it is claimed/enabled already, and if so, skip
> registering the MDIO bus entirely?

It's not really hardwired to any function at the hardware level AFAICT.
Unless by 'hardwired' you mean we always register it to the same devfn?
Walking the bus to find the lowest numbered devfn was my original
suggestion.  I'm good with a PCI walk or a counter to choose which MAC
registers the MDIO bus.  Let me know what the consensus is.

> There is also possibly a problem if you have a shared MDIO device, and
> you turn off/power off the PCI function that does provide it. How can we
> make sure it remains alive for other functions to use it?

I'm not sure this is an issue.  I wouldn't expect the other MACs to lose
the ability to talk to their PHYs.  Those callbacks weren't changed to
use the registered MDIO bus.


Re: [PATCH] net: tcp: add correct check for tcp_retransmit_skb()

2018-11-30 Thread Yuchung Cheng
On Fri, Nov 30, 2018 at 10:28 AM Sharath Chandra Vurukala
 wrote:
>
> when the tcp_retranmission_timer expires and tcp_retranmsit_skb is
> called if the retranmsission fails due to local congestion,
> backoff should not incremented.
>
> tcp_retransmit_skb() returns non-zero negative value in some cases of
> failure but the caller tcp_retransmission_timer() has a check for
> failure which checks if the return value is greater than zero.
> The check is corrected to check for non-zero value.
>
> Signed-off-by: Sharath Chandra Vurukala 
Perhaps my previous comment was not clear: your bug-fix patch is incorrect.

On local congestion, tcp_retransmit_skb returns positive values
*only*. negative values do not indicate local congestion.

> ---
>  net/ipv4/tcp_timer.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 091c5392..c19f371 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -511,7 +511,7 @@ void tcp_retransmit_timer(struct sock *sk)
>
> tcp_enter_loss(sk);
>
> -   if (tcp_retransmit_skb(sk, tcp_rtx_queue_head(sk), 1) > 0) {
> +   if (tcp_retransmit_skb(sk, tcp_rtx_queue_head(sk), 1) != 0) {
> /* Retransmission failed because of local congestion,
>  * do not backoff.
>  */
> --
> 1.9.1
>


Re: [PATCH] net: tcp: add correct check for tcp_retransmit_skb()

2018-11-30 Thread Eric Dumazet



On 11/30/2018 10:28 AM, Sharath Chandra Vurukala wrote:
> when the tcp_retranmission_timer expires and tcp_retranmsit_skb is
> called if the retranmsission fails due to local congestion,
> backoff should not incremented.
> 
> tcp_retransmit_skb() returns non-zero negative value in some cases of
> failure but the caller tcp_retransmission_timer() has a check for
> failure which checks if the return value is greater than zero.
> The check is corrected to check for non-zero value.
> 
> Signed-off-by: Sharath Chandra Vurukala 
>

This looks wrong.

Yuchung has cooked a patch series to really address issues, please wait for it

We are waiting for David Miller to merge a prior patch series into net tree, 
then
in net-next.

Thanks.




Re: [PATCH iproute2] stats output

2018-11-30 Thread David Ahern
On 11/30/18 11:12 AM, Stephen Hemminger wrote:
> I can understand why you would want this, but it is changing the
> behavior of an existing command that might be used in scripts.

+1


[PATCH] net: tcp: add correct check for tcp_retransmit_skb()

2018-11-30 Thread Sharath Chandra Vurukala
when the tcp_retranmission_timer expires and tcp_retranmsit_skb is
called if the retranmsission fails due to local congestion,
backoff should not incremented.

tcp_retransmit_skb() returns non-zero negative value in some cases of
failure but the caller tcp_retransmission_timer() has a check for
failure which checks if the return value is greater than zero.
The check is corrected to check for non-zero value.

Signed-off-by: Sharath Chandra Vurukala 
---
 net/ipv4/tcp_timer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 091c5392..c19f371 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -511,7 +511,7 @@ void tcp_retransmit_timer(struct sock *sk)
 
tcp_enter_loss(sk);
 
-   if (tcp_retransmit_skb(sk, tcp_rtx_queue_head(sk), 1) > 0) {
+   if (tcp_retransmit_skb(sk, tcp_rtx_queue_head(sk), 1) != 0) {
/* Retransmission failed because of local congestion,
 * do not backoff.
 */
-- 
1.9.1



Re: [PATCH iproute2] stats output

2018-11-30 Thread Roopa Prabhu
On Fri, Nov 30, 2018 at 10:12 AM Stephen Hemminger
 wrote:
>
> On Fri, 30 Nov 2018 14:33:49 +0100
> Alexis Vachette  wrote:
>
> > When using:
> >
> > - ip -s link
> >
> > I think it should be better to print errors stats without adding -s
> > option twice.
> >
> > This option print stats for each network interfaces and we want to see
> > if something is off and especially timers with errors.
> >
> > Man page of ip command is updated accordingly.
> >
> > Here is a patchset:
> >
> > Signed-off-by: Alexis Vachette 
> > ---
> > diff --git a/ip/ipaddress.c b/ip/ipaddress.c
> > index 85f05a2..c70394d 100644
> > --- a/ip/ipaddress.c
> > +++ b/ip/ipaddress.c
> > @@ -323,7 +323,7 @@ int print_linkinfo(const struct sockaddr_nl *who,
> >   fprintf(fp,"\nalias %s",
> >   (const char *) RTA_DATA(tb[IFLA_IFALIAS]));
> >
> > - if (do_link && tb[IFLA_STATS64] && show_stats) {
> > + if (do_link && tb[IFLA_STATS64]) {
> >   struct rtnl_link_stats64 slocal;
> >   struct rtnl_link_stats64 *s = RTA_DATA(tb[IFLA_STATS64]);
> >   if (((unsigned long)s) & (sizeof(unsigned long)-1)) {
> > @@ -343,16 +343,14 @@ int print_linkinfo(const struct sockaddr_nl *who,
> >   if (s->rx_compressed)
> >   fprintf(fp, " %-7llu",
> >   (unsigned long long)s->rx_compressed);
> > - if (show_stats > 1) {
> > - fprintf(fp, "%s", _SL_);
> > - fprintf(fp, "RX errors: length  crc frame   fifomissed%s", 
> > _SL_);
> > - fprintf(fp, "   %-7llu  %-7llu %-7llu %-7llu %-7llu",
> > - (unsigned long long)s->rx_length_errors,
> > - (unsigned long long)s->rx_crc_errors,
> > - (unsigned long long)s->rx_frame_errors,
> > - (unsigned long long)s->rx_fifo_errors,
> > - (unsigned long long)s->rx_missed_errors);
> > - }
> > + fprintf(fp, "%s", _SL_);
> > + fprintf(fp, "RX errors: length  crc frame   fifomissed%s", 
> > _SL_);
> > + fprintf(fp, "   %-7llu  %-7llu %-7llu %-7llu %-7llu",
> > + (unsigned long long)s->rx_length_errors,
> > + (unsigned long long)s->rx_crc_errors,
> > + (unsigned long long)s->rx_frame_errors,
> > + (unsigned long long)s->rx_fifo_errors,
> > + (unsigned long long)s->rx_missed_errors);
> >   fprintf(fp, "%s", _SL_);
> >   fprintf(fp, "TX: bytes  packets  errors  dropped carrier collsns 
> > %s%s",
> >   s->tx_compressed ? "compressed" : "", _SL_);
> > @@ -366,17 +364,15 @@ int print_linkinfo(const struct sockaddr_nl *who,
> >   if (s->tx_compressed)
> >   fprintf(fp, " %-7llu",
> >   (unsigned long long)s->tx_compressed);
> > - if (show_stats > 1) {
> > - fprintf(fp, "%s", _SL_);
> > - fprintf(fp, "TX errors: aborted fifowindow  heartbeat%s", _SL_);
> > - fprintf(fp, "   %-7llu  %-7llu %-7llu %-7llu",
> > - (unsigned long long)s->tx_aborted_errors,
> > - (unsigned long long)s->tx_fifo_errors,
> > - (unsigned long long)s->tx_window_errors,
> > - (unsigned long long)s->tx_heartbeat_errors);
> > - }
> > + fprintf(fp, "%s", _SL_);
> > + fprintf(fp, "TX errors: aborted fifowindow  heartbeat%s", _SL_);
> > + fprintf(fp, "   %-7llu  %-7llu %-7llu %-7llu",
> > + (unsigned long long)s->tx_aborted_errors,
> > + (unsigned long long)s->tx_fifo_errors,
> > + (unsigned long long)s->tx_window_errors,
> > + (unsigned long long)s->tx_heartbeat_errors);
> >   }
> > - if (do_link && !tb[IFLA_STATS64] && tb[IFLA_STATS] && show_stats) {
> > + if (do_link && !tb[IFLA_STATS64] && tb[IFLA_STATS]) {
> >   struct rtnl_link_stats slocal;
> >   struct rtnl_link_stats *s = RTA_DATA(tb[IFLA_STATS]);
> >   if (((unsigned long)s) & (sizeof(unsigned long)-1)) {
> > @@ -393,17 +389,15 @@ int print_linkinfo(const struct sockaddr_nl *who,
> >   );
> >   if (s->rx_compressed)
> >   fprintf(fp, " %-7u", s->rx_compressed);
> > - if (show_stats > 1) {
> > - fprintf(fp, "%s", _SL_);
> > - fprintf(fp, "RX errors: length  crc frame   fifomissed%s", 
> > _SL_);
> > - fprintf(fp, "   %-7u  %-7u %-7u %-7u %-7u",
> > - s->rx_length_errors,
> > - s->rx_crc_errors,
> > - s->rx_frame_errors,
> > - s->rx_fifo_errors,
> > - s->rx_missed_errors
> > - );
> > - }
> > + fprintf(fp, "%s", _SL_);
> > + fprintf(fp, "RX errors: length  crc frame   fifomissed%s", 
> > _SL_);
> > + fprintf(fp, "   %-7u  %-7u %-7u %-7u %-7u",
> > + s->rx_length_errors,
> > + s->rx_crc_errors,
> > + s->rx_frame_errors,
> > + s->rx_fifo_errors,
> > + s->rx_missed_errors
> > + );
> >   fprintf(fp, "%s", _SL_);
> >   fprintf(fp, "TX: bytes  packets  errors  dropped carrier collsns 
> > %s%s",
> >   s->tx_compressed ? "compressed" : "", _SL_);
> > @@ -412,16 +406,14 @@ int print_linkinfo(const struct sockaddr_nl *who,
> >   s->tx_dropped, s->tx_carrier_errors, s->collisions);
> >   if (s->tx_compressed)
> >   fprintf(fp, " %-7u", s->tx_compressed);
> > - if (show_stats > 1) {
> > - fprintf(fp, "%s", _SL_);
> > - fprintf(fp, "TX errors: aborted fifowindow  heartbeat%s", _SL_);
> > - fprintf(fp, "   %-7u  %-7u %-7u %-7u",
> > - s->tx_aborted_errors,
> > - 

Re: [iproute PATCH] ssfilter: Fix for inverted last expression

2018-11-30 Thread Eric Dumazet



On 11/29/2018 04:20 AM, Phil Sutter wrote:
> When fixing for shift/reduce conflicts, possibility to invert the last
> expression by prefixing with '!' or 'not' was accidentally removed.
> 
> Fix this by allowing for expr to be an inverted expr so that any
> reference to it in exprlist accepts the inverted prefix.
> 
> Reported-by: Eric Dumazet 
> Fixes: b2038cc0b2403 ("ssfilter: Eliminate shift/reduce conflicts")
> Signed-off-by: Phil Sutter 
> ---


Thanks Phil !

Acked-by: Eric Dumazet 



  1   2   >