Re: [PATCH v2 net-next 0/8] Continue towards using linkmode in phylib

2018-10-01 Thread David Miller
From: Andrew Lunn 
Date: Sat, 29 Sep 2018 23:04:08 +0200

> These patches contain some further cleanup and helpers, and the first
> real patch towards using linkmode bitmaps in phylink.
> 
> The macro magic in the RFC version has been replaced with run time
> initialisation.

Series applied, thanks Andrew.


Re: [PATCH net-next] nfp: warn on experimental TLV types

2018-10-01 Thread David Miller
From: Jakub Kicinski 
Date: Wed, 26 Sep 2018 15:35:31 -0700

> Reserve two TLV types for feature development, and warn in the driver
> if they ever leak into production.
> 
> Signed-off-by: Jakub Kicinski 
> Reviewed-by: Simon Horman 

Applied.


[PATCH v2 net] inet: frags: rework rhashtable dismantle

2018-10-01 Thread Eric Dumazet
syszbot found an interesting use-after-free [1] happening
while IPv4 fragment rhashtable was destroyed at netns dismantle.

While no insertions can possibly happen at the time a dismantling
netns is destroying this rhashtable, timers can still fire and
attempt to remove elements from this rhashtable.

This is forbidden, since rhashtable_free_and_destroy() has
no synchronization against concurrent inserts and deletes.

Add a new nf->dead flag so that timers do not attempt
a rhashtable_remove_fast() operation.

[1]
BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:188 
[inline]
BUG: KASAN: use-after-free in rhashtable_last_table+0x216/0x240 
lib/rhashtable.c:217
Read of size 8 at addr 88019a4c8840 by task kworker/0:4/8279

CPU: 0 PID: 8279 Comm: kworker/0:4 Not tainted 4.19.0-rc5+ #61
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Workqueue: events rht_deferred_worker
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
 print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
 __read_once_size include/linux/compiler.h:188 [inline]
 rhashtable_last_table+0x216/0x240 lib/rhashtable.c:217
 rht_deferred_worker+0x157/0x1de0 lib/rhashtable.c:410
 process_one_work+0xc90/0x1b90 kernel/workqueue.c:2153
 worker_thread+0x17f/0x1390 kernel/workqueue.c:2296
 kthread+0x35a/0x420 kernel/kthread.c:246
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413

Allocated by task 5:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
 __do_kmalloc_node mm/slab.c:3682 [inline]
 __kmalloc_node+0x47/0x70 mm/slab.c:3689
 kmalloc_node include/linux/slab.h:555 [inline]
 kvmalloc_node+0xb9/0xf0 mm/util.c:423
 kvmalloc include/linux/mm.h:577 [inline]
 kvzalloc include/linux/mm.h:585 [inline]
 bucket_table_alloc+0x9a/0x4e0 lib/rhashtable.c:176
 rhashtable_rehash_alloc+0x73/0x100 lib/rhashtable.c:353
 rht_deferred_worker+0x278/0x1de0 lib/rhashtable.c:413
 process_one_work+0xc90/0x1b90 kernel/workqueue.c:2153
 worker_thread+0x17f/0x1390 kernel/workqueue.c:2296
 kthread+0x35a/0x420 kernel/kthread.c:246
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413

Freed by task 8283:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kfree+0xcf/0x230 mm/slab.c:3813
 kvfree+0x61/0x70 mm/util.c:452
 bucket_table_free+0xda/0x250 lib/rhashtable.c:108
 rhashtable_free_and_destroy+0x152/0x900 lib/rhashtable.c:1163
 inet_frags_exit_net+0x3d/0x50 net/ipv4/inet_fragment.c:96
 ipv4_frags_exit_net+0x73/0x90 net/ipv4/ip_fragment.c:914
 ops_exit_list.isra.7+0xb0/0x160 net/core/net_namespace.c:153
 cleanup_net+0x555/0xb10 net/core/net_namespace.c:551
 process_one_work+0xc90/0x1b90 kernel/workqueue.c:2153
 worker_thread+0x17f/0x1390 kernel/workqueue.c:2296
 kthread+0x35a/0x420 kernel/kthread.c:246
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413

The buggy address belongs to the object at 88019a4c8800
 which belongs to the cache kmalloc-16384 of size 16384
The buggy address is located 64 bytes inside of
 16384-byte region [88019a4c8800, 88019a4cc800)
The buggy address belongs to the page:
page:ea0006693200 count:1 mapcount:0 mapping:8801da802200 index:0x0 
compound_mapcount: 0
flags: 0x2fffc008100(slab|head)
raw: 02fffc008100 ea0006685608 ea0006617c08 8801da802200
raw:  88019a4c8800 00010001 
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 88019a4c8700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 88019a4c8780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>88019a4c8800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ^
 88019a4c8880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 88019a4c8900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Cc: Thomas Graf 
Cc: Herbert Xu 
---
 include/net/inet_frag.h  |  4 +++-
 net/ipv4/inet_fragment.c | 31 +--
 2 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 
1662cbc0b46b45296a367ecbdaf03c68854fdce7..ffe5e1be40212fa63e360f3e29a56c1b2ce897ee
 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -11,7 +11,7 @@ struct netns_frags {
int timeout;
int max_dist;
struct inet_frags   *f;
-
+   

Re: pull request: bluetooth 2018-09-27

2018-10-01 Thread David Miller
From: Johan Hedberg 
Date: Thu, 27 Sep 2018 21:28:40 +0300

> Here's one more Bluetooth fix for 4.19, fixing the handling of an
> attempt to unpair a device while pairing is in progress.
> 
> Let me know if there are any issues pulling. Thanks.

Pulled, thanks Johan.


Re: [PATCH net-next 2/2] tcp: adjust rcv zerocopy hints based on frag sizes

2018-10-01 Thread David Miller
From: Soheil Hassas Yeganeh 
Date: Wed, 26 Sep 2018 16:57:04 -0400

> From: Soheil Hassas Yeganeh 
> 
> When SKBs are coalesced, we can have SKBs with different
> frag sizes. Some with PAGE_SIZE and some not with PAGE_SIZE.
> Since recv_skip_hint is always set to the full SKB size,
> it can overestimate the amount that should be read using
> normal read for coalesced packets.
> 
> Change the recv_skip_hint so that it only includes the first
> frags that are not of PAGE_SIZE.
> 
> Signed-off-by: Soheil Hassas Yeganeh 
> Signed-off-by: Eric Dumazet 

Applied.


Re: [PATCH net-next 1/2] tcp: set recv_skip_hint when tcp_inq is less than PAGE_SIZE

2018-10-01 Thread David Miller
From: Soheil Hassas Yeganeh 
Date: Wed, 26 Sep 2018 16:57:03 -0400

> From: Soheil Hassas Yeganeh 
> 
> When we have less than PAGE_SIZE of data on receive queue,
> we set recv_skip_hint to 0. Instead, set it to the actual
> number of bytes available.
> 
> Signed-off-by: Soheil Hassas Yeganeh 
> Signed-off-by: Eric Dumazet 

Applied.


[RFC v2 bpf-next 2/5] bpf: return EOPNOTSUPP when map lookup isn't supported

2018-10-01 Thread Prashant Bhole
Return ERR_PTR(-EOPNOTSUPP) from map_lookup_elem() methods of below
map types:
- BPF_MAP_TYPE_PROG_ARRAY
- BPF_MAP_TYPE_STACK_TRACE
- BPF_MAP_TYPE_XSKMAP
- BPF_MAP_TYPE_SOCKMAP/BPF_MAP_TYPE_SOCKHASH

Signed-off-by: Prashant Bhole 
---
 kernel/bpf/arraymap.c | 2 +-
 kernel/bpf/sockmap.c  | 2 +-
 kernel/bpf/stackmap.c | 2 +-
 kernel/bpf/xskmap.c   | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index dded84cbe814..24583da9ffd1 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -449,7 +449,7 @@ static void fd_array_map_free(struct bpf_map *map)
 
 static void *fd_array_map_lookup_elem(struct bpf_map *map, void *key)
 {
-   return NULL;
+   return ERR_PTR(-EOPNOTSUPP);
 }
 
 /* only called from syscall */
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index d37a1a0a6e1e..5d0677d808ae 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -2096,7 +2096,7 @@ int sockmap_get_from_fd(const union bpf_attr *attr, int 
type,
 
 static void *sock_map_lookup(struct bpf_map *map, void *key)
 {
-   return NULL;
+   return ERR_PTR(-EOPNOTSUPP);
 }
 
 static int sock_map_update_elem(struct bpf_map *map,
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 8061a439ef18..b2ade10f7ec3 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -505,7 +505,7 @@ const struct bpf_func_proto bpf_get_stack_proto = {
 /* Called from eBPF program */
 static void *stack_map_lookup_elem(struct bpf_map *map, void *key)
 {
-   return NULL;
+   return ERR_PTR(-EOPNOTSUPP);
 }
 
 /* Called from syscall */
diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index 9f8463afda9c..ef0b7b6ef8a5 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -154,7 +154,7 @@ void __xsk_map_flush(struct bpf_map *map)
 
 static void *xsk_map_lookup_elem(struct bpf_map *map, void *key)
 {
-   return NULL;
+   return ERR_PTR(-EOPNOTSUPP);
 }
 
 static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
-- 
2.17.1




Re: [net 1/1] tipc: ignore STATE_MSG on wrong link session

2018-10-01 Thread David Miller
From: Jon Maloy 
Date: Wed, 26 Sep 2018 22:28:52 +0200

> From: LUU Duc Canh 
> 
> The initial session number when a link is created is based on a random
> value, taken from struct tipc_net->random. It is then incremented for
> each link reset to avoid mixing protocol messages from different link
> sessions.
> 
> However, when a bearer is reset all its links are deleted, and will
> later be re-created using the same random value as the first time.
> This means that if the link never went down between creation and
> deletion we will still sometimes have two subsequent sessions with
> the same session number. In virtual environments with potentially
> long transmission times this has turned out to be a real problem.
> 
> We now fix this by randomizing the session number each time a link
> is created.
> 
> With a session number size of 16 bits this gives a risk of session
> collision of 1/64k. To reduce this further, we also introduce a sanity
> check on the very first STATE message arriving at a link. If this has
> an acknowledge value differing from 0, which is logically impossible,
> we ignore the message. The final risk for session collision is hence
> reduced to 1/4G, which should be sufficient.
> 
> Signed-off-by: LUU Duc Canh 
> Signed-off-by: Jon Maloy 

Applied.


[RFC v2 bpf-next 5/5] selftests/bpf: verifier, check bpf_map_lookup_elem access in bpf prog

2018-10-01 Thread Prashant Bhole
map_lookup_elem isn't supported by certain map types like:
- BPF_MAP_TYPE_PROG_ARRAY
- BPF_MAP_TYPE_STACK_TRACE
- BPF_MAP_TYPE_XSKMAP
- BPF_MAP_TYPE_SOCKMAP/BPF_MAP_TYPE_SOCKHASH
Let's add verfier tests to check whether verifier prevents
bpf_map_lookup_elem call on above programs from bpf program.

Signed-off-by: Prashant Bhole 
---
 tools/testing/selftests/bpf/test_verifier.c | 121 +++-
 1 file changed, 120 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index c7d25f23baf9..afa7e67f66e4 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -47,7 +47,7 @@
 
 #define MAX_INSNS  BPF_MAXINSNS
 #define MAX_FIXUPS 8
-#define MAX_NR_MAPS8
+#define MAX_NR_MAPS13
 #define POINTER_VALUE  0xcafe4all
 #define TEST_DATA_LEN  64
 
@@ -64,6 +64,10 @@ struct bpf_test {
int fixup_map2[MAX_FIXUPS];
int fixup_map3[MAX_FIXUPS];
int fixup_map4[MAX_FIXUPS];
+   int fixup_map5[MAX_FIXUPS];
+   int fixup_map6[MAX_FIXUPS];
+   int fixup_map7[MAX_FIXUPS];
+   int fixup_map8[MAX_FIXUPS];
int fixup_prog1[MAX_FIXUPS];
int fixup_prog2[MAX_FIXUPS];
int fixup_map_in_map[MAX_FIXUPS];
@@ -4391,6 +4395,85 @@ static struct bpf_test tests[] = {
.errstr = "invalid access to packet",
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
},
+   {
+   "prevent map lookup in sockmap",
+   .insns = {
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map5 = { 3 },
+   .result = REJECT,
+   .errstr = "cannot pass map_type 15 into func 
bpf_map_lookup_elem",
+   .prog_type = BPF_PROG_TYPE_SOCK_OPS,
+   },
+   {
+   "prevent map lookup in sockhash",
+   .insns = {
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map6 = { 3 },
+   .result = REJECT,
+   .errstr = "cannot pass map_type 18 into func 
bpf_map_lookup_elem",
+   .prog_type = BPF_PROG_TYPE_SOCK_OPS,
+   },
+   {
+   "prevent map lookup in xskmap",
+   .insns = {
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map7 = { 3 },
+   .result = REJECT,
+   .errstr = "cannot pass map_type 17 into func 
bpf_map_lookup_elem",
+   .prog_type = BPF_PROG_TYPE_XDP,
+   },
+   {
+   "prevent map lookup in stack trace",
+   .insns = {
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map8 = { 3 },
+   .result = REJECT,
+   .errstr = "cannot pass map_type 7 into func 
bpf_map_lookup_elem",
+   .prog_type = BPF_PROG_TYPE_PERF_EVENT,
+   },
+   {
+   "prevent map lookup in prog array",
+   .insns = {
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_prog2 = { 3 },
+

[RFC v2 bpf-next 4/5] tools/bpf: bpftool, print strerror when map lookup error occurs

2018-10-01 Thread Prashant Bhole
Since map lookup error can be ENOENT or EOPNOTSUPP, let's print
strerror() as error message in normal and JSON output.

This patch adds helper function print_entry_error() to print
entry from lookup error occurs

Example: Following example dumps a map which does not support lookup.

Output before:
root# bpftool map -jp dump id 40
[
"key": ["0x0a","0x00","0x00","0x00"
],
"value": {
"error": "can\'t lookup element"
},
"key": ["0x0b","0x00","0x00","0x00"
],
"value": {
"error": "can\'t lookup element"
}
]

root# bpftool map dump id 40
can't lookup element with key:
0a 00 00 00
can't lookup element with key:
0b 00 00 00
Found 0 elements

Output after changes:
root# bpftool map dump -jp  id 45
[
"key": ["0x0a","0x00","0x00","0x00"
],
"value": {
"error": "Operation not supported"
},
"key": ["0x0b","0x00","0x00","0x00"
],
"value": {
"error": "Operation not supported"
}
]

root# bpftool map dump id 45
key:
0a 00 00 00
value:
Operation not supported
key:
0b 00 00 00
value:
Operation not supported
Found 0 elements

Signed-off-by: Prashant Bhole 
---
 tools/bpf/bpftool/map.c | 29 -
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 28d365435fea..9f5de48f8a99 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -336,6 +336,25 @@ static void print_entry_json(struct bpf_map_info *info, 
unsigned char *key,
jsonw_end_object(json_wtr);
 }
 
+static void print_entry_error(struct bpf_map_info *info, unsigned char *key,
+ const char *value)
+{
+   int value_size = strlen(value);
+   bool single_line, break_names;
+
+   break_names = info->key_size > 16 || value_size > 16;
+   single_line = info->key_size + value_size <= 24 && !break_names;
+
+   printf("key:%c", break_names ? '\n' : ' ');
+   fprint_hex(stdout, key, info->key_size, " ");
+
+   printf(single_line ? "  " : "\n");
+
+   printf("value:%c%s", break_names ? '\n' : ' ', value);
+
+   printf("\n");
+}
+
 static void print_entry_plain(struct bpf_map_info *info, unsigned char *key,
  unsigned char *value)
 {
@@ -663,6 +682,7 @@ static int dump_map_elem(int fd, void *key, void *value,
 json_writer_t *btf_wtr)
 {
int num_elems = 0;
+   int lookup_errno;
 
if (!bpf_map_lookup_elem(fd, key, value)) {
if (json_output) {
@@ -685,6 +705,8 @@ static int dump_map_elem(int fd, void *key, void *value,
}
 
/* lookup error handling */
+   lookup_errno = errno;
+
if (map_is_map_of_maps(map_info->type) ||
map_is_map_of_progs(map_info->type))
return 0;
@@ -694,13 +716,10 @@ static int dump_map_elem(int fd, void *key, void *value,
print_hex_data_json(key, map_info->key_size);
jsonw_name(json_wtr, "value");
jsonw_start_object(json_wtr);
-   jsonw_string_field(json_wtr, "error",
-  "can't lookup element");
+   jsonw_string_field(json_wtr, "error", strerror(lookup_errno));
jsonw_end_object(json_wtr);
} else {
-   p_info("can't lookup element with key: ");
-   fprint_hex(stderr, key, map_info->key_size, " ");
-   fprintf(stderr, "\n");
+   print_entry_error(map_info, key, strerror(lookup_errno));
}
 
return 0;
-- 
2.17.1




[RFC v2 bpf-next 3/5] tools/bpf: bpftool, split the function do_dump()

2018-10-01 Thread Prashant Bhole
do_dump() function in bpftool/map.c has deep indentations. In order
to reduce deep indent, let's move element printing code out of
do_dump() into dump_map_elem() function.

Signed-off-by: Prashant Bhole 
---
 tools/bpf/bpftool/map.c | 83 -
 1 file changed, 49 insertions(+), 34 deletions(-)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 6003e9598973..28d365435fea 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -658,6 +658,54 @@ static int do_show(int argc, char **argv)
return errno == ENOENT ? 0 : -1;
 }
 
+static int dump_map_elem(int fd, void *key, void *value,
+struct bpf_map_info *map_info, struct btf *btf,
+json_writer_t *btf_wtr)
+{
+   int num_elems = 0;
+
+   if (!bpf_map_lookup_elem(fd, key, value)) {
+   if (json_output) {
+   print_entry_json(map_info, key, value, btf);
+   } else {
+   if (btf) {
+   struct btf_dumper d = {
+   .btf = btf,
+   .jw = btf_wtr,
+   .is_plain_text = true,
+   };
+
+   do_dump_btf(, map_info, key, value);
+   } else {
+   print_entry_plain(map_info, key, value);
+   }
+   num_elems++;
+   }
+   return num_elems;
+   }
+
+   /* lookup error handling */
+   if (map_is_map_of_maps(map_info->type) ||
+   map_is_map_of_progs(map_info->type))
+   return 0;
+
+   if (json_output) {
+   jsonw_name(json_wtr, "key");
+   print_hex_data_json(key, map_info->key_size);
+   jsonw_name(json_wtr, "value");
+   jsonw_start_object(json_wtr);
+   jsonw_string_field(json_wtr, "error",
+  "can't lookup element");
+   jsonw_end_object(json_wtr);
+   } else {
+   p_info("can't lookup element with key: ");
+   fprint_hex(stderr, key, map_info->key_size, " ");
+   fprintf(stderr, "\n");
+   }
+
+   return 0;
+}
+
 static int do_dump(int argc, char **argv)
 {
struct bpf_map_info info = {};
@@ -713,40 +761,7 @@ static int do_dump(int argc, char **argv)
err = 0;
break;
}
-
-   if (!bpf_map_lookup_elem(fd, key, value)) {
-   if (json_output)
-   print_entry_json(, key, value, btf);
-   else
-   if (btf) {
-   struct btf_dumper d = {
-   .btf = btf,
-   .jw = btf_wtr,
-   .is_plain_text = true,
-   };
-
-   do_dump_btf(, , key, value);
-   } else {
-   print_entry_plain(, key, value);
-   }
-   num_elems++;
-   } else if (!map_is_map_of_maps(info.type) &&
-  !map_is_map_of_progs(info.type)) {
-   if (json_output) {
-   jsonw_name(json_wtr, "key");
-   print_hex_data_json(key, info.key_size);
-   jsonw_name(json_wtr, "value");
-   jsonw_start_object(json_wtr);
-   jsonw_string_field(json_wtr, "error",
-  "can't lookup element");
-   jsonw_end_object(json_wtr);
-   } else {
-   p_info("can't lookup element with key: ");
-   fprint_hex(stderr, key, info.key_size, " ");
-   fprintf(stderr, "\n");
-   }
-   }
-
+   num_elems += dump_map_elem(fd, key, value, , btf, btf_wtr);
prev_key = key;
}
 
-- 
2.17.1




[RFC v2 bpf-next 1/5] bpf: error handling when map_lookup_elem isn't supported

2018-10-01 Thread Prashant Bhole
The error value returned by map_lookup_elem doesn't differentiate
whether lookup was failed because of invalid key or lookup is not
supported.

Lets add handling for -EOPNOTSUPP return value of map_lookup_elem()
method of map, with expectation from map's implementation that it
should return -EOPNOTSUPP if lookup is not supported.

The errno for bpf syscall for BPF_MAP_LOOKUP_ELEM command will be set
to EOPNOTSUPP if map lookup is not supported.

Signed-off-by: Prashant Bhole 
---
 kernel/bpf/syscall.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5742df21598c..4f416234251f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -719,10 +719,15 @@ static int map_lookup_elem(union bpf_attr *attr)
} else {
rcu_read_lock();
ptr = map->ops->map_lookup_elem(map, key);
-   if (ptr)
+   if (IS_ERR(ptr)) {
+   err = PTR_ERR(ptr);
+   } else if (!ptr) {
+   err = -ENOENT;
+   } else {
+   err = 0;
memcpy(value, ptr, value_size);
+   }
rcu_read_unlock();
-   err = ptr ? 0 : -ENOENT;
}
 
if (err)
-- 
2.17.1




[RFC v2 bpf-next 0/5] Error handling when map lookup isn't supported

2018-10-01 Thread Prashant Bhole
Currently when map a lookup fails, user space API can not make any
distinction whether given key was not found or lookup is not supported
by particular map.

In this series we modify return value of maps which do not support
lookup. Lookup on such map implementation will return -EOPNOTSUPP.
bpf() syscall with BPF_MAP_LOOKUP_ELEM command will set EOPNOTSUPP
errno. We also handle this error in bpftool to print appropriate
message.

Patch 1: adds handling of BPF_MAP_LOOKUP ELEM command of bpf syscall
such that errno will set to EOPNOTSUPP when map doesn't support lookup

Patch 2: Modifies the return value of map_lookup_elem() to EOPNOTSUPP
for maps which do not support lookup

Patch 3: Splits do_dump() in bpftool/map.c. Element printing code is
moved out into new function dump_map_elem(). This was done in order to
reduce deep indentation and accomodate further changes.

Patch 4: Changes in bpftool to print strerror() message when lookup
error is occured. This will result in appropriate message like
"Operation not supported" when map doesn't support lookup.

Patch 5: Added verifier tests to check whether verifier rejects call 
to bpf_map_lookup_elem from bpf program. For all map types those
do not support map lookup.

v2: 
- bpftool: all nit-pick fixes pointed out by Jakub
- bpftool: removed usage of error strings. Now using strerror(),
  suggested by Jakub
- added tests in verifier_tests, suggested by Alexei


Prashant Bhole (5):
  bpf: error handling when map_lookup_elem isn't supported
  bpf: return EOPNOTSUPP when map lookup isn't supported
  tools/bpf: bpftool, split the function do_dump()
  tools/bpf: bpftool, print strerror when map lookup error occurs
  selftests/bpf: verifier, check bpf_map_lookup_elem access in bpf prog

 kernel/bpf/arraymap.c   |   2 +-
 kernel/bpf/sockmap.c|   2 +-
 kernel/bpf/stackmap.c   |   2 +-
 kernel/bpf/syscall.c|   9 +-
 kernel/bpf/xskmap.c |   2 +-
 tools/bpf/bpftool/map.c | 102 +++--
 tools/testing/selftests/bpf/test_verifier.c | 121 +++-
 7 files changed, 199 insertions(+), 41 deletions(-)

-- 
2.17.1




Re: [PATCH net] net: sched: act_ipt: check for underflow in __tcf_ipt_init()

2018-10-01 Thread David Miller
From: Dan Carpenter 
Date: Sat, 22 Sep 2018 16:46:48 +0300

> If "td->u.target_size" is larger than sizeof(struct xt_entry_target) we
> return -EINVAL.  But we don't check whether it's smaller than
> sizeof(struct xt_entry_target) and that could lead to an out of bounds
> read.
> 
> Fixes: 7ba699c604ab ("[NET_SCHED]: Convert actions from rtnetlink to new 
> netlink API")
> Signed-off-by: Dan Carpenter 

Applied.


Re: [RFC bpf-next 4/4] tools/bpf: handle EOPNOTSUPP when map lookup is failed

2018-10-01 Thread Prashant Bhole




On 9/21/2018 12:59 AM, Jakub Kicinski wrote:

On Thu, 20 Sep 2018 14:04:19 +0900, Prashant Bhole wrote:

On 9/20/2018 12:29 AM, Jakub Kicinski wrote:

On Wed, 19 Sep 2018 16:51:43 +0900, Prashant Bhole wrote:

Let's add a check for EOPNOTSUPP error when map lookup is failed.
Also in case map doesn't support lookup, the output of map dump is
changed from "can't lookup element" to "lookup not supported for
this map".

Patch adds function print_entry_error() function to print the error
value.

Following example dumps a map which does not support lookup.

Output before:
root# bpftool map -jp dump id 40
[
  "key": ["0x0a","0x00","0x00","0x00"
  ],
  "value": {
  "error": "can\'t lookup element"
  },
  "key": ["0x0b","0x00","0x00","0x00"
  ],
  "value": {
  "error": "can\'t lookup element"
  }
]

root# bpftool map dump id 40
can't lookup element with key:
0a 00 00 00
can't lookup element with key:
0b 00 00 00
Found 0 elements

Output after changes:
root# bpftool map dump -jp  id 45
[
  "key": ["0x0a","0x00","0x00","0x00"
  ],
  "value": {
  "error": "lookup not supported for this map"
  },
  "key": ["0x0b","0x00","0x00","0x00"
  ],
  "value": {
  "error": "lookup not supported for this map"
  }
]

root# bpftool map dump id 45
key:
0a 00 00 00
value:
lookup not supported for this map
key:
0b 00 00 00
value:
lookup not supported for this map
Found 0 elements


Nice improvement, thanks for the changes!  I wonder what your thoughts
would be on just printing some form of "lookup not supported for this
map" only once?  It seems slightly like repeated information - if
lookup is not supported for one key it likely won't be for other keys
too, so we could shorten the output.  Would that make sense?
   

Signed-off-by: Prashant Bhole 
---
   tools/bpf/bpftool/main.h |  5 +
   tools/bpf/bpftool/map.c  | 35 ++-
   2 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
index 40492cdc4e53..1a8c683f949b 100644
--- a/tools/bpf/bpftool/main.h
+++ b/tools/bpf/bpftool/main.h
@@ -46,6 +46,11 @@
   
   #include "json_writer.h"
   
+#define ERR_CANNOT_LOOKUP \

+   "can't lookup element"
+#define ERR_LOOKUP_NOT_SUPPORTED \
+   "lookup not supported for this map"


Do we need these?  Are we going to reused them in more parts of the
code?


These are used only once. These can be used in do_lookup(). Currently
do_lookup() prints strerror(errno) when lookup is failed. Shall I change
that do_lookup() output?


I actually prefer to stick to strerror(), the standard errors more
clearly correlate with what happened in my mind (i.e. "Operation not
supported" == kernel sent EOPNOTSUPP).   strerror() may also print in
local language if translation/localization matters.

We could even use strerr() in dump_map_elem() but up to you.  The one
in do_lookup() I'd prefer to leave be ;)


Sorry for the late reply.
In v2 I have removed the error strings altogether. As you suggested 
output will be strerror(). Also added verifier tests as Alexei 
suggested. I am sending the RFC-v2 series soon.

Thanks.

-Prashant



Re: pull request (net-next): ipsec-next 2018-10-01

2018-10-01 Thread David Miller
From: Steffen Klassert 
Date: Mon, 1 Oct 2018 11:16:06 +0200

> 1) Make xfrmi_get_link_net() static to silence a sparse warning.
>From Wei Yongjun.
> 
> 2) Remove a unused esph pointer definition in esp_input().
>From Haishuang Yan.
> 
> 3) Allow the NIC driver to quietly refuse xfrm offload
>in case it does not support it, the SA is created
>without offload in this case.
>From Shannon Nelson.
> 
> Please pull or let me know if there are problems.

Also pulled, thank you!


Re: pull request (net): ipsec 2018-10-01

2018-10-01 Thread David Miller
From: Steffen Klassert 
Date: Mon, 1 Oct 2018 10:58:49 +0200

> 1) Validate address prefix lengths in the xfrm selector,
>otherwise we may hit undefined behaviour in the
>address matching functions if the prefix is too
>big for the given address family.
> 
> 2) Fix skb leak on local message size errors.
>From Thadeu Lima de Souza Cascardo.
> 
> 3) We currently reset the transport header back to the network
>header after a transport mode transformation is applied. This
>leads to an incorrect transport header when multiple transport
>mode transformations are applied. Reset the transport header
>only after all transformations are already applied to fix this.
>From Sowmini Varadhan.
> 
> 4) We only support one offloaded xfrm, so reset crypto_done after
>the first transformation in xfrm_input(). Otherwise we may call
>the wrong input method for subsequent transformations.
>From Sowmini Varadhan.
> 
> 5) Fix NULL pointer dereference when skb_dst_force clears the dst_entry.
>skb_dst_force does not really force a dst refcount anymore, it might
>clear it instead. xfrm code did not expect this, add a check to not
>dereference skb_dst() if it was cleared by skb_dst_force.
> 
> 6) Validate xfrm template mode, otherwise we can get a stack-out-of-bounds
>read in xfrm_state_find. From Sean Tranchetti.
> 
> Please pull or let me know if there are problems.

Pulled, thanks!


Re: [PATCH net v2] net/ncsi: Extend NC-SI Netlink interface to allow user space to send NC-SI command

2018-10-01 Thread Samuel Mendoza-Jonas
On Fri, 2018-09-28 at 18:15 +, justin.l...@dell.com wrote:
> The new command (NCSI_CMD_SEND_CMD) is added to allow user space application 
> to send NC-SI command to the network card.
> Also, add a new attribute (NCSI_ATTR_DATA) for transferring request and 
> response.
> 
> The work flow is as below. 
> 
> Request:
> User space application -> Netlink interface (msg)
>   -> new Netlink handler - 
> ncsi_send_cmd_nl()
>   -> ncsi_xmit_cmd()
> Response:
> Response received - ncsi_rcv_rsp() -> internal response handler - 
> ncsi_rsp_handler_xxx()
> -> 
> ncsi_rsp_handler_netlink()
> -> 
> ncsi_send_netlink_rsp ()
> -> 
> Netlink interface (msg)
> -> 
> user space application
> Command timeout - ncsi_request_timeout() -> ncsi_send_netlink_timeout ()
>   
>   -> Netlink interface (msg with zero data length)
>   
>   -> user space application
> Error:
> Error detected -> ncsi_send_netlink_err () -> Netlink interface (err msg)
>   
>  -> user space application
> 
> 
> Signed-off-by: Justin Lee 

Hi Justin,

This is looking pretty good, combined with Vijay's base patch the two
approaches should fit together nicely (
http://patchwork.ozlabs.org/patch/976510/).

A good merge order would probably be the above patch first, then this
patch and Vijay's further OEM patches based on top of that to reduce
conflicts.

Cheers,
Sam

> 
> ---
>  include/uapi/linux/ncsi.h |   3 +
>  net/ncsi/internal.h   |  12 ++-
>  net/ncsi/ncsi-cmd.c   |  47 ++-
>  net/ncsi/ncsi-manage.c|  22 +
>  net/ncsi/ncsi-netlink.c   | 205 
> ++
>  net/ncsi/ncsi-netlink.h   |  12 +++
>  net/ncsi/ncsi-rsp.c   |  71 ++--
>  7 files changed, 363 insertions(+), 9 deletions(-)
> 
> diff --git a/include/uapi/linux/ncsi.h b/include/uapi/linux/ncsi.h
> index 4c292ec..4992bfc 100644
> --- a/include/uapi/linux/ncsi.h
> +++ b/include/uapi/linux/ncsi.h
> @@ -30,6 +30,7 @@ enum ncsi_nl_commands {
>   NCSI_CMD_PKG_INFO,
>   NCSI_CMD_SET_INTERFACE,
>   NCSI_CMD_CLEAR_INTERFACE,
> + NCSI_CMD_SEND_CMD,
>  
>   __NCSI_CMD_AFTER_LAST,
>   NCSI_CMD_MAX = __NCSI_CMD_AFTER_LAST - 1
> @@ -43,6 +44,7 @@ enum ncsi_nl_commands {
>   * @NCSI_ATTR_PACKAGE_LIST: nested array of NCSI_PKG_ATTR attributes
>   * @NCSI_ATTR_PACKAGE_ID: package ID
>   * @NCSI_ATTR_CHANNEL_ID: channel ID
> + * @NCSI_ATTR_DATA: command payload
>   * @NCSI_ATTR_MAX: highest attribute number
>   */
>  enum ncsi_nl_attrs {
> @@ -51,6 +53,7 @@ enum ncsi_nl_attrs {
>   NCSI_ATTR_PACKAGE_LIST,
>   NCSI_ATTR_PACKAGE_ID,
>   NCSI_ATTR_CHANNEL_ID,
> + NCSI_ATTR_DATA,
>  
>   __NCSI_ATTR_AFTER_LAST,
>   NCSI_ATTR_MAX = __NCSI_ATTR_AFTER_LAST - 1
> diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
> index 8055e39..1a3ef9e 100644
> --- a/net/ncsi/internal.h
> +++ b/net/ncsi/internal.h
> @@ -171,6 +171,8 @@ struct ncsi_package;
>  #define NCSI_RESERVED_CHANNEL0x1f
>  #define NCSI_CHANNEL_INDEX(c)((c) & ((1 << NCSI_PACKAGE_SHIFT) - 1))
>  #define NCSI_TO_CHANNEL(p, c)(((p) << NCSI_PACKAGE_SHIFT) | (c))
> +#define NCSI_MAX_PACKAGE 8
> +#define NCSI_MAX_CHANNEL 32
>  
>  struct ncsi_channel {
>   unsigned char   id;
> @@ -215,12 +217,17 @@ struct ncsi_request {
>   unsigned charid;  /* Request ID - 0 to 255   */
>   bool used;/* Request that has been assigned  */
>   unsigned int flags;   /* NCSI request property   */
> -#define NCSI_REQ_FLAG_EVENT_DRIVEN   1
> +#define NCSI_REQ_FLAG_EVENT_DRIVEN   1
> +#define NCSI_REQ_FLAG_NETLINK_DRIVEN 2
>   struct ncsi_dev_priv *ndp;/* Associated NCSI device  */
>   struct sk_buff   *cmd;/* Associated NCSI command packet  */
>   struct sk_buff   *rsp;/* Associated NCSI response packet */
>   struct timer_listtimer;   /* Timer on waiting for response   */
>   bool enabled; /* Time has been enabled or not*/
> +
> + u32  snd_seq; /* netlink sending sequence number */
> + u32  snd_portid;  /* netlink portid of sender*/
> + struct nlmsghdr  nlhdr;   /* netlink message header  */
>  };
>  
>  enum {
> @@ -305,6 +312,9 @@ struct ncsi_cmd_arg {
>   

Re: [PATCH net] inet: frags: rework rhashtable dismantle

2018-10-01 Thread Eric Dumazet
On Mon, Oct 1, 2018 at 5:58 PM Herbert Xu  wrote:

> The walk interface was designed to handle read-only iteration
> through the hash table.  While this probably works since the
> actual freeing is delayed by RCU, it seems to be rather fragile.
>
> How about using the dead flag but instead of putting it in the
> rhashtable put it in netns_frags and have the timers check on that
> before calling rhashtable_remove?

Sure, I will send a new version, thanks.


Re: [PATCH net-next] rtnetlink: fix rtnl_fdb_dump() for shorter family headers

2018-10-01 Thread Mauricio Faria de Oliveira
On Mon, Oct 1, 2018 at 12:38 PM Mauricio Faria de Oliveira
 wrote:
> Ok, thanks for your suggestions.
> I'll do some research/learning on them, and give it a try for a v2.

FYI, that is "[PATCH v2 net-next] rtnetlink: fix rtnl_fdb_dump() for
ndmsg header".

BTW, could please advise whether this should be net or net-next? It's a bug fix,
but it's late in the cycle, and this is not urgent (the problem has been around
since v4.12), so not sure it's really needed for v4.19.

Thanks,

-- 
Mauricio Faria de Oliveira


[PATCH v2 net-next] rtnetlink: fix rtnl_fdb_dump() for ndmsg header

2018-10-01 Thread Mauricio Faria de Oliveira
Currently, rtnl_fdb_dump() assumes the family header is 'struct ifinfomsg',
which is not always true -- 'struct ndmsg' is used by iproute2 ('ip neigh').

The problem is, the function bails out early if nlmsg_parse() fails, which
does occur for iproute2 usage of 'struct ndmsg' because the payload length
is shorter than the family header alone (as 'struct ifinfomsg' is assumed).

This breaks backward compatibility with userspace -- nothing is sent back.

Some examples with iproute2 and netlink library for go [1]:

 1) $ bridge fdb show
33:33:00:00:00:01 dev ens3 self permanent
01:00:5e:00:00:01 dev ens3 self permanent
33:33:ff:15:98:30 dev ens3 self permanent

  This one works, as it uses 'struct ifinfomsg'.

  fdb_show() @ iproute2/bridge/fdb.c
"""
.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
...
if (rtnl_dump_request(, RTM_GETNEIGH, [...]
"""

 2) $ ip --family bridge neigh
RTNETLINK answers: Invalid argument
Dump terminated

  This one fails, as it uses 'struct ndmsg'.

  do_show_or_flush() @ iproute2/ip/ipneigh.c
"""
.n.nlmsg_type = RTM_GETNEIGH,
.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ndmsg)),
"""

 3) $ ./neighlist
< no output >

  This one fails, as it uses 'struct ndmsg'-based.

  neighList() @ netlink/neigh_linux.go
"""
req := h.newNetlinkRequest(unix.RTM_GETNEIGH, [...]
msg := Ndmsg{
"""

The actual breakage was introduced by commit 0ff50e83b512 ("net: rtnetlink:
bail out from rtnl_fdb_dump() on parse error"), because nlmsg_parse() fails
if the payload length (with the _actual_ family header) is less than the
family header length alone (which is assumed, in parameter 'hdrlen').
This is true in the examples above with struct ndmsg, with size and payload
length shorter than struct ifinfomsg.

However, that commit just intends to fix something under the assumption the
family header is indeed an 'struct ifinfomsg' - by preventing access to the
payload as such (via 'ifm' pointer) if the payload length is not sufficient
to actually contain it.

The assumption was introduced by commit 5e6d24358799 ("bridge: netlink dump
interface at par with brctl"), to support iproute2's 'bridge fdb' command
(not 'ip neigh') which indeed uses 'struct ifinfomsg', thus is not broken.

So, in order to unbreak the 'struct ndmsg' family headers and still allow
'struct ifinfomsg' to continue to work, check for the known message sizes
used with 'struct ndmsg' in iproute2 (with zero or one attribute which is
not used in this function anyway) then do not parse the data as ifinfomsg.

Same examples with this patch applied (or revert/before the original fix):

$ bridge fdb show
33:33:00:00:00:01 dev ens3 self permanent
01:00:5e:00:00:01 dev ens3 self permanent
33:33:ff:15:98:30 dev ens3 self permanent

$ ip --family bridge neigh
dev ens3 lladdr 33:33:00:00:00:01 PERMANENT
dev ens3 lladdr 01:00:5e:00:00:01 PERMANENT
dev ens3 lladdr 33:33:ff:15:98:30 PERMANENT

$ ./neighlist
netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, 
IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x33, 0x33, 0x0, 0x0, 0x0, 0x1}, 
LLIPAddr:net.IP(nil), Vlan:0, VNI:0}
netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, 
IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x1, 0x0, 0x5e, 0x0, 0x0, 0x1}, 
LLIPAddr:net.IP(nil), Vlan:0, VNI:0}
netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, 
IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x33, 0x33, 0xff, 0x15, 0x98, 
0x30}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0}

Tested on mainline (v4.19-rc6) and net-next (3bd09b05b068).

References:

[1] netlink library for go (test-case)
https://github.com/vishvananda/netlink

$ cat ~/go/src/neighlist/main.go
package main
import ("fmt"; "syscall"; "github.com/vishvananda/netlink")
func main() {
neighs, _ := netlink.NeighList(0, syscall.AF_BRIDGE)
for _, neigh := range neighs { fmt.Printf("%#v\n", neigh) }
}

$ export GOPATH=~/go
$ go get github.com/vishvananda/netlink
$ go build neighlist
$ ~/go/src/neighlist/neighlist

Thanks to David Ahern for suggestions to improve this patch.

Fixes: 0ff50e83b512 ("net: rtnetlink: bail out from rtnl_fdb_dump() on parse 
error")
Fixes: 5e6d24358799 ("bridge: netlink dump interface at par with brctl")
Reported-by: Aidan Obley 
Signed-off-by: Mauricio Faria de Oliveira 

---
 v2: Change logic to check msg size for ndmsg with optional attribute.
 Thanks: David Ahern 

 net/core/rtnetlink.c | 29 -
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 60c928894a78..6633f245fce5 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3744,16 +3744,27 @@ static int rtnl_fdb_dump(struct sk_buff *skb, struct 
netlink_callback *cb)
int err = 0;
  

[PATCH bpf-next 0/3] nfp: bpf: support big map entries

2018-10-01 Thread Jakub Kicinski
Hi!

This series makes the control message parsing for interacting
with BPF maps more flexible.  Up until now we had a hard limit
in the ABI for key and value size to be 64B at most.  Using
TLV capability allows us to support large map entries.

Jakub Kicinski (3):
  nfp: bpf: parse global BPF ABI version capability
  nfp: allow apps to request larger MTU on control vNIC
  nfp: bpf: allow control message sizing for map ops

 drivers/net/ethernet/netronome/nfp/bpf/cmsg.c | 70 ---
 drivers/net/ethernet/netronome/nfp/bpf/fw.h   | 11 ++-
 drivers/net/ethernet/netronome/nfp/bpf/main.c | 52 --
 drivers/net/ethernet/netronome/nfp/bpf/main.h | 11 +++
 drivers/net/ethernet/netronome/nfp/nfp_app.h  |  4 ++
 .../ethernet/netronome/nfp/nfp_net_common.c   | 14 +++-
 .../net/ethernet/netronome/nfp/nfp_net_ctrl.h |  2 +-
 7 files changed, 142 insertions(+), 22 deletions(-)

-- 
2.17.1



[PATCH bpf-next 2/3] nfp: allow apps to request larger MTU on control vNIC

2018-10-01 Thread Jakub Kicinski
Some apps may want to have higher MTU on the control vNIC/queue.
Allow them to set the requested MTU at init time.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_app.h   |  4 
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 14 --
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_app.h 
b/drivers/net/ethernet/netronome/nfp/nfp_app.h
index 4e1eb3395648..c896eb8f87a1 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_app.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_app.h
@@ -40,6 +40,8 @@
 
 #include "nfp_net_repr.h"
 
+#define NFP_APP_CTRL_MTU_MAX   U32_MAX
+
 struct bpf_prog;
 struct net_device;
 struct netdev_bpf;
@@ -178,6 +180,7 @@ struct nfp_app_type {
  * @ctrl:  pointer to ctrl vNIC struct
  * @reprs: array of pointers to representors
  * @type:  pointer to const application ops and info
+ * @ctrl_mtu:  MTU to set on the control vNIC (set in .init())
  * @priv:  app-specific priv data
  */
 struct nfp_app {
@@ -189,6 +192,7 @@ struct nfp_app {
struct nfp_reprs __rcu *reprs[NFP_REPR_TYPE_MAX + 1];
 
const struct nfp_app_type *type;
+   unsigned int ctrl_mtu;
void *priv;
 };
 
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index d05e37fcc1b2..8e8dc0db2493 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -3877,10 +3877,20 @@ int nfp_net_init(struct nfp_net *nn)
return err;
 
/* Set default MTU and Freelist buffer size */
-   if (nn->max_mtu < NFP_NET_DEFAULT_MTU)
+   if (!nfp_net_is_data_vnic(nn) && nn->app->ctrl_mtu) {
+   if (nn->app->ctrl_mtu <= nn->max_mtu) {
+   nn->dp.mtu = nn->app->ctrl_mtu;
+   } else {
+   if (nn->app->ctrl_mtu != NFP_APP_CTRL_MTU_MAX)
+   nn_warn(nn, "app requested MTU above max 
supported %u > %u\n",
+   nn->app->ctrl_mtu, nn->max_mtu);
+   nn->dp.mtu = nn->max_mtu;
+   }
+   } else if (nn->max_mtu < NFP_NET_DEFAULT_MTU) {
nn->dp.mtu = nn->max_mtu;
-   else
+   } else {
nn->dp.mtu = NFP_NET_DEFAULT_MTU;
+   }
nn->dp.fl_bufsz = nfp_net_calc_fl_bufsz(>dp);
 
if (nfp_app_ctrl_uses_data_vnics(nn->app))
-- 
2.17.1



[PATCH bpf-next 1/3] nfp: bpf: parse global BPF ABI version capability

2018-10-01 Thread Jakub Kicinski
Up until now we only had per-vNIC BPF ABI version capabilities,
which are slightly awkward to use because bulk of the resources
and configuration does not relate to any particular vNIC.  Add
a new capability for global ABI version and check the per-vNIC
version are equal to it.  Assume the ABI version 2 if no explicit
version capability is present.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/bpf/fw.h   |  1 +
 drivers/net/ethernet/netronome/nfp/bpf/main.c | 43 +--
 drivers/net/ethernet/netronome/nfp/bpf/main.h |  4 ++
 3 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/fw.h 
b/drivers/net/ethernet/netronome/nfp/bpf/fw.h
index e4f9b7ec8528..58bad868bb6f 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/fw.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/fw.h
@@ -52,6 +52,7 @@ enum bpf_cap_tlv_type {
NFP_BPF_CAP_TYPE_RANDOM = 4,
NFP_BPF_CAP_TYPE_QUEUE_SELECT   = 5,
NFP_BPF_CAP_TYPE_ADJUST_TAIL= 6,
+   NFP_BPF_CAP_TYPE_ABI_VERSION= 7,
 };
 
 struct nfp_bpf_cap_tlv_func {
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c 
b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index 970af07f4656..1f79246765d1 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -54,11 +54,14 @@ const struct rhashtable_params nfp_bpf_maps_neutral_params 
= {
 static bool nfp_net_ebpf_capable(struct nfp_net *nn)
 {
 #ifdef __LITTLE_ENDIAN
-   if (nn->cap & NFP_NET_CFG_CTRL_BPF &&
-   nn_readb(nn, NFP_NET_CFG_BPF_ABI) == NFP_NET_BPF_ABI)
-   return true;
-#endif
+   struct nfp_app_bpf *bpf = nn->app->priv;
+
+   return nn->cap & NFP_NET_CFG_CTRL_BPF &&
+  bpf->abi_version &&
+  nn_readb(nn, NFP_NET_CFG_BPF_ABI) == bpf->abi_version;
+#else
return false;
+#endif
 }
 
 static int
@@ -342,6 +345,26 @@ nfp_bpf_parse_cap_adjust_tail(struct nfp_app_bpf *bpf, 
void __iomem *value,
return 0;
 }
 
+static int
+nfp_bpf_parse_cap_abi_version(struct nfp_app_bpf *bpf, void __iomem *value,
+ u32 length)
+{
+   if (length < 4) {
+   nfp_err(bpf->app->cpp, "truncated ABI version TLV: %d\n",
+   length);
+   return -EINVAL;
+   }
+
+   bpf->abi_version = readl(value);
+   if (bpf->abi_version != 2) {
+   nfp_warn(bpf->app->cpp, "unsupported BPF ABI version: %d\n",
+bpf->abi_version);
+   bpf->abi_version = 0;
+   }
+
+   return 0;
+}
+
 static int nfp_bpf_parse_capabilities(struct nfp_app *app)
 {
struct nfp_cpp *cpp = app->pf->cpp;
@@ -393,6 +416,11 @@ static int nfp_bpf_parse_capabilities(struct nfp_app *app)
  length))
goto err_release_free;
break;
+   case NFP_BPF_CAP_TYPE_ABI_VERSION:
+   if (nfp_bpf_parse_cap_abi_version(app->priv, value,
+ length))
+   goto err_release_free;
+   break;
default:
nfp_dbg(cpp, "unknown BPF capability: %d\n", type);
break;
@@ -414,6 +442,11 @@ static int nfp_bpf_parse_capabilities(struct nfp_app *app)
return -EINVAL;
 }
 
+static void nfp_bpf_init_capabilities(struct nfp_app_bpf *bpf)
+{
+   bpf->abi_version = 2; /* Original BPF ABI version */
+}
+
 static int nfp_bpf_ndo_init(struct nfp_app *app, struct net_device *netdev)
 {
struct nfp_app_bpf *bpf = app->priv;
@@ -447,6 +480,8 @@ static int nfp_bpf_init(struct nfp_app *app)
if (err)
goto err_free_bpf;
 
+   nfp_bpf_init_capabilities(bpf);
+
err = nfp_bpf_parse_capabilities(app);
if (err)
goto err_free_neutral_maps;
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index dbd00982fd2b..62cdb183efdb 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -127,6 +127,8 @@ enum pkt_vec {
  *
  * @maps_neutral:  hash table of offload-neutral maps (on pointer)
  *
+ * @abi_version:   global BPF ABI version
+ *
  * @adjust_head:   adjust head capability
  * @adjust_head.flags: extra flags for adjust head
  * @adjust_head.off_min:   minimal packet offset within buffer required
@@ -170,6 +172,8 @@ struct nfp_app_bpf {
 
struct rhashtable maps_neutral;
 
+   u32 abi_version;
+
struct nfp_bpf_cap_adjust_head {
u32 flags;
int off_min;
-- 
2.17.1



[PATCH bpf-next 3/3] nfp: bpf: allow control message sizing for map ops

2018-10-01 Thread Jakub Kicinski
In current ABI the size of the messages carrying map elements was
statically defined to at most 16 words of key and 16 words of value
(NFP word is 4 bytes).  We should not make this assumption and use
the max key and value sizes from the BPF capability instead.

To make sure old kernels don't get surprised with larger (or smaller)
messages bump the FW ABI version to 3 when key/value size is different
than 16 words.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/bpf/cmsg.c | 71 ---
 drivers/net/ethernet/netronome/nfp/bpf/fw.h   | 10 +--
 drivers/net/ethernet/netronome/nfp/bpf/main.c | 11 ++-
 drivers/net/ethernet/netronome/nfp/bpf/main.h |  7 ++
 .../net/ethernet/netronome/nfp/nfp_net_ctrl.h |  1 -
 5 files changed, 83 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c 
b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
index 2572a4b91c7c..fdcd2bc98916 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
@@ -89,15 +89,32 @@ nfp_bpf_cmsg_alloc(struct nfp_app_bpf *bpf, unsigned int 
size)
return skb;
 }
 
+static unsigned int
+nfp_bpf_cmsg_map_req_size(struct nfp_app_bpf *bpf, unsigned int n)
+{
+   unsigned int size;
+
+   size = sizeof(struct cmsg_req_map_op);
+   size += (bpf->cmsg_key_sz + bpf->cmsg_val_sz) * n;
+
+   return size;
+}
+
 static struct sk_buff *
 nfp_bpf_cmsg_map_req_alloc(struct nfp_app_bpf *bpf, unsigned int n)
+{
+   return nfp_bpf_cmsg_alloc(bpf, nfp_bpf_cmsg_map_req_size(bpf, n));
+}
+
+static unsigned int
+nfp_bpf_cmsg_map_reply_size(struct nfp_app_bpf *bpf, unsigned int n)
 {
unsigned int size;
 
-   size = sizeof(struct cmsg_req_map_op);
-   size += sizeof(struct cmsg_key_value_pair) * n;
+   size = sizeof(struct cmsg_reply_map_op);
+   size += (bpf->cmsg_key_sz + bpf->cmsg_val_sz) * n;
 
-   return nfp_bpf_cmsg_alloc(bpf, size);
+   return size;
 }
 
 static u8 nfp_bpf_cmsg_get_type(struct sk_buff *skb)
@@ -338,6 +355,34 @@ void nfp_bpf_ctrl_free_map(struct nfp_app_bpf *bpf, struct 
nfp_bpf_map *nfp_map)
dev_consume_skb_any(skb);
 }
 
+static void *
+nfp_bpf_ctrl_req_key(struct nfp_app_bpf *bpf, struct cmsg_req_map_op *req,
+unsigned int n)
+{
+   return >data[bpf->cmsg_key_sz * n + bpf->cmsg_val_sz * n];
+}
+
+static void *
+nfp_bpf_ctrl_req_val(struct nfp_app_bpf *bpf, struct cmsg_req_map_op *req,
+unsigned int n)
+{
+   return >data[bpf->cmsg_key_sz * (n + 1) + bpf->cmsg_val_sz * n];
+}
+
+static void *
+nfp_bpf_ctrl_reply_key(struct nfp_app_bpf *bpf, struct cmsg_reply_map_op 
*reply,
+  unsigned int n)
+{
+   return >data[bpf->cmsg_key_sz * n + bpf->cmsg_val_sz * n];
+}
+
+static void *
+nfp_bpf_ctrl_reply_val(struct nfp_app_bpf *bpf, struct cmsg_reply_map_op 
*reply,
+  unsigned int n)
+{
+   return >data[bpf->cmsg_key_sz * (n + 1) + bpf->cmsg_val_sz * n];
+}
+
 static int
 nfp_bpf_ctrl_entry_op(struct bpf_offloaded_map *offmap,
  enum nfp_bpf_cmsg_type op,
@@ -366,12 +411,13 @@ nfp_bpf_ctrl_entry_op(struct bpf_offloaded_map *offmap,
 
/* Copy inputs */
if (key)
-   memcpy(>elem[0].key, key, map->key_size);
+   memcpy(nfp_bpf_ctrl_req_key(bpf, req, 0), key, map->key_size);
if (value)
-   memcpy(>elem[0].value, value, map->value_size);
+   memcpy(nfp_bpf_ctrl_req_val(bpf, req, 0), value,
+  map->value_size);
 
skb = nfp_bpf_cmsg_communicate(bpf, skb, op,
-  sizeof(*reply) + sizeof(*reply->elem));
+  nfp_bpf_cmsg_map_reply_size(bpf, 1));
if (IS_ERR(skb))
return PTR_ERR(skb);
 
@@ -382,9 +428,11 @@ nfp_bpf_ctrl_entry_op(struct bpf_offloaded_map *offmap,
 
/* Copy outputs */
if (out_key)
-   memcpy(out_key, >elem[0].key, map->key_size);
+   memcpy(out_key, nfp_bpf_ctrl_reply_key(bpf, reply, 0),
+  map->key_size);
if (out_value)
-   memcpy(out_value, >elem[0].value, map->value_size);
+   memcpy(out_value, nfp_bpf_ctrl_reply_val(bpf, reply, 0),
+  map->value_size);
 
dev_consume_skb_any(skb);
 
@@ -428,6 +476,13 @@ int nfp_bpf_ctrl_getnext_entry(struct bpf_offloaded_map 
*offmap,
 key, NULL, 0, next_key, NULL);
 }
 
+unsigned int nfp_bpf_ctrl_cmsg_mtu(struct nfp_app_bpf *bpf)
+{
+   return max3((unsigned int)NFP_NET_DEFAULT_MTU,
+   nfp_bpf_cmsg_map_req_size(bpf, 1),
+   nfp_bpf_cmsg_map_reply_size(bpf, 1));
+}
+
 void nfp_bpf_ctrl_msg_rx(struct nfp_app *app, struct sk_buff *skb)
 {
struct nfp_app_bpf *bpf = app->priv;
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/fw.h 

Re: [PATCH v2] net/ncsi: Add NCSI OEM command support

2018-10-01 Thread Samuel Mendoza-Jonas
On Fri, 2018-09-28 at 18:06 -0700, Vijay Khemka wrote:
> This patch adds OEM commands and response handling. It also defines OEM
> command and response structure as per NCSI specification along with its
> handlers.
> 
> ncsi_cmd_handler_oem: This is a generic command request handler for OEM
> commands
> ncsi_rsp_handler_oem: This is a generic response handler for OEM commands
> 
> Signed-off-by: Vijay Khemka 

Hi Vijay - looks good to me, and should be a good common base for your
and Justin's changes.

Reviewed-by: Samuel Mendoza-Jonas 

> ---
>  net/ncsi/internal.h |  4 
>  net/ncsi/ncsi-cmd.c | 31 ---
>  net/ncsi/ncsi-pkt.h | 16 
>  net/ncsi/ncsi-rsp.c | 44 +++-
>  4 files changed, 91 insertions(+), 4 deletions(-)
> 
> diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
> index 8055e3965cef..c16cb7223064 100644
> --- a/net/ncsi/internal.h
> +++ b/net/ncsi/internal.h
> @@ -68,6 +68,10 @@ enum {
>   NCSI_MODE_MAX
>  };
>  
> +/* OEM Vendor Manufacture ID */
> +#define NCSI_OEM_MFR_MLX_ID 0x8119
> +#define NCSI_OEM_MFR_BCM_ID 0x113d
> +
>  struct ncsi_channel_version {
>   u32 version;/* Supported BCD encoded NCSI version */
>   u32 alpha2; /* Supported BCD encoded NCSI version */
> diff --git a/net/ncsi/ncsi-cmd.c b/net/ncsi/ncsi-cmd.c
> index 7567ca63aae2..2f98533eba46 100644
> --- a/net/ncsi/ncsi-cmd.c
> +++ b/net/ncsi/ncsi-cmd.c
> @@ -211,6 +211,26 @@ static int ncsi_cmd_handler_snfc(struct sk_buff *skb,
>   return 0;
>  }
>  
> +static int ncsi_cmd_handler_oem(struct sk_buff *skb,
> + struct ncsi_cmd_arg *nca)
> +{
> + struct ncsi_cmd_oem_pkt *cmd;
> + unsigned int len;
> +
> + len = sizeof(struct ncsi_cmd_pkt_hdr) + 4;
> + if (nca->payload < 26)
> + len += 26;
> + else
> + len += nca->payload;
> +
> + cmd = skb_put_zero(skb, len);
> + cmd->mfr_id = nca->dwords[0];
> + memcpy(cmd->data, >dwords[1], nca->payload - 4);
> + ncsi_cmd_build_header(>cmd.common, nca);
> +
> + return 0;
> +}
> +
>  static struct ncsi_cmd_handler {
>   unsigned char type;
>   int   payload;
> @@ -244,7 +264,7 @@ static struct ncsi_cmd_handler {
>   { NCSI_PKT_CMD_GNS,0, ncsi_cmd_handler_default },
>   { NCSI_PKT_CMD_GNPTS,  0, ncsi_cmd_handler_default },
>   { NCSI_PKT_CMD_GPS,0, ncsi_cmd_handler_default },
> - { NCSI_PKT_CMD_OEM,0, NULL },
> + { NCSI_PKT_CMD_OEM,   -1, ncsi_cmd_handler_oem },
>   { NCSI_PKT_CMD_PLDM,   0, NULL },
>   { NCSI_PKT_CMD_GPUUID, 0, ncsi_cmd_handler_default }
>  };
> @@ -316,8 +336,13 @@ int ncsi_xmit_cmd(struct ncsi_cmd_arg *nca)
>   return -ENOENT;
>   }
>  
> - /* Get packet payload length and allocate the request */
> - nca->payload = nch->payload;
> + /* Get packet payload length and allocate the request
> +  * It is expected that if length set as negative in
> +  * handler structure means caller is initializing it
> +  * and setting length in nca before calling xmit function
> +  */
> + if (nch->payload >= 0)
> + nca->payload = nch->payload;
>   nr = ncsi_alloc_command(nca);
>   if (!nr)
>   return -ENOMEM;
> diff --git a/net/ncsi/ncsi-pkt.h b/net/ncsi/ncsi-pkt.h
> index 91b4b66438df..1f338386810d 100644
> --- a/net/ncsi/ncsi-pkt.h
> +++ b/net/ncsi/ncsi-pkt.h
> @@ -151,6 +151,22 @@ struct ncsi_cmd_snfc_pkt {
>   unsigned char   pad[22];
>  };
>  
> +/* OEM Request Command as per NCSI Specification */
> +struct ncsi_cmd_oem_pkt {
> + struct ncsi_cmd_pkt_hdr cmd; /* Command header*/
> + __be32  mfr_id;  /* Manufacture ID*/
> + unsigned char   data[64];/* OEM Payload Data  */
> + __be32  checksum;/* Checksum  */
> +};
> +
> +/* OEM Response Packet as per NCSI Specification */
> +struct ncsi_rsp_oem_pkt {
> + struct ncsi_rsp_pkt_hdr rsp; /* Command header*/
> + __be32  mfr_id;  /* Manufacture ID*/
> + unsigned char   data[64];/* Payload data  */
> + __be32  checksum;/* Checksum  */
> +};
> +
>  /* Get Link Status */
>  struct ncsi_rsp_gls_pkt {
>   struct ncsi_rsp_pkt_hdr rsp;/* Response header   */
> diff --git a/net/ncsi/ncsi-rsp.c b/net/ncsi/ncsi-rsp.c
> index 930c1d3796f0..22664ebdc93a 100644
> --- a/net/ncsi/ncsi-rsp.c
> +++ b/net/ncsi/ncsi-rsp.c
> @@ -596,6 +596,48 @@ static int ncsi_rsp_handler_snfc(struct ncsi_request *nr)
>   return 0;
>  }
>  
> +static struct ncsi_rsp_oem_handler {
> + unsigned intmfr_id;
> + int (*handler)(struct ncsi_request *nr);
> +} ncsi_rsp_oem_handlers[] = {
> + { NCSI_OEM_MFR_MLX_ID, NULL },
> + { 

Re: [PATCH net] inet: frags: rework rhashtable dismantle

2018-10-01 Thread Herbert Xu
On Mon, Oct 01, 2018 at 10:58:21AM -0700, Eric Dumazet wrote:
>
>  void inet_frags_exit_net(struct netns_frags *nf)
>  {
> + struct rhashtable_iter hti;
> + struct inet_frag_queue *fq;
> +
> + /* Since we want to cleanup the hashtable, make sure that
> +  * we wont trigger an automatic shrinking while in our
> +  * rhashtable_walk_next() loop.
> +  * Also make sure that no resize is in progress.
> +  */
>   nf->high_thresh = 0; /* prevent creation of new frags */
> + nf->rhashtable.p.automatic_shrinking = false;
> + cancel_work_sync(>rhashtable.run_work);
>  
> - rhashtable_free_and_destroy(>rhashtable, inet_frags_free_cb, NULL);
> + rhashtable_walk_enter(>rhashtable, );
> + rhashtable_walk_start();
> + while ((fq = rhashtable_walk_next()) != NULL) {
> + if (IS_ERR(fq)) /* should not happen */
> + break;
> + if (!del_timer_sync(>timer))
> + continue;
> +
> + spin_lock_bh(>lock);
> + inet_frag_kill(fq);
> + spin_unlock_bh(>lock);
> +
> + inet_frag_put(fq);
> + if (need_resched()) {
> + rhashtable_walk_stop();
> + cond_resched();
> + rhashtable_walk_start();
> + }
> + }

The walk interface was designed to handle read-only iteration
through the hash table.  While this probably works since the
actual freeing is delayed by RCU, it seems to be rather fragile.

How about using the dead flag but instead of putting it in the
rhashtable put it in netns_frags and have the timers check on that
before calling rhashtable_remove?

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH iproute2 net-next] ipneigh: update man page and help for router

2018-10-01 Thread David Ahern
On 9/29/18 8:48 PM, Roopa Prabhu wrote:
> From: Roopa Prabhu 
> 
> While at it also add missing text for proxy in the man page.
> 
> Signed-off-by: Roopa Prabhu 
> ---
>  ip/ipneigh.c|  1 +
>  man/man8/ip-neighbour.8 | 11 ++-
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 

applied to iproute2-next. Thanks



Re: [PATCH net-next] ipv6: add vrf table handling code for ipv6 mcast

2018-10-01 Thread David Ahern
On 10/1/18 2:41 AM, Mike Manning wrote:
> From: Patrick Ruddy 
> 
> The code to obtain the correct table for the incoming interface was
> missing for IPv6. This has been added along with the table creation
> notification to fib rules for the RTNL_FAMILY_IP6MR address family.
> 
> Signed-off-by: Patrick Ruddy 
> Signed-off-by: Mike Manning 
> ---
>  drivers/net/vrf.c | 11 +++
>  net/ipv6/ip6mr.c  | 48 
>  2 files changed, 47 insertions(+), 12 deletions(-)
> 

Reviewed-by: David Ahern 



Re: [PATCH net-next] ipv4: Allow sending multicast packets on specific i/f using VRF socket

2018-10-01 Thread David Ahern
On 10/1/18 2:40 AM, Mike Manning wrote:
> From: Robert Shearman 
> 
> It is useful to be able to use the same socket for listening in a
> specific VRF, as for sending multicast packets out of a specific
> interface. However, the bound device on the socket currently takes
> precedence and results in the packets not being sent.
> 
> Relax the condition on overriding the output interface to use for
> sending packets out of UDP, raw and ping sockets to allow multicast
> packets to be sent using the specified multicast interface.
> 
> Signed-off-by: Robert Shearman 
> Signed-off-by: Mike Manning 
> ---
>  net/ipv4/datagram.c | 2 +-
>  net/ipv4/ping.c | 2 +-
>  net/ipv4/raw.c  | 2 +-
>  net/ipv4/udp.c  | 2 +-
>  4 files changed, 4 insertions(+), 4 deletions(-)
> 

Reviewed-by: David Ahern 



[PATCH RFC v2 net-next 00/25] rtnetlink: Add support for rigid checking of data in dump request

2018-10-01 Thread David Ahern
From: David Ahern 

There are many use cases where a user wants to influence what is
returned in a dump for some rtnetlink command: one is wanting data
for a different namespace than the one the request is received and
another is limiting the amount of data returned in the dump to a
specific set of interest to userspace, reducing the cpu overhead of
both kernel and userspace. Unfortunately, the kernel has historically
not been strict with checking for the proper header or checking the
values passed in the header. This lenient implementation has allowed
iproute2 and other packages to pass any struct or data in the dump
request as long as the family is the first byte. For example, ifinfomsg
struct is used by iproute2 for all generic dump requests - links,
addresses, routes and rules when it is really only valid for link
requests.

There is 1 is example where the kernel deals with the wrong struct: link
dumps after VF support was added. Older iproute2 was sending rtgenmsg as
the header instead of ifinfomsg so a patch was added to try and detect
old userspace vs new:
e5eca6d41f53 ("rtnetlink: fix userspace API breakage for iproute2 < v3.9.0")

The latest example is Christian's patch set wanting to return addresses for
a target namespace. It guesses the header struct is an ifaddrmsg and if it
guesses wrong a netlink warning is generated in the kernel log on every
address dump which is unacceptable.

Another example where the kernel is a bit lenient is route dumps: iproute2
can send either a request with either ifinfomsg or a rtmsg as the header
struct, yet the kernel always treats the header as an rtmsg (see
inet_dump_fib and rtm_flags check). The header inconsistency impacts the
ability to add kernel side filters for route dumps - a necessary feature
for scale setups with 100k+ routes.

How to resolve the problem of not breaking old userspace yet be able to
move forward with new features such as kernel side filtering which are
crucial for efficient operation at high scale?

This patch set addresses the problem by adding a new netlink flag,
NLM_F_DUMP_PROPER_HDR, that userspace can set to say "I have a clue, and
I am sending the right header struct" and that the struct fields and any
attributes after it should be used for filtering the data returned in the
dump.

Kernel side, the dump handlers are updated to verify the message contains
at least the expected header struct:
RTM_GETLINK:   ifinfomsg
RTM_GETADDR:   ifaddrmsg
RTM_GETMULTICAST:  ifaddrmsg
RTM_GETANYCAST:ifaddrmsg
RTM_GETADDRLABEL:  ifaddrlblmsg
RTM_GETROUTE:  rtmsg
RTM_GETSTATS:  if_stats_msg
RTM_GETNEIGH:  ndmsg
RTM_GETNEIGHTBL:   ndtmsg
RTM_GETNSID:   rtnl_net_dumpid
RTM_GETRULE:   fib_rule_hdr
RTM_GETNETCONF:netconfmsg
RTM_GETMDB:br_port_msg

And then every field in the header struct should be 0 with the exception
of the family. There are a few exceptions to this rule where the kernel
already influences the data returned by values in the struct. Next the
message should not contain attributes unless the kernel implements
filtering for it. Any unexpected data causes the dump to fail with EINVAL.
If the new flag is honored by the kernel and the dump contents adjusted
by any data passed in the request, the dump handler can set the
NLM_F_DUMP_FILTERED flag in the netlink message header.

As an example of how this new NLM_F_DUMP_PROPER_HDR can be leveraged,
the last 6 patch add filtering of route dumps based on table id, protocol,
tos, flags, scope, and egress device.

David Ahern (25):
  net/netlink: Pass extack to dump callbacks
  net/ipv6: Refactor address dump to push inet6_fill_args to
in6_dump_addrs
  netlink: introduce NLM_F_DUMP_PROPER_HDR flag
  net/ipv4: Update inet_dump_ifaddr to support NLM_F_DUMP_PROPER_HDR
  net/ipv6: Update inet6_dump_addr to support NLM_F_DUMP_PROPER_HDR
  rtnetlink: Update rtnl_dump_ifinfo to support NLM_F_DUMP_PROPER_HDR
  rtnetlink: Update rtnl_bridge_getlink to support NLM_F_DUMP_PROPER_HDR
  rtnetlink: Update rtnl_stats_dump to support NLM_F_DUMP_PROPER_HDR
  rtnetlink: Update inet6_dump_ifinfo to support NLM_F_DUMP_PROPER_HDR
  rtnetlink: Update ipmr_rtm_dumplink to support NLM_F_DUMP_PROPER_HDR
  rtnetlink: Update fib dumps to support NLM_F_DUMP_PROPER_HDR
  net/neigh: Refactor dump filter handling
  net/neighbor: Update neigh_dump_info to support NLM_F_DUMP_PROPER_HDR
  net/neighbor: Update neightbl_dump_info to support
NLM_F_DUMP_PROPER_HDR
  net/namespace: Update rtnl_net_dumpid to support NLM_F_DUMP_PROPER_HDR
  net/fib_rules: Update fib_nl_dumprule to support NLM_F_DUMP_PROPER_HDR
  net/ipv6: Update ip6addrlbl_dump to support NLM_F_DUMP_PROPER_HDR
  net: Update netconf dump handlers to support NLM_F_DUMP_PROPER_HDR
  net/bridge: Update br_mdb_dump to support NLM_F_DUMP_PROPER_HDR
  net: Add struct for fib dump filter
  net/ipv4: Plumb support for filtering route dumps
  net/ipv6: Plumb support for 

[PATCH RFC v2 net-next 19/25] net/bridge: Update br_mdb_dump to support NLM_F_DUMP_PROPER_HDR

2018-10-01 Thread David Ahern
From: David Ahern 

Update br_mdb_dump to check for NLM_F_DUMP_PROPER_HDR in the netlink
message header. If the flag is set, the dump request is expected to have
a br_port_msg struct as the header. All elements of the struct are
expected to be 0 and no attributes can be appended.

Signed-off-by: David Ahern 
---
 net/bridge/br_mdb.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index a4a848bf827b..57c43c1b1e71 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -167,8 +167,26 @@ static int br_mdb_dump(struct sk_buff *skb, struct 
netlink_callback *cb)
struct net_device *dev;
struct net *net = sock_net(skb->sk);
struct nlmsghdr *nlh = NULL;
+   struct br_port_msg *bpm;
int idx = 0, s_idx;
 
+   if (cb->nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
+   struct netlink_ext_ack *extack = cb->extack;
+
+   if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*bpm))) {
+   NL_SET_ERR_MSG(extack, "Invalid header");
+   return -EINVAL;
+   }
+   if (bpm->ifindex) {
+   NL_SET_ERR_MSG(extack, "Filtering by device index is 
not supported");
+   return -EINVAL;
+   }
+   if (nlh->nlmsg_len != nlmsg_msg_size(sizeof(*bpm))) {
+   NL_SET_ERR_MSG(extack, "Invalid data after header");
+   return -EINVAL;
+   }
+   }
+
s_idx = cb->args[0];
 
rcu_read_lock();
@@ -178,8 +196,6 @@ static int br_mdb_dump(struct sk_buff *skb, struct 
netlink_callback *cb)
 
for_each_netdev_rcu(net, dev) {
if (dev->priv_flags & IFF_EBRIDGE) {
-   struct br_port_msg *bpm;
-
if (idx < s_idx)
goto skip;
 
-- 
2.11.0



[PATCH RFC v2 net-next 17/25] net/ipv6: Update ip6addrlbl_dump to support NLM_F_DUMP_PROPER_HDR

2018-10-01 Thread David Ahern
From: David Ahern 

Update ip6addrlbl_dump to check for NLM_F_DUMP_PROPER_HDR in the netlink
message header. If the flag is set, the dump request is expected to have
an ifaddrlblmsg struct as the header. All elements of the struct are
expected to be 0 and no attributes can be appended.

Signed-off-by: David Ahern 
---
 net/ipv6/addrlabel.c | 25 -
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/addrlabel.c b/net/ipv6/addrlabel.c
index 1d6ced37ad71..89e15ed78c60 100644
--- a/net/ipv6/addrlabel.c
+++ b/net/ipv6/addrlabel.c
@@ -460,18 +460,41 @@ static int ip6addrlbl_fill(struct sk_buff *skb,
 
 static int ip6addrlbl_dump(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   const struct nlmsghdr *nlh = cb->nlh;
struct net *net = sock_net(skb->sk);
struct ip6addrlbl_entry *p;
int idx = 0, s_idx = cb->args[0];
int err;
 
+   if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
+   struct netlink_ext_ack *extack = cb->extack;
+   struct ifaddrlblmsg *ifal;
+
+   if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*ifal))) {
+   NL_SET_ERR_MSG(extack, "Invalid header");
+   return -EINVAL;
+   }
+
+   ifal = nlmsg_data(nlh);
+   if (ifal->__ifal_reserved || ifal->ifal_prefixlen ||
+   ifal->ifal_flags || ifal->ifal_index || ifal->ifal_seq) {
+   NL_SET_ERR_MSG(extack, "Invalid values in header for 
dump request");
+   return -EINVAL;
+   }
+
+   if (nlh->nlmsg_len != nlmsg_msg_size(sizeof(*ifal))) {
+   NL_SET_ERR_MSG(extack, "Invalid data after header");
+   return -EINVAL;
+   }
+   }
+
rcu_read_lock();
hlist_for_each_entry_rcu(p, >ipv6.ip6addrlbl_table.head, list) {
if (idx >= s_idx) {
err = ip6addrlbl_fill(skb, p,
  net->ipv6.ip6addrlbl_table.seq,
  NETLINK_CB(cb->skb).portid,
- cb->nlh->nlmsg_seq,
+ nlh->nlmsg_seq,
  RTM_NEWADDRLABEL,
  NLM_F_MULTI);
if (err < 0)
-- 
2.11.0



[PATCH RFC v2 net-next 07/25] rtnetlink: Update rtnl_bridge_getlink to support NLM_F_DUMP_PROPER_HDR

2018-10-01 Thread David Ahern
From: David Ahern 

Update rtnl_bridge_getlink to check for NLM_F_DUMP_PROPER_HDR in the netlink
message header. If the flag is set, the dump request is expected to have
an ifinfomsg struct as the header potentially followed by one or more
attributes. Any data passed in the header or as an attribute is taken as
a request to influence the data returned. Only values supported by the
dump handler are allowed to be non-0 or set in the request. At the moment
only the IFLA_EXT_MASK attribute is supported.

Signed-off-by: David Ahern 
---
 net/core/rtnetlink.c | 56 ++--
 1 file changed, 46 insertions(+), 10 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 2bf4b9916ca2..51a653b810be 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3999,27 +3999,63 @@ EXPORT_SYMBOL_GPL(ndo_dflt_bridge_getlink);
 
 static int rtnl_bridge_getlink(struct sk_buff *skb, struct netlink_callback 
*cb)
 {
+   struct netlink_ext_ack *extack = cb->extack;
+   const struct nlmsghdr *nlh = cb->nlh;
+   bool proper_hdr = !!(nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR);
struct net *net = sock_net(skb->sk);
+   struct nlattr *tb[IFLA_MAX+1];
struct net_device *dev;
int idx = 0;
u32 portid = NETLINK_CB(cb->skb).portid;
-   u32 seq = cb->nlh->nlmsg_seq;
+   u32 seq = nlh->nlmsg_seq;
u32 filter_mask = 0;
-   int err;
+   int err, i;
 
-   if (nlmsg_len(cb->nlh) > sizeof(struct ifinfomsg)) {
-   struct nlattr *extfilt;
+   if (proper_hdr) {
+   struct ifinfomsg *ifm;
 
-   extfilt = nlmsg_find_attr(cb->nlh, sizeof(struct ifinfomsg),
- IFLA_EXT_MASK);
-   if (extfilt) {
-   if (nla_len(extfilt) < sizeof(filter_mask))
-   return -EINVAL;
+   if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*ifm))) {
+   NL_SET_ERR_MSG(extack, "Invalid header");
+   return -EINVAL;
+   }
 
-   filter_mask = nla_get_u32(extfilt);
+   ifm = nlmsg_data(nlh);
+   if (ifm->__ifi_pad || ifm->ifi_type || ifm->ifi_flags ||
+   ifm->ifi_change) {
+   NL_SET_ERR_MSG(extack, "Invalid values in header for 
dump request");
+   return -EINVAL;
+   }
+   if (ifm->ifi_index) {
+   NL_SET_ERR_MSG(extack, "Filter by device index not 
supported");
+   return -EINVAL;
}
}
 
+   err = nlmsg_parse(nlh, sizeof(struct ifinfomsg), tb, IFLA_MAX,
+ ifla_policy, extack);
+   if (err < 0) {
+   if (proper_hdr) {
+   NL_SET_ERR_MSG(extack, "Failed to parse link 
attributes");
+   return -EINVAL;
+   }
+   goto walk_entries;
+   }
+
+   for (i = 0; i <= IFLA_MAX; ++i) {
+   switch (i) {
+   case IFLA_EXT_MASK:
+   if (tb[i])
+   filter_mask = nla_get_u32(tb[i]);
+   break;
+   default:
+   if (proper_hdr && tb[i]) {
+   NL_SET_ERR_MSG(extack, "Unsupported attribute 
in dump request");
+   return -EINVAL;
+   }
+   }
+   }
+
+walk_entries:
rcu_read_lock();
for_each_netdev_rcu(net, dev) {
const struct net_device_ops *ops = dev->netdev_ops;
-- 
2.11.0



[PATCH RFC v2 net-next 09/25] rtnetlink: Update inet6_dump_ifinfo to support NLM_F_DUMP_PROPER_HDR

2018-10-01 Thread David Ahern
From: David Ahern 

Update inet6_dump_ifinfo to check for NLM_F_DUMP_PROPER_HDR in the netlink
message header. If the flag is set, the dump request is expected to have
an ifinfomsg struct as the header. All elements of the struct are
expected to be 0 and no attributes can be appended.

Signed-off-by: David Ahern 
---
 net/ipv6/addrconf.c | 38 +-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 3382737df2a8..eb6fd5fbac80 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5626,8 +5626,34 @@ static int inet6_fill_ifinfo(struct sk_buff *skb, struct 
inet6_dev *idev,
return -EMSGSIZE;
 }
 
+static int inet6_valid_dump_ifinfo(const struct nlmsghdr *nlh,
+  struct netlink_ext_ack *extack)
+{
+   struct ifinfomsg *ifm;
+
+   if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*ifm))) {
+   NL_SET_ERR_MSG(extack, "Invalid header");
+   return -EINVAL;
+   }
+
+   if (nlh->nlmsg_len > nlmsg_msg_size(sizeof(*ifm))) {
+   NL_SET_ERR_MSG(extack, "Invalid data after header");
+   return -EINVAL;
+   }
+
+   ifm = nlmsg_data(nlh);
+   if (ifm->__ifi_pad || ifm->ifi_type || ifm->ifi_flags ||
+   ifm->ifi_change || ifm->ifi_index) {
+   NL_SET_ERR_MSG(extack, "Invalid values in header for dump 
request");
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 static int inet6_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   const struct nlmsghdr *nlh = cb->nlh;
struct net *net = sock_net(skb->sk);
int h, s_h;
int idx = 0, s_idx;
@@ -5635,6 +5661,16 @@ static int inet6_dump_ifinfo(struct sk_buff *skb, struct 
netlink_callback *cb)
struct inet6_dev *idev;
struct hlist_head *head;
 
+   /* only requests using NLM_F_DUMP_PROPER_HDR can pass data to
+* influence the dump
+*/
+   if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
+   int err = inet6_valid_dump_ifinfo(nlh, cb->extack);
+
+   if (err)
+   return err;
+   }
+
s_h = cb->args[0];
s_idx = cb->args[1];
 
@@ -5650,7 +5686,7 @@ static int inet6_dump_ifinfo(struct sk_buff *skb, struct 
netlink_callback *cb)
goto cont;
if (inet6_fill_ifinfo(skb, idev,
  NETLINK_CB(cb->skb).portid,
- cb->nlh->nlmsg_seq,
+ nlh->nlmsg_seq,
  RTM_NEWLINK, NLM_F_MULTI) < 0)
goto out;
 cont:
-- 
2.11.0



[PATCH RFC v2 net-next 13/25] net/neighbor: Update neigh_dump_info to support NLM_F_DUMP_PROPER_HDR

2018-10-01 Thread David Ahern
From: David Ahern 

Update neigh_dump_info to check for NLM_F_DUMP_PROPER_HDR in the netlink
message header. If the flag is set, the dump request is expected to have
an ndmsg struct as the header potentially followed by one or more
attributes. Any data passed in the header or as an attribute is taken as
a request to influence the data returned. Only values supported by the
dump handler are allowed to be non-0 or set in the request. At the moment
only the NDA_IFINDEX and NDA_MASTER attributes are supported.

Signed-off-by: David Ahern 
---
 net/core/neighbour.c | 61 +++-
 1 file changed, 51 insertions(+), 10 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 9bab9ae9c98e..aaf2526e5da4 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2425,13 +2425,15 @@ static int pneigh_dump_table(struct neigh_table *tbl, 
struct sk_buff *skb,
 
 static int neigh_dump_info(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   struct netlink_ext_ack *extack = cb->extack;
const struct nlmsghdr *nlh = cb->nlh;
+   bool proper_hdr = !!(nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR);
struct neigh_dump_filter filter = {};
struct nlattr *tb[NDA_MAX + 1];
struct neigh_table *tbl;
int t, family, s_t;
int proxy = 0;
-   int err;
+   int err, i;
 
family = ((struct rtgenmsg *)nlmsg_data(nlh))->rtgen_family;
 
@@ -2442,19 +2444,58 @@ static int neigh_dump_info(struct sk_buff *skb, struct 
netlink_callback *cb)
((struct ndmsg *)nlmsg_data(nlh))->ndm_flags == NTF_PROXY)
proxy = 1;
 
-   err = nlmsg_parse(nlh, sizeof(struct ndmsg), tb, NDA_MAX, NULL, NULL);
-   if (!err) {
-   if (tb[NDA_IFINDEX]) {
-   if (nla_len(tb[NDA_IFINDEX]) != sizeof(u32))
-   return -EINVAL;
-   filter.dev_idx = nla_get_u32(tb[NDA_IFINDEX]);
+   if (proper_hdr) {
+   struct ndmsg *ndm;
+
+   if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*ndm))) {
+   NL_SET_ERR_MSG(extack, "Invalid header");
+   return -EINVAL;
+   }
+
+   ndm = nlmsg_data(nlh);
+   if (ndm->ndm_pad1  || ndm->ndm_pad2  || ndm->ndm_ifindex ||
+   ndm->ndm_state || ndm->ndm_flags || ndm->ndm_type) {
+   NL_SET_ERR_MSG(extack, "Invalid values in header for 
dump request");
+   return -EINVAL;
}
-   if (tb[NDA_MASTER]) {
-   if (nla_len(tb[NDA_MASTER]) != sizeof(u32))
+   }
+
+   err = nlmsg_parse(nlh, sizeof(struct ndmsg), tb, NDA_MAX, NULL, extack);
+   if (err < 0) {
+   if (proper_hdr) {
+   NL_SET_ERR_MSG(extack, "Failed to parse link 
attributes");
+   return -EINVAL;
+   }
+   goto walk_entries;
+   }
+
+   for (i = 0; i <= NDA_MAX; ++i) {
+   if (!tb[i])
+   continue;
+   switch (i) {
+   case NDA_IFINDEX:
+   if (nla_len(tb[i]) != sizeof(u32)) {
+   NL_SET_ERR_MSG(extack, "Invalid IFINDEX 
attribute");
+   return -EINVAL;
+   }
+   filter.dev_idx = nla_get_u32(tb[i]);
+   break;
+   case NDA_MASTER:
+   if (nla_len(tb[i]) != sizeof(u32)) {
+   NL_SET_ERR_MSG(extack, "Invalid MASTER 
attribute");
return -EINVAL;
-   filter.master_idx = nla_get_u32(tb[NDA_MASTER]);
+   }
+   filter.master_idx = nla_get_u32(tb[i]);
+   break;
+   default:
+   if (proper_hdr) {
+   NL_SET_ERR_MSG(extack, "Unsupported attribute 
in dump request");
+   return -EINVAL;
+   }
}
}
+
+walk_entries:
s_t = cb->args[0];
 
for (t = 0; t < NEIGH_NR_TABLES; t++) {
-- 
2.11.0



[PATCH RFC v2 net-next 03/25] netlink: introduce NLM_F_DUMP_PROPER_HDR flag

2018-10-01 Thread David Ahern
From: David Ahern 

Add a new flag, NLM_F_DUMP_PROPER_HDR, for userspace to indicate to the
kernel that it believes it is sending the right header struct for the
dump message type (ifinfomsg, ifaddrmsg, rtmsg, fib_rule_hdr, ...).

Setting the flag in the netlink message header indicates to the kernel
it should do rigid checking on all data passed in the dump request and
filter the data returned based on data passed in.

Signed-off-by: David Ahern 
---
 include/uapi/linux/netlink.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/netlink.h b/include/uapi/linux/netlink.h
index 776bc92e9118..e722bed88dee 100644
--- a/include/uapi/linux/netlink.h
+++ b/include/uapi/linux/netlink.h
@@ -57,6 +57,7 @@ struct nlmsghdr {
 #define NLM_F_ECHO 0x08/* Echo this request*/
 #define NLM_F_DUMP_INTR0x10/* Dump was inconsistent due to 
sequence change */
 #define NLM_F_DUMP_FILTERED0x20/* Dump was filtered as requested */
+#define NLM_F_DUMP_PROPER_HDR  0x40/* Dump request has the proper header 
for type */
 
 /* Modifiers to GET request */
 #define NLM_F_ROOT 0x100   /* specify tree root*/
-- 
2.11.0



[PATCH RFC v2 net-next 02/25] net/ipv6: Refactor address dump to push inet6_fill_args to in6_dump_addrs

2018-10-01 Thread David Ahern
From: David Ahern 

Pull the inet6_fill_args arg up to in6_dump_addrs and move netnsid
into it. Since IFA_TARGET_NETNSID is a kernel side filter add the
NLM_F_DUMP_FILTERED flag so userspace knows the request was honored.

Signed-off-by: David Ahern 
Acked-by: Christian Brauner 
---
 net/ipv6/addrconf.c | 59 +
 1 file changed, 32 insertions(+), 27 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index a9a317322388..375ea9d9869b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4793,12 +4793,19 @@ static inline int inet6_ifaddr_msgsize(void)
   + nla_total_size(4)  /* IFA_RT_PRIORITY */;
 }
 
+enum addr_type_t {
+   UNICAST_ADDR,
+   MULTICAST_ADDR,
+   ANYCAST_ADDR,
+};
+
 struct inet6_fill_args {
u32 portid;
u32 seq;
int event;
unsigned int flags;
int netnsid;
+   enum addr_type_t type;
 };
 
 static int inet6_fill_ifaddr(struct sk_buff *skb, struct inet6_ifaddr *ifa,
@@ -4930,39 +4937,28 @@ static int inet6_fill_ifacaddr(struct sk_buff *skb, 
struct ifacaddr6 *ifaca,
return 0;
 }
 
-enum addr_type_t {
-   UNICAST_ADDR,
-   MULTICAST_ADDR,
-   ANYCAST_ADDR,
-};
-
 /* called with rcu_read_lock() */
 static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb,
- struct netlink_callback *cb, enum addr_type_t type,
- int s_ip_idx, int *p_ip_idx, int netnsid)
+ struct netlink_callback *cb,
+ int s_ip_idx, int *p_ip_idx,
+ struct inet6_fill_args *fillargs)
 {
-   struct inet6_fill_args fillargs = {
-   .portid = NETLINK_CB(cb->skb).portid,
-   .seq = cb->nlh->nlmsg_seq,
-   .flags = NLM_F_MULTI,
-   .netnsid = netnsid,
-   };
struct ifmcaddr6 *ifmca;
struct ifacaddr6 *ifaca;
int err = 1;
int ip_idx = *p_ip_idx;
 
read_lock_bh(>lock);
-   switch (type) {
+   switch (fillargs->type) {
case UNICAST_ADDR: {
struct inet6_ifaddr *ifa;
-   fillargs.event = RTM_NEWADDR;
+   fillargs->event = RTM_NEWADDR;
 
/* unicast address incl. temp addr */
list_for_each_entry(ifa, >addr_list, if_list) {
if (++ip_idx < s_ip_idx)
continue;
-   err = inet6_fill_ifaddr(skb, ifa, );
+   err = inet6_fill_ifaddr(skb, ifa, fillargs);
if (err < 0)
break;
nl_dump_check_consistent(cb, nlmsg_hdr(skb));
@@ -4970,26 +4966,26 @@ static int in6_dump_addrs(struct inet6_dev *idev, 
struct sk_buff *skb,
break;
}
case MULTICAST_ADDR:
-   fillargs.event = RTM_GETMULTICAST;
+   fillargs->event = RTM_GETMULTICAST;
 
/* multicast address */
for (ifmca = idev->mc_list; ifmca;
 ifmca = ifmca->next, ip_idx++) {
if (ip_idx < s_ip_idx)
continue;
-   err = inet6_fill_ifmcaddr(skb, ifmca, );
+   err = inet6_fill_ifmcaddr(skb, ifmca, fillargs);
if (err < 0)
break;
}
break;
case ANYCAST_ADDR:
-   fillargs.event = RTM_GETANYCAST;
+   fillargs->event = RTM_GETANYCAST;
/* anycast address */
for (ifaca = idev->ac_list; ifaca;
 ifaca = ifaca->aca_next, ip_idx++) {
if (ip_idx < s_ip_idx)
continue;
-   err = inet6_fill_ifacaddr(skb, ifaca, );
+   err = inet6_fill_ifacaddr(skb, ifaca, fillargs);
if (err < 0)
break;
}
@@ -5005,10 +5001,16 @@ static int in6_dump_addrs(struct inet6_dev *idev, 
struct sk_buff *skb,
 static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
   enum addr_type_t type)
 {
+   struct inet6_fill_args fillargs = {
+   .portid = NETLINK_CB(cb->skb).portid,
+   .seq = cb->nlh->nlmsg_seq,
+   .flags = NLM_F_MULTI,
+   .netnsid = -1,
+   .type = type,
+   };
struct net *net = sock_net(skb->sk);
struct nlattr *tb[IFA_MAX+1];
struct net *tgt_net = net;
-   int netnsid = -1;
int h, s_h;
int idx, ip_idx;
int s_idx, s_ip_idx;
@@ -5023,11 +5025,14 @@ static int inet6_dump_addr(struct sk_buff *skb, struct 
netlink_callback *cb,
if (nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,
  

[PATCH RFC v2 net-next 14/25] net/neighbor: Update neightbl_dump_info to support NLM_F_DUMP_PROPER_HDR

2018-10-01 Thread David Ahern
From: David Ahern 

Update neightbl_dump_info to check for NLM_F_DUMP_PROPER_HDR in the netlink
message header. If the flag is set, the dump request is expected to have
an ndtmsg struct as the header. All elements of the struct are expected to
be 0 and no attributes can be appended.

Signed-off-by: David Ahern 
---
 net/core/neighbour.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index aaf2526e5da4..8488f2e2c865 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2166,13 +2166,35 @@ static int neightbl_set(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 
 static int neightbl_dump_info(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   const struct nlmsghdr *nlh = cb->nlh;
struct net *net = sock_net(skb->sk);
int family, tidx, nidx = 0;
int tbl_skip = cb->args[0];
int neigh_skip = cb->args[1];
struct neigh_table *tbl;
 
-   family = ((struct rtgenmsg *) nlmsg_data(cb->nlh))->rtgen_family;
+   if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
+   struct netlink_ext_ack *extack = cb->extack;
+   struct ndtmsg *ndtm;
+
+   if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*ndtm))) {
+   NL_SET_ERR_MSG(extack, "Invalid header");
+   return -EINVAL;
+   }
+
+   ndtm = nlmsg_data(nlh);
+   if (ndtm->ndtm_pad1  || ndtm->ndtm_pad2) {
+   NL_SET_ERR_MSG(extack, "Invalid values in header for 
dump request");
+   return -EINVAL;
+   }
+
+   if (nlh->nlmsg_len != nlmsg_msg_size(sizeof(*ndtm))) {
+   NL_SET_ERR_MSG(extack, "Invalid data after header");
+   return -EINVAL;
+   }
+   }
+
+   family = ((struct rtgenmsg *)nlmsg_data(nlh))->rtgen_family;
 
for (tidx = 0; tidx < NEIGH_NR_TABLES; tidx++) {
struct neigh_parms *p;
@@ -2185,7 +2207,7 @@ static int neightbl_dump_info(struct sk_buff *skb, struct 
netlink_callback *cb)
continue;
 
if (neightbl_fill_info(skb, tbl, NETLINK_CB(cb->skb).portid,
-  cb->nlh->nlmsg_seq, RTM_NEWNEIGHTBL,
+  nlh->nlmsg_seq, RTM_NEWNEIGHTBL,
   NLM_F_MULTI) < 0)
break;
 
@@ -2200,7 +,7 @@ static int neightbl_dump_info(struct sk_buff *skb, struct 
netlink_callback *cb)
 
if (neightbl_fill_param_info(skb, tbl, p,
 NETLINK_CB(cb->skb).portid,
-cb->nlh->nlmsg_seq,
+nlh->nlmsg_seq,
 RTM_NEWNEIGHTBL,
 NLM_F_MULTI) < 0)
goto out;
-- 
2.11.0



[PATCH RFC v2 net-next 08/25] rtnetlink: Update rtnl_stats_dump to support NLM_F_DUMP_PROPER_HDR

2018-10-01 Thread David Ahern
From: David Ahern 

Update rtnl_stats_dump to check for NLM_F_DUMP_PROPER_HDR in the netlink
message header. If the flag is set, the dump request is expected to have
an if_stats_msg struct as the header. All elements of the struct are
expected to be 0 except filter_mask which must be non-0 (legacy behavior).
No attributes are supported.

Signed-off-by: David Ahern 
---
 net/core/rtnetlink.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 51a653b810be..1751baf0c823 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -4648,6 +4648,9 @@ static int rtnl_stats_get(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 
 static int rtnl_stats_dump(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   struct netlink_ext_ack *extack = cb->extack;
+   const struct nlmsghdr *nlh = cb->nlh;
+   bool proper_hdr = !!(nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR);
int h, s_h, err, s_idx, s_idxattr, s_prividx;
struct net *net = sock_net(skb->sk);
unsigned int flags = NLM_F_MULTI;
@@ -4668,9 +4671,26 @@ static int rtnl_stats_dump(struct sk_buff *skb, struct 
netlink_callback *cb)
return -EINVAL;
 
ifsm = nlmsg_data(cb->nlh);
+
+   /* only requests using NLM_F_DUMP_PROPER_HDR can pass data to
+* influence the dump. The legacy exception is filter_mask.
+*/
+   if (proper_hdr) {
+   if (ifsm->pad1 || ifsm->pad2 || ifsm->ifindex) {
+   NL_SET_ERR_MSG(extack, "Invalid values in header for 
dump request");
+   return -EINVAL;
+   }
+   if (nlmsg_len(cb->nlh) != nlmsg_msg_size(sizeof(*ifsm))) {
+   NL_SET_ERR_MSG(extack, "Invalid attributes after 
header");
+   return -EINVAL;
+   }
+   }
+
filter_mask = ifsm->filter_mask;
-   if (!filter_mask)
+   if (!filter_mask) {
+   NL_SET_ERR_MSG(extack, "Invalid filter_mask in header");
return -EINVAL;
+   }
 
for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
idx = 0;
-- 
2.11.0



[PATCH RFC v2 net-next 05/25] net/ipv6: Update inet6_dump_addr to support NLM_F_DUMP_PROPER_HDR

2018-10-01 Thread David Ahern
From: David Ahern 

Update inet6_dump_addr to check for NLM_F_DUMP_PROPER_HDR in the netlink
message header. If the flag is set, the dump request is expected to have
an ifaddrmsg struct as the header potentially followed by one or more
attributes. Any data passed in the header or as an attribute is taken as
a request to influence the data returned. Only values suppored by the
dump handler are allowed to be non-0 or set in the request. At the moment
only the IFA_TARGET_NETNSID attribute is supported. Follow on patches
will support for other fields (e.g., honor ifa_index and only return data
for the given device index).

Signed-off-by: David Ahern 
---
 net/ipv6/addrconf.c | 49 +++--
 1 file changed, 39 insertions(+), 10 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 375ea9d9869b..3382737df2a8 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5001,6 +5001,8 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct 
sk_buff *skb,
 static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
   enum addr_type_t type)
 {
+   struct netlink_ext_ack *extack = cb->extack;
+   const struct nlmsghdr *nlh = cb->nlh;
struct inet6_fill_args fillargs = {
.portid = NETLINK_CB(cb->skb).portid,
.seq = cb->nlh->nlmsg_seq,
@@ -5009,7 +5011,6 @@ static int inet6_dump_addr(struct sk_buff *skb, struct 
netlink_callback *cb,
.type = type,
};
struct net *net = sock_net(skb->sk);
-   struct nlattr *tb[IFA_MAX+1];
struct net *tgt_net = net;
int h, s_h;
int idx, ip_idx;
@@ -5022,17 +5023,45 @@ static int inet6_dump_addr(struct sk_buff *skb, struct 
netlink_callback *cb,
s_idx = idx = cb->args[1];
s_ip_idx = ip_idx = cb->args[2];
 
-   if (nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,
-   ifa_ipv6_policy, NULL) >= 0) {
-   if (tb[IFA_TARGET_NETNSID]) {
-   fillargs.netnsid = nla_get_s32(tb[IFA_TARGET_NETNSID]);
+   if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
+   struct nlattr *tb[IFA_MAX+1];
+   struct ifaddrmsg *ifm;
+   int err, i;
+
+   if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*ifm))) {
+   NL_SET_ERR_MSG(extack, "Invalid header");
+   return -EINVAL;
+   }
+
+   ifm = nlmsg_data(nlh);
+   if (ifm->ifa_prefixlen || ifm->ifa_flags || ifm->ifa_scope) {
+   NL_SET_ERR_MSG(extack, "Invalid values in header for 
dump request");
+   return -EINVAL;
+   }
+   if (ifm->ifa_index) {
+   NL_SET_ERR_MSG(extack, "Filter by device index not 
supported");
+   return -EINVAL;
+   }
+
+   err = nlmsg_parse(nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,
+ ifa_ipv6_policy, extack);
+   if (err < 0)
+   return err;
+
+   for (i = 0; i <= IFA_MAX; ++i) {
+   if (i == IFA_TARGET_NETNSID) {
+   fillargs.netnsid = nla_get_s32(tb[i]);
 
-   tgt_net = rtnl_get_net_ns_capable(skb->sk,
- fillargs.netnsid);
-   if (IS_ERR(tgt_net))
-   return PTR_ERR(tgt_net);
+   tgt_net = rtnl_get_net_ns_capable(skb->sk,
+ 
fillargs.netnsid);
+   if (IS_ERR(tgt_net))
+   return PTR_ERR(tgt_net);
 
-   fillargs.flags |= NLM_F_DUMP_FILTERED;
+   fillargs.flags |= NLM_F_DUMP_FILTERED;
+   } else if (tb[i]) {
+   NL_SET_ERR_MSG(extack, "Unsupported attribute 
in dump request");
+   return -EINVAL;
+   }
}
}
 
-- 
2.11.0



[PATCH RFC v2 net-next 04/25] net/ipv4: Update inet_dump_ifaddr to support NLM_F_DUMP_PROPER_HDR

2018-10-01 Thread David Ahern
From: David Ahern 

Update inet_dump_ifaddr to check for NLM_F_DUMP_PROPER_HDR in the netlink
message header. If the flag is set, the dump request is expected to have
an ifaddrmsg struct as the header potentially followed by one or more
attributes. Any data passed in the header or as an attribute is taken as
a request to influence the data returned. Only values suppored by the
dump handler are allowed to be non-0 or set in the request. At the moment
only the IFA_TARGET_NETNSID attribute is supported. Follow on patches
will support for other fields (e.g., honor ifa_index and only return data
for the given device index).

Signed-off-by: David Ahern 
---
 net/ipv4/devinet.c | 52 +---
 1 file changed, 41 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 44d931a3cd50..c27537f568f0 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1661,15 +1661,15 @@ static int inet_fill_ifaddr(struct sk_buff *skb, struct 
in_ifaddr *ifa,
 
 static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   struct netlink_ext_ack *extack = cb->extack;
+   const struct nlmsghdr *nlh = cb->nlh;
struct inet_fill_args fillargs = {
.portid = NETLINK_CB(cb->skb).portid,
-   .seq = cb->nlh->nlmsg_seq,
+   .seq = nlh->nlmsg_seq,
.event = RTM_NEWADDR,
-   .flags = NLM_F_MULTI,
.netnsid = -1,
};
struct net *net = sock_net(skb->sk);
-   struct nlattr *tb[IFA_MAX+1];
struct net *tgt_net = net;
int h, s_h;
int idx, s_idx;
@@ -1683,15 +1683,45 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct 
netlink_callback *cb)
s_idx = idx = cb->args[1];
s_ip_idx = ip_idx = cb->args[2];
 
-   if (nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,
-   ifa_ipv4_policy, NULL) >= 0) {
-   if (tb[IFA_TARGET_NETNSID]) {
-   fillargs.netnsid = nla_get_s32(tb[IFA_TARGET_NETNSID]);
+   if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
+   struct nlattr *tb[IFA_MAX+1];
+   struct ifaddrmsg *ifm;
+   int err, i;
+
+   if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*ifm))) {
+   NL_SET_ERR_MSG(extack, "Invalid header");
+   return -EINVAL;
+   }
+
+   ifm = nlmsg_data(nlh);
+   if (ifm->ifa_prefixlen || ifm->ifa_flags || ifm->ifa_scope) {
+   NL_SET_ERR_MSG(extack, "Invalid values in header for 
dump request");
+   return -EINVAL;
+   }
+   if (ifm->ifa_index) {
+   NL_SET_ERR_MSG(extack, "Filter by device index not 
supported");
+   return -EINVAL;
+   }
 
-   tgt_net = rtnl_get_net_ns_capable(skb->sk,
- fillargs.netnsid);
-   if (IS_ERR(tgt_net))
-   return PTR_ERR(tgt_net);
+   err = nlmsg_parse(nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,
+ ifa_ipv4_policy, extack);
+   if (err < 0)
+   return err;
+
+   for (i = 0; i <= IFA_MAX; ++i) {
+   if (i == IFA_TARGET_NETNSID) {
+   fillargs.netnsid = nla_get_s32(tb[i]);
+
+   tgt_net = rtnl_get_net_ns_capable(skb->sk,
+ 
fillargs.netnsid);
+   if (IS_ERR(tgt_net))
+   return PTR_ERR(tgt_net);
+
+   fillargs.flags |= NLM_F_DUMP_FILTERED;
+   } else if (tb[i]) {
+   NL_SET_ERR_MSG(extack, "Unsupported attribute 
in dump request");
+   return -EINVAL;
+   }
}
}
 
-- 
2.11.0



[PATCH RFC v2 net-next 21/25] net/ipv4: Plumb support for filtering route dumps

2018-10-01 Thread David Ahern
From: David Ahern 

Implement kernel side filtering of routes by table id, egress device index,
protocol, tos, scope, and route type.

Signed-off-by: David Ahern 
---
 include/net/ip_fib.h|  2 +-
 net/ipv4/fib_frontend.c | 13 -
 net/ipv4/fib_trie.c | 33 ++---
 3 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index d0cd838ca00c..e064c37a2a9f 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -240,7 +240,7 @@ int fib_table_insert(struct net *, struct fib_table *, 
struct fib_config *,
 int fib_table_delete(struct net *, struct fib_table *, struct fib_config *,
 struct netlink_ext_ack *extack);
 int fib_table_dump(struct fib_table *table, struct sk_buff *skb,
-  struct netlink_callback *cb);
+  struct netlink_callback *cb, struct fib_dump_filter *filter);
 int fib_table_flush(struct net *net, struct fib_table *table);
 struct fib_table *fib_trie_unmerge(struct fib_table *main_tb);
 void fib_table_flush_external(struct fib_table *table);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 9d872a4900cd..a3f4073e509a 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -861,16 +861,27 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
 
rcu_read_lock();
 
+   if (filter.ifindex) {
+   filter.dev = dev_get_by_index_rcu(net, filter.ifindex);
+   if (!filter.dev) {
+   err = -ENODEV;
+   goto out_err;
+   }
+   }
+
for (h = s_h; h < FIB_TABLE_HASHSZ; h++, s_e = 0) {
e = 0;
head = >ipv4.fib_table_hash[h];
hlist_for_each_entry_rcu(tb, head, tb_hlist) {
if (e < s_e)
goto next;
+   if (filter.table_id && filter.table_id != tb->tb_id)
+   goto next;
+
if (dumped)
memset(>args[2], 0, sizeof(cb->args) -
 2 * sizeof(cb->args[0]));
-   err = fib_table_dump(tb, skb, cb);
+   err = fib_table_dump(tb, skb, cb, );
if (err < 0) {
if (likely(skb->len))
goto out;
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 5bc0c89e81e4..0e7b4233851a 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2003,7 +2003,8 @@ void fib_free_table(struct fib_table *tb)
 }
 
 static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb,
-struct sk_buff *skb, struct netlink_callback *cb)
+struct sk_buff *skb, struct netlink_callback *cb,
+struct fib_dump_filter *filter)
 {
__be32 xkey = htonl(l->key);
struct fib_alias *fa;
@@ -2016,15 +2017,24 @@ static int fn_trie_dump_leaf(struct key_vector *l, 
struct fib_table *tb,
hlist_for_each_entry_rcu(fa, >leaf, fa_list) {
int err;
 
-   if (i < s_i) {
-   i++;
-   continue;
-   }
+   if (i < s_i)
+   goto next;
 
-   if (tb->tb_id != fa->tb_id) {
-   i++;
-   continue;
-   }
+   if (tb->tb_id != fa->tb_id)
+   goto next;
+
+   if ((filter->tos && fa->fa_tos != filter->tos) ||
+   (filter->rt_type && fa->fa_type != filter->rt_type))
+   goto next;
+
+   if ((filter->protocol &&
+fa->fa_info->fib_protocol != filter->protocol) ||
+   (filter->scope && fa->fa_info->fib_scope != filter->scope))
+   goto next;
+
+   if (filter->dev &&
+   !fib_info_nh_uses_dev(fa->fa_info, filter->dev))
+   goto next;
 
err = fib_dump_info(skb, NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq, RTM_NEWROUTE,
@@ -2035,6 +2045,7 @@ static int fn_trie_dump_leaf(struct key_vector *l, struct 
fib_table *tb,
cb->args[4] = i;
return err;
}
+next:
i++;
}
 
@@ -2044,7 +2055,7 @@ static int fn_trie_dump_leaf(struct key_vector *l, struct 
fib_table *tb,
 
 /* rcu_read_lock needs to be hold by caller from readside */
 int fib_table_dump(struct fib_table *tb, struct sk_buff *skb,
-  struct netlink_callback *cb)
+  struct netlink_callback *cb, struct fib_dump_filter *filter)
 {
struct trie *t = (struct trie *)tb->tb_data;
struct key_vector *l, *tp 

[PATCH RFC v2 net-next 01/25] net/netlink: Pass extack to dump callbacks

2018-10-01 Thread David Ahern
From: David Ahern 

Pass extack to dump callbacks by adding extack to netlink_dump_control,
transferring to netlink_callback and adding to the netlink_dump. Update
rtnetlink as the first user. Update netlink_dump to add any message after
the dump_done_errno.

Signed-off-by: David Ahern 
---
 include/linux/netlink.h  |  2 ++
 net/core/rtnetlink.c |  1 +
 net/netlink/af_netlink.c | 20 +++-
 3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 71f121b66ca8..8fc90308a653 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -176,6 +176,7 @@ struct netlink_callback {
void*data;
/* the module that dump function belong to */
struct module   *module;
+   struct netlink_ext_ack  *extack;
u16 family;
u16 min_dump_alloc;
unsigned intprev_seq, seq;
@@ -197,6 +198,7 @@ struct netlink_dump_control {
int (*done)(struct netlink_callback *);
void *data;
struct module *module;
+   struct netlink_ext_ack *extack;
u16 min_dump_alloc;
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 35162e1b06ad..da91b38297d3 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -4689,6 +4689,7 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct 
nlmsghdr *nlh,
.dump   = dumpit,
.min_dump_alloc = min_dump_alloc,
.module = owner,
+   .extack = extack
};
err = netlink_dump_start(rtnl, skb, nlh, );
/* netlink_dump_start() will keep a reference on
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index e3a0538ec0be..7094156c94f0 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -129,7 +129,7 @@ static const char *const nlk_cb_mutex_key_strings[MAX_LINKS 
+ 1] = {
"nlk_cb_mutex-MAX_LINKS"
 };
 
-static int netlink_dump(struct sock *sk);
+static int netlink_dump(struct sock *sk, struct netlink_ext_ack *extack);
 
 /* nl_table locking explained:
  * Lookup and traversal are protected with an RCU read-side lock. Insertion
@@ -1981,7 +1981,7 @@ static int netlink_recvmsg(struct socket *sock, struct 
msghdr *msg, size_t len,
 
if (nlk->cb_running &&
atomic_read(>sk_rmem_alloc) <= sk->sk_rcvbuf / 2) {
-   ret = netlink_dump(sk);
+   ret = netlink_dump(sk, NULL);
if (ret) {
sk->sk_err = -ret;
sk->sk_error_report(sk);
@@ -2168,7 +2168,7 @@ EXPORT_SYMBOL(__nlmsg_put);
  * It would be better to create kernel thread.
  */
 
-static int netlink_dump(struct sock *sk)
+static int netlink_dump(struct sock *sk, struct netlink_ext_ack *extack)
 {
struct netlink_sock *nlk = nlk_sk(sk);
struct netlink_callback *cb;
@@ -,8 +,11 @@ static int netlink_dump(struct sock *sk)
skb_reserve(skb, skb_tailroom(skb) - alloc_size);
netlink_skb_set_owner_r(skb, sk);
 
-   if (nlk->dump_done_errno > 0)
+   if (nlk->dump_done_errno > 0) {
+   cb->extack = extack;
nlk->dump_done_errno = cb->dump(skb, cb);
+   cb->extack = NULL;
+   }
 
if (nlk->dump_done_errno > 0 ||
skb_tailroom(skb) < nlmsg_total_size(sizeof(nlk->dump_done_errno))) 
{
@@ -2246,6 +2249,12 @@ static int netlink_dump(struct sock *sk)
memcpy(nlmsg_data(nlh), >dump_done_errno,
   sizeof(nlk->dump_done_errno));
 
+   if (extack && extack->_msg && nlk->flags & NETLINK_F_EXT_ACK) {
+   nlh->nlmsg_flags |= NLM_F_ACK_TLVS;
+   if (!nla_put_string(skb, NLMSGERR_ATTR_MSG, extack->_msg))
+   nlmsg_end(skb, nlh);
+   }
+
if (sk_filter(sk, skb))
kfree_skb(skb);
else
@@ -2307,6 +2316,7 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff 
*skb,
cb->module = control->module;
cb->min_dump_alloc = control->min_dump_alloc;
cb->skb = skb;
+   cb->extack = control->extack;
 
if (control->start) {
ret = control->start(cb);
@@ -2319,7 +2329,7 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff 
*skb,
 
mutex_unlock(nlk->cb_mutex);
 
-   ret = netlink_dump(sk);
+   ret = netlink_dump(sk, cb->extack);
 
sock_put(sk);
 
-- 
2.11.0



[PATCH RFC v2 net-next 18/25] net: Update netconf dump handlers to support NLM_F_DUMP_PROPER_HDR

2018-10-01 Thread David Ahern
From: David Ahern 

Update inet_netconf_dump_devconf, inet6_netconf_dump_devconf, and
mpls_netconf_dump_devconf to check for NLM_F_DUMP_PROPER_HDR in the
netlink message header. If the flag is set, the dump request is
expected to have an netconfmsg struct as the header. The struct
only has the family member and no attributes can be appended, so
the request should only have the header.

Signed-off-by: David Ahern 
---
 net/ipv4/devinet.c  | 22 +++---
 net/ipv6/addrconf.c | 22 +++---
 net/mpls/af_mpls.c  | 18 +-
 3 files changed, 55 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index c27537f568f0..d7859b358cc6 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -2065,6 +2065,7 @@ static int inet_netconf_get_devconf(struct sk_buff 
*in_skb,
 static int inet_netconf_dump_devconf(struct sk_buff *skb,
 struct netlink_callback *cb)
 {
+   const struct nlmsghdr *nlh = cb->nlh;
struct net *net = sock_net(skb->sk);
int h, s_h;
int idx, s_idx;
@@ -2072,6 +2073,21 @@ static int inet_netconf_dump_devconf(struct sk_buff *skb,
struct in_device *in_dev;
struct hlist_head *head;
 
+   if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
+   struct netlink_ext_ack *extack = cb->extack;
+   struct netconfmsg *ncm;
+
+   if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*ncm))) {
+   NL_SET_ERR_MSG(extack, "Invalid header");
+   return -EINVAL;
+   }
+
+   if (nlh->nlmsg_len != nlmsg_msg_size(sizeof(*ncm))) {
+   NL_SET_ERR_MSG(extack, "Invalid data after header");
+   return -EINVAL;
+   }
+   }
+
s_h = cb->args[0];
s_idx = idx = cb->args[1];
 
@@ -2091,7 +2107,7 @@ static int inet_netconf_dump_devconf(struct sk_buff *skb,
if (inet_netconf_fill_devconf(skb, dev->ifindex,
  _dev->cnf,
  
NETLINK_CB(cb->skb).portid,
- cb->nlh->nlmsg_seq,
+ nlh->nlmsg_seq,
  RTM_NEWNETCONF,
  NLM_F_MULTI,
  NETCONFA_ALL) < 0) {
@@ -2108,7 +2124,7 @@ static int inet_netconf_dump_devconf(struct sk_buff *skb,
if (inet_netconf_fill_devconf(skb, NETCONFA_IFINDEX_ALL,
  net->ipv4.devconf_all,
  NETLINK_CB(cb->skb).portid,
- cb->nlh->nlmsg_seq,
+ nlh->nlmsg_seq,
  RTM_NEWNETCONF, NLM_F_MULTI,
  NETCONFA_ALL) < 0)
goto done;
@@ -2119,7 +2135,7 @@ static int inet_netconf_dump_devconf(struct sk_buff *skb,
if (inet_netconf_fill_devconf(skb, NETCONFA_IFINDEX_DEFAULT,
  net->ipv4.devconf_dflt,
  NETLINK_CB(cb->skb).portid,
- cb->nlh->nlmsg_seq,
+ nlh->nlmsg_seq,
  RTM_NEWNETCONF, NLM_F_MULTI,
  NETCONFA_ALL) < 0)
goto done;
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index eb6fd5fbac80..34b5daa9e977 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -666,6 +666,7 @@ static int inet6_netconf_get_devconf(struct sk_buff *in_skb,
 static int inet6_netconf_dump_devconf(struct sk_buff *skb,
  struct netlink_callback *cb)
 {
+   const struct nlmsghdr *nlh = cb->nlh;
struct net *net = sock_net(skb->sk);
int h, s_h;
int idx, s_idx;
@@ -673,6 +674,21 @@ static int inet6_netconf_dump_devconf(struct sk_buff *skb,
struct inet6_dev *idev;
struct hlist_head *head;
 
+   if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
+   struct netlink_ext_ack *extack = cb->extack;
+   struct netconfmsg *ncm;
+
+   if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*ncm))) {
+   NL_SET_ERR_MSG(extack, "Invalid header");
+   return -EINVAL;
+   }
+
+   if (nlh->nlmsg_len != nlmsg_msg_size(sizeof(*ncm))) {
+   NL_SET_ERR_MSG(extack, "Invalid data after header");
+   return -EINVAL;
+   }
+   }
+
s_h = cb->args[0];

[PATCH RFC v2 net-next 20/25] net: Add struct for fib dump filter

2018-10-01 Thread David Ahern
From: David Ahern 

Add struct fib_dump_filter for options on limiting which routes are
dumped. The current list is table id, tos, protocol, scope, route type,
flags and nexthop device index.

This patch adds the struct and argument to ip_valid_fib_dump_req so
that per-protocol patches can be done followed by actually parsing any
data from userspace.

Signed-off-by: David Ahern 
---
 include/net/ip6_route.h |  1 +
 include/net/ip_fib.h| 12 
 net/ipv4/fib_frontend.c |  4 +++-
 net/ipv4/ipmr.c |  3 ++-
 net/ipv6/ip6_fib.c  |  4 ++--
 net/ipv6/ip6mr.c|  3 ++-
 net/mpls/af_mpls.c  |  3 ++-
 7 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 7b9c82de11cc..ecaba26b3399 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -175,6 +175,7 @@ struct rt6_rtnl_dump_arg {
struct sk_buff *skb;
struct netlink_callback *cb;
struct net *net;
+   struct fib_dump_filter filter;
 };
 
 int rt6_dump_route(struct fib6_info *f6i, void *p_arg);
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 9846b79c9ee1..d0cd838ca00c 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -222,6 +222,17 @@ struct fib_table {
unsigned long   __data[0];
 };
 
+struct fib_dump_filter {
+   u32 table_id;
+   unsigned char   tos;
+   unsigned char   protocol;
+   unsigned char   scope;
+   unsigned char   rt_type;
+   unsigned intflags;
+   int ifindex;
+   struct net_device   *dev;
+};
+
 int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp,
 struct fib_result *res, int fib_flags);
 int fib_table_insert(struct net *, struct fib_table *, struct fib_config *,
@@ -453,5 +464,6 @@ static inline void fib_proc_exit(struct net *net)
 u32 ip_mtu_from_fib_result(struct fib_result *res, __be32 daddr);
 
 int ip_valid_fib_dump_req(const struct nlmsghdr *nlh,
+ struct fib_dump_filter *filter,
  struct netlink_ext_ack *extack);
 #endif  /* _NET_FIB_H */
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index c608b393ae49..9d872a4900cd 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -803,6 +803,7 @@ static int inet_rtm_newroute(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 }
 
 int ip_valid_fib_dump_req(const struct nlmsghdr *nlh,
+ struct fib_dump_filter *filter,
  struct netlink_ext_ack *extack)
 {
struct rtmsg *rtm;
@@ -838,6 +839,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
 {
const struct nlmsghdr *nlh = cb->nlh;
struct net *net = sock_net(skb->sk);
+   struct fib_dump_filter filter = {};
unsigned int h, s_h;
unsigned int e = 0, s_e;
struct fib_table *tb;
@@ -845,7 +847,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
int dumped = 0, err;
 
if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
-   err = ip_valid_fib_dump_req(nlh, cb->extack);
+   err = ip_valid_fib_dump_req(nlh, , cb->extack);
if (err)
return err;
}
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 91b5991ed536..9e9ad60dff6b 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -2528,9 +2528,10 @@ static int ipmr_rtm_getroute(struct sk_buff *in_skb, 
struct nlmsghdr *nlh,
 static int ipmr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb)
 {
const struct nlmsghdr *nlh = cb->nlh;
+   struct fib_dump_filter filter = {};
 
if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
-   int err = ip_valid_fib_dump_req(nlh, cb->extack);
+   int err = ip_valid_fib_dump_req(nlh, , cb->extack);
 
if (err)
return err;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index fc14733fbad8..e0362a21737f 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -570,16 +570,16 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
 {
const struct nlmsghdr *nlh = cb->nlh;
struct net *net = sock_net(skb->sk);
+   struct rt6_rtnl_dump_arg arg = {};
unsigned int h, s_h;
unsigned int e = 0, s_e;
-   struct rt6_rtnl_dump_arg arg;
struct fib6_walker *w;
struct fib6_table *tb;
struct hlist_head *head;
int res = 0;
 
if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
-   int err = ip_valid_fib_dump_req(nlh, cb->extack);
+   int err = ip_valid_fib_dump_req(nlh, , cb->extack);
 
if (err)
return err;
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index aa668214edc2..b3084b2c8f88 100644
--- 

[PATCH RFC v2 net-next 25/25] net: Enable kernel side filtering of route dumps

2018-10-01 Thread David Ahern
From: David Ahern 

Update parsing of route dump request to enable kernel side of filtering.

Signed-off-by: David Ahern 
---
 net/ipv4/fib_frontend.c | 42 ++
 1 file changed, 30 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index a3f4073e509a..d1ef1cb98139 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -806,7 +806,9 @@ int ip_valid_fib_dump_req(const struct nlmsghdr *nlh,
  struct fib_dump_filter *filter,
  struct netlink_ext_ack *extack)
 {
+   struct nlattr *tb[RTA_MAX + 1];
struct rtmsg *rtm;
+   int err, i;
 
if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*rtm))) {
NL_SET_ERR_MSG(extack, "Invalid header");
@@ -814,21 +816,37 @@ int ip_valid_fib_dump_req(const struct nlmsghdr *nlh,
}
 
rtm = nlmsg_data(nlh);
-   if (rtm->rtm_dst_len || rtm->rtm_src_len  || rtm->rtm_tos   ||
-   rtm->rtm_table   || rtm->rtm_protocol || rtm->rtm_scope ||
-   rtm->rtm_type) {
-   NL_SET_ERR_MSG(extack,
-  "Invalid values in header for dump request");
+   if (rtm->rtm_dst_len || rtm->rtm_src_len) {
+   NL_SET_ERR_MSG(extack, "Invalid values in header for dump 
request");
return -EINVAL;
}
 
-   if (rtm->rtm_flags & ~(RTM_F_CLONED | RTM_F_PREFIX)) {
-   NL_SET_ERR_MSG(extack, "Invalid flags for dump request");
-   return -EINVAL;
-   }
-   if (nlh->nlmsg_len != nlmsg_msg_size(sizeof(*rtm))) {
-   NL_SET_ERR_MSG(extack, "Invalid data after header");
-   return -EINVAL;
+   filter->flags= rtm->rtm_flags;
+   filter->tos  = rtm->rtm_tos;
+   filter->protocol = rtm->rtm_protocol;
+   filter->scope= rtm->rtm_scope;
+   filter->rt_type  = rtm->rtm_type;
+   filter->table_id = rtm->rtm_table;
+
+   err = nlmsg_parse(nlh, sizeof(*rtm), tb, RTA_MAX,
+ rtm_ipv4_policy, extack);
+   if (err < 0)
+   return err;
+
+   for (i = 0; i <= RTA_MAX; ++i) {
+   if (!tb[i])
+   continue;
+   switch (i) {
+   case RTA_TABLE:
+   filter->table_id = nla_get_u32(tb[i]);
+   break;
+   case RTA_OIF:
+   filter->ifindex = nla_get_u32(tb[i]);
+   break;
+   default:
+   NL_SET_ERR_MSG(extack, "Unsupported attribute in dump 
request");
+   return -EINVAL;
+   }
}
 
return 0;
-- 
2.11.0



[PATCH RFC v2 net-next 11/25] rtnetlink: Update fib dumps to support NLM_F_DUMP_PROPER_HDR

2018-10-01 Thread David Ahern
From: David Ahern 

Add helper to check netlink message for route dumps. The dump request
is expected to have an rtmsg struct as the header. All elements of the
struct are expected to be 0 with the exception of rtm_flags (which is
used by both ipv4 and ipv6 dumps) and with no attributes can be appended.

Update inet_dump_fib, inet6_dump_fib, mpls_dump_routes, ipmr_rtm_dumproute,
and ip6mr_rtm_dumproute to call this helper if NLM_F_DUMP_PROPER_HDR is
set in the netlink message header.

Signed-off-by: David Ahern 
---
 include/net/ip_fib.h|  2 ++
 net/ipv4/fib_frontend.c | 43 +--
 net/ipv4/ipmr.c |  9 +
 net/ipv6/ip6_fib.c  |  8 
 net/ipv6/ip6mr.c|  9 +
 net/mpls/af_mpls.c  |  8 
 6 files changed, 77 insertions(+), 2 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index f7c109e37298..9846b79c9ee1 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -452,4 +452,6 @@ static inline void fib_proc_exit(struct net *net)
 
 u32 ip_mtu_from_fib_result(struct fib_result *res, __be32 daddr);
 
+int ip_valid_fib_dump_req(const struct nlmsghdr *nlh,
+ struct netlink_ext_ack *extack);
 #endif  /* _NET_FIB_H */
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 30e2bcc3ef2a..c608b393ae49 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -802,8 +802,41 @@ static int inet_rtm_newroute(struct sk_buff *skb, struct 
nlmsghdr *nlh,
return err;
 }
 
+int ip_valid_fib_dump_req(const struct nlmsghdr *nlh,
+ struct netlink_ext_ack *extack)
+{
+   struct rtmsg *rtm;
+
+   if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*rtm))) {
+   NL_SET_ERR_MSG(extack, "Invalid header");
+   return -EINVAL;
+   }
+
+   rtm = nlmsg_data(nlh);
+   if (rtm->rtm_dst_len || rtm->rtm_src_len  || rtm->rtm_tos   ||
+   rtm->rtm_table   || rtm->rtm_protocol || rtm->rtm_scope ||
+   rtm->rtm_type) {
+   NL_SET_ERR_MSG(extack,
+  "Invalid values in header for dump request");
+   return -EINVAL;
+   }
+
+   if (rtm->rtm_flags & ~(RTM_F_CLONED | RTM_F_PREFIX)) {
+   NL_SET_ERR_MSG(extack, "Invalid flags for dump request");
+   return -EINVAL;
+   }
+   if (nlh->nlmsg_len != nlmsg_msg_size(sizeof(*rtm))) {
+   NL_SET_ERR_MSG(extack, "Invalid data after header");
+   return -EINVAL;
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(ip_valid_fib_dump_req);
+
 static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   const struct nlmsghdr *nlh = cb->nlh;
struct net *net = sock_net(skb->sk);
unsigned int h, s_h;
unsigned int e = 0, s_e;
@@ -811,8 +844,14 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
struct hlist_head *head;
int dumped = 0, err;
 
-   if (nlmsg_len(cb->nlh) >= sizeof(struct rtmsg) &&
-   ((struct rtmsg *) nlmsg_data(cb->nlh))->rtm_flags & RTM_F_CLONED)
+   if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
+   err = ip_valid_fib_dump_req(nlh, cb->extack);
+   if (err)
+   return err;
+   }
+
+   if (nlmsg_len(nlh) >= sizeof(struct rtmsg) &&
+   ((struct rtmsg *)nlmsg_data(nlh))->rtm_flags & RTM_F_CLONED)
return skb->len;
 
s_h = cb->args[0];
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index a706e9269e8c..91b5991ed536 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -2527,6 +2527,15 @@ static int ipmr_rtm_getroute(struct sk_buff *in_skb, 
struct nlmsghdr *nlh,
 
 static int ipmr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   const struct nlmsghdr *nlh = cb->nlh;
+
+   if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
+   int err = ip_valid_fib_dump_req(nlh, cb->extack);
+
+   if (err)
+   return err;
+   }
+
return mr_rtm_dumproute(skb, cb, ipmr_mr_table_iter,
_ipmr_fill_mroute, _unres_lock);
 }
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 5516f55e214b..fc14733fbad8 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -568,6 +568,7 @@ static int fib6_dump_table(struct fib6_table *table, struct 
sk_buff *skb,
 
 static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   const struct nlmsghdr *nlh = cb->nlh;
struct net *net = sock_net(skb->sk);
unsigned int h, s_h;
unsigned int e = 0, s_e;
@@ -577,6 +578,13 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
struct hlist_head *head;
int res = 0;
 
+   if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
+   int err = ip_valid_fib_dump_req(nlh, cb->extack);
+
+ 

[PATCH RFC v2 net-next 12/25] net/neigh: Refactor dump filter handling

2018-10-01 Thread David Ahern
From: David Ahern 

Move the attribute parsing from neigh_dump_table to neigh_dump_info, and
pass the filter arguments down to neigh_dump_table in a new struct. Add
the filter option to proxy neigh dumps as well to make the dumps consistent.

Signed-off-by: David Ahern 
---
 net/core/neighbour.c | 65 ++--
 1 file changed, 37 insertions(+), 28 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 20e0d3308148..9bab9ae9c98e 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2329,35 +2329,24 @@ static bool neigh_ifindex_filtered(struct net_device 
*dev, int filter_idx)
return false;
 }
 
+struct neigh_dump_filter {
+   int master_idx;
+   int dev_idx;
+};
+
 static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
-   struct netlink_callback *cb)
+   struct netlink_callback *cb,
+   struct neigh_dump_filter *filter)
 {
struct net *net = sock_net(skb->sk);
-   const struct nlmsghdr *nlh = cb->nlh;
-   struct nlattr *tb[NDA_MAX + 1];
struct neighbour *n;
int rc, h, s_h = cb->args[1];
int idx, s_idx = idx = cb->args[2];
struct neigh_hash_table *nht;
-   int filter_master_idx = 0, filter_idx = 0;
unsigned int flags = NLM_F_MULTI;
-   int err;
 
-   err = nlmsg_parse(nlh, sizeof(struct ndmsg), tb, NDA_MAX, NULL, NULL);
-   if (!err) {
-   if (tb[NDA_IFINDEX]) {
-   if (nla_len(tb[NDA_IFINDEX]) != sizeof(u32))
-   return -EINVAL;
-   filter_idx = nla_get_u32(tb[NDA_IFINDEX]);
-   }
-   if (tb[NDA_MASTER]) {
-   if (nla_len(tb[NDA_MASTER]) != sizeof(u32))
-   return -EINVAL;
-   filter_master_idx = nla_get_u32(tb[NDA_MASTER]);
-   }
-   if (filter_idx || filter_master_idx)
-   flags |= NLM_F_DUMP_FILTERED;
-   }
+   if (filter->dev_idx || filter->master_idx)
+   flags |= NLM_F_DUMP_FILTERED;
 
rcu_read_lock_bh();
nht = rcu_dereference_bh(tbl->nht);
@@ -2370,8 +2359,8 @@ static int neigh_dump_table(struct neigh_table *tbl, 
struct sk_buff *skb,
 n = rcu_dereference_bh(n->next)) {
if (idx < s_idx || !net_eq(dev_net(n->dev), net))
goto next;
-   if (neigh_ifindex_filtered(n->dev, filter_idx) ||
-   neigh_master_filtered(n->dev, filter_master_idx))
+   if (neigh_ifindex_filtered(n->dev, filter->dev_idx) ||
+   neigh_master_filtered(n->dev, filter->master_idx))
goto next;
if (neigh_fill_info(skb, n, NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq,
@@ -2393,7 +2382,8 @@ static int neigh_dump_table(struct neigh_table *tbl, 
struct sk_buff *skb,
 }
 
 static int pneigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
-struct netlink_callback *cb)
+struct netlink_callback *cb,
+struct neigh_dump_filter *filter)
 {
struct pneigh_entry *n;
struct net *net = sock_net(skb->sk);
@@ -2408,6 +2398,9 @@ static int pneigh_dump_table(struct neigh_table *tbl, 
struct sk_buff *skb,
for (n = tbl->phash_buckets[h], idx = 0; n; n = n->next) {
if (idx < s_idx || pneigh_net(n) != net)
goto next;
+   if (neigh_ifindex_filtered(n->dev, filter->dev_idx) ||
+   neigh_master_filtered(n->dev, filter->master_idx))
+   goto next;
if (pneigh_fill_info(skb, n, NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq,
RTM_NEWNEIGH,
@@ -2432,20 +2425,36 @@ static int pneigh_dump_table(struct neigh_table *tbl, 
struct sk_buff *skb,
 
 static int neigh_dump_info(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   const struct nlmsghdr *nlh = cb->nlh;
+   struct neigh_dump_filter filter = {};
+   struct nlattr *tb[NDA_MAX + 1];
struct neigh_table *tbl;
int t, family, s_t;
int proxy = 0;
int err;
 
-   family = ((struct rtgenmsg *) nlmsg_data(cb->nlh))->rtgen_family;
+   family = ((struct rtgenmsg *)nlmsg_data(nlh))->rtgen_family;
 
/* check for full ndmsg structure presence, family member is
 * the same for both structures
 */
-   if (nlmsg_len(cb->nlh) >= sizeof(struct ndmsg) &&
-   ((struct ndmsg *) nlmsg_data(cb->nlh))->ndm_flags == NTF_PROXY)
+   

[PATCH RFC v2 net-next 15/25] net/namespace: Update rtnl_net_dumpid to support NLM_F_DUMP_PROPER_HDR

2018-10-01 Thread David Ahern
From: David Ahern 

Update rtnl_net_dumpid to check for NLM_F_DUMP_PROPER_HDR in the netlink
message header. The dump request is expected to have an rtgenmsg struct
which has the family as the only element. No data may be appended.

Signed-off-by: David Ahern 
---
 net/core/net_namespace.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 670c84b1bfc2..0c3a978efc57 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -844,6 +844,7 @@ static int rtnl_net_dumpid_one(int id, void *peer, void 
*data)
 
 static int rtnl_net_dumpid(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   const struct nlmsghdr *nlh = cb->nlh;
struct net *net = sock_net(skb->sk);
struct rtnl_net_dump_cb net_cb = {
.net = net,
@@ -853,6 +854,13 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct 
netlink_callback *cb)
.s_idx = cb->args[0],
};
 
+   if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
+   if (nlh->nlmsg_len != nlmsg_msg_size(sizeof(struct rtgenmsg))) {
+   NL_SET_ERR_MSG(cb->extack, "Unknown data in dump 
request");
+   return -EINVAL;
+   }
+   }
+
spin_lock_bh(>nsid_lock);
idr_for_each(>netns_ids, rtnl_net_dumpid_one, _cb);
spin_unlock_bh(>nsid_lock);
-- 
2.11.0



[PATCH RFC v2 net-next 16/25] net/fib_rules: Update fib_nl_dumprule to support NLM_F_DUMP_PROPER_HDR

2018-10-01 Thread David Ahern
From: David Ahern 

Update fib_nl_dumprule to check for NLM_F_DUMP_PROPER_HDR in the netlink
message header. If the flag is set, the dump request is expected to have
fib_rule_hdr struct as the header. All elements of the struct are
expected to be 0 and no attributes can be appended.

Signed-off-by: David Ahern 
---
 net/core/fib_rules.c | 25 -
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 0ff3953f64aa..e9f9dc501b2e 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -1065,11 +1065,34 @@ static int dump_rules(struct sk_buff *skb, struct 
netlink_callback *cb,
 
 static int fib_nl_dumprule(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   const struct nlmsghdr *nlh = cb->nlh;
struct net *net = sock_net(skb->sk);
struct fib_rules_ops *ops;
int idx = 0, family;
 
-   family = rtnl_msg_family(cb->nlh);
+   if (nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
+   struct netlink_ext_ack *extack = cb->extack;
+   struct fib_rule_hdr *frh;
+
+   if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*frh))) {
+   NL_SET_ERR_MSG(extack, "Invalid header");
+   return -EINVAL;
+   }
+
+   frh = nlmsg_data(nlh);
+   if (frh->dst_len || frh->src_len || frh->tos || frh->table ||
+   frh->res1 || frh->res2 || frh->action || frh->flags) {
+   NL_SET_ERR_MSG(extack, "Invalid values in header for 
dump request");
+   return -EINVAL;
+   }
+
+   if (nlh->nlmsg_len != nlmsg_msg_size(sizeof(*frh))) {
+   NL_SET_ERR_MSG(extack, "Invalid data after header");
+   return -EINVAL;
+   }
+   }
+
+   family = rtnl_msg_family(nlh);
if (family != AF_UNSPEC) {
/* Protocol specific dump request */
ops = lookup_rules_ops(net, family);
-- 
2.11.0



[PATCH RFC v2 net-next 10/25] rtnetlink: Update ipmr_rtm_dumplink to support NLM_F_DUMP_PROPER_HDR

2018-10-01 Thread David Ahern
From: David Ahern 

Update ipmr_rtm_dumplink to check for NLM_F_DUMP_PROPER_HDR in the netlink
message header. If the flag is set, the dump request is expected to have
an ifinfomsg struct as the header. All elements of the struct are
expected to be 0 and no attributes can be appended.

Signed-off-by: David Ahern 
---
 net/ipv4/ipmr.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 5660adcf7a04..a706e9269e8c 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -2710,6 +2710,31 @@ static bool ipmr_fill_vif(struct mr_table *mrt, u32 
vifid, struct sk_buff *skb)
return true;
 }
 
+static int ipmr_valid_dumplink(const struct nlmsghdr *nlh,
+  struct netlink_ext_ack *extack)
+{
+   struct ifinfomsg *ifm;
+
+   if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*ifm))) {
+   NL_SET_ERR_MSG(extack, "Invalid header");
+   return -EINVAL;
+   }
+
+   if (nlh->nlmsg_len > nlmsg_msg_size(sizeof(*ifm))) {
+   NL_SET_ERR_MSG(extack, "Invalid data after header");
+   return -EINVAL;
+   }
+
+   ifm = nlmsg_data(nlh);
+   if (ifm->__ifi_pad || ifm->ifi_type || ifm->ifi_flags ||
+   ifm->ifi_change || ifm->ifi_index) {
+   NL_SET_ERR_MSG(extack, "Invalid values in header for dump 
request");
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 static int ipmr_rtm_dumplink(struct sk_buff *skb, struct netlink_callback *cb)
 {
struct net *net = sock_net(skb->sk);
@@ -2718,6 +2743,13 @@ static int ipmr_rtm_dumplink(struct sk_buff *skb, struct 
netlink_callback *cb)
unsigned int e = 0, s_e;
struct mr_table *mrt;
 
+   if (cb->nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR) {
+   int err = ipmr_valid_dumplink(cb->nlh, cb->extack);
+
+   if (err)
+   return err;
+   }
+
s_t = cb->args[0];
s_e = cb->args[1];
 
-- 
2.11.0



[PATCH RFC v2 net-next 22/25] net/ipv6: Plumb support for filtering route dumps

2018-10-01 Thread David Ahern
From: David Ahern 

Implement kernel side filtering of routes by table id, egress device index,
protocol, and route type. Move the existing route flags check
for prefix only routes to the new filter.

Signed-off-by: David Ahern 
---
 net/ipv6/ip6_fib.c | 13 +
 net/ipv6/route.c   | 36 +++-
 2 files changed, 40 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index e0362a21737f..15b9806270c1 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -613,12 +613,25 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
w->args = 
 
rcu_read_lock();
+
+   if (arg.filter.ifindex) {
+   arg.filter.dev = dev_get_by_index_rcu(net, arg.filter.ifindex);
+   if (!arg.filter.dev) {
+   res = -ENODEV;
+   goto out;
+   }
+   }
+
for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_e = 0) {
e = 0;
head = >ipv6.fib_table_hash[h];
hlist_for_each_entry_rcu(tb, head, tb6_hlist) {
if (e < s_e)
goto next;
+   if (arg.filter.table_id &&
+   arg.filter.table_id != tb->tb6_id)
+   goto next;
+
res = fib6_dump_table(tb, skb, cb);
if (res != 0)
goto out;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index d28f83e01593..99ba2313c380 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -4792,24 +4792,42 @@ static int rt6_fill_node(struct net *net, struct 
sk_buff *skb,
return -EMSGSIZE;
 }
 
+static bool fib6_info_uses_dev(const struct fib6_info *f6i,
+  const struct net_device *dev)
+{
+   if (f6i->fib6_nh.nh_dev == dev)
+   return true;
+
+   if (f6i->fib6_nsiblings) {
+   struct fib6_info *sibling, *next_sibling;
+
+   list_for_each_entry_safe(sibling, next_sibling,
+>fib6_siblings, fib6_siblings) {
+   if (sibling->fib6_nh.nh_dev == dev)
+   return true;
+   }
+   }
+   return false;
+}
+
 int rt6_dump_route(struct fib6_info *rt, void *p_arg)
 {
struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *) p_arg;
+   struct fib_dump_filter *filter = >filter;
struct net *net = arg->net;
 
if (rt == net->ipv6.fib6_null_entry)
return 0;
 
-   if (nlmsg_len(arg->cb->nlh) >= sizeof(struct rtmsg)) {
-   struct rtmsg *rtm = nlmsg_data(arg->cb->nlh);
-
-   /* user wants prefix routes only */
-   if (rtm->rtm_flags & RTM_F_PREFIX &&
-   !(rt->fib6_flags & RTF_PREFIX_RT)) {
-   /* success since this is not a prefix route */
-   return 1;
-   }
+   if ((filter->flags & RTM_F_PREFIX) &&
+   !(rt->fib6_flags & RTF_PREFIX_RT)) {
+   /* success since this is not a prefix route */
+   return 1;
}
+   if ((filter->protocol && rt->fib6_protocol != filter->protocol) ||
+   (filter->rt_type && rt->fib6_type != filter->rt_type) ||
+   (filter->dev && !fib6_info_uses_dev(rt, filter->dev)))
+   return 1;
 
return rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0,
 RTM_NEWROUTE, NETLINK_CB(arg->cb->skb).portid,
-- 
2.11.0



[PATCH RFC v2 net-next 24/25] net: Plumb support for filtering ipv4 and ipv6 multicast route dumps

2018-10-01 Thread David Ahern
From: David Ahern 

Implement kernel side filtering of routes by egress device index and
table id.

Signed-off-by: David Ahern 
---
 include/linux/mroute_base.h |  5 +++--
 net/ipv4/ipmr.c |  2 +-
 net/ipv4/ipmr_base.c| 42 +-
 net/ipv6/ip6mr.c|  2 +-
 4 files changed, 46 insertions(+), 5 deletions(-)

diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h
index 6675b9f81979..8fc516c47a64 100644
--- a/include/linux/mroute_base.h
+++ b/include/linux/mroute_base.h
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /**
  * struct vif_device - interface representor for multicast routing
@@ -290,7 +291,7 @@ int mr_rtm_dumproute(struct sk_buff *skb, struct 
netlink_callback *cb,
 struct sk_buff *skb,
 u32 portid, u32 seq, struct mr_mfc *c,
 int cmd, int flags),
-spinlock_t *lock);
+spinlock_t *lock, struct fib_dump_filter *filter);
 
 int mr_dump(struct net *net, struct notifier_block *nb, unsigned short family,
int (*rules_dump)(struct net *net,
@@ -340,7 +341,7 @@ mr_rtm_dumproute(struct sk_buff *skb, struct 
netlink_callback *cb,
 struct sk_buff *skb,
 u32 portid, u32 seq, struct mr_mfc *c,
 int cmd, int flags),
-spinlock_t *lock)
+spinlock_t *lock, struct fib_dump_filter *filter)
 {
return -EINVAL;
 }
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 9e9ad60dff6b..2fe24009439a 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -2538,7 +2538,7 @@ static int ipmr_rtm_dumproute(struct sk_buff *skb, struct 
netlink_callback *cb)
}
 
return mr_rtm_dumproute(skb, cb, ipmr_mr_table_iter,
-   _ipmr_fill_mroute, _unres_lock);
+   _ipmr_fill_mroute, _unres_lock, );
 }
 
 static const struct nla_policy rtm_ipmr_policy[RTA_MAX + 1] = {
diff --git a/net/ipv4/ipmr_base.c b/net/ipv4/ipmr_base.c
index 1ad9aa62a97b..a4f83cbf033d 100644
--- a/net/ipv4/ipmr_base.c
+++ b/net/ipv4/ipmr_base.c
@@ -268,6 +268,24 @@ int mr_fill_mroute(struct mr_table *mrt, struct sk_buff 
*skb,
 }
 EXPORT_SYMBOL(mr_fill_mroute);
 
+static bool mr_mfc_uses_dev(const struct mr_table *mrt,
+   const struct mr_mfc *c,
+   const struct net_device *dev)
+{
+   int ct;
+
+   for (ct = c->mfc_un.res.minvif; ct < c->mfc_un.res.maxvif; ct++) {
+   if (VIF_EXISTS(mrt, ct) && c->mfc_un.res.ttls[ct] < 255) {
+   const struct vif_device *vif;
+
+   vif = >vif_table[ct];
+   if (vif->dev == dev)
+   return true;
+   }
+   }
+   return false;
+}
+
 int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb,
 struct mr_table *(*iter)(struct net *net,
  struct mr_table *mrt),
@@ -275,17 +293,35 @@ int mr_rtm_dumproute(struct sk_buff *skb, struct 
netlink_callback *cb,
 struct sk_buff *skb,
 u32 portid, u32 seq, struct mr_mfc *c,
 int cmd, int flags),
-spinlock_t *lock)
+spinlock_t *lock, struct fib_dump_filter *filter)
 {
unsigned int t = 0, e = 0, s_t = cb->args[0], s_e = cb->args[1];
struct net *net = sock_net(skb->sk);
struct mr_table *mrt;
struct mr_mfc *mfc;
 
+   /* multicast does not use tos or scope, track protocol or have
+* route type other than RTN_MULTICAST
+*/
+   if (filter->tos || filter->protocol || filter->scope || filter->flags ||
+   (filter->rt_type && filter->rt_type != RTN_MULTICAST))
+   return 0;
+
rcu_read_lock();
+
+   if (filter->ifindex) {
+   filter->dev = dev_get_by_index_rcu(net, filter->ifindex);
+   if (!filter->dev) {
+   rcu_read_unlock();
+   return -ENODEV;
+   }
+   }
+
for (mrt = iter(net, NULL); mrt; mrt = iter(net, mrt)) {
if (t < s_t)
goto next_table;
+   if (filter->table_id && filter->table_id != mrt->id)
+   goto next_table;
list_for_each_entry_rcu(mfc, >mfc_cache_list, list) {
if (e < s_e)
goto next_entry;
@@ -303,6 +339,10 @@ int mr_rtm_dumproute(struct sk_buff *skb, struct 
netlink_callback *cb,
list_for_each_entry(mfc, >mfc_unres_queue, list) {
if (e < s_e)
goto next_entry2;
+   if 

[PATCH RFC v2 net-next 06/25] rtnetlink: Update rtnl_dump_ifinfo to support NLM_F_DUMP_PROPER_HDR

2018-10-01 Thread David Ahern
From: David Ahern 

Update rtnl_dump_ifinfo to check for NLM_F_DUMP_PROPER_HDR in the netlink
message header. If the flag is set, the dump request is expected to have
an ifinfomsg struct as the header potentially followed by one or more
attributes. Any data passed in the header or as an attribute is taken as
a request to influence the data returned. Only values supported by the
dump handler are allowed to be non-0 or set in the request. At the moment
only the IFA_TARGET_NETNSID, IFLA_EXT_MASK, IFLA_MASTER, and IFLA_LINKINFO
attributes are supported.

Signed-off-by: David Ahern 
---
 net/core/rtnetlink.c | 106 ---
 1 file changed, 75 insertions(+), 31 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index da91b38297d3..2bf4b9916ca2 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1880,6 +1880,9 @@ EXPORT_SYMBOL_GPL(rtnl_get_net_ns_capable);
 
 static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   struct netlink_ext_ack *extack = cb->extack;
+   const struct nlmsghdr *nlh = cb->nlh;
+   bool proper_hdr = !!(nlh->nlmsg_flags & NLM_F_DUMP_PROPER_HDR);
struct net *net = sock_net(skb->sk);
struct net *tgt_net = net;
int h, s_h;
@@ -1892,46 +1895,88 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct 
netlink_callback *cb)
unsigned int flags = NLM_F_MULTI;
int master_idx = 0;
int netnsid = -1;
-   int err;
+   int err, i;
int hdrlen;
 
s_h = cb->args[0];
s_idx = cb->args[1];
 
-   /* A hack to preserve kernel<->userspace interface.
-* The correct header is ifinfomsg. It is consistent with rtnl_getlink.
-* However, before Linux v3.9 the code here assumed rtgenmsg and that's
-* what iproute2 < v3.9.0 used.
-* We can detect the old iproute2. Even including the IFLA_EXT_MASK
-* attribute, its netlink message is shorter than struct ifinfomsg.
-*/
-   hdrlen = nlmsg_len(cb->nlh) < sizeof(struct ifinfomsg) ?
-sizeof(struct rtgenmsg) : sizeof(struct ifinfomsg);
+   if (proper_hdr) {
+   struct ifinfomsg *ifm;
 
-   if (nlmsg_parse(cb->nlh, hdrlen, tb, IFLA_MAX,
-   ifla_policy, NULL) >= 0) {
-   if (tb[IFLA_TARGET_NETNSID]) {
-   netnsid = nla_get_s32(tb[IFLA_TARGET_NETNSID]);
-   tgt_net = rtnl_get_net_ns_capable(skb->sk, netnsid);
-   if (IS_ERR(tgt_net)) {
-   tgt_net = net;
-   netnsid = -1;
-   }
+   hdrlen = sizeof(*ifm);
+   if (nlh->nlmsg_len < nlmsg_msg_size(hdrlen)) {
+   NL_SET_ERR_MSG(extack, "Invalid header");
+   return -EINVAL;
}
 
-   if (tb[IFLA_EXT_MASK])
-   ext_filter_mask = nla_get_u32(tb[IFLA_EXT_MASK]);
-
-   if (tb[IFLA_MASTER])
-   master_idx = nla_get_u32(tb[IFLA_MASTER]);
-
-   if (tb[IFLA_LINKINFO])
-   kind_ops = linkinfo_to_kind_ops(tb[IFLA_LINKINFO]);
+   ifm = nlmsg_data(nlh);
+   if (ifm->__ifi_pad || ifm->ifi_type || ifm->ifi_flags ||
+   ifm->ifi_change) {
+   NL_SET_ERR_MSG(extack, "Invalid values in header for 
dump request");
+   return -EINVAL;
+   }
+   if (ifm->ifi_index) {
+   NL_SET_ERR_MSG(extack, "Filter by device index not 
supported");
+   return -EINVAL;
+   }
+   } else {
+   /* A hack to preserve kernel<->userspace interface.
+* The correct header is ifinfomsg. It is consistent with
+* rtnl_getlink. However, before Linux v3.9 the code here
+* assumed rtgenmsg and that's what iproute2 < v3.9.0 used.
+* We can detect the old iproute2. Even including the
+* IFLA_EXT_MASK attribute, its netlink message is shorter
+* than struct ifinfomsg.
+*/
+   hdrlen = nlmsg_len(nlh) < sizeof(struct ifinfomsg) ?
+sizeof(struct rtgenmsg) : sizeof(struct ifinfomsg);
+   }
 
-   if (master_idx || kind_ops)
-   flags |= NLM_F_DUMP_FILTERED;
+   err = nlmsg_parse(nlh, hdrlen, tb, IFLA_MAX, ifla_policy, extack);
+   if (err < 0) {
+   if (proper_hdr) {
+   NL_SET_ERR_MSG(extack, "Failed to parse link 
attributes");
+   return -EINVAL;
+   }
+   goto walk_entries;
+   }
+
+   for (i = 0; i <= IFLA_MAX; ++i) {
+   switch (i) {
+   case IFLA_TARGET_NETNSID:
+   if (tb[i]) {
+ 

[PATCH RFC v2 net-next 23/25] net/mpls: Plumb support for filtering route dumps

2018-10-01 Thread David Ahern
From: David Ahern 

Implement kernel side filtering of routes by egress device index and
protocol. MPLS uses only a single table and route type.

Signed-off-by: David Ahern 
---
 net/mpls/af_mpls.c | 55 +-
 1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index f94d1db63eb5..4dd8a2a026e7 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -2031,6 +2031,28 @@ static int mpls_dump_route(struct sk_buff *skb, u32 
portid, u32 seq, int event,
return -EMSGSIZE;
 }
 
+static bool mpls_rt_uses_dev(struct mpls_route *rt,
+const struct net_device *dev)
+{
+   struct net_device *nh_dev;
+
+   if (rt->rt_nhn == 1) {
+   struct mpls_nh *nh = rt->rt_nh;
+
+   nh_dev = rtnl_dereference(nh->nh_dev);
+   if (dev == nh_dev)
+   return true;
+   } else {
+   for_nexthops(rt) {
+   nh_dev = rtnl_dereference(nh->nh_dev);
+   if (nh_dev == dev)
+   return true;
+   } endfor_nexthops(rt);
+   }
+
+   return false;
+}
+
 static int mpls_dump_routes(struct sk_buff *skb, struct netlink_callback *cb)
 {
const struct nlmsghdr *nlh = cb->nlh;
@@ -2039,6 +2061,7 @@ static int mpls_dump_routes(struct sk_buff *skb, struct 
netlink_callback *cb)
struct fib_dump_filter filter = {};
size_t platform_labels;
unsigned int index;
+   int err;
 
ASSERT_RTNL();
 
@@ -2047,6 +2070,15 @@ static int mpls_dump_routes(struct sk_buff *skb, struct 
netlink_callback *cb)
 
if (err)
return err;
+
+   /* for MPLS, there is only 1 table with fixed type, scope
+* tos and flags. If any of these are set in the filter then
+* return nothing
+*/
+   if ((filter.table_id && filter.table_id != RT_TABLE_MAIN) ||
+   (filter.rt_type && filter.rt_type != RTN_UNICAST) ||
+filter.scope || filter.tos || filter.flags)
+   return 0;
}
 
index = cb->args[0];
@@ -2055,20 +2087,41 @@ static int mpls_dump_routes(struct sk_buff *skb, struct 
netlink_callback *cb)
 
platform_label = rtnl_dereference(net->mpls.platform_label);
platform_labels = net->mpls.platform_labels;
+
+   rcu_read_lock();
+
+   if (filter.ifindex) {
+   filter.dev = dev_get_by_index_rcu(net, filter.ifindex);
+   if (!filter.dev) {
+   err = -ENODEV;
+   goto out_err;
+   }
+   }
+
for (; index < platform_labels; index++) {
struct mpls_route *rt;
+
rt = rtnl_dereference(platform_label[index]);
if (!rt)
continue;
 
+   if (filter.protocol && rt->rt_protocol != filter.protocol)
+   continue;
+
+   if (filter.dev && !mpls_rt_uses_dev(rt, filter.dev))
+   continue;
+
if (mpls_dump_route(skb, NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq, RTM_NEWROUTE,
index, rt, NLM_F_MULTI) < 0)
break;
}
cb->args[0] = index;
+   err = skb->len;
 
-   return skb->len;
+out_err:
+   rcu_read_unlock();
+   return err;
 }
 
 static inline size_t lfib_nlmsg_size(struct mpls_route *rt)
-- 
2.11.0



Re: [RFC PATCH bpf-next v3 4/7] bpf: add bpf queue and stack maps

2018-10-01 Thread Alexei Starovoitov
On Mon, Oct 01, 2018 at 08:11:43AM -0500, Mauricio Vasquez wrote:
> > > > +BPF_CALL_3(bpf_map_pop_elem, struct bpf_map *, map, void *,
> > > > value, u32, size)
> > > > +{
> > > > +    void *ptr;
> > > > +
> > > > +    if (map->value_size != size)
> > > > +    return -EINVAL;
> > > > +
> > > > +    ptr = map->ops->map_lookup_and_delete_elem(map, NULL);
> > > > +    if (!ptr)
> > > > +    return -ENOENT;
> > > > +
> > > > +    switch (size) {
> > > > +    case 1:
> > > > +    *(u8 *) value = *(u8 *) ptr;
> > > > +    break;
> > > > +    case 2:
> > > > +    *(u16 *) value = *(u16 *) ptr;
> > > > +    break;
> > > > +    case 4:
> > > > +    *(u32 *) value = *(u32 *) ptr;
> > > > +    break;
> > > > +    case 8:
> > > > +    *(u64 *) value = *(u64 *) ptr;
> > > > +    break;
> > > this is inefficient. can we pass value ptr into ops and let it
> > > populate it?
> > 
> > I don't think so, doing that implies that look_and_delete will be a
> > per-value op, while other ops in maps are per-reference.
> > For instance, how to change it in the case of peek helper that is using
> > the lookup operation?, we cannot change the signature of the lookup
> > operation.
> > 
> > This is something that worries me a little bit, we are creating new
> > per-value helpers based on already existing per-reference operations,
> > this is not probably the cleanest way.  Here we are at the beginning of
> > the discussion once again, how should we map helpers and syscalls to
> > ops.
> > 
> > What about creating pop/peek/push ops, mapping helpers one to one and
> > adding some logic into syscall.c to call the correct operation in case
> > the map is stack/queue?
> > Syscall mapping would be:
> > bpf_map_lookup_elem() -> peek
> > bpf_map_lookup_and_delete_elem() -> pop
> > bpf_map_update_elem() -> push
> > 
> > Does it make sense?
> 
> Hello Alexei,
> 
> Do you have any feedback on this specific part?

Indeed. It seems push/pop ops will be cleaner.
I still think that peek() is useless due to races.
So BPF_MAP_UPDATE_ELEM syscall cmd will map to 'push' ops
and new BPF_MAP_LOOKUP_AND_DELETE_ELEM will map to 'pop' ops.
right?



Re: [PATCH v2 net-next 7/8] net: ethernet: xgbe: expand PHY_GBIT_FEAUTRES

2018-10-01 Thread Andrew Lunn
On Sun, Sep 30, 2018 at 11:41:00AM +0300, Sergei Shtylyov wrote:
> Hello!
> 
> On 9/30/2018 12:04 AM, Andrew Lunn wrote:
> 
> >The macro PHY_GBIT_FEAUTRES needs to change into a bitmap in order to
> >support link_modes. Remove its use from xgde by replacing it with its
> >definition.
> >
> >Probably, the current behavior is wrong. It probably should be
> >ANDing not assigning.
> 
>ORing, maybe?

Hi Sergei

It is hard to know what was intended here.

By assigning these speeds, if the PHY does not actually support 1Gbps,
that information is going to be overwritten. So it should really be
ANDing with that the MAC supports. ORing would have the same problem.
This assignment is also clearing out an TP, AUI, BNC bits which might
be set.

Since i don't really know what the intention is here, i'm just going
to leave it alone.

> 
> >Signed-off-by: Andrew Lunn 
> >---
> >v2
> >Remove unneeded ()
> 
>Really? :-)

I did not say all unneeded :-)

I will remove some more.

  Andrew


Re: [PATCH net-next] tcp: start receiver buffer autotuning sooner

2018-10-01 Thread Yuchung Cheng
On Mon, Oct 1, 2018 at 3:46 PM, David Miller  wrote:
> From: Yuchung Cheng 
> Date: Mon,  1 Oct 2018 15:42:32 -0700
>
>> Previously receiver buffer auto-tuning starts after receiving
>> one advertised window amount of data. After the initial receiver
>> buffer was raised by patch a337531b942b ("tcp: up initial rmem to
>> 128KB and SYN rwin to around 64KB"), the reciver buffer may take
>> too long to start raising. To address this issue, this patch lowers
>> the initial bytes expected to receive roughly the expected sender's
>> initial window.
>>
>> Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 
>> 64KB")
>> Signed-off-by: Yuchung Cheng 
>> Signed-off-by: Wei Wang 
>> Signed-off-by: Neal Cardwell 
>> Signed-off-by: Eric Dumazet 
>> Reviewed-by: Soheil Hassas Yeganeh 
>
> Applied, sorry for applying v1 instead of v2 the the rmem increasing patch.
> :-/
No problem thanks for the fast response!


Re: [pull request][net-next 00/13] Mellanox, mlx5e updates 2018-10-01

2018-10-01 Thread David Miller
From: Saeed Mahameed 
Date: Mon,  1 Oct 2018 11:56:48 -0700

> The following pull request includes updates to mlx5e ethernet netdevice
> driver, for more information please see tag log below.
> 
> Please pull and let me know if there's any problem.

Pulled, thank you.


Re: [PATCH net-next] tcp: start receiver buffer autotuning sooner

2018-10-01 Thread David Miller
From: Yuchung Cheng 
Date: Mon,  1 Oct 2018 15:42:32 -0700

> Previously receiver buffer auto-tuning starts after receiving
> one advertised window amount of data. After the initial receiver
> buffer was raised by patch a337531b942b ("tcp: up initial rmem to
> 128KB and SYN rwin to around 64KB"), the reciver buffer may take
> too long to start raising. To address this issue, this patch lowers
> the initial bytes expected to receive roughly the expected sender's
> initial window.
> 
> Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 
> 64KB")
> Signed-off-by: Yuchung Cheng 
> Signed-off-by: Wei Wang 
> Signed-off-by: Neal Cardwell 
> Signed-off-by: Eric Dumazet 
> Reviewed-by: Soheil Hassas Yeganeh 

Applied, sorry for applying v1 instead of v2 the the rmem increasing patch.
:-/


Re: [Potential Spoof] Re: [PATCH v2] net/ncsi: Add NCSI OEM command support

2018-10-01 Thread Vijay Khemka


On 9/28/18, 6:21 PM, "Linux-aspeed on behalf of Vijay Khemka" 
 wrote:



> On 9/28/18, 6:07 PM, "Vijay Khemka"  wrote:

 >   This patch adds OEM commands and response handling. It also defines OEM
 >   command and response structure as per NCSI specification along with its
 >   handlers.
 >   
 >   ncsi_cmd_handler_oem: This is a generic command request handler for OEM
 >   commands
 >   ncsi_rsp_handler_oem: This is a generic response handler for OEM 
commands
   
  This is a generic patch for OEM command handling, There will be another 
patch 
  following this to handle specific OEM commands for each vendor. Currently 
I have
  defined 2 vendor/manufacturer id below in internal.h, more can be added 
here for
  other vendors. I have not defined ncsi_rsp_oem_handler in this patch as 
they are 
  NULL, but there will be a defined handlers for each vendor in next patch. 
 
Sam, Joel, Justin, please review this patch. I have 2 more patch coming for 
Broadcom and Mellanox. 



Re: [net-next 0/8][pull request] 100GbE Intel Wired LAN Driver Updates 2018-10-01

2018-10-01 Thread David Miller
From: Jeff Kirsher 
Date: Mon,  1 Oct 2018 14:14:23 -0700

> This series contains updates to ice driver only.
> 
> Anirudh provides several changes to "prep" the driver for upcoming
> features.  Specifically, the functions that are used for PF VSI/netdev
> setup will also be used in SR-IOV support and to allow the reuse of
> these functions, code needs to move.
> 
> Dave provides the only other change in the series, updates the driver to
> protect the reset patch in its entirety.  This is done by adding the
> various bit checks to determine if a reset is scheduled/initiated and
> whether it came from the software or firmware.
> 
> The following are changes since commit 
> 804fe108fc92e591ddfe9447e7fb4691ed16daee:
>   openvswitch: Use correct reply values in datapath and vport ops
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 100GbE

Pulled, thanks Jeff.


Re: [PATCH net] tcp/dccp: fix lockdep issue when SYN is backlogged

2018-10-01 Thread David Miller
From: Eric Dumazet 
Date: Mon,  1 Oct 2018 15:02:26 -0700

> In normal SYN processing, packets are handled without listener
> lock and in RCU protected ingress path.
> 
> But syzkaller is known to be able to trick us and SYN
> packets might be processed in process context, after being
> queued into socket backlog.
> 
> In commit 06f877d613be ("tcp/dccp: fix other lockdep splats
> accessing ireq_opt") I made a very stupid fix, that happened
> to work mostly because of the regular path being RCU protected.
> 
> Really the thing protecting ireq->ireq_opt is RCU read lock,
> and the pseudo request refcnt is not relevant.
> 
> This patch extends what I did in commit 449809a66c1d ("tcp/dccp:
> block BH for SYN processing") by adding an extra rcu_read_{lock|unlock}
> pair in the paths that might be taken when processing SYN from
> socket backlog (thus possibly in process context)
> 
> Fixes: 06f877d613be ("tcp/dccp: fix other lockdep splats accessing ireq_opt")
> Signed-off-by: Eric Dumazet 
> Reported-by: syzbot 

Applied and queued up for -stable, thanks Eric.


[PATCH net-next] tcp: start receiver buffer autotuning sooner

2018-10-01 Thread Yuchung Cheng
Previously receiver buffer auto-tuning starts after receiving
one advertised window amount of data. After the initial receiver
buffer was raised by patch a337531b942b ("tcp: up initial rmem to
128KB and SYN rwin to around 64KB"), the reciver buffer may take
too long to start raising. To address this issue, this patch lowers
the initial bytes expected to receive roughly the expected sender's
initial window.

Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 
64KB")
Signed-off-by: Yuchung Cheng 
Signed-off-by: Wei Wang 
Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
Reviewed-by: Soheil Hassas Yeganeh 
---
 net/ipv4/tcp_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 7a59f6a96212..bf1aac315490 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -438,7 +438,7 @@ void tcp_init_buffer_space(struct sock *sk)
if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
tcp_sndbuf_expand(sk);
 
-   tp->rcvq_space.space = tp->rcv_wnd;
+   tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * 
tp->advmss);
tcp_mstamp_refresh(tp);
tp->rcvq_space.time = tp->tcp_mstamp;
tp->rcvq_space.seq = tp->copied_seq;
-- 
2.19.0.605.g01d371f741-goog



Re: [PATCH net-next v2] tcp: up initial rmem to 128KB and SYN rwin to around 64KB

2018-10-01 Thread Eric Dumazet
On Mon, Oct 1, 2018 at 3:18 PM Yuchung Cheng  wrote:

> Hi David: thanks for taking this patch - I didn't notice this earlier
> but it seems patch v1 was applied instead of v2? should I submit a
> v2-v1-diff patch?

Yes, please, send an additional patch.


Re: [PATCH net-next v2] tcp: up initial rmem to 128KB and SYN rwin to around 64KB

2018-10-01 Thread Yuchung Cheng
On Sat, Sep 29, 2018 at 11:23 AM, David Miller  wrote:
>
> From: Yuchung Cheng 
> Date: Fri, 28 Sep 2018 13:09:02 -0700
>
> > Previously TCP initial receive buffer is ~87KB by default and
> > the initial receive window is ~29KB (20 MSS). This patch changes
> > the two numbers to 128KB and ~64KB (rounding down to the multiples
> > of MSS) respectively. The patch also simplifies the calculations s.t.
> > the two numbers are directly controlled by sysctl tcp_rmem[1]:
> >
> >   1) Initial receiver buffer budget (sk_rcvbuf): while this should
> >  be configured via sysctl tcp_rmem[1], previously tcp_fixup_rcvbuf()
> >  always override and set a larger size when a new connection
> >  establishes.
> >
> >   2) Initial receive window in SYN: previously it is set to 20
> >  packets if MSS <= 1460. The number 20 was based on the initial
> >  congestion window of 10: the receiver needs twice amount to
> >  avoid being limited by the receive window upon out-of-order
> >  delivery in the first window burst. But since this only
> >  applies if the receiving MSS <= 1460, connection using large MTU
> >  (e.g. to utilize receiver zero-copy) may be limited by the
> >  receive window.
> >
> > This patch also lowers the initial bytes expected to receive in
> > the receiver buffer autotuning algorithm - otherwise the receiver
> > may take two to three rounds to increase the buffer to the
> > appropriate level (2x sender congestion window).
> >
> > With this patch TCP memory configuration is more straight-forward and
> > more properly sized to modern high-speed networks by default. Several
> > popular stacks have been announcing 64KB rwin in SYNs as well.
> >
> > Signed-off-by: Yuchung Cheng 
> > Signed-off-by: Wei Wang 
> > Signed-off-by: Neal Cardwell 
> > Signed-off-by: Eric Dumazet 
> > Reviewed-by: Soheil Hassas Yeganeh 
>
> Applied, thanks.

Hi David: thanks for taking this patch - I didn't notice this earlier
but it seems patch v1 was applied instead of v2? should I submit a
v2-v1-diff patch?


[PATCH net] tcp/dccp: fix lockdep issue when SYN is backlogged

2018-10-01 Thread Eric Dumazet
In normal SYN processing, packets are handled without listener
lock and in RCU protected ingress path.

But syzkaller is known to be able to trick us and SYN
packets might be processed in process context, after being
queued into socket backlog.

In commit 06f877d613be ("tcp/dccp: fix other lockdep splats
accessing ireq_opt") I made a very stupid fix, that happened
to work mostly because of the regular path being RCU protected.

Really the thing protecting ireq->ireq_opt is RCU read lock,
and the pseudo request refcnt is not relevant.

This patch extends what I did in commit 449809a66c1d ("tcp/dccp:
block BH for SYN processing") by adding an extra rcu_read_{lock|unlock}
pair in the paths that might be taken when processing SYN from
socket backlog (thus possibly in process context)

Fixes: 06f877d613be ("tcp/dccp: fix other lockdep splats accessing ireq_opt")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
---
 include/net/inet_sock.h | 3 +--
 net/dccp/input.c| 4 +++-
 net/ipv4/tcp_input.c| 4 +++-
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 
e03b93360f332b3e3232873ac1cbd0ee7478fabb..a8cd5cf9ff5b6ddc50bd2e70d3f9103afa32a3b5
 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -132,8 +132,7 @@ static inline int inet_request_bound_dev_if(const struct 
sock *sk,
 
 static inline struct ip_options_rcu *ireq_opt_deref(const struct 
inet_request_sock *ireq)
 {
-   return rcu_dereference_check(ireq->ireq_opt,
-refcount_read(>req.rsk_refcnt) > 0);
+   return rcu_dereference(ireq->ireq_opt);
 }
 
 struct inet_cork {
diff --git a/net/dccp/input.c b/net/dccp/input.c
index 
d28d46bff6ab43441f34284ec975c1e052a774d0..85d6c879383da8994c6b20cd1e49e0f667a07482
 100644
--- a/net/dccp/input.c
+++ b/net/dccp/input.c
@@ -606,11 +606,13 @@ int dccp_rcv_state_process(struct sock *sk, struct 
sk_buff *skb,
if (sk->sk_state == DCCP_LISTEN) {
if (dh->dccph_type == DCCP_PKT_REQUEST) {
/* It is possible that we process SYN packets from 
backlog,
-* so we need to make sure to disable BH right there.
+* so we need to make sure to disable BH and RCU right 
there.
 */
+   rcu_read_lock();
local_bh_disable();
acceptable = 
inet_csk(sk)->icsk_af_ops->conn_request(sk, skb) >= 0;
local_bh_enable();
+   rcu_read_unlock();
if (!acceptable)
return 1;
consume_skb(skb);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 
4cf2f7bb2802ad4ae968b5a6dfb9d005ed619c76..47e08c1b5bc3e14e6ae2851b7ec8de91a3eb4a35
 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6009,11 +6009,13 @@ int tcp_rcv_state_process(struct sock *sk, struct 
sk_buff *skb)
if (th->fin)
goto discard;
/* It is possible that we process SYN packets from 
backlog,
-* so we need to make sure to disable BH right there.
+* so we need to make sure to disable BH and RCU right 
there.
 */
+   rcu_read_lock();
local_bh_disable();
acceptable = icsk->icsk_af_ops->conn_request(sk, skb) 
>= 0;
local_bh_enable();
+   rcu_read_unlock();
 
if (!acceptable)
return 1;
-- 
2.19.0.605.g01d371f741-goog



[net-next 2/8] ice: Move common functions out of ice_main.c part 2/7

2018-10-01 Thread Jeff Kirsher
From: Anirudh Venkataramanan 

This patch continues the code move out of ice_main.c

The following top level functions (and related dependency functions) were
moved to ice_lib.c:
ice_vsi_start_rx_rings
ice_vsi_stop_rx_rings
ice_vsi_stop_tx_rings
ice_vsi_cfg_rxqs
ice_vsi_cfg_txqs
ice_vsi_cfg_msix

Signed-off-by: Anirudh Venkataramanan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_lib.c  | 491 
 drivers/net/ethernet/intel/ice/ice_lib.h  |  13 +
 drivers/net/ethernet/intel/ice/ice_main.c | 541 +-
 3 files changed, 526 insertions(+), 519 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c 
b/drivers/net/ethernet/intel/ice/ice_lib.c
index 1cf4dca12495..06a54d79fba8 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -4,6 +4,227 @@
 #include "ice.h"
 #include "ice_lib.h"
 
+/**
+ * ice_setup_rx_ctx - Configure a receive ring context
+ * @ring: The Rx ring to configure
+ *
+ * Configure the Rx descriptor ring in RLAN context.
+ */
+static int ice_setup_rx_ctx(struct ice_ring *ring)
+{
+   struct ice_vsi *vsi = ring->vsi;
+   struct ice_hw *hw = >back->hw;
+   u32 rxdid = ICE_RXDID_FLEX_NIC;
+   struct ice_rlan_ctx rlan_ctx;
+   u32 regval;
+   u16 pf_q;
+   int err;
+
+   /* what is RX queue number in global space of 2K Rx queues */
+   pf_q = vsi->rxq_map[ring->q_index];
+
+   /* clear the context structure first */
+   memset(_ctx, 0, sizeof(rlan_ctx));
+
+   rlan_ctx.base = ring->dma >> 7;
+
+   rlan_ctx.qlen = ring->count;
+
+   /* Receive Packet Data Buffer Size.
+* The Packet Data Buffer Size is defined in 128 byte units.
+*/
+   rlan_ctx.dbuf = vsi->rx_buf_len >> ICE_RLAN_CTX_DBUF_S;
+
+   /* use 32 byte descriptors */
+   rlan_ctx.dsize = 1;
+
+   /* Strip the Ethernet CRC bytes before the packet is posted to host
+* memory.
+*/
+   rlan_ctx.crcstrip = 1;
+
+   /* L2TSEL flag defines the reported L2 Tags in the receive descriptor */
+   rlan_ctx.l2tsel = 1;
+
+   rlan_ctx.dtype = ICE_RX_DTYPE_NO_SPLIT;
+   rlan_ctx.hsplit_0 = ICE_RLAN_RX_HSPLIT_0_NO_SPLIT;
+   rlan_ctx.hsplit_1 = ICE_RLAN_RX_HSPLIT_1_NO_SPLIT;
+
+   /* This controls whether VLAN is stripped from inner headers
+* The VLAN in the inner L2 header is stripped to the receive
+* descriptor if enabled by this flag.
+*/
+   rlan_ctx.showiv = 0;
+
+   /* Max packet size for this queue - must not be set to a larger value
+* than 5 x DBUF
+*/
+   rlan_ctx.rxmax = min_t(u16, vsi->max_frame,
+  ICE_MAX_CHAINED_RX_BUFS * vsi->rx_buf_len);
+
+   /* Rx queue threshold in units of 64 */
+   rlan_ctx.lrxqthresh = 1;
+
+/* Enable Flexible Descriptors in the queue context which
+ * allows this driver to select a specific receive descriptor format
+ */
+   regval = rd32(hw, QRXFLXP_CNTXT(pf_q));
+   regval |= (rxdid << QRXFLXP_CNTXT_RXDID_IDX_S) &
+   QRXFLXP_CNTXT_RXDID_IDX_M;
+
+   /* increasing context priority to pick up profile id;
+* default is 0x01; setting to 0x03 to ensure profile
+* is programming if prev context is of same priority
+*/
+   regval |= (0x03 << QRXFLXP_CNTXT_RXDID_PRIO_S) &
+   QRXFLXP_CNTXT_RXDID_PRIO_M;
+
+   wr32(hw, QRXFLXP_CNTXT(pf_q), regval);
+
+   /* Absolute queue number out of 2K needs to be passed */
+   err = ice_write_rxq_ctx(hw, _ctx, pf_q);
+   if (err) {
+   dev_err(>back->pdev->dev,
+   "Failed to set LAN Rx queue context for absolute Rx 
queue %d error: %d\n",
+   pf_q, err);
+   return -EIO;
+   }
+
+   /* init queue specific tail register */
+   ring->tail = hw->hw_addr + QRX_TAIL(pf_q);
+   writel(0, ring->tail);
+   ice_alloc_rx_bufs(ring, ICE_DESC_UNUSED(ring));
+
+   return 0;
+}
+
+/**
+ * ice_setup_tx_ctx - setup a struct ice_tlan_ctx instance
+ * @ring: The Tx ring to configure
+ * @tlan_ctx: Pointer to the Tx LAN queue context structure to be initialized
+ * @pf_q: queue index in the PF space
+ *
+ * Configure the Tx descriptor ring in TLAN context.
+ */
+static void
+ice_setup_tx_ctx(struct ice_ring *ring, struct ice_tlan_ctx *tlan_ctx, u16 
pf_q)
+{
+   struct ice_vsi *vsi = ring->vsi;
+   struct ice_hw *hw = >back->hw;
+
+   tlan_ctx->base = ring->dma >> ICE_TLAN_CTX_BASE_S;
+
+   tlan_ctx->port_num = vsi->port_info->lport;
+
+   /* Transmit Queue Length */
+   tlan_ctx->qlen = ring->count;
+
+   /* PF number */
+   tlan_ctx->pf_num = hw->pf_id;
+
+   /* queue belongs to a specific VSI type
+* VF / VM index should be programmed per vmvf_type setting:
+* for vmvf_type = VF, it is VF number 

[net-next 7/8] ice: Move common functions out of ice_main.c part 7/7

2018-10-01 Thread Jeff Kirsher
From: Anirudh Venkataramanan 

This patch completes the code move out of ice_main.c

The following top level functions and related dependency functions) were
moved to ice_lib.c:
ice_vsi_setup
ice_vsi_cfg_tc

The following functions were made static again:
ice_vsi_setup_vector_base
ice_vsi_alloc_q_vectors
ice_vsi_get_qs
void ice_vsi_map_rings_to_vectors
ice_vsi_alloc_rings
ice_vsi_set_rss_params
ice_vsi_set_num_qs
ice_get_free_slot
ice_vsi_init
ice_vsi_alloc_arrays

Signed-off-by: Anirudh Venkataramanan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_lib.c  | 253 +-
 drivers/net/ethernet/intel/ice/ice_lib.h  |  26 +--
 drivers/net/ethernet/intel/ice/ice_main.c | 252 -
 3 files changed, 249 insertions(+), 282 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c 
b/drivers/net/ethernet/intel/ice/ice_lib.c
index 232ca06974ea..21e3a3e70329 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -233,7 +233,7 @@ static int ice_vsi_ctrl_rx_rings(struct ice_vsi *vsi, bool 
ena)
  * On error: returns error code (negative)
  * On success: returns 0
  */
-int ice_vsi_alloc_arrays(struct ice_vsi *vsi, bool alloc_qvectors)
+static int ice_vsi_alloc_arrays(struct ice_vsi *vsi, bool alloc_qvectors)
 {
struct ice_pf *pf = vsi->back;
 
@@ -274,7 +274,7 @@ int ice_vsi_alloc_arrays(struct ice_vsi *vsi, bool 
alloc_qvectors)
  *
  * Return 0 on success and a negative value on error
  */
-void ice_vsi_set_num_qs(struct ice_vsi *vsi)
+static void ice_vsi_set_num_qs(struct ice_vsi *vsi)
 {
struct ice_pf *pf = vsi->back;
 
@@ -301,7 +301,7 @@ void ice_vsi_set_num_qs(struct ice_vsi *vsi)
  * void * is being used to keep the functionality generic. This lets us use 
this
  * function on any array of pointers.
  */
-int ice_get_free_slot(void *array, int size, int curr)
+static int ice_get_free_slot(void *array, int size, int curr)
 {
int **tmp_array = (int **)array;
int next;
@@ -423,6 +423,70 @@ irqreturn_t ice_msix_clean_rings(int __always_unused irq, 
void *data)
return IRQ_HANDLED;
 }
 
+/**
+ * ice_vsi_alloc - Allocates the next available struct VSI in the PF
+ * @pf: board private structure
+ * @type: type of VSI
+ *
+ * returns a pointer to a VSI on success, NULL on failure.
+ */
+static struct ice_vsi *ice_vsi_alloc(struct ice_pf *pf, enum ice_vsi_type type)
+{
+   struct ice_vsi *vsi = NULL;
+
+   /* Need to protect the allocation of the VSIs at the PF level */
+   mutex_lock(>sw_mutex);
+
+   /* If we have already allocated our maximum number of VSIs,
+* pf->next_vsi will be ICE_NO_VSI. If not, pf->next_vsi index
+* is available to be populated
+*/
+   if (pf->next_vsi == ICE_NO_VSI) {
+   dev_dbg(>pdev->dev, "out of VSI slots!\n");
+   goto unlock_pf;
+   }
+
+   vsi = devm_kzalloc(>pdev->dev, sizeof(*vsi), GFP_KERNEL);
+   if (!vsi)
+   goto unlock_pf;
+
+   vsi->type = type;
+   vsi->back = pf;
+   set_bit(__ICE_DOWN, vsi->state);
+   vsi->idx = pf->next_vsi;
+   vsi->work_lmt = ICE_DFLT_IRQ_WORK;
+
+   ice_vsi_set_num_qs(vsi);
+
+   switch (vsi->type) {
+   case ICE_VSI_PF:
+   if (ice_vsi_alloc_arrays(vsi, true))
+   goto err_rings;
+
+   /* Setup default MSIX irq handler for VSI */
+   vsi->irq_handler = ice_msix_clean_rings;
+   break;
+   default:
+   dev_warn(>pdev->dev, "Unknown VSI type %d\n", vsi->type);
+   goto unlock_pf;
+   }
+
+   /* fill VSI slot in the PF struct */
+   pf->vsi[pf->next_vsi] = vsi;
+
+   /* prepare pf->next_vsi for next use */
+   pf->next_vsi = ice_get_free_slot(pf->vsi, pf->num_alloc_vsi,
+pf->next_vsi);
+   goto unlock_pf;
+
+err_rings:
+   devm_kfree(>pdev->dev, vsi);
+   vsi = NULL;
+unlock_pf:
+   mutex_unlock(>sw_mutex);
+   return vsi;
+}
+
 /**
  * ice_vsi_get_qs_contig - Assign a contiguous chunk of queues to VSI
  * @vsi: the VSI getting queues
@@ -533,7 +597,7 @@ static int ice_vsi_get_qs_scatter(struct ice_vsi *vsi)
  *
  * Returns 0 on success and a negative value on error
  */
-int ice_vsi_get_qs(struct ice_vsi *vsi)
+static int ice_vsi_get_qs(struct ice_vsi *vsi)
 {
int ret = 0;
 
@@ -602,7 +666,7 @@ static void ice_rss_clean(struct ice_vsi *vsi)
  * ice_vsi_set_rss_params - Setup RSS capabilities per VSI type
  * @vsi: the VSI being configured
  */
-void ice_vsi_set_rss_params(struct ice_vsi *vsi)
+static void ice_vsi_set_rss_params(struct ice_vsi *vsi)
 {
struct ice_hw_common_caps *cap;
struct ice_pf *pf = vsi->back;
@@ -793,7 +857,7 @@ static void ice_set_rss_vsi_ctx(struct ice_vsi_ctx *ctxt, 
struct ice_vsi *vsi)
  * This initializes a VSI context depending on 

[net-next 4/8] ice: Move common functions out of ice_main.c part 4/7

2018-10-01 Thread Jeff Kirsher
From: Anirudh Venkataramanan 

This patch continues the code move out of ice_main.c

The following top level functions (and related dependency functions) were
moved to ice_lib.c:
ice_vsi_alloc_rings
ice_vsi_set_rss_params
ice_vsi_set_num_qs
ice_get_free_slot
ice_vsi_init
ice_vsi_clear_rings
ice_vsi_alloc_arrays

Signed-off-by: Anirudh Venkataramanan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_lib.c  | 414 +
 drivers/net/ethernet/intel/ice/ice_lib.h  |  14 +
 drivers/net/ethernet/intel/ice/ice_main.c | 418 +-
 3 files changed, 429 insertions(+), 417 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c 
b/drivers/net/ethernet/intel/ice/ice_lib.c
index 474ce5828bd4..df20d68c92ab 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -225,6 +225,102 @@ static int ice_vsi_ctrl_rx_rings(struct ice_vsi *vsi, 
bool ena)
return ret;
 }
 
+/**
+ * ice_vsi_alloc_arrays - Allocate queue and vector pointer arrays for the VSI
+ * @vsi: VSI pointer
+ * @alloc_qvectors: a bool to specify if q_vectors need to be allocated.
+ *
+ * On error: returns error code (negative)
+ * On success: returns 0
+ */
+int ice_vsi_alloc_arrays(struct ice_vsi *vsi, bool alloc_qvectors)
+{
+   struct ice_pf *pf = vsi->back;
+
+   /* allocate memory for both Tx and Rx ring pointers */
+   vsi->tx_rings = devm_kcalloc(>pdev->dev, vsi->alloc_txq,
+sizeof(struct ice_ring *), GFP_KERNEL);
+   if (!vsi->tx_rings)
+   goto err_txrings;
+
+   vsi->rx_rings = devm_kcalloc(>pdev->dev, vsi->alloc_rxq,
+sizeof(struct ice_ring *), GFP_KERNEL);
+   if (!vsi->rx_rings)
+   goto err_rxrings;
+
+   if (alloc_qvectors) {
+   /* allocate memory for q_vector pointers */
+   vsi->q_vectors = devm_kcalloc(>pdev->dev,
+ vsi->num_q_vectors,
+ sizeof(struct ice_q_vector *),
+ GFP_KERNEL);
+   if (!vsi->q_vectors)
+   goto err_vectors;
+   }
+
+   return 0;
+
+err_vectors:
+   devm_kfree(>pdev->dev, vsi->rx_rings);
+err_rxrings:
+   devm_kfree(>pdev->dev, vsi->tx_rings);
+err_txrings:
+   return -ENOMEM;
+}
+
+/**
+ * ice_vsi_set_num_qs - Set num queues, descriptors and vectors for a VSI
+ * @vsi: the VSI being configured
+ *
+ * Return 0 on success and a negative value on error
+ */
+void ice_vsi_set_num_qs(struct ice_vsi *vsi)
+{
+   struct ice_pf *pf = vsi->back;
+
+   switch (vsi->type) {
+   case ICE_VSI_PF:
+   vsi->alloc_txq = pf->num_lan_tx;
+   vsi->alloc_rxq = pf->num_lan_rx;
+   vsi->num_desc = ALIGN(ICE_DFLT_NUM_DESC, ICE_REQ_DESC_MULTIPLE);
+   vsi->num_q_vectors = max_t(int, pf->num_lan_rx, pf->num_lan_tx);
+   break;
+   default:
+   dev_warn(>back->pdev->dev, "Unknown VSI type %d\n",
+vsi->type);
+   break;
+   }
+}
+
+/**
+ * ice_get_free_slot - get the next non-NULL location index in array
+ * @array: array to search
+ * @size: size of the array
+ * @curr: last known occupied index to be used as a search hint
+ *
+ * void * is being used to keep the functionality generic. This lets us use 
this
+ * function on any array of pointers.
+ */
+int ice_get_free_slot(void *array, int size, int curr)
+{
+   int **tmp_array = (int **)array;
+   int next;
+
+   if (curr < (size - 1) && !tmp_array[curr + 1]) {
+   next = curr + 1;
+   } else {
+   int i = 0;
+
+   while ((i < size) && (tmp_array[i]))
+   i++;
+   if (i == size)
+   next = ICE_NO_VSI;
+   else
+   next = i;
+   }
+   return next;
+}
+
 /**
  * ice_vsi_delete - delete a VSI from the switch
  * @vsi: pointer to VSI being removed
@@ -286,6 +382,324 @@ void ice_vsi_put_qs(struct ice_vsi *vsi)
mutex_unlock(>avail_q_mutex);
 }
 
+/**
+ * ice_vsi_set_rss_params - Setup RSS capabilities per VSI type
+ * @vsi: the VSI being configured
+ */
+void ice_vsi_set_rss_params(struct ice_vsi *vsi)
+{
+   struct ice_hw_common_caps *cap;
+   struct ice_pf *pf = vsi->back;
+
+   if (!test_bit(ICE_FLAG_RSS_ENA, pf->flags)) {
+   vsi->rss_size = 1;
+   return;
+   }
+
+   cap = >hw.func_caps.common_cap;
+   switch (vsi->type) {
+   case ICE_VSI_PF:
+   /* PF VSI will inherit RSS instance of PF */
+   vsi->rss_table_size = cap->rss_table_size;
+   vsi->rss_size = min_t(int, num_online_cpus(),
+ BIT(cap->rss_table_entry_width));
+   

[net-next 6/8] ice: Move common functions out of ice_main.c part 6/7

2018-10-01 Thread Jeff Kirsher
From: Anirudh Venkataramanan 

This patch continues the code move out of ice_main.c

The following top level functions (and related dependency functions) were
moved to ice_lib.c:
ice_vsi_setup_vector_base
ice_vsi_alloc_q_vectors
ice_vsi_get_qs

The following functions were made static again:
ice_vsi_free_arrays
ice_vsi_clear_rings

Also, in this patch, the netdev and NAPI registration logic was de-coupled
from the VSI creation logic (ice_vsi_setup) as for SR-IOV, while we want to
create VF VSIs using ice_vsi_setup, we don't want to create netdevs.

Signed-off-by: Anirudh Venkataramanan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_lib.c  | 463 +++-
 drivers/net/ethernet/intel/ice/ice_lib.h  |  16 +-
 drivers/net/ethernet/intel/ice/ice_main.c | 492 +++---
 3 files changed, 521 insertions(+), 450 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c 
b/drivers/net/ethernet/intel/ice/ice_lib.c
index 6ba82337d017..232ca06974ea 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -346,7 +346,7 @@ void ice_vsi_delete(struct ice_vsi *vsi)
  * @vsi: pointer to VSI being cleared
  * @free_qvectors: bool to specify if q_vectors should be deallocated
  */
-void ice_vsi_free_arrays(struct ice_vsi *vsi, bool free_qvectors)
+static void ice_vsi_free_arrays(struct ice_vsi *vsi, bool free_qvectors)
 {
struct ice_pf *pf = vsi->back;
 
@@ -423,6 +423,141 @@ irqreturn_t ice_msix_clean_rings(int __always_unused irq, 
void *data)
return IRQ_HANDLED;
 }
 
+/**
+ * ice_vsi_get_qs_contig - Assign a contiguous chunk of queues to VSI
+ * @vsi: the VSI getting queues
+ *
+ * Return 0 on success and a negative value on error
+ */
+static int ice_vsi_get_qs_contig(struct ice_vsi *vsi)
+{
+   struct ice_pf *pf = vsi->back;
+   int offset, ret = 0;
+
+   mutex_lock(>avail_q_mutex);
+   /* look for contiguous block of queues for Tx */
+   offset = bitmap_find_next_zero_area(pf->avail_txqs, ICE_MAX_TXQS,
+   0, vsi->alloc_txq, 0);
+   if (offset < ICE_MAX_TXQS) {
+   int i;
+
+   bitmap_set(pf->avail_txqs, offset, vsi->alloc_txq);
+   for (i = 0; i < vsi->alloc_txq; i++)
+   vsi->txq_map[i] = i + offset;
+   } else {
+   ret = -ENOMEM;
+   vsi->tx_mapping_mode = ICE_VSI_MAP_SCATTER;
+   }
+
+   /* look for contiguous block of queues for Rx */
+   offset = bitmap_find_next_zero_area(pf->avail_rxqs, ICE_MAX_RXQS,
+   0, vsi->alloc_rxq, 0);
+   if (offset < ICE_MAX_RXQS) {
+   int i;
+
+   bitmap_set(pf->avail_rxqs, offset, vsi->alloc_rxq);
+   for (i = 0; i < vsi->alloc_rxq; i++)
+   vsi->rxq_map[i] = i + offset;
+   } else {
+   ret = -ENOMEM;
+   vsi->rx_mapping_mode = ICE_VSI_MAP_SCATTER;
+   }
+   mutex_unlock(>avail_q_mutex);
+
+   return ret;
+}
+
+/**
+ * ice_vsi_get_qs_scatter - Assign a scattered queues to VSI
+ * @vsi: the VSI getting queues
+ *
+ * Return 0 on success and a negative value on error
+ */
+static int ice_vsi_get_qs_scatter(struct ice_vsi *vsi)
+{
+   struct ice_pf *pf = vsi->back;
+   int i, index = 0;
+
+   mutex_lock(>avail_q_mutex);
+
+   if (vsi->tx_mapping_mode == ICE_VSI_MAP_SCATTER) {
+   for (i = 0; i < vsi->alloc_txq; i++) {
+   index = find_next_zero_bit(pf->avail_txqs,
+  ICE_MAX_TXQS, index);
+   if (index < ICE_MAX_TXQS) {
+   set_bit(index, pf->avail_txqs);
+   vsi->txq_map[i] = index;
+   } else {
+   goto err_scatter_tx;
+   }
+   }
+   }
+
+   if (vsi->rx_mapping_mode == ICE_VSI_MAP_SCATTER) {
+   for (i = 0; i < vsi->alloc_rxq; i++) {
+   index = find_next_zero_bit(pf->avail_rxqs,
+  ICE_MAX_RXQS, index);
+   if (index < ICE_MAX_RXQS) {
+   set_bit(index, pf->avail_rxqs);
+   vsi->rxq_map[i] = index;
+   } else {
+   goto err_scatter_rx;
+   }
+   }
+   }
+
+   mutex_unlock(>avail_q_mutex);
+   return 0;
+
+err_scatter_rx:
+   /* unflag any queues we have grabbed (i is failed position) */
+   for (index = 0; index < i; index++) {
+   clear_bit(vsi->rxq_map[index], pf->avail_rxqs);
+   vsi->rxq_map[index] = 0;
+   }
+   i = vsi->alloc_txq;
+err_scatter_tx:
+   /* i is either position of failed attempt 

[net-next 5/8] ice: Move common functions out of ice_main.c part 5/7

2018-10-01 Thread Jeff Kirsher
From: Anirudh Venkataramanan 

This patch continues the code move out of ice_main.c

The following top level functions (and related dependency functions) were
moved to ice_lib.c:
ice_vsi_clear
ice_vsi_close
ice_vsi_free_arrays
ice_vsi_map_rings_to_vectors

Signed-off-by: Anirudh Venkataramanan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_lib.c  | 133 ++
 drivers/net/ethernet/intel/ice/ice_lib.h  |   8 ++
 drivers/net/ethernet/intel/ice/ice_main.c | 132 -
 3 files changed, 141 insertions(+), 132 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c 
b/drivers/net/ethernet/intel/ice/ice_lib.c
index df20d68c92ab..6ba82337d017 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -341,6 +341,71 @@ void ice_vsi_delete(struct ice_vsi *vsi)
vsi->vsi_num);
 }
 
+/**
+ * ice_vsi_free_arrays - clean up VSI resources
+ * @vsi: pointer to VSI being cleared
+ * @free_qvectors: bool to specify if q_vectors should be deallocated
+ */
+void ice_vsi_free_arrays(struct ice_vsi *vsi, bool free_qvectors)
+{
+   struct ice_pf *pf = vsi->back;
+
+   /* free the ring and vector containers */
+   if (free_qvectors && vsi->q_vectors) {
+   devm_kfree(>pdev->dev, vsi->q_vectors);
+   vsi->q_vectors = NULL;
+   }
+   if (vsi->tx_rings) {
+   devm_kfree(>pdev->dev, vsi->tx_rings);
+   vsi->tx_rings = NULL;
+   }
+   if (vsi->rx_rings) {
+   devm_kfree(>pdev->dev, vsi->rx_rings);
+   vsi->rx_rings = NULL;
+   }
+}
+
+/**
+ * ice_vsi_clear - clean up and deallocate the provided VSI
+ * @vsi: pointer to VSI being cleared
+ *
+ * This deallocates the VSI's queue resources, removes it from the PF's
+ * VSI array if necessary, and deallocates the VSI
+ *
+ * Returns 0 on success, negative on failure
+ */
+int ice_vsi_clear(struct ice_vsi *vsi)
+{
+   struct ice_pf *pf = NULL;
+
+   if (!vsi)
+   return 0;
+
+   if (!vsi->back)
+   return -EINVAL;
+
+   pf = vsi->back;
+
+   if (!pf->vsi[vsi->idx] || pf->vsi[vsi->idx] != vsi) {
+   dev_dbg(>pdev->dev, "vsi does not exist at pf->vsi[%d]\n",
+   vsi->idx);
+   return -EINVAL;
+   }
+
+   mutex_lock(>sw_mutex);
+   /* updates the PF for this cleared VSI */
+
+   pf->vsi[vsi->idx] = NULL;
+   if (vsi->idx < pf->next_vsi)
+   pf->next_vsi = vsi->idx;
+
+   ice_vsi_free_arrays(vsi, true);
+   mutex_unlock(>sw_mutex);
+   devm_kfree(>pdev->dev, vsi);
+
+   return 0;
+}
+
 /**
  * ice_msix_clean_rings - MSIX mode Interrupt Handler
  * @irq: interrupt number
@@ -700,6 +765,60 @@ int ice_vsi_alloc_rings(struct ice_vsi *vsi)
return -ENOMEM;
 }
 
+/**
+ * ice_vsi_map_rings_to_vectors - Map VSI rings to interrupt vectors
+ * @vsi: the VSI being configured
+ *
+ * This function maps descriptor rings to the queue-specific vectors allotted
+ * through the MSI-X enabling code. On a constrained vector budget, we map Tx
+ * and Rx rings to the vector as "efficiently" as possible.
+ */
+void ice_vsi_map_rings_to_vectors(struct ice_vsi *vsi)
+{
+   int q_vectors = vsi->num_q_vectors;
+   int tx_rings_rem, rx_rings_rem;
+   int v_id;
+
+   /* initially assigning remaining rings count to VSIs num queue value */
+   tx_rings_rem = vsi->num_txq;
+   rx_rings_rem = vsi->num_rxq;
+
+   for (v_id = 0; v_id < q_vectors; v_id++) {
+   struct ice_q_vector *q_vector = vsi->q_vectors[v_id];
+   int tx_rings_per_v, rx_rings_per_v, q_id, q_base;
+
+   /* Tx rings mapping to vector */
+   tx_rings_per_v = DIV_ROUND_UP(tx_rings_rem, q_vectors - v_id);
+   q_vector->num_ring_tx = tx_rings_per_v;
+   q_vector->tx.ring = NULL;
+   q_base = vsi->num_txq - tx_rings_rem;
+
+   for (q_id = q_base; q_id < (q_base + tx_rings_per_v); q_id++) {
+   struct ice_ring *tx_ring = vsi->tx_rings[q_id];
+
+   tx_ring->q_vector = q_vector;
+   tx_ring->next = q_vector->tx.ring;
+   q_vector->tx.ring = tx_ring;
+   }
+   tx_rings_rem -= tx_rings_per_v;
+
+   /* Rx rings mapping to vector */
+   rx_rings_per_v = DIV_ROUND_UP(rx_rings_rem, q_vectors - v_id);
+   q_vector->num_ring_rx = rx_rings_per_v;
+   q_vector->rx.ring = NULL;
+   q_base = vsi->num_rxq - rx_rings_rem;
+
+   for (q_id = q_base; q_id < (q_base + rx_rings_per_v); q_id++) {
+   struct ice_ring *rx_ring = vsi->rx_rings[q_id];
+
+   rx_ring->q_vector = q_vector;
+   rx_ring->next = q_vector->rx.ring;
+   

[net-next 0/8][pull request] 100GbE Intel Wired LAN Driver Updates 2018-10-01

2018-10-01 Thread Jeff Kirsher
This series contains updates to ice driver only.

Anirudh provides several changes to "prep" the driver for upcoming
features.  Specifically, the functions that are used for PF VSI/netdev
setup will also be used in SR-IOV support and to allow the reuse of
these functions, code needs to move.

Dave provides the only other change in the series, updates the driver to
protect the reset patch in its entirety.  This is done by adding the
various bit checks to determine if a reset is scheduled/initiated and
whether it came from the software or firmware.

The following are changes since commit 804fe108fc92e591ddfe9447e7fb4691ed16daee:
  openvswitch: Use correct reply values in datapath and vport ops
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 100GbE

Anirudh Venkataramanan (7):
  ice: Move common functions out of ice_main.c part 1/7
  ice: Move common functions out of ice_main.c part 2/7
  ice: Move common functions out of ice_main.c part 3/7
  ice: Move common functions out of ice_main.c part 4/7
  ice: Move common functions out of ice_main.c part 5/7
  ice: Move common functions out of ice_main.c part 6/7
  ice: Move common functions out of ice_main.c part 7/7

Dave Ertman (1):
  ice: Change pf state behavior to protect reset path

 drivers/net/ethernet/intel/ice/Makefile |1 +
 drivers/net/ethernet/intel/ice/ice.h|2 +-
 drivers/net/ethernet/intel/ice/ice_common.c |   61 +
 drivers/net/ethernet/intel/ice/ice_common.h |4 +
 drivers/net/ethernet/intel/ice/ice_lib.c| 2379 +++
 drivers/net/ethernet/intel/ice/ice_lib.h|   74 +
 drivers/net/ethernet/intel/ice/ice_main.c   | 2901 ++-
 7 files changed, 2773 insertions(+), 2649 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ice/ice_lib.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_lib.h

-- 
2.17.1



[net-next 3/8] ice: Move common functions out of ice_main.c part 3/7

2018-10-01 Thread Jeff Kirsher
From: Anirudh Venkataramanan 

This patch continues the code move out of ice_main.c

The following top level functions (and related dependency functions) were
moved to ice_lib.c:
ice_vsi_delete
ice_free_res
ice_get_res
ice_is_reset_recovery_pending
ice_vsi_put_qs
ice_vsi_dis_irq
ice_vsi_free_irq
ice_vsi_free_rx_rings
ice_vsi_free_tx_rings
ice_msix_clean_rings

Signed-off-by: Anirudh Venkataramanan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_lib.c  | 388 ++
 drivers/net/ethernet/intel/ice/ice_lib.h  |  22 ++
 drivers/net/ethernet/intel/ice/ice_main.c | 386 -
 3 files changed, 410 insertions(+), 386 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c 
b/drivers/net/ethernet/intel/ice/ice_lib.c
index 06a54d79fba8..474ce5828bd4 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -225,6 +225,67 @@ static int ice_vsi_ctrl_rx_rings(struct ice_vsi *vsi, bool 
ena)
return ret;
 }
 
+/**
+ * ice_vsi_delete - delete a VSI from the switch
+ * @vsi: pointer to VSI being removed
+ */
+void ice_vsi_delete(struct ice_vsi *vsi)
+{
+   struct ice_pf *pf = vsi->back;
+   struct ice_vsi_ctx ctxt;
+   enum ice_status status;
+
+   ctxt.vsi_num = vsi->vsi_num;
+
+   memcpy(, >info, sizeof(struct ice_aqc_vsi_props));
+
+   status = ice_free_vsi(>hw, vsi->idx, , false, NULL);
+   if (status)
+   dev_err(>pdev->dev, "Failed to delete VSI %i in FW\n",
+   vsi->vsi_num);
+}
+
+/**
+ * ice_msix_clean_rings - MSIX mode Interrupt Handler
+ * @irq: interrupt number
+ * @data: pointer to a q_vector
+ */
+irqreturn_t ice_msix_clean_rings(int __always_unused irq, void *data)
+{
+   struct ice_q_vector *q_vector = (struct ice_q_vector *)data;
+
+   if (!q_vector->tx.ring && !q_vector->rx.ring)
+   return IRQ_HANDLED;
+
+   napi_schedule(_vector->napi);
+
+   return IRQ_HANDLED;
+}
+
+/**
+ * ice_vsi_put_qs - Release queues from VSI to PF
+ * @vsi: the VSI that is going to release queues
+ */
+void ice_vsi_put_qs(struct ice_vsi *vsi)
+{
+   struct ice_pf *pf = vsi->back;
+   int i;
+
+   mutex_lock(>avail_q_mutex);
+
+   for (i = 0; i < vsi->alloc_txq; i++) {
+   clear_bit(vsi->txq_map[i], pf->avail_txqs);
+   vsi->txq_map[i] = ICE_INVAL_Q_INDEX;
+   }
+
+   for (i = 0; i < vsi->alloc_rxq; i++) {
+   clear_bit(vsi->rxq_map[i], pf->avail_rxqs);
+   vsi->rxq_map[i] = ICE_INVAL_Q_INDEX;
+   }
+
+   mutex_unlock(>avail_q_mutex);
+}
+
 /**
  * ice_add_mac_to_list - Add a mac address filter entry to the list
  * @vsi: the VSI to be forwarded to
@@ -747,3 +808,330 @@ int ice_vsi_stop_tx_rings(struct ice_vsi *vsi)
 
return err;
 }
+
+/**
+ * ice_cfg_vlan_pruning - enable or disable VLAN pruning on the VSI
+ * @vsi: VSI to enable or disable VLAN pruning on
+ * @ena: set to true to enable VLAN pruning and false to disable it
+ *
+ * returns 0 if VSI is updated, negative otherwise
+ */
+int ice_cfg_vlan_pruning(struct ice_vsi *vsi, bool ena)
+{
+   struct ice_vsi_ctx *ctxt;
+   struct device *dev;
+   int status;
+
+   if (!vsi)
+   return -EINVAL;
+
+   dev = >back->pdev->dev;
+   ctxt = devm_kzalloc(dev, sizeof(*ctxt), GFP_KERNEL);
+   if (!ctxt)
+   return -ENOMEM;
+
+   ctxt->info = vsi->info;
+
+   if (ena) {
+   ctxt->info.sec_flags |=
+   ICE_AQ_VSI_SEC_TX_VLAN_PRUNE_ENA <<
+   ICE_AQ_VSI_SEC_TX_PRUNE_ENA_S;
+   ctxt->info.sw_flags2 |= ICE_AQ_VSI_SW_FLAG_RX_VLAN_PRUNE_ENA;
+   } else {
+   ctxt->info.sec_flags &=
+   ~(ICE_AQ_VSI_SEC_TX_VLAN_PRUNE_ENA <<
+ ICE_AQ_VSI_SEC_TX_PRUNE_ENA_S);
+   ctxt->info.sw_flags2 &= ~ICE_AQ_VSI_SW_FLAG_RX_VLAN_PRUNE_ENA;
+   }
+
+   ctxt->info.valid_sections = cpu_to_le16(ICE_AQ_VSI_PROP_SECURITY_VALID |
+   ICE_AQ_VSI_PROP_SW_VALID);
+   ctxt->vsi_num = vsi->vsi_num;
+   status = ice_aq_update_vsi(>back->hw, ctxt, NULL);
+   if (status) {
+   netdev_err(vsi->netdev, "%sabling VLAN pruning on VSI %d 
failed, err = %d, aq_err = %d\n",
+  ena ? "Ena" : "Dis", vsi->vsi_num, status,
+  vsi->back->hw.adminq.sq_last_status);
+   goto err_out;
+   }
+
+   vsi->info.sec_flags = ctxt->info.sec_flags;
+   vsi->info.sw_flags2 = ctxt->info.sw_flags2;
+
+   devm_kfree(dev, ctxt);
+   return 0;
+
+err_out:
+   devm_kfree(dev, ctxt);
+   return -EIO;
+}
+
+/**
+ * ice_vsi_release_msix - Clear the queue to Interrupt mapping in HW
+ * @vsi: the VSI being cleaned up
+ */
+static void ice_vsi_release_msix(struct ice_vsi *vsi)
+{
+   

[net-next 1/8] ice: Move common functions out of ice_main.c part 1/7

2018-10-01 Thread Jeff Kirsher
From: Anirudh Venkataramanan 

The functions that are used for PF VSI/netdev setup will also be used
for SR-IOV support. To allow reuse of these functions, move these
functions out of ice_main.c to ice_common.c/ice_lib.c

This move is done across multiple patches. Each patch moves a few
functions and may have minor adjustments. For example, a function that was
previously static in ice_main.c will be made non-static temporarily in
its new location to allow the driver to build cleanly. These adjustments
will be removed in subsequent patches where more code is moved out of
ice_main.c

In this particular patch, the following functions were moved out of
ice_main.c:
int ice_add_mac_to_list
ice_free_fltr_list
ice_stat_update40
ice_stat_update32
ice_update_eth_stats
ice_vsi_add_vlan
ice_vsi_kill_vlan
ice_vsi_manage_vlan_insertion
ice_vsi_manage_vlan_stripping

Signed-off-by: Anirudh Venkataramanan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/Makefile |   1 +
 drivers/net/ethernet/intel/ice/ice_common.c |  61 
 drivers/net/ethernet/intel/ice/ice_common.h |   4 +
 drivers/net/ethernet/intel/ice/ice_lib.c| 258 
 drivers/net/ethernet/intel/ice/ice_lib.h|  23 ++
 drivers/net/ethernet/intel/ice/ice_main.c   | 316 +---
 6 files changed, 348 insertions(+), 315 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ice/ice_lib.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_lib.h

diff --git a/drivers/net/ethernet/intel/ice/Makefile 
b/drivers/net/ethernet/intel/ice/Makefile
index 4058673fd853..45125bd074d9 100644
--- a/drivers/net/ethernet/intel/ice/Makefile
+++ b/drivers/net/ethernet/intel/ice/Makefile
@@ -13,5 +13,6 @@ ice-y := ice_main.o   \
 ice_nvm.o  \
 ice_switch.o   \
 ice_sched.o\
+ice_lib.o  \
 ice_txrx.o \
 ice_ethtool.o
diff --git a/drivers/net/ethernet/intel/ice/ice_common.c 
b/drivers/net/ethernet/intel/ice/ice_common.c
index decfdb065a20..ef9229fa5510 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -2652,3 +2652,64 @@ ice_cfg_vsi_lan(struct ice_port_info *pi, u16 vsi_id, u8 
tc_bitmap,
return ice_cfg_vsi_qs(pi, vsi_id, tc_bitmap, max_lanqs,
  ICE_SCHED_NODE_OWNER_LAN);
 }
+
+/**
+ * ice_stat_update40 - read 40 bit stat from the chip and update stat values
+ * @hw: ptr to the hardware info
+ * @hireg: high 32 bit HW register to read from
+ * @loreg: low 32 bit HW register to read from
+ * @prev_stat_loaded: bool to specify if previous stats are loaded
+ * @prev_stat: ptr to previous loaded stat value
+ * @cur_stat: ptr to current stat value
+ */
+void ice_stat_update40(struct ice_hw *hw, u32 hireg, u32 loreg,
+  bool prev_stat_loaded, u64 *prev_stat, u64 *cur_stat)
+{
+   u64 new_data;
+
+   new_data = rd32(hw, loreg);
+   new_data |= ((u64)(rd32(hw, hireg) & 0x)) << 32;
+
+   /* device stats are not reset at PFR, they likely will not be zeroed
+* when the driver starts. So save the first values read and use them as
+* offsets to be subtracted from the raw values in order to report stats
+* that count from zero.
+*/
+   if (!prev_stat_loaded)
+   *prev_stat = new_data;
+   if (new_data >= *prev_stat)
+   *cur_stat = new_data - *prev_stat;
+   else
+   /* to manage the potential roll-over */
+   *cur_stat = (new_data + BIT_ULL(40)) - *prev_stat;
+   *cur_stat &= 0xFFULL;
+}
+
+/**
+ * ice_stat_update32 - read 32 bit stat from the chip and update stat values
+ * @hw: ptr to the hardware info
+ * @reg: HW register to read from
+ * @prev_stat_loaded: bool to specify if previous stats are loaded
+ * @prev_stat: ptr to previous loaded stat value
+ * @cur_stat: ptr to current stat value
+ */
+void ice_stat_update32(struct ice_hw *hw, u32 reg, bool prev_stat_loaded,
+  u64 *prev_stat, u64 *cur_stat)
+{
+   u32 new_data;
+
+   new_data = rd32(hw, reg);
+
+   /* device stats are not reset at PFR, they likely will not be zeroed
+* when the driver starts. So save the first values read and use them as
+* offsets to be subtracted from the raw values in order to report stats
+* that count from zero.
+*/
+   if (!prev_stat_loaded)
+   *prev_stat = new_data;
+   if (new_data >= *prev_stat)
+   *cur_stat = new_data - *prev_stat;
+   else
+   /* to manage the potential roll-over */
+   *cur_stat = (new_data + BIT_ULL(32)) - *prev_stat;
+}
diff --git a/drivers/net/ethernet/intel/ice/ice_common.h 
b/drivers/net/ethernet/intel/ice/ice_common.h
index aac2d6cadaaf..80d288a07731 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.h
+++ b/drivers/net/ethernet/intel/ice/ice_common.h
@@ -96,4 +96,8 @@ 

[net-next 8/8] ice: Change pf state behavior to protect reset path

2018-10-01 Thread Jeff Kirsher
From: Dave Ertman 

Currently, there is no bit, or set of bits, that protect the entirety
of the reset path.

If the reset is originated by the driver, then the relevant
one of the following bits will be set when the reset is scheduled:
__ICE_PFR_REQ
__ICE_CORER_REQ
__ICE_GLOBR_REQ
This bit will not be cleared until after the rebuild has completed.

If the reset is originated by the FW, then the first the driver knows of
it will be the reception of the OICR interrupt.  The __ICE_RESET_OICR_RECV
bit will be set in the interrupt handler.  This will also be the indicator
in a SW originated reset that we have completed the pre-OICR tasks and
have informed the FW that a reset was requested.

To utilize these bits, change the function:
ice_is_reset_recovery_pending()
to be:
ice_is_reset_in_progress()

The new function will check all of the above bits in the pf->state and
will return a true if one or more of these bits are set.

Signed-off-by: Dave Ertman 
Signed-off-by: Anirudh Venkataramanan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice.h  |  2 +-
 drivers/net/ethernet/intel/ice/ice_lib.c  | 13 ---
 drivers/net/ethernet/intel/ice/ice_lib.h  |  2 +-
 drivers/net/ethernet/intel/ice/ice_main.c | 44 +++
 4 files changed, 31 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice.h 
b/drivers/net/ethernet/intel/ice/ice.h
index e84a612ffa71..9cce4cb91401 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -124,7 +124,7 @@ enum ice_state {
__ICE_DOWN,
__ICE_NEEDS_RESTART,
__ICE_PREPARED_FOR_RESET,   /* set by driver when prepared */
-   __ICE_RESET_RECOVERY_PENDING,   /* set by driver when reset starts */
+   __ICE_RESET_OICR_RECV,  /* set by driver after rcv reset OICR */
__ICE_PFR_REQ,  /* set by driver and peers */
__ICE_CORER_REQ,/* set by driver and peers */
__ICE_GLOBR_REQ,/* set by driver and peers */
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c 
b/drivers/net/ethernet/intel/ice/ice_lib.c
index 21e3a3e70329..95588fe0e22f 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -2250,7 +2250,7 @@ int ice_vsi_release(struct ice_vsi *vsi)
 * currently. This is done to avoid check_flush_dependency() warning
 * on this wq
 */
-   if (vsi->netdev && !ice_is_reset_recovery_pending(pf->state)) {
+   if (vsi->netdev && !ice_is_reset_in_progress(pf->state)) {
unregister_netdev(vsi->netdev);
free_netdev(vsi->netdev);
vsi->netdev = NULL;
@@ -2280,7 +2280,7 @@ int ice_vsi_release(struct ice_vsi *vsi)
 * free VSI netdev when PF is not in reset recovery pending state,\
 * for ex: during rmmod.
 */
-   if (!ice_is_reset_recovery_pending(pf->state))
+   if (!ice_is_reset_in_progress(pf->state))
ice_vsi_clear(vsi);
 
return 0;
@@ -2367,10 +2367,13 @@ int ice_vsi_rebuild(struct ice_vsi *vsi)
 }
 
 /**
- * ice_is_reset_recovery_pending - schedule a reset
+ * ice_is_reset_in_progress - check for a reset in progress
  * @state: pf state field
  */
-bool ice_is_reset_recovery_pending(unsigned long *state)
+bool ice_is_reset_in_progress(unsigned long *state)
 {
-   return test_bit(__ICE_RESET_RECOVERY_PENDING, state);
+   return test_bit(__ICE_RESET_OICR_RECV, state) ||
+  test_bit(__ICE_PFR_REQ, state) ||
+  test_bit(__ICE_CORER_REQ, state) ||
+  test_bit(__ICE_GLOBR_REQ, state);
 }
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.h 
b/drivers/net/ethernet/intel/ice/ice_lib.h
index a76cde895bf3..4265464ee3d3 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_lib.h
@@ -54,7 +54,7 @@ ice_get_res(struct ice_pf *pf, struct ice_res_tracker *res, 
u16 needed, u16 id);
 
 int ice_vsi_rebuild(struct ice_vsi *vsi);
 
-bool ice_is_reset_recovery_pending(unsigned long *state);
+bool ice_is_reset_in_progress(unsigned long *state);
 
 void ice_vsi_free_q_vectors(struct ice_vsi *vsi);
 
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c 
b/drivers/net/ethernet/intel/ice/ice_main.c
index 58cb2edd1c74..a3513acd272b 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -364,21 +364,17 @@ static void ice_do_reset(struct ice_pf *pf, enum 
ice_reset_req reset_type)
dev_dbg(dev, "reset_type 0x%x requested\n", reset_type);
WARN_ON(in_interrupt());
 
-   /* PFR is a bit of a special case because it doesn't result in an OICR
-* interrupt. Set pending bit here which otherwise gets set in the
-* OICR handler.
-*/
-   if (reset_type == ICE_RESET_PFR)
-   set_bit(__ICE_RESET_RECOVERY_PENDING, pf->state);

Re: [PATCH bpf-next v2 0/5] xsk: fix bug when trying to use both copy and zero-copy mode

2018-10-01 Thread Jakub Kicinski
On Mon,  1 Oct 2018 14:51:32 +0200, Magnus Karlsson wrote:
> Jakub, please take a look at your patches. The last one I had to
> change slightly to make it fit with the new interface
> xdp_get_umem_from_qid(). An added bonus with this function is that we,
> in the future, can also use it from the driver to get a umem, thus
> simplifying driver implementations (and later remove the umem from the
> NDO completely). Björn will mail patches, at a later point in time,
> using this in the i40e and ixgbe drivers, that removes a good chunk of
> code from the ZC implementations. 

Nice, drivers which don't follow the prepare/commit model of handling
reconfigurations will benefit!

> I also made your code aware of Tx queues. If we create a socket that
> only has a Tx queue, then the queue id will refer to a Tx queue id
> only and could be larger than the available amount of Rx queues.
> Please take a look at it.

The semantics of Tx queue id are slightly unclear.  To me XDP is
associated with Rx, so the qid in driver context can only refer to 
Rx queue and its associated XDP Tx queue.  It does not mean the Tx
queue stack uses, like it does for copy fallback.  If one doesn't have
a Rx queue $id, there will be no associated XDP Tx queue $id (in all
drivers but Intel, and virtio, which use per-CPU Tx queues making TX
queue even more meaningless).

Its to be seen how others implement AF_XDP.  My general feeling is
that we should only talk about Rx queues in context of driver XDP. 


[net-next 13/13] net/mlx5: Cache the system image guid

2018-10-01 Thread Saeed Mahameed
From: Alaa Hleihel 

The system image guid is a read-only field which is used by the TC
offloads code to determine if two mlx5 devices belong to the same
ASIC while adding flows.

Read this once and save it on the core device rather than querying each
time an offloaded flow is added.

Signed-off-by: Alaa Hleihel 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/vport.c | 9 +
 include/linux/mlx5/driver.h | 1 +
 include/linux/mlx5/vport.h  | 2 ++
 4 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 9fed54017659..82723a0e509a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2040,8 +2040,8 @@ static bool same_hw_devs(struct mlx5e_priv *priv, struct 
mlx5e_priv *peer_priv)
fmdev = priv->mdev;
pmdev = peer_priv->mdev;
 
-   mlx5_query_nic_vport_system_image_guid(fmdev, _guid);
-   mlx5_query_nic_vport_system_image_guid(pmdev, _guid);
+   fsystem_guid = mlx5_query_nic_system_image_guid(fmdev);
+   psystem_guid = mlx5_query_nic_system_image_guid(pmdev);
 
return (fsystem_guid == psystem_guid);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c 
b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
index b02af317c125..cfbea66b4879 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
@@ -1201,3 +1201,12 @@ int mlx5_nic_vport_unaffiliate_multiport(struct 
mlx5_core_dev *port_mdev)
return err;
 }
 EXPORT_SYMBOL_GPL(mlx5_nic_vport_unaffiliate_multiport);
+
+u64 mlx5_query_nic_system_image_guid(struct mlx5_core_dev *mdev)
+{
+   if (!mdev->sys_image_guid)
+   mlx5_query_nic_vport_system_image_guid(mdev, 
>sys_image_guid);
+
+   return mdev->sys_image_guid;
+}
+EXPORT_SYMBOL_GPL(mlx5_query_nic_system_image_guid);
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index ed73b51f6697..26a92462f4ce 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -838,6 +838,7 @@ struct mlx5_core_dev {
u32 fpga[MLX5_ST_SZ_DW(fpga_cap)];
u32 qcam[MLX5_ST_SZ_DW(qcam_reg)];
} caps;
+   u64 sys_image_guid;
phys_addr_t iseg_base;
struct mlx5_init_seg __iomem *iseg;
enum mlx5_device_state  state;
diff --git a/include/linux/mlx5/vport.h b/include/linux/mlx5/vport.h
index 7e7c6dfcfb09..9c694808c212 100644
--- a/include/linux/mlx5/vport.h
+++ b/include/linux/mlx5/vport.h
@@ -121,4 +121,6 @@ int mlx5_nic_vport_query_local_lb(struct mlx5_core_dev 
*mdev, bool *status);
 int mlx5_nic_vport_affiliate_multiport(struct mlx5_core_dev *master_mdev,
   struct mlx5_core_dev *port_mdev);
 int mlx5_nic_vport_unaffiliate_multiport(struct mlx5_core_dev *port_mdev);
+
+u64 mlx5_query_nic_system_image_guid(struct mlx5_core_dev *mdev);
 #endif /* __MLX5_VPORT_H__ */
-- 
2.17.1



[net-next 10/13] net/mlx5e: Add ethtool control of ring params to VF representors

2018-10-01 Thread Saeed Mahameed
From: Gavi Teitz 

Added ethtool control to the representors for setting and querying
the ring params.

Signed-off-by: Gavi Teitz 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_rep.c   | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index be435e76d316..9264c3332aa6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -180,6 +180,22 @@ static int mlx5e_rep_get_sset_count(struct net_device 
*dev, int sset)
}
 }
 
+static void mlx5e_rep_get_ringparam(struct net_device *dev,
+   struct ethtool_ringparam *param)
+{
+   struct mlx5e_priv *priv = netdev_priv(dev);
+
+   mlx5e_ethtool_get_ringparam(priv, param);
+}
+
+static int mlx5e_rep_set_ringparam(struct net_device *dev,
+  struct ethtool_ringparam *param)
+{
+   struct mlx5e_priv *priv = netdev_priv(dev);
+
+   return mlx5e_ethtool_set_ringparam(priv, param);
+}
+
 static int mlx5e_replace_rep_vport_rx_rule(struct mlx5e_priv *priv,
   struct mlx5_flow_destination *dest)
 {
@@ -260,6 +276,8 @@ static const struct ethtool_ops mlx5e_rep_ethtool_ops = {
.get_strings   = mlx5e_rep_get_strings,
.get_sset_count= mlx5e_rep_get_sset_count,
.get_ethtool_stats = mlx5e_rep_get_ethtool_stats,
+   .get_ringparam = mlx5e_rep_get_ringparam,
+   .set_ringparam = mlx5e_rep_set_ringparam,
.get_channels  = mlx5e_rep_get_channels,
.set_channels  = mlx5e_rep_set_channels,
.get_rxfh_key_size   = mlx5e_rep_get_rxfh_key_size,
-- 
2.17.1



[net-next 12/13] net/mlx5e: Allow reporting of checksum unnecessary

2018-10-01 Thread Saeed Mahameed
From: Or Gerlitz 

Currently we practically never report checksum unnecessary, because
for all IP packets we take the checksum complete path.

Enable non-default runs with reprorting checksum unnecessary, using
an ethtool private flag. This can be useful for performance evals
and other explorations.

Signed-off-by: Or Gerlitz 
Reviewed-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  2 ++
 .../ethernet/mellanox/mlx5/core/en_ethtool.c  | 28 +++
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  4 +++
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  3 ++
 4 files changed, 37 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index b3bd79833517..ef7a44eb9adb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -209,6 +209,7 @@ enum mlx5e_priv_flag {
MLX5E_PFLAG_TX_CQE_BASED_MODER = (1 << 1),
MLX5E_PFLAG_RX_CQE_COMPRESS = (1 << 2),
MLX5E_PFLAG_RX_STRIDING_RQ = (1 << 3),
+   MLX5E_PFLAG_RX_NO_CSUM_COMPLETE = (1 << 4),
 };
 
 #define MLX5E_SET_PFLAG(params, pflag, enable) \
@@ -290,6 +291,7 @@ struct mlx5e_dcbx_dp {
 enum {
MLX5E_RQ_STATE_ENABLED,
MLX5E_RQ_STATE_AM,
+   MLX5E_RQ_STATE_NO_CSUM_COMPLETE,
 };
 
 struct mlx5e_cq {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 33dafd8638b1..c86fd770c463 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -140,6 +140,7 @@ static const char mlx5e_priv_flags[][ETH_GSTRING_LEN] = {
"tx_cqe_moder",
"rx_cqe_compress",
"rx_striding_rq",
+   "rx_no_csum_complete",
 };
 
 int mlx5e_ethtool_get_sset_count(struct mlx5e_priv *priv, int sset)
@@ -1531,6 +1532,27 @@ static int set_pflag_rx_striding_rq(struct net_device 
*netdev, bool enable)
return 0;
 }
 
+static int set_pflag_rx_no_csum_complete(struct net_device *netdev, bool 
enable)
+{
+   struct mlx5e_priv *priv = netdev_priv(netdev);
+   struct mlx5e_channels *channels = >channels;
+   struct mlx5e_channel *c;
+   int i;
+
+   if (!test_bit(MLX5E_STATE_OPENED, >state))
+   return 0;
+
+   for (i = 0; i < channels->num; i++) {
+   c = channels->c[i];
+   if (enable)
+   __set_bit(MLX5E_RQ_STATE_NO_CSUM_COMPLETE, 
>rq.state);
+   else
+   __clear_bit(MLX5E_RQ_STATE_NO_CSUM_COMPLETE, 
>rq.state);
+   }
+
+   return 0;
+}
+
 static int mlx5e_handle_pflag(struct net_device *netdev,
  u32 wanted_flags,
  enum mlx5e_priv_flag flag,
@@ -1582,6 +1604,12 @@ static int mlx5e_set_priv_flags(struct net_device 
*netdev, u32 pflags)
err = mlx5e_handle_pflag(netdev, pflags,
 MLX5E_PFLAG_RX_STRIDING_RQ,
 set_pflag_rx_striding_rq);
+   if (err)
+   goto out;
+
+   err = mlx5e_handle_pflag(netdev, pflags,
+MLX5E_PFLAG_RX_NO_CSUM_COMPLETE,
+set_pflag_rx_no_csum_complete);
 
 out:
mutex_unlock(>state_lock);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 6086f874c7bf..35aca9a8e3d6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -929,6 +929,9 @@ static int mlx5e_open_rq(struct mlx5e_channel *c,
if (params->rx_dim_enabled)
__set_bit(MLX5E_RQ_STATE_AM, >rq.state);
 
+   if (params->pflags & MLX5E_PFLAG_RX_NO_CSUM_COMPLETE)
+   __set_bit(MLX5E_RQ_STATE_NO_CSUM_COMPLETE, >rq.state);
+
return 0;
 
 err_destroy_rq:
@@ -4528,6 +4531,7 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
params->rx_cqe_compress_def = slow_pci_heuristic(mdev);
 
MLX5E_SET_PFLAG(params, MLX5E_PFLAG_RX_CQE_COMPRESS, 
params->rx_cqe_compress_def);
+   MLX5E_SET_PFLAG(params, MLX5E_PFLAG_RX_NO_CSUM_COMPLETE, false);
 
/* RQ */
mlx5e_build_rq_params(mdev, params);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 5a43cbf9103f..f19067c94272 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -782,6 +782,9 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
return;
}
 
+   if (unlikely(test_bit(MLX5E_RQ_STATE_NO_CSUM_COMPLETE, >state)))
+   goto csum_unnecessary;
+
if (likely(is_last_ethertype_ip(skb, _depth, ))) {
if 

[net-next 02/13] net/mlx5e: Change VF representors' RQ type

2018-10-01 Thread Saeed Mahameed
From: Gavi Teitz 

The representors' RQ size was not large enough for them to achieve
high enough performance, and therefore needed to be enlarged, while
suffering a minimum hit to its memory usage. To achieve this the
representors RQ size was increased, and its type was changed to be a
striding RQ if it is supported.

Towards that goal the following changes were made:

* Extracted the sequence for setting the standard netdev's RQ parmas
  into a function

* Replaced the sequence for setting the representor's RQ params with
  the standard sequence

The impact of this change can be seen in the following measurements
taken on a setup of a VM over a VF, connected to OVS via the VF
representor, to an external host:

Before current change:
 TCP Throughput [Gb/s]
VM to external host ~  7.2

With the current change (measured with a striding RQ):
 TCP Throughput [Gb/s]
VM to external host ~ 23.5

Each representor now consumes 2 [MB] of memory for its packet
buffers.

Signed-off-by: Gavi Teitz 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  2 ++
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 30 +++
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  | 11 ---
 3 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 01a967e717e7..b298456da8e7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -966,6 +966,8 @@ void mlx5e_destroy_netdev(struct mlx5e_priv *priv);
 void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
struct mlx5e_params *params,
u16 max_channels, u16 mtu);
+void mlx5e_build_rq_params(struct mlx5_core_dev *mdev,
+  struct mlx5e_params *params);
 u8 mlx5e_params_calculate_tx_min_inline(struct mlx5_core_dev *mdev);
 void mlx5e_rx_dim_work(struct work_struct *work);
 void mlx5e_tx_dim_work(struct work_struct *work);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 5955b4d844cc..46001855d0e9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4480,6 +4480,23 @@ static u32 mlx5e_choose_lro_timeout(struct mlx5_core_dev 
*mdev, u32 wanted_timeo
return MLX5_CAP_ETH(mdev, lro_timer_supported_periods[i]);
 }
 
+void mlx5e_build_rq_params(struct mlx5_core_dev *mdev,
+  struct mlx5e_params *params)
+{
+   /* Prefer Striding RQ, unless any of the following holds:
+* - Striding RQ configuration is not possible/supported.
+* - Slow PCI heuristic.
+* - Legacy RQ would use linear SKB while Striding RQ would use 
non-linear.
+*/
+   if (!slow_pci_heuristic(mdev) &&
+   mlx5e_striding_rq_possible(mdev, params) &&
+   (mlx5e_rx_mpwqe_is_linear_skb(mdev, params) ||
+!mlx5e_rx_is_linear_skb(mdev, params)))
+   MLX5E_SET_PFLAG(params, MLX5E_PFLAG_RX_STRIDING_RQ, true);
+   mlx5e_set_rq_type(mdev, params);
+   mlx5e_init_rq_type_params(mdev, params);
+}
+
 void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
struct mlx5e_params *params,
u16 max_channels, u16 mtu)
@@ -4505,18 +4522,7 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
MLX5E_SET_PFLAG(params, MLX5E_PFLAG_RX_CQE_COMPRESS, 
params->rx_cqe_compress_def);
 
/* RQ */
-   /* Prefer Striding RQ, unless any of the following holds:
-* - Striding RQ configuration is not possible/supported.
-* - Slow PCI heuristic.
-* - Legacy RQ would use linear SKB while Striding RQ would use 
non-linear.
-*/
-   if (!slow_pci_heuristic(mdev) &&
-   mlx5e_striding_rq_possible(mdev, params) &&
-   (mlx5e_rx_mpwqe_is_linear_skb(mdev, params) ||
-!mlx5e_rx_is_linear_skb(mdev, params)))
-   MLX5E_SET_PFLAG(params, MLX5E_PFLAG_RX_STRIDING_RQ, true);
-   mlx5e_set_rq_type(mdev, params);
-   mlx5e_init_rq_type_params(mdev, params);
+   mlx5e_build_rq_params(mdev, params);
 
/* HW LRO */
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index f6eead24931f..fc4433e93846 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -46,8 +46,6 @@
 
 #define MLX5E_REP_PARAMS_LOG_SQ_SIZE \
max(0x6, MLX5E_PARAMS_MINIMUM_LOG_SQ_SIZE)
-#define MLX5E_REP_PARAMS_LOG_RQ_SIZE \
-   max(0x6, MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE)
 
 static const char mlx5e_rep_driver_name[] = "mlx5e_rep";
 
@@ -934,14 +932,15 @@ static void mlx5e_build_rep_params(struct 

[net-next 09/13] net/mlx5e: Enable multi-queue and RSS for VF representors

2018-10-01 Thread Saeed Mahameed
From: Gavi Teitz 

Increased the amount of channels the representors can open to be the
amount of CPUs. The default amount opened remains one.

Used the standard NIC netdev functions to:
* Set RSS params when building the representors' params.
* Setup an indirect TIR and RQT for the representors upon
  initialization.
* Create a TTC flow table for the representors' indirect TIR (when
  creating the TTC table, mlx5e_set_ttc_basic_params() is not called,
  in order to avoid setting the inner_ttc param, which is not needed).

Added ethtool control to the representors for setting and querying
the amount of open channels. Additionally, included logic in the
representors' ethtool set channels handler which controls a
representor's vport rx rule, so that if there is one open channel
the rx rule steers traffic to the representor's direct TIR, whereas
if there is more than one channel, the rx rule steers traffic to the
new TTC flow table.

Signed-off-by: Gavi Teitz 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  | 140 --
 1 file changed, 129 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 7392c70910e8..be435e76d316 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -180,12 +180,90 @@ static int mlx5e_rep_get_sset_count(struct net_device 
*dev, int sset)
}
 }
 
+static int mlx5e_replace_rep_vport_rx_rule(struct mlx5e_priv *priv,
+  struct mlx5_flow_destination *dest)
+{
+   struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
+   struct mlx5e_rep_priv *rpriv = priv->ppriv;
+   struct mlx5_eswitch_rep *rep = rpriv->rep;
+   struct mlx5_flow_handle *flow_rule;
+
+   flow_rule = mlx5_eswitch_create_vport_rx_rule(esw,
+ rep->vport,
+ dest);
+   if (IS_ERR(flow_rule))
+   return PTR_ERR(flow_rule);
+
+   mlx5_del_flow_rules(rpriv->vport_rx_rule);
+   rpriv->vport_rx_rule = flow_rule;
+   return 0;
+}
+
+static void mlx5e_rep_get_channels(struct net_device *dev,
+  struct ethtool_channels *ch)
+{
+   struct mlx5e_priv *priv = netdev_priv(dev);
+
+   mlx5e_ethtool_get_channels(priv, ch);
+}
+
+static int mlx5e_rep_set_channels(struct net_device *dev,
+ struct ethtool_channels *ch)
+{
+   struct mlx5e_priv *priv = netdev_priv(dev);
+   u16 curr_channels_amount = priv->channels.params.num_channels;
+   u32 new_channels_amount = ch->combined_count;
+   struct mlx5_flow_destination new_dest;
+   int err = 0;
+
+   err = mlx5e_ethtool_set_channels(priv, ch);
+   if (err)
+   return err;
+
+   if (curr_channels_amount == 1 && new_channels_amount > 1) {
+   new_dest.type = MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE;
+   new_dest.ft = priv->fs.ttc.ft.t;
+   } else if (new_channels_amount == 1 && curr_channels_amount > 1) {
+   new_dest.type = MLX5_FLOW_DESTINATION_TYPE_TIR;
+   new_dest.tir_num = priv->direct_tir[0].tirn;
+   } else {
+   return 0;
+   }
+
+   err = mlx5e_replace_rep_vport_rx_rule(priv, _dest);
+   if (err) {
+   netdev_warn(priv->netdev, "Failed to update vport rx rule, when 
going from (%d) channels to (%d) channels\n",
+   curr_channels_amount, new_channels_amount);
+   return err;
+   }
+
+   return 0;
+}
+
+static u32 mlx5e_rep_get_rxfh_key_size(struct net_device *netdev)
+{
+   struct mlx5e_priv *priv = netdev_priv(netdev);
+
+   return mlx5e_ethtool_get_rxfh_key_size(priv);
+}
+
+static u32 mlx5e_rep_get_rxfh_indir_size(struct net_device *netdev)
+{
+   struct mlx5e_priv *priv = netdev_priv(netdev);
+
+   return mlx5e_ethtool_get_rxfh_indir_size(priv);
+}
+
 static const struct ethtool_ops mlx5e_rep_ethtool_ops = {
.get_drvinfo   = mlx5e_rep_get_drvinfo,
.get_link  = ethtool_op_get_link,
.get_strings   = mlx5e_rep_get_strings,
.get_sset_count= mlx5e_rep_get_sset_count,
.get_ethtool_stats = mlx5e_rep_get_ethtool_stats,
+   .get_channels  = mlx5e_rep_get_channels,
+   .set_channels  = mlx5e_rep_set_channels,
+   .get_rxfh_key_size   = mlx5e_rep_get_rxfh_key_size,
+   .get_rxfh_indir_size = mlx5e_rep_get_rxfh_indir_size,
 };
 
 int mlx5e_attr_get(struct net_device *dev, struct switchdev_attr *attr)
@@ -943,6 +1021,9 @@ static void mlx5e_build_rep_params(struct mlx5_core_dev 
*mdev,
params->num_tc= 1;
 
mlx5_query_min_inline(mdev, >tx_min_inline_mode);
+
+   /* RSS */
+   

[net-next 06/13] net/mlx5e: Provide explicit directive if to create inner indirect tirs

2018-10-01 Thread Saeed Mahameed
From: Or Gerlitz 

Change the driver functions that deal with creating indirect tirs
to get a flag telling if inner ttc is desired.

A pre-step for enabling rss on the vport representors, where
inner ttc is not needed.

Signed-off-by: Or Gerlitz 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 14 +++---
 .../net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c  |  6 +++---
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index b298456da8e7..275af3bd63b3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -906,8 +906,8 @@ void mlx5e_close_drop_rq(struct mlx5e_rq *drop_rq);
 
 int mlx5e_create_indirect_rqt(struct mlx5e_priv *priv);
 
-int mlx5e_create_indirect_tirs(struct mlx5e_priv *priv);
-void mlx5e_destroy_indirect_tirs(struct mlx5e_priv *priv);
+int mlx5e_create_indirect_tirs(struct mlx5e_priv *priv, bool inner_ttc);
+void mlx5e_destroy_indirect_tirs(struct mlx5e_priv *priv, bool inner_ttc);
 
 int mlx5e_create_direct_rqts(struct mlx5e_priv *priv);
 void mlx5e_destroy_direct_rqts(struct mlx5e_priv *priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 46001855d0e9..114f6226b17d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3175,7 +3175,7 @@ static void mlx5e_build_direct_tir_ctx(struct mlx5e_priv 
*priv, u32 rqtn, u32 *t
MLX5_SET(tirc, tirc, rx_hash_fn, MLX5_RX_HASH_FN_INVERTED_XOR8);
 }
 
-int mlx5e_create_indirect_tirs(struct mlx5e_priv *priv)
+int mlx5e_create_indirect_tirs(struct mlx5e_priv *priv, bool inner_ttc)
 {
struct mlx5e_tir *tir;
void *tirc;
@@ -3202,7 +3202,7 @@ int mlx5e_create_indirect_tirs(struct mlx5e_priv *priv)
}
}
 
-   if (!mlx5e_tunnel_inner_ft_supported(priv->mdev))
+   if (!inner_ttc || !mlx5e_tunnel_inner_ft_supported(priv->mdev))
goto out;
 
for (i = 0; i < MLX5E_NUM_INDIR_TIRS; i++) {
@@ -3273,14 +3273,14 @@ int mlx5e_create_direct_tirs(struct mlx5e_priv *priv)
return err;
 }
 
-void mlx5e_destroy_indirect_tirs(struct mlx5e_priv *priv)
+void mlx5e_destroy_indirect_tirs(struct mlx5e_priv *priv, bool inner_ttc)
 {
int i;
 
for (i = 0; i < MLX5E_NUM_INDIR_TIRS; i++)
mlx5e_destroy_tir(priv->mdev, >indir_tir[i]);
 
-   if (!mlx5e_tunnel_inner_ft_supported(priv->mdev))
+   if (!inner_ttc || !mlx5e_tunnel_inner_ft_supported(priv->mdev))
return;
 
for (i = 0; i < MLX5E_NUM_INDIR_TIRS; i++)
@@ -4786,7 +4786,7 @@ static int mlx5e_init_nic_rx(struct mlx5e_priv *priv)
if (err)
goto err_destroy_indirect_rqts;
 
-   err = mlx5e_create_indirect_tirs(priv);
+   err = mlx5e_create_indirect_tirs(priv, true);
if (err)
goto err_destroy_direct_rqts;
 
@@ -4811,7 +4811,7 @@ static int mlx5e_init_nic_rx(struct mlx5e_priv *priv)
 err_destroy_direct_tirs:
mlx5e_destroy_direct_tirs(priv);
 err_destroy_indirect_tirs:
-   mlx5e_destroy_indirect_tirs(priv);
+   mlx5e_destroy_indirect_tirs(priv, true);
 err_destroy_direct_rqts:
mlx5e_destroy_direct_rqts(priv);
 err_destroy_indirect_rqts:
@@ -4828,7 +4828,7 @@ static void mlx5e_cleanup_nic_rx(struct mlx5e_priv *priv)
mlx5e_tc_nic_cleanup(priv);
mlx5e_destroy_flow_steering(priv);
mlx5e_destroy_direct_tirs(priv);
-   mlx5e_destroy_indirect_tirs(priv);
+   mlx5e_destroy_indirect_tirs(priv, true);
mlx5e_destroy_direct_rqts(priv);
mlx5e_destroy_rqt(priv, >indir_rqt);
mlx5e_close_drop_rq(>drop_rq);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c 
b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
index a825ed093efd..299e2a897f7e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
@@ -368,7 +368,7 @@ static int mlx5i_init_rx(struct mlx5e_priv *priv)
if (err)
goto err_destroy_indirect_rqts;
 
-   err = mlx5e_create_indirect_tirs(priv);
+   err = mlx5e_create_indirect_tirs(priv, true);
if (err)
goto err_destroy_direct_rqts;
 
@@ -385,7 +385,7 @@ static int mlx5i_init_rx(struct mlx5e_priv *priv)
 err_destroy_direct_tirs:
mlx5e_destroy_direct_tirs(priv);
 err_destroy_indirect_tirs:
-   mlx5e_destroy_indirect_tirs(priv);
+   mlx5e_destroy_indirect_tirs(priv, true);
 err_destroy_direct_rqts:
mlx5e_destroy_direct_rqts(priv);
 err_destroy_indirect_rqts:
@@ -401,7 +401,7 @@ static void mlx5i_cleanup_rx(struct mlx5e_priv *priv)
 {

[net-next 11/13] net/mlx5e: Enable reporting checksum unnecessary also for L3 packets

2018-10-01 Thread Saeed Mahameed
From: Or Gerlitz 

We can report checksum unnecessary also when the L3 checksum
flag on the cqe is set and there's no L4 header.

Signed-off-by: Or Gerlitz 
Reviewed-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 424bc89184c6..5a43cbf9103f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -805,7 +805,8 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
 
 csum_unnecessary:
if (likely((cqe->hds_ip_ext & CQE_L3_OK) &&
-  (cqe->hds_ip_ext & CQE_L4_OK))) {
+  ((cqe->hds_ip_ext & CQE_L4_OK) ||
+   (get_cqe_l4_hdr_type(cqe) == CQE_L4_HDR_TYPE_NONE {
skb->ip_summed = CHECKSUM_UNNECESSARY;
if (cqe_is_tunneled(cqe)) {
skb->csum_level = 1;
-- 
2.17.1



[net-next 08/13] net/mlx5e: Expose ethtool rss key size / indirection table functions

2018-10-01 Thread Saeed Mahameed
From: Or Gerlitz 

Towards enabling RSS for the vport representors, expose the functions for
querying the rss hash key size and indirection table size via ethtool.

Signed-off-by: Or Gerlitz 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h |  2 ++
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c | 16 ++--
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 98390a5b106a..b3bd79833517 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -951,6 +951,8 @@ int mlx5e_ethtool_get_coalesce(struct mlx5e_priv *priv,
   struct ethtool_coalesce *coal);
 int mlx5e_ethtool_set_coalesce(struct mlx5e_priv *priv,
   struct ethtool_coalesce *coal);
+u32 mlx5e_ethtool_get_rxfh_key_size(struct mlx5e_priv *priv);
+u32 mlx5e_ethtool_get_rxfh_indir_size(struct mlx5e_priv *priv);
 int mlx5e_ethtool_get_ts_info(struct mlx5e_priv *priv,
  struct ethtool_ts_info *info);
 int mlx5e_ethtool_flash_device(struct mlx5e_priv *priv,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 8cd338ceb237..33dafd8638b1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -859,18 +859,30 @@ static int mlx5e_set_link_ksettings(struct net_device 
*netdev,
return err;
 }
 
+u32 mlx5e_ethtool_get_rxfh_key_size(struct mlx5e_priv *priv)
+{
+   return sizeof(priv->channels.params.toeplitz_hash_key);
+}
+
 static u32 mlx5e_get_rxfh_key_size(struct net_device *netdev)
 {
struct mlx5e_priv *priv = netdev_priv(netdev);
 
-   return sizeof(priv->channels.params.toeplitz_hash_key);
+   return mlx5e_ethtool_get_rxfh_key_size(priv);
 }
 
-static u32 mlx5e_get_rxfh_indir_size(struct net_device *netdev)
+u32 mlx5e_ethtool_get_rxfh_indir_size(struct mlx5e_priv *priv)
 {
return MLX5E_INDIR_RQT_SIZE;
 }
 
+static u32 mlx5e_get_rxfh_indir_size(struct net_device *netdev)
+{
+   struct mlx5e_priv *priv = netdev_priv(netdev);
+
+   return mlx5e_ethtool_get_rxfh_indir_size(priv);
+}
+
 static int mlx5e_get_rxfh(struct net_device *netdev, u32 *indir, u8 *key,
  u8 *hfunc)
 {
-- 
2.17.1



[net-next 07/13] net/mlx5e: Expose function for building RSS params

2018-10-01 Thread Saeed Mahameed
From: Gavi Teitz 

Towards enabling RSS for the vport representors, extract the
procedure for building a device's RSS params, and expose the
function.

Signed-off-by: Gavi Teitz 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 13 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 275af3bd63b3..98390a5b106a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -968,6 +968,7 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
u16 max_channels, u16 mtu);
 void mlx5e_build_rq_params(struct mlx5_core_dev *mdev,
   struct mlx5e_params *params);
+void mlx5e_build_rss_params(struct mlx5e_params *params);
 u8 mlx5e_params_calculate_tx_min_inline(struct mlx5_core_dev *mdev);
 void mlx5e_rx_dim_work(struct work_struct *work);
 void mlx5e_tx_dim_work(struct work_struct *work);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 114f6226b17d..6086f874c7bf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4497,6 +4497,14 @@ void mlx5e_build_rq_params(struct mlx5_core_dev *mdev,
mlx5e_init_rq_type_params(mdev, params);
 }
 
+void mlx5e_build_rss_params(struct mlx5e_params *params)
+{
+   params->rss_hfunc = ETH_RSS_HASH_XOR;
+   netdev_rss_key_fill(params->toeplitz_hash_key, 
sizeof(params->toeplitz_hash_key));
+   mlx5e_build_default_indir_rqt(params->indirection_rqt,
+ MLX5E_INDIR_RQT_SIZE, 
params->num_channels);
+}
+
 void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
struct mlx5e_params *params,
u16 max_channels, u16 mtu)
@@ -4545,10 +4553,7 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
params->tx_min_inline_mode = mlx5e_params_calculate_tx_min_inline(mdev);
 
/* RSS */
-   params->rss_hfunc = ETH_RSS_HASH_XOR;
-   netdev_rss_key_fill(params->toeplitz_hash_key, 
sizeof(params->toeplitz_hash_key));
-   mlx5e_build_default_indir_rqt(params->indirection_rqt,
- MLX5E_INDIR_RQT_SIZE, max_channels);
+   mlx5e_build_rss_params(params);
 }
 
 static void mlx5e_build_nic_netdev_priv(struct mlx5_core_dev *mdev,
-- 
2.17.1



[pull request][net-next 00/13] Mellanox, mlx5e updates 2018-10-01

2018-10-01 Thread Saeed Mahameed
Hi Dave,

The following pull request includes updates to mlx5e ethernet netdevice
driver, for more information please see tag log below.

Please pull and let me know if there's any problem.

Thanks,
Saeed.

---

The following changes since commit 804fe108fc92e591ddfe9447e7fb4691ed16daee:

  openvswitch: Use correct reply values in datapath and vport ops (2018-09-29 
11:44:11 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git 
tags/mlx5e-updates-2018-10-01

for you to fetch changes up to 59c9d35ea9cd73c3a55642ec9a0097770baccb93:

  net/mlx5: Cache the system image guid (2018-10-01 11:32:47 -0700)


mlx5e-updates-2018-10-01

This series includes updates to mlx5e ethernet netdevice driver:

>From Or Gerlitz:
1) Support masks for l3/l4 filters in ethtool flow steering
2) Report checksum unnecessary also when the L3 checksum flag on the
   cqe is set and there's no L4 header
3) Allow reporting of checksum unnecessary, using an ethtool private flag.

>From Gavi Teitz and Or, VF representors netdevs performance improvements
4) Allow striding RQ in VF representor and bigger RQ size, ~3X performance 
improvement
5) Enable stateless offloads for VF representor, csum and TSO, 1.5X performance 
improvement
6) RSS Support for VF representors
   6.1) Allow flow table destination fir VF representor steering rule.
   6.2) Create RSS flow table per representor netdev
   6.3) Expose mlx5e RSS ethtool to be used by representor netdevs
   6.4) Enable multi-queue and RSS for VF representors, using mlx5e existing 
infrastructure
for managing a multi-queue RX RSS tables.

>From Alaa Hleihel:
7) Cache the system image guid, The system image guid is a read-only field
   Read this once and save it on the core device.


Alaa Hleihel (1):
  net/mlx5: Cache the system image guid

Gavi Teitz (7):
  net/mlx5e: Change VF representors' RQ type
  net/mlx5e: Enable stateless offloads for VF representor netdevs
  net/mlx5e: Extract creation of rep's default flow rule
  net/mlx5: E-Switch, Provide flow dest when creating vport rx rule
  net/mlx5e: Expose function for building RSS params
  net/mlx5e: Enable multi-queue and RSS for VF representors
  net/mlx5e: Add ethtool control of ring params to VF representors

Or Gerlitz (5):
  net/mlx5e: Ethtool steering, Support masks for l3/l4 filters
  net/mlx5e: Provide explicit directive if to create inner indirect tirs
  net/mlx5e: Expose ethtool rss key size / indirection table functions
  net/mlx5e: Enable reporting checksum unnecessary also for L3 packets
  net/mlx5e: Allow reporting of checksum unnecessary

 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  11 +-
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |  44 -
 .../ethernet/mellanox/mlx5/core/en_fs_ethtool.c|  56 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  61 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   | 205 ++---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c|   6 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c|   4 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  |   3 +-
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c |   8 +-
 .../net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c  |   6 +-
 drivers/net/ethernet/mellanox/mlx5/core/vport.c|   9 +
 include/linux/mlx5/driver.h|   1 +
 include/linux/mlx5/vport.h |   2 +
 13 files changed, 312 insertions(+), 104 deletions(-)


[net-next 05/13] net/mlx5: E-Switch, Provide flow dest when creating vport rx rule

2018-10-01 Thread Saeed Mahameed
From: Gavi Teitz 

Currently the destination for the representor e-switch rx rule is
a TIR number. Towards changing that to potentially be a flow table,
as part of enabling RSS for representors, modify the signature of
the related e-switch API to get a flow destination.

Signed-off-by: Gavi Teitz 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c  | 5 -
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h | 3 ++-
 .../net/ethernet/mellanox/mlx5/core/eswitch_offloads.c| 8 +++-
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 84946870d164..7392c70910e8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -1009,10 +1009,13 @@ static int mlx5e_create_rep_vport_rx_rule(struct 
mlx5e_priv *priv)
struct mlx5e_rep_priv *rpriv = priv->ppriv;
struct mlx5_eswitch_rep *rep = rpriv->rep;
struct mlx5_flow_handle *flow_rule;
+   struct mlx5_flow_destination dest;
 
+   dest.type = MLX5_FLOW_DESTINATION_TYPE_TIR;
+   dest.tir_num = priv->direct_tir[0].tirn;
flow_rule = mlx5_eswitch_create_vport_rx_rule(esw,
  rep->vport,
- priv->direct_tir[0].tirn);
+ );
if (IS_ERR(flow_rule))
return PTR_ERR(flow_rule);
rpriv->vport_rx_rule = flow_rule;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index c17bfcab517c..0b05bf2b91f6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -230,7 +230,8 @@ mlx5_eswitch_del_offloaded_rule(struct mlx5_eswitch *esw,
struct mlx5_esw_flow_attr *attr);
 
 struct mlx5_flow_handle *
-mlx5_eswitch_create_vport_rx_rule(struct mlx5_eswitch *esw, int vport, u32 
tirn);
+mlx5_eswitch_create_vport_rx_rule(struct mlx5_eswitch *esw, int vport,
+ struct mlx5_flow_destination *dest);
 
 enum {
SET_VLAN_STRIP  = BIT(0),
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 3028e8d90920..21e957083f65 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -775,10 +775,10 @@ static void esw_destroy_vport_rx_group(struct 
mlx5_eswitch *esw)
 }
 
 struct mlx5_flow_handle *
-mlx5_eswitch_create_vport_rx_rule(struct mlx5_eswitch *esw, int vport, u32 
tirn)
+mlx5_eswitch_create_vport_rx_rule(struct mlx5_eswitch *esw, int vport,
+ struct mlx5_flow_destination *dest)
 {
struct mlx5_flow_act flow_act = {0};
-   struct mlx5_flow_destination dest = {};
struct mlx5_flow_handle *flow_rule;
struct mlx5_flow_spec *spec;
void *misc;
@@ -796,12 +796,10 @@ mlx5_eswitch_create_vport_rx_rule(struct mlx5_eswitch 
*esw, int vport, u32 tirn)
MLX5_SET_TO_ONES(fte_match_set_misc, misc, source_port);
 
spec->match_criteria_enable = MLX5_MATCH_MISC_PARAMETERS;
-   dest.type = MLX5_FLOW_DESTINATION_TYPE_TIR;
-   dest.tir_num = tirn;
 
flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
flow_rule = mlx5_add_flow_rules(esw->offloads.ft_offloads, spec,
-   _act, , 1);
+   _act, dest, 1);
if (IS_ERR(flow_rule)) {
esw_warn(esw->dev, "fs offloads: Failed to add vport rx rule 
err %ld\n", PTR_ERR(flow_rule));
goto out;
-- 
2.17.1



[net-next 04/13] net/mlx5e: Extract creation of rep's default flow rule

2018-10-01 Thread Saeed Mahameed
From: Gavi Teitz 

Cleaning up the flow of the representors' rx initialization, towards
enabling RSS for the representors.

Signed-off-by: Gavi Teitz 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  | 25 ---
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 08ba2063e8f6..84946870d164 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -1003,13 +1003,25 @@ static void mlx5e_init_rep(struct mlx5_core_dev *mdev,
mlx5e_timestamp_init(priv);
 }
 
-static int mlx5e_init_rep_rx(struct mlx5e_priv *priv)
+static int mlx5e_create_rep_vport_rx_rule(struct mlx5e_priv *priv)
 {
struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
struct mlx5e_rep_priv *rpriv = priv->ppriv;
struct mlx5_eswitch_rep *rep = rpriv->rep;
-   struct mlx5_core_dev *mdev = priv->mdev;
struct mlx5_flow_handle *flow_rule;
+
+   flow_rule = mlx5_eswitch_create_vport_rx_rule(esw,
+ rep->vport,
+ priv->direct_tir[0].tirn);
+   if (IS_ERR(flow_rule))
+   return PTR_ERR(flow_rule);
+   rpriv->vport_rx_rule = flow_rule;
+   return 0;
+}
+
+static int mlx5e_init_rep_rx(struct mlx5e_priv *priv)
+{
+   struct mlx5_core_dev *mdev = priv->mdev;
int err;
 
mlx5e_init_l2_addr(priv);
@@ -1028,14 +1040,9 @@ static int mlx5e_init_rep_rx(struct mlx5e_priv *priv)
if (err)
goto err_destroy_direct_rqts;
 
-   flow_rule = mlx5_eswitch_create_vport_rx_rule(esw,
- rep->vport,
- priv->direct_tir[0].tirn);
-   if (IS_ERR(flow_rule)) {
-   err = PTR_ERR(flow_rule);
+   err = mlx5e_create_rep_vport_rx_rule(priv);
+   if (err)
goto err_destroy_direct_tirs;
-   }
-   rpriv->vport_rx_rule = flow_rule;
 
return 0;
 
-- 
2.17.1



[net-next 01/13] net/mlx5e: Ethtool steering, Support masks for l3/l4 filters

2018-10-01 Thread Saeed Mahameed
From: Or Gerlitz 

Allow using partial masks for L3 addresses and L4 ports across
the place.

Signed-off-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 .../mellanox/mlx5/core/en_fs_ethtool.c| 56 ++-
 1 file changed, 16 insertions(+), 40 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c
index 41cde926cdab..c18dcebe1462 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c
@@ -131,14 +131,14 @@ set_ip4(void *headers_c, void *headers_v, __be32 ip4src_m,
if (ip4src_m) {
memcpy(MLX5E_FTE_ADDR_OF(headers_v, 
src_ipv4_src_ipv6.ipv4_layout.ipv4),
   _v, sizeof(ip4src_v));
-   memset(MLX5E_FTE_ADDR_OF(headers_c, 
src_ipv4_src_ipv6.ipv4_layout.ipv4),
-  0xff, sizeof(ip4src_m));
+   memcpy(MLX5E_FTE_ADDR_OF(headers_c, 
src_ipv4_src_ipv6.ipv4_layout.ipv4),
+  _m, sizeof(ip4src_m));
}
if (ip4dst_m) {
memcpy(MLX5E_FTE_ADDR_OF(headers_v, 
dst_ipv4_dst_ipv6.ipv4_layout.ipv4),
   _v, sizeof(ip4dst_v));
-   memset(MLX5E_FTE_ADDR_OF(headers_c, 
dst_ipv4_dst_ipv6.ipv4_layout.ipv4),
-  0xff, sizeof(ip4dst_m));
+   memcpy(MLX5E_FTE_ADDR_OF(headers_c, 
dst_ipv4_dst_ipv6.ipv4_layout.ipv4),
+  _m, sizeof(ip4dst_m));
}
 
MLX5E_FTE_SET(headers_c, ethertype, 0x);
@@ -173,11 +173,11 @@ set_tcp(void *headers_c, void *headers_v, __be16 psrc_m, 
__be16 psrc_v,
__be16 pdst_m, __be16 pdst_v)
 {
if (psrc_m) {
-   MLX5E_FTE_SET(headers_c, tcp_sport, 0x);
+   MLX5E_FTE_SET(headers_c, tcp_sport, ntohs(psrc_m));
MLX5E_FTE_SET(headers_v, tcp_sport, ntohs(psrc_v));
}
if (pdst_m) {
-   MLX5E_FTE_SET(headers_c, tcp_dport, 0x);
+   MLX5E_FTE_SET(headers_c, tcp_dport, ntohs(pdst_m));
MLX5E_FTE_SET(headers_v, tcp_dport, ntohs(pdst_v));
}
 
@@ -190,12 +190,12 @@ set_udp(void *headers_c, void *headers_v, __be16 psrc_m, 
__be16 psrc_v,
__be16 pdst_m, __be16 pdst_v)
 {
if (psrc_m) {
-   MLX5E_FTE_SET(headers_c, udp_sport, 0x);
+   MLX5E_FTE_SET(headers_c, udp_sport, ntohs(psrc_m));
MLX5E_FTE_SET(headers_v, udp_sport, ntohs(psrc_v));
}
 
if (pdst_m) {
-   MLX5E_FTE_SET(headers_c, udp_dport, 0x);
+   MLX5E_FTE_SET(headers_c, udp_dport, ntohs(pdst_m));
MLX5E_FTE_SET(headers_v, udp_dport, ntohs(pdst_v));
}
 
@@ -508,26 +508,14 @@ static int validate_tcpudp4(struct ethtool_rx_flow_spec 
*fs)
if (l4_mask->tos)
return -EINVAL;
 
-   if (l4_mask->ip4src) {
-   if (!all_ones(l4_mask->ip4src))
-   return -EINVAL;
+   if (l4_mask->ip4src)
ntuples++;
-   }
-   if (l4_mask->ip4dst) {
-   if (!all_ones(l4_mask->ip4dst))
-   return -EINVAL;
+   if (l4_mask->ip4dst)
ntuples++;
-   }
-   if (l4_mask->psrc) {
-   if (!all_ones(l4_mask->psrc))
-   return -EINVAL;
+   if (l4_mask->psrc)
ntuples++;
-   }
-   if (l4_mask->pdst) {
-   if (!all_ones(l4_mask->pdst))
-   return -EINVAL;
+   if (l4_mask->pdst)
ntuples++;
-   }
/* Flow is TCP/UDP */
return ++ntuples;
 }
@@ -540,16 +528,10 @@ static int validate_ip4(struct ethtool_rx_flow_spec *fs)
if (l3_mask->l4_4_bytes || l3_mask->tos ||
fs->h_u.usr_ip4_spec.ip_ver != ETH_RX_NFC_IP4)
return -EINVAL;
-   if (l3_mask->ip4src) {
-   if (!all_ones(l3_mask->ip4src))
-   return -EINVAL;
+   if (l3_mask->ip4src)
ntuples++;
-   }
-   if (l3_mask->ip4dst) {
-   if (!all_ones(l3_mask->ip4dst))
-   return -EINVAL;
+   if (l3_mask->ip4dst)
ntuples++;
-   }
if (l3_mask->proto)
ntuples++;
/* Flow is IPv4 */
@@ -588,16 +570,10 @@ static int validate_tcpudp6(struct ethtool_rx_flow_spec 
*fs)
if (!ipv6_addr_any((struct in6_addr *)l4_mask->ip6dst))
ntuples++;
 
-   if (l4_mask->psrc) {
-   if (!all_ones(l4_mask->psrc))
-   return -EINVAL;
+   if (l4_mask->psrc)
ntuples++;
-   }
-   if (l4_mask->pdst) {
-   if (!all_ones(l4_mask->pdst))
-   return -EINVAL;
+   if (l4_mask->pdst)
ntuples++;
-   }
/* Flow is TCP/UDP */
return ++ntuples;
 }
-- 
2.17.1

[net-next 03/13] net/mlx5e: Enable stateless offloads for VF representor netdevs

2018-10-01 Thread Saeed Mahameed
From: Gavi Teitz 

Enabled checksum and TSO offloads for the representors, in
order to increase their performance, which is required to
increase the performance of flows that cannot be offloaded.

Checksum offloads contribute to a general acceleration of all
traffic (to around 150%), whereas the TSO offload contributes
to a prominent acceleration of the representor's TX for traffic
flows with larger than MTU sized packets (to around 200%). This
is the usual case for TCP streams, as the PF, which serves as
the uplink representor, and the VF representors employ GRO before
forwarding the packets to the representor.

GRO was enabled implicitly for the representors beforehand, and
is explicitly enabled here to ensure that the representors preserve
the performance boost it provides (of around 200%) when working in
tandem with the TSO offload by the forwardee, which is the standard
case as both the PF and the VF representors employ HW TSO.

The impact of these changes can be seen in the following
measurements taken on a setup of a VM over a VF, connected
to OVS via the VF representor, to an external host:

Before current changes:
 TCP Throughput [Gb/s]
External host to VM ~ 10.5
VM to external host ~ 23.5

With just checksum offloads enabled:
 TCP Throughput [Gb/s]
External host to VM ~ 14.9
VM to external host ~ 28.5

With the TSO offload also enabled:
 TCP Throughput [Gb/s]
External host to VM ~ 30.5

Signed-off-by: Gavi Teitz 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index fc4433e93846..08ba2063e8f6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -962,6 +962,16 @@ static void mlx5e_build_rep_netdev(struct net_device 
*netdev)
netdev->features |= NETIF_F_VLAN_CHALLENGED | NETIF_F_HW_TC | 
NETIF_F_NETNS_LOCAL;
netdev->hw_features  |= NETIF_F_HW_TC;
 
+   netdev->hw_features|= NETIF_F_SG;
+   netdev->hw_features|= NETIF_F_IP_CSUM;
+   netdev->hw_features|= NETIF_F_IPV6_CSUM;
+   netdev->hw_features|= NETIF_F_GRO;
+   netdev->hw_features|= NETIF_F_TSO;
+   netdev->hw_features|= NETIF_F_TSO6;
+   netdev->hw_features|= NETIF_F_RXCSUM;
+
+   netdev->features |= netdev->hw_features;
+
eth_hw_addr_random(netdev);
 
netdev->min_mtu = ETH_MIN_MTU;
-- 
2.17.1



Re: [bpf-next PATCH 1/3] net: fix generic XDP to handle if eth header was mangled

2018-10-01 Thread Daniel Borkmann
[ ping to Jesper wrt feedback ]

On 09/26/2018 07:36 AM, Song Liu wrote:
> On Tue, Sep 25, 2018 at 7:26 AM Jesper Dangaard Brouer
>  wrote:
>>
>> XDP can modify (and resize) the Ethernet header in the packet.
>>
>> There is a bug in generic-XDP, because skb->protocol and skb->pkt_type
>> are setup before reaching (netif_receive_)generic_xdp.
>>
>> This bug was hit when XDP were popping VLAN headers (changing
>> eth->h_proto), as skb->protocol still contains VLAN-indication
>> (ETH_P_8021Q) causing invocation of skb_vlan_untag(skb), which corrupt
>> the packet (basically popping the VLAN again).
>>
>> This patch catch if XDP changed eth header in such a way, that SKB
>> fields needs to be updated.
>>
>> Fixes: d445516966dc ("net: xdp: support xdp generic on virtual devices")
>> Signed-off-by: Jesper Dangaard Brouer 
>> ---
>>  net/core/dev.c |   14 ++
>>  1 file changed, 14 insertions(+)
>>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index ca78dc5a79a3..db6d89f536cb 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -4258,6 +4258,9 @@ static u32 netif_receive_generic_xdp(struct sk_buff 
>> *skb,
>> struct netdev_rx_queue *rxqueue;
>> void *orig_data, *orig_data_end;
>> u32 metalen, act = XDP_DROP;
>> +   __be16 orig_eth_type;
>> +   struct ethhdr *eth;
>> +   bool orig_bcast;
>> int hlen, off;
>> u32 mac_len;
>>
>> @@ -4298,6 +4301,9 @@ static u32 netif_receive_generic_xdp(struct sk_buff 
>> *skb,
>> xdp->data_hard_start = skb->data - skb_headroom(skb);
>> orig_data_end = xdp->data_end;
>> orig_data = xdp->data;
>> +   eth = (struct ethhdr *)xdp->data;
>> +   orig_bcast = is_multicast_ether_addr_64bits(eth->h_dest);
>> +   orig_eth_type = eth->h_proto;
>>
>> rxqueue = netif_get_rxqueue(skb);
>> xdp->rxq = >xdp_rxq;
>> @@ -4321,6 +4327,14 @@ static u32 netif_receive_generic_xdp(struct sk_buff 
>> *skb,
>>
>> }
>>
>> +   /* check if XDP changed eth hdr such SKB needs update */
>> +   eth = (struct ethhdr *)xdp->data;
>> +   if ((orig_eth_type != eth->h_proto) ||
>> +   (orig_bcast != is_multicast_ether_addr_64bits(eth->h_dest))) {
> 
> Is the actions below always correct for the condition above? Do we need
> to confirm the SKB is updated properly?
> 
>> +   __skb_push(skb, mac_len);
>> +   skb->protocol = eth_type_trans(skb, skb->dev);
>> +   }
>> +
>> switch (act) {
>> case XDP_REDIRECT:
>> case XDP_TX:
>>



Re: [PATCH 0/3] bpf: allow zero-initialising hash map seed

2018-10-01 Thread Daniel Borkmann
On 10/01/2018 12:45 PM, Lorenz Bauer wrote:
> This patch set adds a new flag BPF_F_ZERO_SEED, which allows
> forcing the seed used by hash maps to zero. This makes
> it possible to write deterministic tests.
> 
> Based on an off-list conversation with Alexei Starovoitov and
> Daniel Borkmann.
> 
> Lorenz Bauer (3):
>   bpf: allow zero-initializing hash map seed
>   tools: sync linux/bpf.h
>   tools: add selftest for BPF_F_ZERO_SEED
> 
>  include/uapi/linux/bpf.h|  2 +
>  kernel/bpf/hashtab.c|  8 ++-
>  tools/include/uapi/linux/bpf.h  |  2 +
>  tools/testing/selftests/bpf/test_maps.c | 67 +
>  4 files changed, 66 insertions(+), 13 deletions(-)
> 

Please respin with proper SoB for each patch and non-empty commit
description. I think patch 1 should also have a more elaborate
commit description on the use case for BPF_F_ZERO_SEED, and I
think also a better comment in the uapi header that this is only
meant for testing and not production use.

Thanks,
Daniel


Re: [RFC PATCH ethtool] ethtool: better syntax for combinations of FEC modes

2018-10-01 Thread John W. Linville
Is this patch still RFC?

On Wed, Sep 19, 2018 at 05:06:25PM +0100, Edward Cree wrote:
> Instead of commas, just have them as separate argvs.
> 
> The parsing state machine might look heavyweight, but it makes it easy to add
>  more parameters later and distinguish parameter names from encoding names.
> 
> Suggested-by: Michal Kubecek 
> Signed-off-by: Edward Cree 
> ---
>  ethtool.8.in   |  6 +++---
>  ethtool.c  | 63 
> --
>  test-cmdline.c | 10 +-
>  3 files changed, 25 insertions(+), 54 deletions(-)
> 
> diff --git a/ethtool.8.in b/ethtool.8.in
> index 414eaa1..7ea2cc0 100644
> --- a/ethtool.8.in
> +++ b/ethtool.8.in
> @@ -390,7 +390,7 @@ ethtool \- query or control network driver and hardware 
> settings
>  .B ethtool \-\-set\-fec
>  .I devname
>  .B encoding
> -.BR auto | off | rs | baser [ , ...]
> +.BR auto | off | rs | baser \ [...]
>  .
>  .\" Adjust lines (i.e. full justification) and hyphenate.
>  .ad
> @@ -1120,11 +1120,11 @@ current FEC mode, the driver or firmware must take 
> the link down
>  administratively and report the problem in the system logs for users to 
> correct.
>  .RS 4
>  .TP
> -.BR encoding\ auto | off | rs | baser [ , ...]
> +.BR encoding\ auto | off | rs | baser \ [...]
>  
>  Sets the FEC encoding for the device.  Combinations of options are specified 
> as
>  e.g.
> -.B auto,rs
> +.B encoding auto rs
>  ; the semantics of such combinations vary between drivers.
>  .TS
>  nokeep;
> diff --git a/ethtool.c b/ethtool.c
> index 9997930..2f7e96b 100644
> --- a/ethtool.c
> +++ b/ethtool.c
> @@ -4979,39 +4979,6 @@ static int fecmode_str_to_type(const char *str)
>   return 0;
>  }
>  
> -/* Takes a comma-separated list of FEC modes, returns the bitwise OR of their
> - * corresponding ETHTOOL_FEC_* constants.
> - * Accepts repetitions (e.g. 'auto,auto') and trailing comma (e.g. 'off,').
> - */
> -static int parse_fecmode(const char *str)
> -{
> - int fecmode = 0;
> - char buf[6];
> -
> - if (!str)
> - return 0;
> - while (*str) {
> - size_t next;
> - int mode;
> -
> - next = strcspn(str, ",");
> - if (next >= 6) /* Bad mode, longest name is 5 chars */
> - return 0;
> - /* Copy into temp buffer and nul-terminate */
> - memcpy(buf, str, next);
> - buf[next] = 0;
> - mode = fecmode_str_to_type(buf);
> - if (!mode) /* Bad mode encountered */
> - return 0;
> - fecmode |= mode;
> - str += next;
> - /* Skip over ',' (but not nul) */
> - if (*str)
> - str++;
> - }
> - return fecmode;
> -}
> -
>  static int do_gfec(struct cmd_context *ctx)
>  {
>   struct ethtool_fecparam feccmd = { 0 };
> @@ -5041,22 +5008,26 @@ static int do_gfec(struct cmd_context *ctx)
>  
>  static int do_sfec(struct cmd_context *ctx)
>  {
> - char *fecmode_str = NULL;
> + enum { ARG_NONE, ARG_ENCODING } state = ARG_NONE;
>   struct ethtool_fecparam feccmd;
> - struct cmdline_info cmdline_fec[] = {
> - { "encoding", CMDL_STR,  _str,  },
> - };
> - int changed;
> - int fecmode;
> - int rv;
> + int fecmode = 0, newmode;
> + int rv, i;
>  
> - parse_generic_cmdline(ctx, , cmdline_fec,
> -   ARRAY_SIZE(cmdline_fec));
> -
> - if (!fecmode_str)
> + for (i = 0; i < ctx->argc; i++) {
> + if (!strcmp(ctx->argp[i], "encoding")) {
> + state = ARG_ENCODING;
> + continue;
> + }
> + if (state == ARG_ENCODING) {
> + newmode = fecmode_str_to_type(ctx->argp[i]);
> + if (!newmode)
> + exit_bad_args();
> + fecmode |= newmode;
> + continue;
> + }
>   exit_bad_args();
> + }
>  
> - fecmode = parse_fecmode(fecmode_str);
>   if (!fecmode)
>   exit_bad_args();
>  
> @@ -5265,7 +5236,7 @@ static const struct option {
> " [ all ]\n"},
>   { "--show-fec", 1, do_gfec, "Show FEC settings"},
>   { "--set-fec", 1, do_sfec, "Set FEC settings",
> -   " [ encoding auto|off|rs|baser ]\n"},
> +   " [ encoding auto|off|rs|baser [...]]\n"},
>   { "-h|--help", 0, show_usage, "Show this help" },
>   { "--version", 0, do_version, "Show version number" },
>   {}
> diff --git a/test-cmdline.c b/test-cmdline.c
> index 9c51cca..84630a5 100644
> --- a/test-cmdline.c
> +++ b/test-cmdline.c
> @@ -268,12 +268,12 @@ static struct test_case {
>   { 1, "--set-eee devname advertise foo" },
>   { 1, "--set-fec devname" },
>   { 0, "--set-fec devname encoding auto" },
> - { 0, "--set-fec devname encoding off," },
> - { 0, "--set-fec devname 

Re: [PATCH net-next] tls: Add support for inplace records encryption

2018-10-01 Thread Dave Watson
On 09/30/18 08:04 AM, Vakul Garg wrote:
> Presently, for non-zero copy case, separate pages are allocated for
> storing plaintext and encrypted text of records. These pages are stored
> in sg_plaintext_data and sg_encrypted_data scatterlists inside record
> structure. Further, sg_plaintext_data & sg_encrypted_data are passed
> to cryptoapis for record encryption. Allocating separate pages for
> plaintext and encrypted text is inefficient from both required memory
> and performance point of view.
> 
> This patch adds support of inplace encryption of records. For non-zero
> copy case, we reuse the pages from sg_encrypted_data scatterlist to
> copy the application's plaintext data. For the movement of pages from
> sg_encrypted_data to sg_plaintext_data scatterlists, we introduce a new
> function move_to_plaintext_sg(). This function add pages into
> sg_plaintext_data from sg_encrypted_data scatterlists.
> 
> tls_do_encryption() is modified to pass the same scatterlist as both
> source and destination into aead_request_set_crypt() if inplace crypto
> has been enabled. A new ariable 'inplace_crypto' has been introduced in
> record structure to signify whether the same scatterlist can be used.
> By default, the inplace_crypto is enabled in get_rec(). If zero-copy is
> used (i.e. plaintext data is not copied), inplace_crypto is set to '0'.
> 
> Signed-off-by: Vakul Garg 

Looks reasonable to me, thanks.

Reviewed-by: Dave Watson 


Re: [PATCH net-next v1 0/1] net/sched: Introduce the taprio scheduler

2018-10-01 Thread Vinicius Costa Gomes
Hi,

Just a small correction, one link on the cover letter is wrong.

Vinicius Costa Gomes  writes:

[...]

>
>
> [1] https://patchwork.ozlabs.org/cover/938991/
>
> [2] https://patchwork.ozlabs.org/cover/808504/
>
> [3] github doesn't make it clear, but the gist can be cloned like this:
> $ git clone https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f 
> taprio-test
>
> [4] https://github.com/vcgomes/linux/tree/taprio-v1

The correct link is:

[4] https://github.com/vcgomes/net-next

>
> [5] https://github.com/vcgomes/iproute2/tree/taprio-v1
>
>
> Vinicius Costa Gomes (1):
>   tc: Add support for configuring the taprio scheduler
>
>  include/uapi/linux/pkt_sched.h |  46 ++
>  net/sched/Kconfig  |  11 +
>  net/sched/Makefile |   1 +
>  net/sched/sch_taprio.c | 962 +
>  4 files changed, 1020 insertions(+)
>  create mode 100644 net/sched/sch_taprio.c
>
> -- 
> 2.19.0


Cheers,
--
Vinicius


[PATCH net] inet: frags: rework rhashtable dismantle

2018-10-01 Thread Eric Dumazet
syszbot found an interesting use-after-free [1] happening
while IPv4 fragment rhashtable was destroyed at netns dismantle.

While no insertions can possibly happen at the time a dismantling
netns is destroying this rhashtable, timers can still fire and
attempt to remove elements from this rhashtable.

This is forbidden, since rhashtable_free_and_destroy() has
no synchronization against concurrent inserts and deletes.

It seems we need to clean the rhashtable before destroying it.

[1]
BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:188 
[inline]
BUG: KASAN: use-after-free in rhashtable_last_table+0x216/0x240 
lib/rhashtable.c:217
Read of size 8 at addr 88019a4c8840 by task kworker/0:4/8279

CPU: 0 PID: 8279 Comm: kworker/0:4 Not tainted 4.19.0-rc5+ #61
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Workqueue: events rht_deferred_worker
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
 print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
 __read_once_size include/linux/compiler.h:188 [inline]
 rhashtable_last_table+0x216/0x240 lib/rhashtable.c:217
 rht_deferred_worker+0x157/0x1de0 lib/rhashtable.c:410
 process_one_work+0xc90/0x1b90 kernel/workqueue.c:2153
 worker_thread+0x17f/0x1390 kernel/workqueue.c:2296
 kthread+0x35a/0x420 kernel/kthread.c:246
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413

Allocated by task 5:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
 __do_kmalloc_node mm/slab.c:3682 [inline]
 __kmalloc_node+0x47/0x70 mm/slab.c:3689
 kmalloc_node include/linux/slab.h:555 [inline]
 kvmalloc_node+0xb9/0xf0 mm/util.c:423
 kvmalloc include/linux/mm.h:577 [inline]
 kvzalloc include/linux/mm.h:585 [inline]
 bucket_table_alloc+0x9a/0x4e0 lib/rhashtable.c:176
 rhashtable_rehash_alloc+0x73/0x100 lib/rhashtable.c:353
 rht_deferred_worker+0x278/0x1de0 lib/rhashtable.c:413
 process_one_work+0xc90/0x1b90 kernel/workqueue.c:2153
 worker_thread+0x17f/0x1390 kernel/workqueue.c:2296
 kthread+0x35a/0x420 kernel/kthread.c:246
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413

Freed by task 8283:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kfree+0xcf/0x230 mm/slab.c:3813
 kvfree+0x61/0x70 mm/util.c:452
 bucket_table_free+0xda/0x250 lib/rhashtable.c:108
 rhashtable_free_and_destroy+0x152/0x900 lib/rhashtable.c:1163
 inet_frags_exit_net+0x3d/0x50 net/ipv4/inet_fragment.c:96
 ipv4_frags_exit_net+0x73/0x90 net/ipv4/ip_fragment.c:914
 ops_exit_list.isra.7+0xb0/0x160 net/core/net_namespace.c:153
 cleanup_net+0x555/0xb10 net/core/net_namespace.c:551
 process_one_work+0xc90/0x1b90 kernel/workqueue.c:2153
 worker_thread+0x17f/0x1390 kernel/workqueue.c:2296
 kthread+0x35a/0x420 kernel/kthread.c:246
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413

The buggy address belongs to the object at 88019a4c8800
 which belongs to the cache kmalloc-16384 of size 16384
The buggy address is located 64 bytes inside of
 16384-byte region [88019a4c8800, 88019a4cc800)
The buggy address belongs to the page:
page:ea0006693200 count:1 mapcount:0 mapping:8801da802200 index:0x0 
compound_mapcount: 0
flags: 0x2fffc008100(slab|head)
raw: 02fffc008100 ea0006685608 ea0006617c08 8801da802200
raw:  88019a4c8800 00010001 
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 88019a4c8700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 88019a4c8780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>88019a4c8800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ^
 88019a4c8880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 88019a4c8900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Cc: Thomas Graf 
Cc: Herbert Xu 
---
 net/ipv4/inet_fragment.c | 55 
 1 file changed, 33 insertions(+), 22 deletions(-)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 
bcb11f3a27c0c34115af05034a5a20f57842eb0a..50d74a191ff14078bcb87c86640fe7dd342f9956
 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -68,32 +68,43 @@ void inet_frags_fini(struct inet_frags *f)
 }
 EXPORT_SYMBOL(inet_frags_fini);
 
-static void inet_frags_free_cb(void *ptr, void *arg)
-{
-   struct inet_frag_queue *fq = ptr;
-
-   /* If we can 

[net 3/3] net/mlx5e: Set vlan masks for all offloaded TC rules

2018-10-01 Thread Saeed Mahameed
From: Jianbo Liu 

In flow steering, if asked to, the hardware matches on the first ethertype
which is not vlan. It's possible to set a rule as follows, which is meant
to match on untagged packet, but will match on a vlan packet:
tc filter add dev eth0 parent : protocol ip flower ...

To avoid this for packets with single tag, we set vlan masks to tell
hardware to check the tags for every matched packet.

Fixes: 095b6cfd69ce ('net/mlx5e: Add TC vlan match parsing')
Signed-off-by: Jianbo Liu 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 52e05f3ece50..85796727093e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1368,6 +1368,9 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 
*match_level = MLX5_MATCH_L2;
}
+   } else {
+   MLX5_SET(fte_match_set_lyr_2_4, headers_c, svlan_tag, 1);
+   MLX5_SET(fte_match_set_lyr_2_4, headers_c, cvlan_tag, 1);
}
 
if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_CVLAN)) {
-- 
2.17.1



[net 2/3] net/mlx5: E-Switch, Fix out of bound access when setting vport rate

2018-10-01 Thread Saeed Mahameed
From: Eran Ben Elisha 

The code that deals with eswitch vport bw guarantee was going beyond the
eswitch vport array limit, fix that.  This was pointed out by the kernel
address sanitizer (KASAN).

The error from KASAN log:
[2018-09-15 15:04:45] BUG: KASAN: slab-out-of-bounds in
mlx5_eswitch_set_vport_rate+0x8c1/0xae0 [mlx5_core]

Fixes: c9497c98901c ("net/mlx5: Add support for setting VF min rate")
Signed-off-by: Eran Ben Elisha 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 2b252cde5cc2..ea7dedc2d5ad 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -2000,7 +2000,7 @@ static u32 calculate_vports_min_rate_divider(struct 
mlx5_eswitch *esw)
u32 max_guarantee = 0;
int i;
 
-   for (i = 0; i <= esw->total_vports; i++) {
+   for (i = 0; i < esw->total_vports; i++) {
evport = >vports[i];
if (!evport->enabled || evport->info.min_rate < max_guarantee)
continue;
@@ -2020,7 +2020,7 @@ static int normalize_vports_min_rate(struct mlx5_eswitch 
*esw, u32 divider)
int err;
int i;
 
-   for (i = 0; i <= esw->total_vports; i++) {
+   for (i = 0; i < esw->total_vports; i++) {
evport = >vports[i];
if (!evport->enabled)
continue;
-- 
2.17.1



  1   2   >