Re: [RFC net-next 1/2] net: net-porcfs: Reduce rcu lock critical section

2018-04-11 Thread Eric Dumazet


On 04/11/2018 04:47 PM, Saeed Mahameed wrote:
> 
> Well if we allow devices to access HW counters via FW command
> interfaces in ndo_get_stats and by testing mlx5 where we query up to 5
> hw registers, it could take 100us, still this is way smaller than 10sec
>  :) and it is really a nice rate to fetch HW stats on demand.

If hardware stats are slower than software ones, maybe it is time to use 
software stats,
instead of changing the whole stack ?

There are very few devices drivers having issues like that.



Re: [PATCH v2] net: dsa: b53: Using sleep-able operations in b53_switch_reset_gpio

2018-04-11 Thread Phil Reid

On 12/04/2018 09:48, Jia-Ju Bai wrote:

b53_switch_reset_gpio() is never called in atomic context.

The call chain ending up at b53_switch_reset_gpio() is:
[1] b53_switch_reset_gpio() <- b53_switch_reset() <-
 b53_reset_switch() <- b53_setup()

b53_switch_reset_gpio() is set as ".setup" in struct dsa_switch_ops.
This function is not called in atomic context.

Despite never getting called from atomic context, b53_switch_reset_gpio()
calls non-sleep operations mdelay() and gpio_set_value().
They are not necessary and can be replaced with msleep()
and gpio_set_value_cansleep().

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai 
---
v2:
* Use gpio_set_value_cansleep() to replace gpio_set_value() additionally.
   Thanks for Florian and Phil for good advice.
---
  drivers/net/dsa/b53/b53_common.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 274f367..36cc60d 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -596,11 +596,11 @@ static void b53_switch_reset_gpio(struct b53_device *dev)
  
  	/* Reset sequence: RESET low(50ms)->high(20ms)

 */
-   gpio_set_value(gpio, 0);
-   mdelay(50);
+   gpio_set_value_cansleep(gpio, 0);
+   msleep(50);
  
-	gpio_set_value(gpio, 1);

-   mdelay(20);
+   gpio_set_value_cansleep(gpio, 1);
+   msleep(20);
  
  	dev->current_page = 0xff;

  }



FWIW:
Reviewed-by: Phil Reid 


Re: [PATCH v2] dec: tulip: de4x5: Replace mdelay with usleep_range in de4x5_hw_init

2018-04-11 Thread Jia-Ju Bai



On 2018/4/12 10:21, arvindY wrote:



On Thursday 12 April 2018 07:00 AM, Jia-Ju Bai wrote:



On 2018/4/12 0:16, James Bottomley wrote:

On Wed, 2018-04-11 at 23:39 +0800, Jia-Ju Bai wrote:

de4x5_hw_init() is never called in atomic context.

de4x5_hw_init() is only called by de4x5_pci_probe(), which is only
set as ".probe" in struct pci_driver.

Despite never getting called from atomic context, de4x5_hw_init()
calls mdelay() to busily wait. This is not necessary and can be
replaced with usleep_range() to  avoid busy waiting.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Did you actually test this?  The usual reason for wanting m/udelay is
that the timing must be exact.  The driver is filled with mdelay()s for
this reason.  The one you've picked on is in the init path so it won't
affect the runtime in any way.  I also don't think we have the hrtimer
machinery for usleep_range() to work properly on parisc, so I don't
think the replacement works.

James



Hello, James.
Thanks for your reply :)

I agree that usleep_range() here will not much affect the real 
execution of this driver.


But I think usleep_range() can more opportunity for other threads to 
use the CPU core to schedule during waiting.
That is why I detect mdelay() that can be replaced with msleep() or 
usleep_range().




James is right, You have added all usleep_range() during system 
boot-up time.

During boot-up system will run as single threaded. Where this change will
not make much sense. System first priority is match the exact timing on
each and every boot-up.



Hello, Arvind.
Thanks for your reply :)

I admit I am not familiar with this driver.
I did not know this driver is only loaded during system boot-up time,
I thought this driver can be loaded as a kernel module (like many 
drivers) after system booting.

After knowing this, I admit my patch is not proper, sorry...


Best wishes,
Jia-Ju Bai


iproute2-4.16.0 no longer accepts routes via fe80::1

2018-04-11 Thread Thomas Deutschmann
Hi,

I upgraded to iproute2-4.16.0 and rebooted my system.

On start I noticed that my IPv6 default route via "fe80::1" was rejected:

> # ip route add default via ff80::1
> Error: inet address is expected rather than "ff80::1".

This works fine with 

Re: [PATCH v2] dec: tulip: de4x5: Replace mdelay with usleep_range in de4x5_hw_init

2018-04-11 Thread arvindY



On Thursday 12 April 2018 07:00 AM, Jia-Ju Bai wrote:



On 2018/4/12 0:16, James Bottomley wrote:

On Wed, 2018-04-11 at 23:39 +0800, Jia-Ju Bai wrote:

de4x5_hw_init() is never called in atomic context.

de4x5_hw_init() is only called by de4x5_pci_probe(), which is only
set as ".probe" in struct pci_driver.

Despite never getting called from atomic context, de4x5_hw_init()
calls mdelay() to busily wait. This is not necessary and can be
replaced with usleep_range() to  avoid busy waiting.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Did you actually test this?  The usual reason for wanting m/udelay is
that the timing must be exact.  The driver is filled with mdelay()s for
this reason.  The one you've picked on is in the init path so it won't
affect the runtime in any way.  I also don't think we have the hrtimer
machinery for usleep_range() to work properly on parisc, so I don't
think the replacement works.

James



Hello, James.
Thanks for your reply :)

I agree that usleep_range() here will not much affect the real 
execution of this driver.


But I think usleep_range() can more opportunity for other threads to 
use the CPU core to schedule during waiting.
That is why I detect mdelay() that can be replaced with msleep() or 
usleep_range().




James is right, You have added all usleep_range() during system boot-up 
time.

During boot-up system will run as single threaded. Where this change will
not make much sense. System first priority is match the exact timing on
each and every boot-up.

~arvind



Best wishes,
Jia-Ju Bai




Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap

2018-04-11 Thread Michael S. Tsirkin
On Tue, Mar 27, 2018 at 06:59:08PM +0200, Björn Töpel wrote:
> @@ -30,4 +31,18 @@ struct xdp_umem_reg {
>   __u32 frame_headroom; /* Frame head room */
>  };
>  
> +/* Pgoff for mmaping the rings */
> +#define XDP_UMEM_PGOFF_FILL_QUEUE0x1
> +
> +struct xdp_queue {
> + __u32 head_idx __attribute__((aligned(64)));
> + __u32 tail_idx __attribute__((aligned(64)));
> +};
> +
> +/* Used for the fill and completion queues for buffers */
> +struct xdp_umem_queue {
> + struct xdp_queue ptrs;
> + __u32 desc[0] __attribute__((aligned(64)));
> +};
> +
>  #endif /* _LINUX_IF_XDP_H */

So IIUC it's a head/tail ring of 32 bit descriptors.

In my experience (from implementing ptr_ring) this
implies that head/tail cache lines bounce a lot between
CPUs. Caching will help some. You are also forced to
use barriers to check validity which is slow on
some architectures.

If instead you can use a special descriptor value (e.g. 0) as
a valid signal, things work much better:

- you read descriptor atomically, if it's not 0 it's fine
- same with write - write 0 to pass it to the other side
- there is a data dependency so no need for barriers (except on dec alpha)
- no need for power of 2 limitations, you can make it any size you like
- easy to resize too

architecture (if not implementation) would be shared with ptr_ring
so some of the optimization ideas like batched updates could
be lifted from there.

When I was building ptr_ring, any head/tail design underperformed
storing valid flag with data itself. YMMV.

-- 
MST


Re: [PATCH] net: dsa: b53: Replace mdelay with msleep in b53_switch_reset_gpio

2018-04-11 Thread Jia-Ju Bai



On 2018/4/12 0:19, Florian Fainelli wrote:

On 04/11/2018 12:14 AM, Jia-Ju Bai wrote:


On 2018/4/11 13:30, Phil Reid wrote:

On 11/04/2018 09:51, Jia-Ju Bai wrote:

b53_switch_reset_gpio() is never called in atomic context.

The call chain ending up at b53_switch_reset_gpio() is:
[1] b53_switch_reset_gpio() <- b53_switch_reset() <-
 b53_reset_switch() <- b53_setup()

b53_switch_reset_gpio() is set as ".setup" in struct dsa_switch_ops.
This function is not called in atomic context.

Despite never getting called from atomic context,
b53_switch_reset_gpio()
calls mdelay() to busily wait.
This is not necessary and can be replaced with msleep() to
avoid busy waiting.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai 
---
   drivers/net/dsa/b53/b53_common.c | 4 ++--
   1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c
b/drivers/net/dsa/b53/b53_common.c
index 274f367..e070ff6 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -597,10 +597,10 @@ static void b53_switch_reset_gpio(struct
b53_device *dev)
   /* Reset sequence: RESET low(50ms)->high(20ms)
*/
   gpio_set_value(gpio, 0);
-mdelay(50);
+msleep(50);
 gpio_set_value(gpio, 1);
-mdelay(20);
+msleep(20);
 dev->current_page = 0xff;
   }


Would that also imply gpio_set_value could be gpio_set_value_cansleep?


Yes, I think gpio_set_value_cansleep() is okay here?
Do I need to send a V2 patch to replace gpio_set_value()?

Yes, I would lump these two changes in the same patch since this is
effectively about solving sleeping vs. non sleeping operations.


Okay, I have sent a V2 patch, and you can have a look :)


Best wishes,
Jia-Ju Bai


iproute2-4.16.0 no longer accepts routes via fe80::1

2018-04-11 Thread Thomas Deutschmann
Hi,

I upgraded to iproute2-4.16.0 and rebooted my system.

On start I noticed that my IPv6 default route via "fe80::1" was rejected:

> # ip route add default via ff80::1
> Error: inet address is expected rather than "ff80::1".

This works with 

[PATCH v2] net: dsa: b53: Using sleep-able operations in b53_switch_reset_gpio

2018-04-11 Thread Jia-Ju Bai
b53_switch_reset_gpio() is never called in atomic context.

The call chain ending up at b53_switch_reset_gpio() is:
[1] b53_switch_reset_gpio() <- b53_switch_reset() <-
b53_reset_switch() <- b53_setup()

b53_switch_reset_gpio() is set as ".setup" in struct dsa_switch_ops.
This function is not called in atomic context.

Despite never getting called from atomic context, b53_switch_reset_gpio()
calls non-sleep operations mdelay() and gpio_set_value().
They are not necessary and can be replaced with msleep() 
and gpio_set_value_cansleep().

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai 
---
v2:
* Use gpio_set_value_cansleep() to replace gpio_set_value() additionally.
  Thanks for Florian and Phil for good advice.
---
 drivers/net/dsa/b53/b53_common.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 274f367..36cc60d 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -596,11 +596,11 @@ static void b53_switch_reset_gpio(struct b53_device *dev)
 
/* Reset sequence: RESET low(50ms)->high(20ms)
 */
-   gpio_set_value(gpio, 0);
-   mdelay(50);
+   gpio_set_value_cansleep(gpio, 0);
+   msleep(50);
 
-   gpio_set_value(gpio, 1);
-   mdelay(20);
+   gpio_set_value_cansleep(gpio, 1);
+   msleep(20);
 
dev->current_page = 0xff;
 }
-- 
1.9.1



Re: [PATCH v2] dec: tulip: de4x5: Replace mdelay with usleep_range in de4x5_hw_init

2018-04-11 Thread Jia-Ju Bai



On 2018/4/12 0:16, James Bottomley wrote:

On Wed, 2018-04-11 at 23:39 +0800, Jia-Ju Bai wrote:

de4x5_hw_init() is never called in atomic context.

de4x5_hw_init() is only called by de4x5_pci_probe(), which is only
set as ".probe" in struct pci_driver.

Despite never getting called from atomic context, de4x5_hw_init()
calls mdelay() to busily wait. This is not necessary and can be
replaced with usleep_range() to  avoid busy waiting.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Did you actually test this?  The usual reason for wanting m/udelay is
that the timing must be exact.  The driver is filled with mdelay()s for
this reason.  The one you've picked on is in the init path so it won't
affect the runtime in any way.  I also don't think we have the hrtimer
machinery for usleep_range() to work properly on parisc, so I don't
think the replacement works.

James



Hello, James.
Thanks for your reply :)

I agree that usleep_range() here will not much affect the real execution 
of this driver.


But I think usleep_range() can more opportunity for other threads to use 
the CPU core to schedule during waiting.
That is why I detect mdelay() that can be replaced with msleep() or 
usleep_range().



Best wishes,
Jia-Ju Bai


RE: WARNING in kobject_add_internal

2018-04-11 Thread Yuan, Linyu (NSB - CN/Shanghai)
Hi,

I have a question,
"can syzbot auto test each tree with newest changeset" ?

> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Dmitry Vyukov
> Sent: Wednesday, April 11, 2018 10:58 PM
> To: syzbot
> Cc: bri...@lists.linux-foundation.org; David Miller; Greg Kroah-Hartman;
> LKML; netdev; stephen hemminger; syzkaller-bugs
> Subject: Re: WARNING in kobject_add_internal
> 
> On Fri, Jan 5, 2018 at 10:41 PM, syzbot
>  il.com>
> wrote:
> > syzkaller has found reproducer for the following crash on
> > 89876f275e8d562912d9c238cd888b52065cf25c
> > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> > compiler: gcc (GCC) 7.1.1 20170620
> > .config is attached
> > Raw console output is attached.
> > C reproducer is attached
> > syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> > for information about syzkaller reproducers
> >
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by:
> >
> syzbot+e204ced820ef739d71ef5438f5e1976a874abc8d@syzkaller.appspotmail
> .com
> > It will help syzbot understand when the bug is fixed.
> 
> #syz dup: WARNING: kobject bug in device_add
> 
> > [ cut here ]
> > kobject_add_internal failed for   (error: -12 parent: net)
> > WARNING: CPU: 1 PID: 3494 at lib/kobject.c:244
> > kobject_add_internal+0x3f6/0xbc0 lib/kobject.c:242
> > Kernel panic - not syncing: panic_on_warn set ...
> >
> > CPU: 1 PID: 3494 Comm: syzkaller425998 Not tainted 4.15.0-rc6+ #249
> > Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS
> > Google 01/01/2011
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:17 [inline]
> >  dump_stack+0x194/0x257 lib/dump_stack.c:53
> >  panic+0x1e4/0x41c kernel/panic.c:183
> >  __warn+0x1dc/0x200 kernel/panic.c:547
> >  report_bug+0x211/0x2d0 lib/bug.c:184
> >  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
> >  fixup_bug arch/x86/kernel/traps.c:247 [inline]
> >  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
> >  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
> >  invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1079
> > RIP: 0010:kobject_add_internal+0x3f6/0xbc0 lib/kobject.c:242
> > RSP: 0018:8801c53c76f0 EFLAGS: 00010286
> > RAX: dc08 RBX: 8801bf5a88d8 RCX: 8159da9e
> > RDX:  RSI: 110038a78e99 RDI: 8801c53c73f8
> > RBP: 8801c53c77e8 R08: 110038a78e5b R09: 
> > R10: 8801c53c74b0 R11:  R12: 110038a78ee4
> > R13: fff4 R14: 8801d8359a80 R15: 86201980
> >  kobject_add_varg lib/kobject.c:366 [inline]
> >  kobject_add+0x132/0x1f0 lib/kobject.c:411
> >  device_add+0x35d/0x1650 drivers/base/core.c:1787
> >  netdev_register_kobject+0x183/0x360 net/core/net-sysfs.c:1604
> >  register_netdevice+0xb2b/0x1010 net/core/dev.c:7698
> >  tun_set_iff drivers/net/tun.c:2319 [inline]
> >  __tun_chr_ioctl+0x1d89/0x3dd0 drivers/net/tun.c:2524
> >  tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2773
> >  vfs_ioctl fs/ioctl.c:46 [inline]
> >  do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
> >  SYSC_ioctl fs/ioctl.c:701 [inline]
> >  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
> >  entry_SYSCALL_64_fastpath+0x23/0x9a
> > RIP: 0033:0x444fc9
> > RSP: 002b:7fff53389dc8 EFLAGS: 0246 ORIG_RAX:
> 0010
> > RAX: ffda RBX: 0001 RCX: 00444fc9
> > RDX: 20533000 RSI: 400454ca RDI: 0004
> > RBP: 0005 R08: 0002 R09: 006f3131
> > R10:  R11: 0246 R12: 00402500
> > R13: 00402590 R14:  R15: 
> >
> > Dumping ftrace buffer:
> >(ftrace buffer empty)
> > Kernel Offset: disabled
> > Rebooting in 86400 seconds..
> >


[bpf PATCH 3/3] bpf: sockmap, fix double put_page on ENOMEM error in redirect path

2018-04-11 Thread John Fastabend
In the case where the socket memory boundary is hit the redirect
path returns an ENOMEM error. However, before checking for this
condition the redirect scatterlist buffer is setup with a valid
page and length. This is never unwound so when the buffers are
released latter in the error path we do a put_page() and clear
the scatterlist fields. But, because the initial error happens
before completing the scatterlist buffer we end up with both the
original buffer and the redirect buffer pointing to the same page
resulting in duplicate put_page() calls.

To fix this simply move the initial configuration of the redirect
scatterlist buffer below the sock memory check.

Found this while running TCP_STREAM test with netperf using Cilium.

Fixes: fa246693a111 ("bpf: sockmap, BPF_F_INGRESS flag for 
BPF_SK_SKB_STREAM_VERDICT")
Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index ffe266b..0fefdb0 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -524,8 +524,6 @@ static int bpf_tcp_ingress(struct sock *sk, int apply_bytes,
i = md->sg_start;
 
do {
-   r->sg_data[i] = md->sg_data[i];
-
size = (apply && apply_bytes < md->sg_data[i].length) ?
apply_bytes : md->sg_data[i].length;
 
@@ -536,6 +534,7 @@ static int bpf_tcp_ingress(struct sock *sk, int apply_bytes,
}
 
sk_mem_charge(sk, size);
+   r->sg_data[i] = md->sg_data[i];
r->sg_data[i].length = size;
md->sg_data[i].length -= size;
md->sg_data[i].offset += size;



[bpf PATCH 1/3] bpf: sockmap, map_release does not hold refcnt for pinned maps

2018-04-11 Thread John Fastabend
Relying on map_release hook to decrement the reference counts when a
map is removed only works if the map is not being pinned. In the
pinned case the ref is decremented immediately and the BPF programs
released. After this BPF programs may not be in-use which is not
what the user would expect.

This patch moves the release logic into bpf_map_put_uref() and brings
sockmap in-line with how a similar case is handled in prog array maps.

Fixes: 3d9e952697de ("bpf: sockmap, fix leaking maps with attached but not 
detached progs")
Signed-off-by: John Fastabend 
---
 include/linux/bpf.h  |3 +++
 kernel/bpf/sockmap.c |3 +--
 kernel/bpf/syscall.c |   10 +-
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 95a7abd..8a71058 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -646,6 +646,7 @@ static inline void bpf_map_offload_map_free(struct bpf_map 
*map)
 #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_BPF_SYSCALL) && 
defined(CONFIG_INET)
 struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 key);
 int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type);
+void sock_map_release(struct bpf_map *map);
 #else
 static inline struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 
key)
 {
@@ -658,6 +659,8 @@ static inline int sock_map_prog(struct bpf_map *map,
 {
return -EOPNOTSUPP;
 }
+
+void sock_map_release(struct bpf_map *map) {}
 #endif
 
 /* verifier prototypes for helper functions called from eBPF programs */
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 8dd9210..cef187f 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1834,7 +1834,7 @@ static int sock_map_update_elem(struct bpf_map *map,
return err;
 }
 
-static void sock_map_release(struct bpf_map *map, struct file *map_file)
+void sock_map_release(struct bpf_map *map)
 {
struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
struct bpf_prog *orig;
@@ -1858,7 +1858,6 @@ static void sock_map_release(struct bpf_map *map, struct 
file *map_file)
.map_get_next_key = sock_map_get_next_key,
.map_update_elem = sock_map_update_elem,
.map_delete_elem = sock_map_delete_elem,
-   .map_release = sock_map_release,
 };
 
 BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0244973..449a618 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -257,8 +257,16 @@ static void bpf_map_free_deferred(struct work_struct *work)
 static void bpf_map_put_uref(struct bpf_map *map)
 {
if (atomic_dec_and_test(>usercnt)) {
-   if (map->map_type == BPF_MAP_TYPE_PROG_ARRAY)
+   switch (map->map_type) {
+   case BPF_MAP_TYPE_PROG_ARRAY:
bpf_fd_array_map_clear(map);
+   break;
+   case BPF_MAP_TYPE_SOCKMAP:
+   sock_map_release(map);
+   break;
+   default:
+   break;
+   }
}
 }
 



[bpf PATCH 0/3] BPF, a couple sockmap fixes

2018-04-11 Thread John Fastabend
While testing sockmap with more programs (besides our test programs)
I found a couple issues.

The attached series fixes an issue where pinned maps were not
working correctly, blocking sockets returned zero, and an error
path that when the sock hit an out of memory case resulted in a
double page_put() while doing ingress redirects.

See individual patches for more details.

Thanks,
John

---

John Fastabend (3):
  bpf: sockmap, map_release does not hold refcnt for pinned maps
  bpf: sockmap, sk_wait_event needed to handle blocking cases
  bpf: sockmap, fix double put_page on ENOMEM error in redirect path


 include/linux/bpf.h  |3 +++
 kernel/bpf/sockmap.c |   51 ++
 kernel/bpf/syscall.c |   10 +-
 3 files changed, 59 insertions(+), 5 deletions(-)

--


[bpf PATCH 2/3] bpf: sockmap, sk_wait_event needed to handle blocking cases

2018-04-11 Thread John Fastabend
In the recvmsg handler we need to add a wait event to support the
blocking use cases. Without this we return zero and may confuse
user applications. In the wait event any data received on the
sk either via sk_receive_queue or the psock ingress list will
wake up the sock.

Fixes: fa246693a111 ("bpf: sockmap, BPF_F_INGRESS flag for 
BPF_SK_SKB_STREAM_VERDICT")
Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |   45 +
 1 file changed, 45 insertions(+)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index cef187f..ffe266b 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define SOCK_CREATE_FLAG_MASK \
(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
@@ -732,6 +733,26 @@ static int bpf_exec_tx_verdict(struct smap_psock *psock,
return err;
 }
 
+static int bpf_wait_data(struct sock *sk,
+struct smap_psock *psk, int flags,
+long timeo, int *err)
+{
+   int rc;
+
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
+
+   add_wait_queue(sk_sleep(sk), );
+   sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
+   rc = sk_wait_event(sk, ,
+  !list_empty(>ingress) ||
+  !skb_queue_empty(>sk_receive_queue),
+  );
+   sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk);
+   remove_wait_queue(sk_sleep(sk), );
+
+   return rc;
+}
+
 static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
   int nonblock, int flags, int *addr_len)
 {
@@ -755,6 +776,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr 
*msg, size_t len,
return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
 
lock_sock(sk);
+bytes_ready:
while (copied != len) {
struct scatterlist *sg;
struct sk_msg_buff *md;
@@ -809,6 +831,29 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr 
*msg, size_t len,
}
}
 
+   if (!copied) {
+   long timeo;
+   int data;
+   int err = 0;
+
+   timeo = sock_rcvtimeo(sk, nonblock);
+   data = bpf_wait_data(sk, psock, flags, timeo, );
+
+   if (data) {
+   if (!skb_queue_empty(>sk_receive_queue)) {
+   release_sock(sk);
+   smap_release_sock(psock, sk);
+   copied = tcp_recvmsg(sk, msg, len,
+nonblock, flags, addr_len);
+   return copied;
+   }
+   goto bytes_ready;
+   }
+
+   if (err)
+   copied = err;
+   }
+
release_sock(sk);
smap_release_sock(psock, sk);
return copied;



[PATCH net 4/4] nfp: flower: split and limit cmsg skb lists

2018-04-11 Thread Jakub Kicinski
From: Pieter Jansen van Vuuren 

Introduce a second skb list for handling control messages and limit the
number of allowed messages. Some control messages are considered more
crucial than others, resulting in the need for a second skb list. By
splitting the list into a separate high and low priority list we can
ensure that messages on the high list get added to the head of the list
that gets processed, this however has no functional impact. Previously
there was no limit on the number of messages allowed on the queue, this
could result in the queue growing boundlessly and eventually the host
running out of memory.

Fixes: b985f870a5f0 ("nfp: process control messages in workqueue in flower app")
Signed-off-by: Pieter Jansen van Vuuren 
Reviewed-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
 drivers/net/ethernet/netronome/nfp/flower/cmsg.c | 38 +---
 drivers/net/ethernet/netronome/nfp/flower/cmsg.h |  2 ++
 drivers/net/ethernet/netronome/nfp/flower/main.c |  6 ++--
 drivers/net/ethernet/netronome/nfp/flower/main.h |  8 +++--
 4 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/cmsg.c 
b/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
index 8ad857eb89c6..577659f332e4 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
@@ -272,18 +272,49 @@ nfp_flower_cmsg_process_one_rx(struct nfp_app *app, 
struct sk_buff *skb)
 
 void nfp_flower_cmsg_process_rx(struct work_struct *work)
 {
+   struct sk_buff_head cmsg_joined;
struct nfp_flower_priv *priv;
struct sk_buff *skb;
 
priv = container_of(work, struct nfp_flower_priv, cmsg_work);
+   skb_queue_head_init(_joined);
 
-   while ((skb = skb_dequeue(>cmsg_skbs)))
+   spin_lock_bh(>cmsg_skbs_high.lock);
+   skb_queue_splice_tail_init(>cmsg_skbs_high, _joined);
+   spin_unlock_bh(>cmsg_skbs_high.lock);
+
+   spin_lock_bh(>cmsg_skbs_low.lock);
+   skb_queue_splice_tail_init(>cmsg_skbs_low, _joined);
+   spin_unlock_bh(>cmsg_skbs_low.lock);
+
+   while ((skb = __skb_dequeue(_joined)))
nfp_flower_cmsg_process_one_rx(priv->app, skb);
 }
 
-void nfp_flower_cmsg_rx(struct nfp_app *app, struct sk_buff *skb)
+static void
+nfp_flower_queue_ctl_msg(struct nfp_app *app, struct sk_buff *skb, int type)
 {
struct nfp_flower_priv *priv = app->priv;
+   struct sk_buff_head *skb_head;
+
+   if (type == NFP_FLOWER_CMSG_TYPE_PORT_REIFY ||
+   type == NFP_FLOWER_CMSG_TYPE_PORT_MOD)
+   skb_head = >cmsg_skbs_high;
+   else
+   skb_head = >cmsg_skbs_low;
+
+   if (skb_queue_len(skb_head) >= NFP_FLOWER_WORKQ_MAX_SKBS) {
+   nfp_flower_cmsg_warn(app, "Dropping queued control messages\n");
+   dev_kfree_skb_any(skb);
+   return;
+   }
+
+   skb_queue_tail(skb_head, skb);
+   schedule_work(>cmsg_work);
+}
+
+void nfp_flower_cmsg_rx(struct nfp_app *app, struct sk_buff *skb)
+{
struct nfp_flower_cmsg_hdr *cmsg_hdr;
 
cmsg_hdr = nfp_flower_cmsg_get_hdr(skb);
@@ -307,7 +338,6 @@ void nfp_flower_cmsg_rx(struct nfp_app *app, struct sk_buff 
*skb)
/* Acks from the NFP that the route is added - ignore. */
dev_consume_skb_any(skb);
} else {
-   skb_queue_tail(>cmsg_skbs, skb);
-   schedule_work(>cmsg_work);
+   nfp_flower_queue_ctl_msg(app, skb, cmsg_hdr->type);
}
 }
diff --git a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h 
b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
index 96bc0e33980c..b6c0fd053a50 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
@@ -108,6 +108,8 @@
 #define NFP_FL_IPV4_TUNNEL_TYPEGENMASK(7, 4)
 #define NFP_FL_IPV4_PRE_TUN_INDEX  GENMASK(2, 0)
 
+#define NFP_FLOWER_WORKQ_MAX_SKBS  3
+
 #define nfp_flower_cmsg_warn(app, fmt, args...) \
do {\
if (net_ratelimit())\
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.c 
b/drivers/net/ethernet/netronome/nfp/flower/main.c
index 6357e0720f43..ad02592a82b7 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.c
@@ -519,7 +519,8 @@ static int nfp_flower_init(struct nfp_app *app)
 
app->priv = app_priv;
app_priv->app = app;
-   skb_queue_head_init(_priv->cmsg_skbs);
+   skb_queue_head_init(_priv->cmsg_skbs_high);
+   skb_queue_head_init(_priv->cmsg_skbs_low);
INIT_WORK(_priv->cmsg_work, nfp_flower_cmsg_process_rx);

[PATCH net 3/4] nfp: flower: move route ack control messages out of the workqueue

2018-04-11 Thread Jakub Kicinski
From: Pieter Jansen van Vuuren 

Previously we processed the route ack control messages in the workqueue,
this unnecessarily loads the workqueue. We can deal with these messages
sooner as we know we are going to drop them.

Fixes: 8e6a9046b66a ("nfp: flower vxlan neighbour offload")
Signed-off-by: Pieter Jansen van Vuuren 
Reviewed-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
 drivers/net/ethernet/netronome/nfp/flower/cmsg.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/cmsg.c 
b/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
index 3735c09d2112..8ad857eb89c6 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
@@ -258,9 +258,6 @@ nfp_flower_cmsg_process_one_rx(struct nfp_app *app, struct 
sk_buff *skb)
case NFP_FLOWER_CMSG_TYPE_ACTIVE_TUNS:
nfp_tunnel_keep_alive(app, skb);
break;
-   case NFP_FLOWER_CMSG_TYPE_TUN_NEIGH:
-   /* Acks from the NFP that the route is added - ignore. */
-   break;
default:
nfp_flower_cmsg_warn(app, "Cannot handle invalid repr control 
type %u\n",
 type);
@@ -306,6 +303,9 @@ void nfp_flower_cmsg_rx(struct nfp_app *app, struct sk_buff 
*skb)
   nfp_flower_process_mtu_ack(app, skb)) {
/* Handle MTU acks outside wq to prevent RTNL conflict. */
dev_consume_skb_any(skb);
+   } else if (cmsg_hdr->type == NFP_FLOWER_CMSG_TYPE_TUN_NEIGH) {
+   /* Acks from the NFP that the route is added - ignore. */
+   dev_consume_skb_any(skb);
} else {
skb_queue_tail(>cmsg_skbs, skb);
schedule_work(>cmsg_work);
-- 
2.16.2



[PATCH net 1/4] nfp: ignore signals when communicating with management FW

2018-04-11 Thread Jakub Kicinski
We currently allow signals to interrupt the wait for management FW
commands.  Exiting the wait should not cause trouble, the FW will
just finish executing the command in the background and new commands
will wait for the old one to finish.

However, this may not be what users expect (Ctrl-C not actually stopping
the command).  Moreover some systems routinely request link information
with signals pending (Ubuntu 14.04 runs a landscape-sysinfo python tool
from MOTD) worrying users with errors like these:

nfp :04:00.0: nfp_nsp: Error -512 waiting for code 0x0007 to start
nfp :04:00.0: nfp: reading port table failed -512

Make the wait for management FW responses non-interruptible.

Fixes: 1a64821c6af7 ("nfp: add support for service processor access")
Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
index 99bb679a9801..2abee0fe3a7c 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
@@ -281,8 +281,7 @@ nfp_nsp_wait_reg(struct nfp_cpp *cpp, u64 *reg, u32 
nsp_cpp, u64 addr,
if ((*reg & mask) == val)
return 0;
 
-   if (msleep_interruptible(25))
-   return -ERESTARTSYS;
+   msleep(25);
 
if (time_after(start_time, wait_until))
return -ETIMEDOUT;
-- 
2.16.2



[PATCH net 2/4] nfp: print a message when mutex wait is interrupted

2018-04-11 Thread Jakub Kicinski
When waiting for an NFP mutex is interrupted print a message
to make root causing later error messages easier.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c
index f7b958181126..cb28ac03e4ca 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c
@@ -211,8 +211,11 @@ int nfp_cpp_mutex_lock(struct nfp_cpp_mutex *mutex)
break;
 
err = msleep_interruptible(timeout_ms);
-   if (err != 0)
+   if (err != 0) {
+   nfp_info(mutex->cpp,
+"interrupted waiting for NFP mutex\n");
return -ERESTARTSYS;
+   }
 
if (time_is_before_eq_jiffies(warn_at)) {
warn_at = jiffies + NFP_MUTEX_WAIT_NEXT_WARN * HZ;
-- 
2.16.2



[PATCH net 0/4] nfp: improve signal handing on FW waits and flower control message processing

2018-04-11 Thread Jakub Kicinski
Hi!

The first part of this set aims to improve handling of interrupted
waits.  Patch 1 makes waiting for management FW responses
uninterruptible while patch 2 adds a message when signal arrives
while waiting for an NFP mutex.  We can't interrupt execution of
FW commands so uninterruptible sleep seems reasonable there.
Exiting a wait for a mutex should be clean and have no side affects
so we are allowing to abort it.  Note that both waits have rather
large timeouts (tens of seconds).

Patches 3 and 4 improve flower offload operation under heavy load.
Currently there is no cap on the number of queued FW notifications.
Some of the notifications have to be processed from a workqueue
which may lead to very large number of messages getting queued
if workqueue never gets a chance to run.  Pieter puts a limit
on number of queued messages, tries to drop some messages we ignore
without queuing and process more important messages first.

Jakub Kicinski (2):
  nfp: ignore signals when communicating with management FW
  nfp: print a message when mutex wait is interrupted

Pieter Jansen van Vuuren (2):
  nfp: flower: move route ack control messages out of the workqueue
  nfp: flower: split and limit cmsg skb lists

 drivers/net/ethernet/netronome/nfp/flower/cmsg.c   | 44 ++
 drivers/net/ethernet/netronome/nfp/flower/cmsg.h   |  2 +
 drivers/net/ethernet/netronome/nfp/flower/main.c   |  6 ++-
 drivers/net/ethernet/netronome/nfp/flower/main.h   |  8 +++-
 .../net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c |  5 ++-
 .../net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c   |  3 +-
 6 files changed, 54 insertions(+), 14 deletions(-)

-- 
2.16.2



Re: [RFC net-next 1/2] net: net-porcfs: Reduce rcu lock critical section

2018-04-11 Thread Saeed Mahameed
On Wed, 2018-04-11 at 15:30 -0700, Eric Dumazet wrote:
> 
> On 04/11/2018 11:59 AM, Saeed Mahameed wrote:
> > On Tue, 2018-04-10 at 13:35 -0700, Eric Dumazet wrote:
> > > 
> > > On 04/10/2018 10:16 AM, David Miller wrote:
> > > > 
> > > > The tradeoff here is that now you are doing two unnecessary
> > > > atomic
> > > > operations per stats dump.
> > > > 
> > > > That is what the RCU lock allows us to avoid.
> > > > 
> > > 
> > > dev_hold() and dev_put() are actually per cpu increment and
> > > decrements,
> > > pretty cheap these days.
> > > 
> > 
> > Yes, i am adding only 2 cpu instructions here.
> > I think the trade-off here is too small and the price to finally
> > have
> > get_stats64 called from non atomic context is really worth it.
> 
> Oh... but you have not mentioned this reason in your changelog.
> 

from the commit message:

"This is really greedy and demanding from device drivers since
ndo_get_stats64 called from dev_seq_show while the rcu lock is held"

sorry if this wasn't clear enough I will fix this if this goes through.

> What about bonding stats ?
> 
> By sending partial patches like that, others have to take care of the
> details
> and this is not really acceptable.
> 

This is a RFC just to show the approach, if the approach is acceptable
of course i will provide a full series that will handle all other
places, the change should be the same, I already have 2 other patches
to address ovs and netlink stats, i just didn't want to waste your time
on small details like netlink messages.

> > 
> > It  looks really odd to me that the device chain locks are held for
> > such long periods, while we already have the means to avoid this,
> > same
> > goes for rtnl_lock, same trick can work here for many use cases and
> > many ndos, we are just over protective for no reasons.
> > 
> > 
> > > Problem here is that any preemption of the process holding device
> > > reference
> > > might trigger warnings in device unregister.
> > > 
> > 
> > This is true for any other place dev_hold is used,
> > as explained in the commit message dev_hold is used for a very
> > brief
> > moment before calling the stats ndo and released directly after.
> 
> Not really.
> 
> Other places usually have notifiers to remove the refcount when
> needed.
> 

Other places hold the netdev for the whole lifetime of the netdev, they
don't know when to release it, this is why we need notifiers.

in this patch the approach is:

hold
call ndo
put

notifier is not needed in this case.

> We worked quite hard in the past to remove extra dev_hold()
> (before we finally converted it to percpu counter)
> 
> > 
> > looking at 
> > 
> > netdev_wait_allrefs(...)
> > [...]
> > msleep(250);
> > 
> > refcnt = netdev_refcnt_read(dev);
> > 
> > if (time_after(jiffies, warning_time + 10 * HZ)) {
> > pr_emerg("unregister_netdevice: waiting for %s to
> > become free. Usage count = %d\n",
> >  dev->name, refcnt);
> > warning_time = jiffies;
> > }
> > 
> > The holder will get enough time to release the netdev way before
> > the
> > warning is triggered.
> > 
> > The warning is triggered only if someone holds the dev for more
> > than 10
> > seconds which is impossible for the stats ndo to take more than
> > this,
> > in fact i just did a quick measurement and it seems that in average
> > get_stats64 ndo takes 0.5us !
> 
> Average is nice, but what about max time ?
> 

Well if we allow devices to access HW counters via FW command
interfaces in ndo_get_stats and by testing mlx5 where we query up to 5
hw registers, it could take 100us, still this is way smaller than 10sec
 :) and it is really a nice rate to fetch HW stats on demand.

> Sleeping more than 10 seconds to satify GFP_KERNEL memory allocation
> can definitely
> happen in the real world, or simply if you get preempted by some
> RT/high prio tasks.
> 

Same issue can occur without this patch if an IRQ is triggered under
while under rcu lock.
And RT/high prio task shouldn't take more than 10sec.

In case GFP_KERNEL memory allocation takes more than 10sec you will
already get tons of OOM warnings.
Having a netdev warning in case of really the netdev is being
unregistered and ndo_get_stats was preempted for more than 10 seconds
will be the least of your problems.

> Just say no to 'might sleep in ndo_get_stats()', and you will save a
> lot of issues.

We are just being over protective here.

Just say no to 'might sleep in ndo_get_stats()' is by itself creating a
lot of issues, many drivers out there have a background thread running
N times a second just to cache HW stats. which is really silly and CPU
consuming for no reason.

The rate of ndo_get_stats should be equal to the rate the driver can
actually provide fresh stats, any background thread is just a a waste
of CPU. Counters should be fetched from HW on demand rather than
periodically for no reason.

The same goes for set_rx_mode ndo.

Bottom line it looks like the need for rcu 

Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-04-11 Thread Jesus Sanchez-Palencia
Hi,

On 04/11/2018 01:16 PM, Thomas Gleixner wrote:
 Putting it all together, we end up with:

 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look 
 like:
 $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 15 offload 
 sorting
>>>
>>> Why CLOCK_REALTIME? The only interesting time in a TSN network is
>>> CLOCK_TAI, really.
>>
>> REALTIME was just an example here to show that the qdisc has to be configured
>> with a clockid parameter. Are you suggesting that instead both of the new 
>> qdiscs
>> (i.e. tbs and taprio) should always be using CLOCK_TAI implicitly?
> 
> I think so. It's _the_ network time on which everything is based on.

Yes, but more on this below.


> 
 2) a new cmsg-interface for setting a per-packet timestamp that will be 
 used
 either as a txtime or as deadline by tbs (and further the NIC driver for 
 the
 offlaod case): SCM_TXTIME.

 3) a new socket option: SO_TXTIME. It will be used to enable the feature 
 for a
 socket, and will have as parameters a clockid and a txtime mode (deadline 
 or
 explicit), that defines the semantics of the timestamp set on packets using
 SCM_TXTIME.

 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .
>>>
>>> Can you remind me why we would need that?
>>
>> So there is a "clockid" that can be used for the full hw offload modes. On 
>> this
>> case, the txtimes are in reference to the NIC's PTP clock, and, as 
>> discussed, we
>> can't just use a clockid that was computed from the fd pointing to /dev/ptpX 
>> .
> 
> And the NICs PTP clock is CLOCK_TAI, so there should be no reason to have
> yet another clock, right?

Just breaking this down a bit, yes, TAI is the network time base, and the NICs
PTP clock use that because PTP is (commonly) based on TAI. After the PHCs have
been synchronized over the network (e.g. with ptp4l), my understanding is that
if applications want to use the clockid_t CLOCK_TAI as a network clock reference
it's required that something (i.e. phc2sys) is synchronizing the PHCs and the
system clock, and also that something calls adjtime to apply the TAI vs UTC
offset to CLOCK_TAI.

If we are fine with those 'dependencies', then I agree there is no need for
another clock.

I was thinking about the full offload use-cases, thus when no scheduling is
happening inside the qdiscs. Applications could just read the time from the PHC
clocks directly without having to rely on any of the above. On this case,
userspace would use DYNAMIC_CLOCK just to flag that this is the case, but I must
admit it's not clear to me how common of a use-case that is, or even if it makes
sense.


Thanks,
Jesus


> 
> Thanks,
> 
>   tglx
> 


[net 1/1] tipc: fix missing initializer in tipc_sendmsg()

2018-04-11 Thread Jon Maloy
The stack variable 'dnode' in __tipc_sendmsg() may theoretically
end up tipc_node_get_mtu() as an unitilalized variable.

We fix this by intializing the variable at declaration. We also add
a default else clause to the two conditional ones already there, so
that we never end up in the named function if the given address
type is illegal.

Reported-by: syzbot+b0975ce9355b347c1...@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy 
---
 net/tipc/socket.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 1fd1c8b..252a52ae 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1278,7 +1278,7 @@ static int __tipc_sendmsg(struct socket *sock, struct 
msghdr *m, size_t dlen)
struct tipc_msg *hdr = >phdr;
struct tipc_name_seq *seq;
struct sk_buff_head pkts;
-   u32 dnode, dport;
+   u32 dport, dnode = 0;
u32 type, inst;
int mtu, rc;
 
@@ -1348,6 +1348,8 @@ static int __tipc_sendmsg(struct socket *sock, struct 
msghdr *m, size_t dlen)
msg_set_destnode(hdr, dnode);
msg_set_destport(hdr, dest->addr.id.ref);
msg_set_hdr_sz(hdr, BASIC_H_SIZE);
+   } else {
+   return -EINVAL;
}
 
/* Block or return if destination link is congested */
-- 
2.1.4



Re: [PATCH net-next 1/5] virtio: Add support for SCTP checksum offloading

2018-04-11 Thread Michael S. Tsirkin
On Mon, Apr 02, 2018 at 09:40:02AM -0400, Vladislav Yasevich wrote:
> To support SCTP checksum offloading, we need to add a new feature
> to virtio_net, so we can negotiate support between the hypervisor
> and the guest.
> 
> The signalling to the guest that an alternate checksum needs to
> be used is done via a new flag in the virtio_net_hdr.  If the
> flag is set, the host will know to perform an alternate checksum
> calculation, which right now is only CRC32c.
> 
> Signed-off-by: Vladislav Yasevich 
> ---
>  drivers/net/virtio_net.c| 11 ---
>  include/linux/virtio_net.h  |  6 ++
>  include/uapi/linux/virtio_net.h |  2 ++
>  3 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 7b187ec..b601294 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2724,9 +2724,14 @@ static int virtnet_probe(struct virtio_device *vdev)
>   /* Do we support "hardware" checksums? */
>   if (virtio_has_feature(vdev, VIRTIO_NET_F_CSUM)) {
>   /* This opens up the world of extra features. */
> - dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG;
> + netdev_features_t sctp = 0;
> +
> + if (virtio_has_feature(vdev, VIRTIO_NET_F_SCTP_CSUM))
> + sctp |= NETIF_F_SCTP_CRC;
> +
> + dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
>   if (csum)
> - dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG;
> + dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
>  
>   if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) {
>   dev->hw_features |= NETIF_F_TSO
> @@ -2952,7 +2957,7 @@ static struct virtio_device_id id_table[] = {
>  };
>  
>  #define VIRTNET_FEATURES \
> - VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, \
> + VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM,  VIRTIO_NET_F_SCTP_CSUM, \
>   VIRTIO_NET_F_MAC, \
>   VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6, \
>   VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, 
> VIRTIO_NET_F_GUEST_TSO6, \
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index f144216..2e7a64a 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -39,6 +39,9 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
>  
>   if (!skb_partial_csum_set(skb, start, off))
>   return -EINVAL;
> +
> + if (hdr->flags & VIRTIO_NET_HDR_F_CSUM_NOT_INET)
> + skb->csum_not_inet = 1;
>   }
>  
>   if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> @@ -96,6 +99,9 @@ static inline int virtio_net_hdr_from_skb(const struct 
> sk_buff *skb,
>   hdr->flags = VIRTIO_NET_HDR_F_DATA_VALID;
>   } /* else everything is zero */
>  
> + if (skb->csum_not_inet)
> + hdr->flags &= VIRTIO_NET_HDR_F_CSUM_NOT_INET;
> +
>   return 0;
>  }
>  
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index 5de6ed3..3f279c8 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -36,6 +36,7 @@
>  #define VIRTIO_NET_F_GUEST_CSUM  1   /* Guest handles pkts w/ 
> partial csum */
>  #define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS 2 /* Dynamic offload configuration. 
> */
>  #define VIRTIO_NET_F_MTU 3   /* Initial MTU advice */
> +#define VIRTIO_NET_F_SCTP_CSUM  4/* SCTP checksum offload support */
>  #define VIRTIO_NET_F_MAC 5   /* Host has given MAC address. */
>  #define VIRTIO_NET_F_GUEST_TSO4  7   /* Guest can handle TSOv4 in. */
>  #define VIRTIO_NET_F_GUEST_TSO6  8   /* Guest can handle TSOv6 in. */

Is this a guest or a host checksum? We should differenciate between the
two.


> @@ -101,6 +102,7 @@ struct virtio_net_config {
>  struct virtio_net_hdr_v1 {
>  #define VIRTIO_NET_HDR_F_NEEDS_CSUM  1   /* Use csum_start, csum_offset 
> */
>  #define VIRTIO_NET_HDR_F_DATA_VALID  2   /* Csum is valid */
> +#define VIRTIO_NET_HDR_F_CSUM_NOT_INET  4   /* Checksum is not inet */
>   __u8 flags;
>  #define VIRTIO_NET_HDR_GSO_NONE  0   /* Not a GSO frame */
>  #define VIRTIO_NET_HDR_GSO_TCPV4 1   /* GSO frame, IPv4 TCP (TSO) */
> -- 
> 2.9.5


Re: [PATCH net-next 1/5] virtio: Add support for SCTP checksum offloading

2018-04-11 Thread Marcelo Ricardo Leitner
On Mon, Apr 02, 2018 at 09:40:02AM -0400, Vladislav Yasevich wrote:
> To support SCTP checksum offloading, we need to add a new feature
> to virtio_net, so we can negotiate support between the hypervisor
> and the guest.
>
> The signalling to the guest that an alternate checksum needs to
> be used is done via a new flag in the virtio_net_hdr.  If the
> flag is set, the host will know to perform an alternate checksum
> calculation, which right now is only CRC32c.
>
> Signed-off-by: Vladislav Yasevich 
> ---
>  drivers/net/virtio_net.c| 11 ---
>  include/linux/virtio_net.h  |  6 ++
>  include/uapi/linux/virtio_net.h |  2 ++
>  3 files changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 7b187ec..b601294 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2724,9 +2724,14 @@ static int virtnet_probe(struct virtio_device *vdev)
>   /* Do we support "hardware" checksums? */
>   if (virtio_has_feature(vdev, VIRTIO_NET_F_CSUM)) {
>   /* This opens up the world of extra features. */
> - dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG;
> + netdev_features_t sctp = 0;
> +
> + if (virtio_has_feature(vdev, VIRTIO_NET_F_SCTP_CSUM))
> + sctp |= NETIF_F_SCTP_CRC;
> +
> + dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
>   if (csum)
> - dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG;
> + dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
>
>   if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) {
>   dev->hw_features |= NETIF_F_TSO
> @@ -2952,7 +2957,7 @@ static struct virtio_device_id id_table[] = {
>  };
>
>  #define VIRTNET_FEATURES \
> - VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, \
> + VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM,  VIRTIO_NET_F_SCTP_CSUM, \
>   VIRTIO_NET_F_MAC, \
>   VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6, \
>   VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, 
> VIRTIO_NET_F_GUEST_TSO6, \
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index f144216..2e7a64a 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -39,6 +39,9 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
>
>   if (!skb_partial_csum_set(skb, start, off))
>   return -EINVAL;
> +
> + if (hdr->flags & VIRTIO_NET_HDR_F_CSUM_NOT_INET)
> + skb->csum_not_inet = 1;
>   }
>
>   if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> @@ -96,6 +99,9 @@ static inline int virtio_net_hdr_from_skb(const struct 
> sk_buff *skb,
>   hdr->flags = VIRTIO_NET_HDR_F_DATA_VALID;
>   } /* else everything is zero */
>
> + if (skb->csum_not_inet)
> + hdr->flags &= VIRTIO_NET_HDR_F_CSUM_NOT_INET;
Shouldn't this be  |=  instead?

> +
>   return 0;
>  }
>
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index 5de6ed3..3f279c8 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -36,6 +36,7 @@
>  #define VIRTIO_NET_F_GUEST_CSUM  1   /* Guest handles pkts w/ 
> partial csum */
>  #define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS 2 /* Dynamic offload configuration. 
> */
>  #define VIRTIO_NET_F_MTU 3   /* Initial MTU advice */
> +#define VIRTIO_NET_F_SCTP_CSUM  4/* SCTP checksum offload support */
>  #define VIRTIO_NET_F_MAC 5   /* Host has given MAC address. */
>  #define VIRTIO_NET_F_GUEST_TSO4  7   /* Guest can handle TSOv4 in. */
>  #define VIRTIO_NET_F_GUEST_TSO6  8   /* Guest can handle TSOv6 in. */
> @@ -101,6 +102,7 @@ struct virtio_net_config {
>  struct virtio_net_hdr_v1 {
>  #define VIRTIO_NET_HDR_F_NEEDS_CSUM  1   /* Use csum_start, csum_offset 
> */
>  #define VIRTIO_NET_HDR_F_DATA_VALID  2   /* Csum is valid */
> +#define VIRTIO_NET_HDR_F_CSUM_NOT_INET  4   /* Checksum is not inet */
>   __u8 flags;
>  #define VIRTIO_NET_HDR_GSO_NONE  0   /* Not a GSO frame */
>  #define VIRTIO_NET_HDR_GSO_TCPV4 1   /* GSO frame, IPv4 TCP (TSO) */
> --
> 2.9.5
>


Re: [RFC net-next 1/2] net: net-porcfs: Reduce rcu lock critical section

2018-04-11 Thread Eric Dumazet


On 04/11/2018 11:59 AM, Saeed Mahameed wrote:
> On Tue, 2018-04-10 at 13:35 -0700, Eric Dumazet wrote:
>>
>> On 04/10/2018 10:16 AM, David Miller wrote:
>>>
>>> The tradeoff here is that now you are doing two unnecessary atomic
>>> operations per stats dump.
>>>
>>> That is what the RCU lock allows us to avoid.
>>>
>>
>> dev_hold() and dev_put() are actually per cpu increment and
>> decrements,
>> pretty cheap these days.
>>
> 
> Yes, i am adding only 2 cpu instructions here.
> I think the trade-off here is too small and the price to finally have
> get_stats64 called from non atomic context is really worth it.

Oh... but you have not mentioned this reason in your changelog.

What about bonding stats ?

By sending partial patches like that, others have to take care of the details
and this is not really acceptable.

> 
> It  looks really odd to me that the device chain locks are held for
> such long periods, while we already have the means to avoid this, same
> goes for rtnl_lock, same trick can work here for many use cases and
> many ndos, we are just over protective for no reasons.
> 
> 
>> Problem here is that any preemption of the process holding device
>> reference
>> might trigger warnings in device unregister.
>>
> 
> This is true for any other place dev_hold is used,
> as explained in the commit message dev_hold is used for a very brief
> moment before calling the stats ndo and released directly after.

Not really.

Other places usually have notifiers to remove the refcount when needed.

We worked quite hard in the past to remove extra dev_hold()
(before we finally converted it to percpu counter)

> 
> looking at 
> 
> netdev_wait_allrefs(...)
> [...]
>   msleep(250);
> 
>   refcnt = netdev_refcnt_read(dev);
> 
>   if (time_after(jiffies, warning_time + 10 * HZ)) {
>   pr_emerg("unregister_netdevice: waiting for %s to
> become free. Usage count = %d\n",
>dev->name, refcnt);
>   warning_time = jiffies;
>   }
> 
> The holder will get enough time to release the netdev way before the
> warning is triggered.
> 
> The warning is triggered only if someone holds the dev for more than 10
> seconds which is impossible for the stats ndo to take more than this,
> in fact i just did a quick measurement and it seems that in average
> get_stats64 ndo takes 0.5us !

Average is nice, but what about max time ?

Sleeping more than 10 seconds to satify GFP_KERNEL memory allocation can 
definitely
happen in the real world, or simply if you get preempted by some RT/high prio 
tasks.

Just say no to 'might sleep in ndo_get_stats()', and you will save a lot of 
issues.


Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy

2018-04-11 Thread Mickaël Salaün

On 04/10/2018 06:48 AM, Alexei Starovoitov wrote:
> On Mon, Apr 09, 2018 at 12:01:59AM +0200, Mickaël Salaün wrote:
>>
>> On 04/08/2018 11:06 PM, Andy Lutomirski wrote:
>>> On Sun, Apr 8, 2018 at 6:13 AM, Mickaël Salaün  wrote:

 On 02/27/2018 10:48 PM, Mickaël Salaün wrote:
>
> On 27/02/2018 17:39, Andy Lutomirski wrote:
>> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
>>  wrote:
>>> On Tue, Feb 27, 2018 at 05:20:55AM +, Andy Lutomirski wrote:
 On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
  wrote:
> On Tue, Feb 27, 2018 at 04:40:34AM +, Andy Lutomirski wrote:
>> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>>  wrote:
>>> On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
 The seccomp(2) syscall can be used by a task to apply a Landlock 
 program
 to itself. As a seccomp filter, a Landlock program is enforced for 
 the
 current task and all its future children. A program is immutable 
 and a
 task can only add new restricting programs to itself, forming a 
 list of
 programss.

 A Landlock program is tied to a Landlock hook. If the action on a 
 kernel
 object is allowed by the other Linux security mechanisms (e.g. DAC,
 capabilities, other LSM), then a Landlock hook related to this 
 kind of
 object is triggered. The list of programs for this hook is then
 evaluated. Each program return a 32-bit value which can deny the 
 action
 on a kernel object with a non-zero value. If every programs of the 
 list
 return zero, then the action on the object is allowed.

 Multiple Landlock programs can be chained to share a 64-bits value 
 for a
 call chain (e.g. evaluating multiple elements of a file path).  
 This
 chaining is restricted when a process construct this chain by 
 loading a
 program, but additional checks are performed when it requests to 
 apply
 this chain of programs to itself.  The restrictions ensure that it 
 is
 not possible to call multiple programs in a way that would imply to
 handle multiple shared values (i.e. cookies) for one chain.  For 
 now,
 only a fs_pick program can be chained to the same type of program,
 because it may make sense if they have different triggers (cf. next
 commits).  This restrictions still allows to reuse Landlock 
 programs in
 a safe way (e.g. use the same loaded fs_walk program with multiple
 chains of fs_pick programs).

 Signed-off-by: Mickaël Salaün 
>>>
>>> ...
>>>
 +struct landlock_prog_set *landlock_prepend_prog(
 + struct landlock_prog_set *current_prog_set,
 + struct bpf_prog *prog)
 +{
 + struct landlock_prog_set *new_prog_set = current_prog_set;
 + unsigned long pages;
 + int err;
 + size_t i;
 + struct landlock_prog_set tmp_prog_set = {};
 +
 + if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
 + return ERR_PTR(-EINVAL);
 +
 + /* validate memory size allocation */
 + pages = prog->pages;
 + if (current_prog_set) {
 + size_t i;
 +
 + for (i = 0; i < 
 ARRAY_SIZE(current_prog_set->programs); i++) {
 + struct landlock_prog_list *walker_p;
 +
 + for (walker_p = 
 current_prog_set->programs[i];
 + walker_p; walker_p = 
 walker_p->prev)
 + pages += walker_p->prog->pages;
 + }
 + /* count a struct landlock_prog_set if we need to 
 allocate one */
 + if (refcount_read(_prog_set->usage) != 1)
 + pages += round_up(sizeof(*current_prog_set), 
 PAGE_SIZE)
 + / PAGE_SIZE;
 + }
 + if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
 + return ERR_PTR(-E2BIG);
 +
 + /* ensure early that we can allocate enough memory for the 
 new

[PATCH net] strparser: Fix incorrect strp->need_bytes value.

2018-04-11 Thread Doron Roberts-Kedes
strp_data_ready resets strp->need_bytes to 0 if strp_peek_len indicates
that the remainder of the message has been received. However,
do_strp_work does not reset strp->need_bytes to 0. If do_strp_work
completes a partial message, the value of strp->need_bytes will continue
to reflect the needed bytes of the previous message, causing
future invocations of strp_data_ready to return early if
strp->need_bytes is less than strp_peek_len. Resetting strp->need_bytes
to 0 in __strp_recv on handing a full message to the upper layer solves
this problem.

__strp_recv also calculates strp->need_bytes using stm->accum_len before
stm->accum_len has been incremented by cand_len. This can cause
strp->need_bytes to be equal to the full length of the message instead
of the full length minus the accumulated length. This, in turn, causes
strp_data_ready to return early, even when there is sufficient data to
complete the partial message. Incrementing stm->accum_len before using
it to calculate strp->need_bytes solves this problem.

Found while testing net/tls_sw recv path.

Fixes: 43a0c6751a322847 ("strparser: Stream parser for messages")
Signed-off-by: Doron Roberts-Kedes 
---
 net/strparser/strparser.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
index b9283ce..805b139 100644
--- a/net/strparser/strparser.c
+++ b/net/strparser/strparser.c
@@ -296,9 +296,9 @@ static int __strp_recv(read_descriptor_t *desc, struct 
sk_buff *orig_skb,
strp_start_timer(strp, timeo);
}
 
+   stm->accum_len += cand_len;
strp->need_bytes = stm->strp.full_len -
   stm->accum_len;
-   stm->accum_len += cand_len;
stm->early_eaten = cand_len;
STRP_STATS_ADD(strp->stats.bytes, cand_len);
desc->count = 0; /* Stop reading socket */
@@ -321,6 +321,7 @@ static int __strp_recv(read_descriptor_t *desc, struct 
sk_buff *orig_skb,
/* Hurray, we have a new message! */
cancel_delayed_work(>msg_timer_work);
strp->skb_head = NULL;
+   strp->need_bytes = 0;
STRP_STATS_INCR(strp->stats.msgs);
 
/* Give skb to upper layer */
@@ -410,9 +411,7 @@ void strp_data_ready(struct strparser *strp)
return;
 
if (strp->need_bytes) {
-   if (strp_peek_len(strp) >= strp->need_bytes)
-   strp->need_bytes = 0;
-   else
+   if (strp_peek_len(strp) < strp->need_bytes)
return;
}
 
-- 
2.9.5



[GIT] Networking

2018-04-11 Thread David Miller

1) In ip_gre tunnel, handle the conflict between TUNNEL_{SEQ,CSUM} and
   GSO/LLTX properly.  From Sabrina Dubroca.

2) Stop properly on error in lan78xx_read_otp(), from Phil Elwell.

3) Don't uncompress in slip before rstate is initialized, from Tejaswi
   Tanikella.

4) When using 1.x firmware on aquantia, issue a deinit before we
   hardware reset the chip, otherwise we break dirty wake WOL.  From
   Igor Russkikh.

5) Correct log check in vhost_vq_access_ok(), from Stefan Hajnoczi.

6) Fix ethtool -x crashes in bnxt_en, from Michael Chan.

7) Fix races in l2tp tunnel creation and duplicate tunnel detection,
   from Guillaume Nault.

Please pull, thanks a lot!

The following changes since commit c18bb396d3d261ebbb4efbc05129c5d354c541e4:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2018-04-09 
17:04:10 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to 0c84cee8f131a090f77f5a3dea5d6a7bd99c00db:

  Merge branch 'l2tp-tunnel-creation-fixes' (2018-04-11 17:41:28 -0400)


Andy Gospodarek (1):
  bnxt_en: do not allow wildcard matches for L2 flows

Bassem Boubaker (1):
  cdc_ether: flag the Cinterion AHS8 modem by gemalto as WWAN

David S. Miller (4):
  Merge branch 'Aquantia-atlantic-critical-fixes-04-2018'
  Merge branch 'vhost-fix-vhost_vq_access_ok-log-check'
  Merge branch 'bnxt_en-Fixes-for-net'
  Merge branch 'l2tp-tunnel-creation-fixes'

Eric Auger (1):
  vhost: Fix vhost_copy_to_user()

Guillaume Nault (2):
  l2tp: fix races in tunnel creation
  l2tp: fix race in duplicate tunnel detection

Igor Russkikh (2):
  net: aquantia: Regression on reset with 1.x firmware
  net: aquantia: oops when shutdown on already stopped device

Ka-Cheong Poon (1):
  rds: MP-RDS may use an invalid c_path

Michael Chan (3):
  bnxt_en: Fix ethtool -x crash when device is down.
  bnxt_en: Need to include RDMA rings in bnxt_check_rings().
  bnxt_en: Fix NULL pointer dereference at bnxt_free_irq().

Phil Elwell (3):
  lan78xx: Correctly indicate invalid OTP
  lan78xx: Avoid spurious kevent 4 "error"
  lan78xx: Don't reset the interface on open

Sabrina Dubroca (3):
  ip_gre: clear feature flags when incompatible o_flags are set
  tun: set the flags before registering the netdevice
  tun: send netlink notification when the device is modified

Sriharsha Basavapatna (2):
  bnxt_en: Ignore src port field in decap filter nodes
  bnxt_en: Support max-mtu with VF-reps

Stefan Hajnoczi (2):
  vhost: fix vhost_vq_access_ok() log check
  vhost: return bool from *_access_ok() functions

Tejaswi Tanikella (1):
  slip: Check if rstate is initialized before uncompressing

 drivers/net/ethernet/aquantia/atlantic/aq_nic.c  |   8 +-
 drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c |  16 
 drivers/net/ethernet/broadcom/bnxt/bnxt.c|   4 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c|  11 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c |  63 
+++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c|  30 
 drivers/net/slip/slhc.c  |   5 ++
 drivers/net/tun.c|  33 ++--
 drivers/net/usb/cdc_ether.c  |   6 ++
 drivers/net/usb/lan78xx.c|   9 +--
 drivers/vhost/vhost.c|  72 
+-
 drivers/vhost/vhost.h|   4 +-
 include/net/slhc_vj.h|   1 +
 net/ipv4/ip_gre.c|   6 ++
 net/l2tp/l2tp_core.c | 225 
---
 net/l2tp/l2tp_core.h |   4 +-
 net/l2tp/l2tp_netlink.c  |  22 +++---
 net/l2tp/l2tp_ppp.c  |   9 +++
 net/rds/send.c   |  15 ++--
 19 files changed, 345 insertions(+), 198 deletions(-)


Re: [PATCH net] net: validate attribute sizes in neigh_dump_table()

2018-04-11 Thread David Ahern
On 4/11/18 3:46 PM, Eric Dumazet wrote:
> Since neigh_dump_table() calls nlmsg_parse() without giving policy
> constraints, attributes can have arbirary size that we must validate
> 

...

> 
> Fixes: 21fdd092acc7 ("net: Add support for filtering neigh dump by master 
> device")
> Signed-off-by: Eric Dumazet 
> Cc: David Ahern 
> Reported-by: syzbot 
> ---
>  net/core/neighbour.c | 12 
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 

Acked-by: David Ahern 

Thanks for fixing, Eric.


[PATCH net] net: validate attribute sizes in neigh_dump_table()

2018-04-11 Thread Eric Dumazet
Since neigh_dump_table() calls nlmsg_parse() without giving policy
constraints, attributes can have arbirary size that we must validate

Reported by syzbot/KMSAN :

BUG: KMSAN: uninit-value in neigh_master_filtered net/core/neighbour.c:2292 
[inline]
BUG: KMSAN: uninit-value in neigh_dump_table net/core/neighbour.c:2348 [inline]
BUG: KMSAN: uninit-value in neigh_dump_info+0x1af0/0x2250 
net/core/neighbour.c:2438
CPU: 1 PID: 3575 Comm: syzkaller268891 Not tainted 4.16.0+ #83
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 neigh_master_filtered net/core/neighbour.c:2292 [inline]
 neigh_dump_table net/core/neighbour.c:2348 [inline]
 neigh_dump_info+0x1af0/0x2250 net/core/neighbour.c:2438
 netlink_dump+0x9ad/0x1540 net/netlink/af_netlink.c:2225
 __netlink_dump_start+0x1167/0x12a0 net/netlink/af_netlink.c:2322
 netlink_dump_start include/linux/netlink.h:214 [inline]
 rtnetlink_rcv_msg+0x1435/0x1560 net/core/rtnetlink.c:4598
 netlink_rcv_skb+0x355/0x5f0 net/netlink/af_netlink.c:2447
 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:4653
 netlink_unicast_kernel net/netlink/af_netlink.c:1311 [inline]
 netlink_unicast+0x1672/0x1750 net/netlink/af_netlink.c:1337
 netlink_sendmsg+0x1048/0x1310 net/netlink/af_netlink.c:1900
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
 __sys_sendmsg net/socket.c:2080 [inline]
 SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
 SyS_sendmsg+0x54/0x80 net/socket.c:2087
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x43fed9
RSP: 002b:7ffddbee2798 EFLAGS: 0213 ORIG_RAX: 002e
RAX: ffda RBX: 004002c8 RCX: 0043fed9
RDX:  RSI: 20005000 RDI: 0003
RBP: 006ca018 R08: 004002c8 R09: 004002c8
R10: 004002c8 R11: 0213 R12: 00401800
R13: 00401890 R14:  R15: 

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
 kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
 slab_post_alloc_hook mm/slab.h:445 [inline]
 slab_alloc_node mm/slub.c:2737 [inline]
 __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
 __kmalloc_reserve net/core/skbuff.c:138 [inline]
 __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
 alloc_skb include/linux/skbuff.h:984 [inline]
 netlink_alloc_large_skb net/netlink/af_netlink.c:1183 [inline]
 netlink_sendmsg+0x9a6/0x1310 net/netlink/af_netlink.c:1875
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
 __sys_sendmsg net/socket.c:2080 [inline]
 SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
 SyS_sendmsg+0x54/0x80 net/socket.c:2087
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: 21fdd092acc7 ("net: Add support for filtering neigh dump by master 
device")
Signed-off-by: Eric Dumazet 
Cc: David Ahern 
Reported-by: syzbot 
---
 net/core/neighbour.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 
7b7a14abba28e2b77c6448f1c3d151287afc79ad..a8bc02bb339f9f4c914ae7b23408cd5ccc8b3b8e
 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2323,12 +2323,16 @@ static int neigh_dump_table(struct neigh_table *tbl, 
struct sk_buff *skb,
 
err = nlmsg_parse(nlh, sizeof(struct ndmsg), tb, NDA_MAX, NULL, NULL);
if (!err) {
-   if (tb[NDA_IFINDEX])
+   if (tb[NDA_IFINDEX]) {
+   if (nla_len(tb[NDA_IFINDEX]) != sizeof(u32))
+   return -EINVAL;
filter_idx = nla_get_u32(tb[NDA_IFINDEX]);
-
-   if (tb[NDA_MASTER])
+   }
+   if (tb[NDA_MASTER]) {
+   if (nla_len(tb[NDA_MASTER]) != sizeof(u32))
+   return -EINVAL;
filter_master_idx = nla_get_u32(tb[NDA_MASTER]);
-
+   }
if (filter_idx || filter_master_idx)
flags |= NLM_F_DUMP_FILTERED;
}
-- 
2.17.0.484.g0c8726318c-goog



Re: [PATCH net 0/2] l2tp: tunnel creation fixes

2018-04-11 Thread David Miller
From: Guillaume Nault 
Date: Tue, 10 Apr 2018 21:01:07 +0200

> L2TP tunnel creation is racy. We need to make sure that the tunnel
> returned by l2tp_tunnel_create() isn't going to be freed while the
> caller is using it. This is done in patch #1, by separating tunnel
> creation from tunnel registration.
> 
> With the tunnel registration code in place, we can now check for
> duplicate tunnels in a race-free way. This is done in patch #2, which
> incidentally removes the last use of l2tp_tunnel_find().

Series applied and queued up for -stable, thanks.


[PATCH net] tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets

2018-04-11 Thread Eric Dumazet
syzbot/KMSAN reported an uninit-value in tcp_parse_options() [1]

I believe this was caused by a TCP_MD5SIG being set on live
flow.

This is highly unexpected, since TCP option space is limited.

For instance, presence of TCP MD5 option automatically disables
TCP TimeStamp option at SYN/SYNACK time, which we can not do
once flow has been established.

Really, adding/deleting an MD5 key only makes sense on sockets
in CLOSE or LISTEN state.

[1]
BUG: KMSAN: uninit-value in tcp_parse_options+0xd74/0x1a30 
net/ipv4/tcp_input.c:3720
CPU: 1 PID: 6177 Comm: syzkaller192004 Not tainted 4.16.0+ #83
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720
 tcp_fast_parse_options net/ipv4/tcp_input.c:3858 [inline]
 tcp_validate_incoming+0x4f1/0x2790 net/ipv4/tcp_input.c:5184
 tcp_rcv_established+0xf60/0x2bb0 net/ipv4/tcp_input.c:5453
 tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
 sk_backlog_rcv include/net/sock.h:908 [inline]
 __release_sock+0x2d6/0x680 net/core/sock.c:2271
 release_sock+0x97/0x2a0 net/core/sock.c:2786
 tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
 inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
 SyS_sendto+0x8a/0xb0 net/socket.c:1715
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x448fe9
RSP: 002b:7fd472c64d38 EFLAGS: 0216 ORIG_RAX: 002c
RAX: ffda RBX: 006e5a30 RCX: 00448fe9
RDX: 029f RSI: 20a88f88 RDI: 0004
RBP: 006e5a34 R08: 20e68000 R09: 0010
R10: 27fd R11: 0216 R12: 
R13: 7fff074899ef R14: 7fd472c659c0 R15: 0009

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
 kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
 slab_post_alloc_hook mm/slab.h:445 [inline]
 slab_alloc_node mm/slub.c:2737 [inline]
 __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
 __kmalloc_reserve net/core/skbuff.c:138 [inline]
 __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
 alloc_skb include/linux/skbuff.h:984 [inline]
 tcp_send_ack+0x18c/0x910 net/ipv4/tcp_output.c:3624
 __tcp_ack_snd_check net/ipv4/tcp_input.c:5040 [inline]
 tcp_ack_snd_check net/ipv4/tcp_input.c:5053 [inline]
 tcp_rcv_established+0x2103/0x2bb0 net/ipv4/tcp_input.c:5469
 tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
 sk_backlog_rcv include/net/sock.h:908 [inline]
 __release_sock+0x2d6/0x680 net/core/sock.c:2271
 release_sock+0x97/0x2a0 net/core/sock.c:2786
 tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
 inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
 SyS_sendto+0x8a/0xb0 net/socket.c:1715
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
---
 net/ipv4/tcp.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 
bccc4c2700870b8c7ff592a6bd27acebd9bc6471..4fa3f812b9ff8954a9b6a018c648ff12ab995721
 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2813,8 +2813,10 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 #ifdef CONFIG_TCP_MD5SIG
case TCP_MD5SIG:
case TCP_MD5SIG_EXT:
-   /* Read the IP->Key mappings from userspace */
-   err = tp->af_specific->md5_parse(sk, optname, optval, optlen);
+   if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
+   err = tp->af_specific->md5_parse(sk, optname, optval, 
optlen);
+   else
+   err = -EINVAL;
break;
 #endif
case TCP_USER_TIMEOUT:
-- 
2.17.0.484.g0c8726318c-goog



Re: [PATCH net-next] netns: filter uevents correctly

2018-04-11 Thread Eric W. Biederman
Christian Brauner  writes:

> On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
>> Christian Brauner  writes:
>> > Yeah, agreed.
>> > But I think the patch is not complete. To guarantee that no non-initial
>> > user namespace actually receives uevents we need to:
>> > 1. only sent uevents to uevent sockets that are located in network
>> >namespaces that are owned by init_user_ns
>> > 2. filter uevents that are sent to sockets in mc_list that have opened a
>> >uevent socket that is owned by init_user_ns *from* a
>> >non-init_user_ns
>> >
>> > We account for 1. by only recording uevent sockets in the global uevent
>> > socket list who are owned by init_user_ns.
>> > But to account for 2. we need to filter by the user namespace who owns
>> > the socket in mc_list. So in addition to that we also need to slightly
>> > change the filter logic in kobj_bcast_filter() I think:
>> >
>> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> > index 22a2c1a98b8f..064d7d29ace5 100644
>> > --- a/lib/kobject_uevent.c
>> > +++ b/lib/kobject_uevent.c
>> > @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct 
>> > sk_buff *skb, void *data)
>> >return sock_ns != ns;
>> >}
>> >  
>> > -  return 0;
>> > +  /* Check if socket was opened from non-initial user namespace. */
>> > +  return sk_user_ns(dsk) != _user_ns;
>> >  }
>> >  #endif
>> >  
>> >
>> > But correct me if I'm wrong.
>> 
>> You are worrying about NETLINK_LISTEN_ALL_NSID sockets.  That has
>> permissions and an explicit opt-in to receiving packets from multiple
>> network namespaces.
>
> I don't think that's what I'm talking about unless that is somehow the
> default for NETLINK_KOBJECT_UEVENT sockets. What I'm worried about is
> doing
>
> unshare -U --map-root
>
> then opening a NETLINK_KOBJECT_UEVENT socket and starting to listen to
> uevents. Imho, this should not be possible because I'm in a
> non-init_user_ns. But currently I'm able to - even with the patch to
> come - since the uevent socket in the kernel was created when init_net
> was created and hence is *owned* by the init_user_ns which means it is
> in the list of uevent sockets. Here's a demo of what I mean:
>
> https://asciinema.org/a/175632

Why do you care about this case?

Everyone is allowed to listen to the uevent netlink sockets with or
without user namespaces.  So there are no permission issues, and
this is not even a data information leak.

If you don't want programs in your user namespace to have access you
will be able to unshare the network namespace.

Eric


Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows

2018-04-11 Thread Jakub Kicinski
On Wed, 11 Apr 2018 13:55:11 -0700, Michael Chan wrote:
> On Wed, Apr 11, 2018 at 1:50 PM, Andy Gospodarek wrote:
> > On Wed, Apr 11, 2018 at 01:41:31PM -0700, Michael Chan wrote:  
> > True, but I'm not sure that tc_cls_common_offload is used in all cases.
> > Take red_offload() as one of those.  
> 
> For Flower, we know we have the extack pointer in
> tc_cls_common_offload struct and we can use it to set the netlink
> error message.  The point is that we don't have to modify
> ndo_setup_tc().

Yes, the extack is actually only populated when skip_sw is specified to
avoid warning users who don't care about offloads.

Flower offloads don't go via .ndo_setup_tc but TC block callbacks.  But
one day we will hopefully find a reasonable way to pass extack to qdisc
offloads as well..

FWIW your driver is actually already using extack under the veil of
tc_cls_can_offload_and_chain0() :)


Re: [PATCH net v2 0/6] bnxt_en: Fixes for net.

2018-04-11 Thread David Miller
From: Michael Chan 
Date: Wed, 11 Apr 2018 11:50:12 -0400

> This bug fix series include NULL pointer fixes in ethtool -x code path
> and in the error clean up path when freeing IRQs, a ring accounting bug
> that missed rings used by the RDMA driver, and 3 bug fixes related to TC
> Flower and VF-reps.
> 
> v2: Fixed commit message of patch 4.  Changed the pound sign to $ sign
> in front of the ip command.

Yep, that looks better.

Series applied, thanks Michael.


Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows

2018-04-11 Thread Jakub Kicinski
On Wed, 11 Apr 2018 11:50:14 -0400, Michael Chan wrote:
> @@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, struct 
> bnxt_tc_flow *flow)
>   return false;
>   }
>  
> + /* Currently source/dest MAC cannot be partial wildcard  */
> + if (bits_set(>l2_key.smac, sizeof(flow->l2_key.smac)) &&
> + !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
> + netdev_info(bp->dev, "Wildcard match unsupported for Source 
> MAC\n");

This wouldn't be something to do in net, but how do you feel about
using extack for messages like this?

> + return false;
> + }


BUG: corrupted list in team_nl_cmd_options_set

2018-04-11 Thread syzbot

Hello,

syzbot hit the following crash on upstream commit
b284d4d5a6785f8cd07eda2646a95782373cd01e (Tue Apr 10 19:25:30 2018 +)
Merge tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=4d4af685432dc0e56c91


C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6161158629228544
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=5600380654190592
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=4627738266697728
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=-1223000601505858474

compiler: gcc (GCC) 8.0.1 20180301 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+4d4af685432dc0e56...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.

If you forward the report, please keep this part and the footer.

8021q: adding VLAN 0 to HW filter on device team0
netlink: 'syzkaller556835': attribute type 3 has an invalid length.
netlink: 'syzkaller556835': attribute type 3 has an invalid length.
list_add double add: new=04f859c0, prev=c9745291,  
next=04f859c0.

[ cut here ]
kernel BUG at lib/list_debug.c:31!
invalid opcode:  [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 4466 Comm: syzkaller556835 Not tainted 4.16.0+ #17
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

RIP: 0010:__list_add_valid+0xaa/0xb0 lib/list_debug.c:29
RSP: 0018:8801b04bf248 EFLAGS: 00010286
RAX: 0058 RBX: 8801c8fc7a90 RCX: 
RDX: 0058 RSI: 815fbf41 RDI: ed0036097e3f
RBP: 8801b04bf260 R08: 8801b0b2a700 R09: ed003b604f90
R10: ed003b604f90 R11: 8801db027c87 R12: 8801c8fc7a90
R13: 8801c8fc7a90 R14: dc00 R15: 
FS:  00b98880() GS:8801db00() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0043fc30 CR3: 0001afe8e000 CR4: 001406f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 __list_add include/linux/list.h:60 [inline]
 list_add include/linux/list.h:79 [inline]
 team_nl_cmd_options_set+0x9ff/0x12b0 drivers/net/team/team.c:2571
 genl_family_rcv_msg+0x889/0x1120 net/netlink/genetlink.c:599
 genl_rcv_msg+0xc6/0x170 net/netlink/genetlink.c:624
 netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
 genl_rcv+0x28/0x40 net/netlink/genetlink.c:635
 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
 netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
 netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:639
 ___sys_sendmsg+0x805/0x940 net/socket.c:2117
 __sys_sendmsg+0x115/0x270 net/socket.c:2155
 SYSC_sendmsg net/socket.c:2164 [inline]
 SyS_sendmsg+0x29/0x30 net/socket.c:2162
 do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x4458b9
RSP: 002b:7ffd1d4a7278 EFLAGS: 0213 ORIG_RAX: 002e
RAX: ffda RBX: 001b RCX: 004458b9
RDX: 0010 RSI: 2d00 RDI: 0004
RBP: 004a74ed R08:  R09: 
R10:  R11: 0213 R12: 7ffd1d4a7348
R13: 00402a60 R14:  R15: 
Code: 75 e8 eb a9 48 89 f7 48 89 75 e8 e8 d1 85 7b fe 48 8b 75 e8 eb bb 48  
89 f2 48 89 d9 4c 89 e6 48 c7 c7 a0 84 d8 87 e8 ea 67 28 fe <0f> 0b 0f 1f  
40 00 48 b8 00 00 00 00 00 fc ff df 55 48 89 e5 41

RIP: __list_add_valid+0xaa/0xb0 lib/list_debug.c:29 RSP: 8801b04bf248
---[ end trace b4f71d7dd7ca6d10 ]---


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkal...@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged

into any tree, please reply to this email with:
#syz fix: exact-commit-title
If you want to test a patch for this bug, please reply with:
#syz test: git://repo/address.git branch
and provide the patch inline or as an attachment.
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.

Note: all commands must start from beginning of the line in the email body.


Re: [PATCH] net/mlx5: remove some extraneous spaces in indentations

2018-04-11 Thread David Miller
From: Saeed Mahameed 
Date: Wed, 11 Apr 2018 18:23:41 +

> Dave, if it is ok with you I would like to apply this into mlx5-next
> branch so it won't cause any issues with our next pull requests to rdma
> and netdev.

Ok.


Re: [RFC PATCH v2 00/14] Introducing AF_XDP support

2018-04-11 Thread Alexei Starovoitov

On 4/11/18 5:17 AM, Björn Töpel wrote:


In the current RFC you are required to create both an Rx and Tx queue
to bind the socket, which is just weird for your "Rx on one device, Tx
to another" scenario. I'll fix that in the next RFC.


I would defer on adding new features until the key functionality lands.
imo it's in good shape and I would submit it without RFC tag as soon as
net-next reopens.



[PATCH 4.4 004/190] x86/asm: Dont use RBP as a temporary register in csum_partial_copy_generic()

2018-04-11 Thread Greg Kroah-Hartman
4.4-stable review patch.  If anyone has any objections, please let me know.

--

From: Josh Poimboeuf 


[ Upstream commit 42fc6c6cb1662ba2fa727dd01c9473c63be4e3b6 ]

Andrey Konovalov reported the following warning while fuzzing the kernel
with syzkaller:

  WARNING: kernel stack regs at 8800686869f8 in a.out:4933 has bad 'bp' 
value c3fc855a10167ec0

The unwinder dump revealed that RBP had a bad value when an interrupt
occurred in csum_partial_copy_generic().

That function saves RBP on the stack and then overwrites it, using it as
a scratch register.  That's problematic because it breaks stack traces
if an interrupt occurs in the middle of the function.

Replace the usage of RBP with another callee-saved register (R15) so
stack traces are no longer affected.

Reported-by: Andrey Konovalov 
Tested-by: Andrey Konovalov 
Signed-off-by: Josh Poimboeuf 
Cc: Cong Wang 
Cc: David S . Miller 
Cc: Dmitry Vyukov 
Cc: Eric Dumazet 
Cc: Kostya Serebryany 
Cc: Linus Torvalds 
Cc: Marcelo Ricardo Leitner 
Cc: Neil Horman 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Vlad Yasevich 
Cc: linux-s...@vger.kernel.org
Cc: netdev 
Cc: syzkaller 
Link: 
http://lkml.kernel.org/r/4b03a961efda5ec9bfe46b7b9c9ad72d1efad343.1493909486.git.jpoim...@redhat.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/lib/csum-copy_64.S |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/arch/x86/lib/csum-copy_64.S
+++ b/arch/x86/lib/csum-copy_64.S
@@ -55,7 +55,7 @@ ENTRY(csum_partial_copy_generic)
movq  %r12, 3*8(%rsp)
movq  %r14, 4*8(%rsp)
movq  %r13, 5*8(%rsp)
-   movq  %rbp, 6*8(%rsp)
+   movq  %r15, 6*8(%rsp)
 
movq  %r8, (%rsp)
movq  %r9, 1*8(%rsp)
@@ -74,7 +74,7 @@ ENTRY(csum_partial_copy_generic)
/* main loop. clear in 64 byte blocks */
/* r9: zero, r8: temp2, rbx: temp1, rax: sum, rcx: saved length */
/* r11: temp3, rdx: temp4, r12 loopcnt */
-   /* r10: temp5, rbp: temp6, r14 temp7, r13 temp8 */
+   /* r10: temp5, r15: temp6, r14 temp7, r13 temp8 */
.p2align 4
 .Lloop:
source
@@ -89,7 +89,7 @@ ENTRY(csum_partial_copy_generic)
source
movq  32(%rdi), %r10
source
-   movq  40(%rdi), %rbp
+   movq  40(%rdi), %r15
source
movq  48(%rdi), %r14
source
@@ -103,7 +103,7 @@ ENTRY(csum_partial_copy_generic)
adcq  %r11, %rax
adcq  %rdx, %rax
adcq  %r10, %rax
-   adcq  %rbp, %rax
+   adcq  %r15, %rax
adcq  %r14, %rax
adcq  %r13, %rax
 
@@ -121,7 +121,7 @@ ENTRY(csum_partial_copy_generic)
dest
movq %r10, 32(%rsi)
dest
-   movq %rbp, 40(%rsi)
+   movq %r15, 40(%rsi)
dest
movq %r14, 48(%rsi)
dest
@@ -203,7 +203,7 @@ ENTRY(csum_partial_copy_generic)
movq 3*8(%rsp), %r12
movq 4*8(%rsp), %r14
movq 5*8(%rsp), %r13
-   movq 6*8(%rsp), %rbp
+   movq 6*8(%rsp), %r15
addq $7*8, %rsp
ret
 




Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows

2018-04-11 Thread Michael Chan
On Wed, Apr 11, 2018 at 1:50 PM, Andy Gospodarek
 wrote:
> On Wed, Apr 11, 2018 at 01:41:31PM -0700, Michael Chan wrote:
>> On Wed, Apr 11, 2018 at 1:31 PM, Andy Gospodarek
>>  wrote:
>> > On Wed, Apr 11, 2018 at 11:43:14AM -0700, Jakub Kicinski wrote:
>> >> On Wed, 11 Apr 2018 11:50:14 -0400, Michael Chan wrote:
>> >> > @@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, 
>> >> > struct bnxt_tc_flow *flow)
>> >> > return false;
>> >> > }
>> >> >
>> >> > +   /* Currently source/dest MAC cannot be partial wildcard  */
>> >> > +   if (bits_set(>l2_key.smac, sizeof(flow->l2_key.smac)) &&
>> >> > +   !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) 
>> >> > {
>> >> > +   netdev_info(bp->dev, "Wildcard match unsupported for Source 
>> >> > MAC\n");
>> >>
>> >> This wouldn't be something to do in net, but how do you feel about
>> >> using extack for messages like this?
>> >>
>> >
>> > I agree 'net' would not have been the place for a change like that, but
>> > I do think that would be a good idea.  It looks like we could easily
>> > change the ndo_setup_tc to something like this:
>> >
>> > int (*ndo_setup_tc)(struct net_device *dev,
>> > enum tc_setup_type type,
>> > void *type_data,
>> > struct netlink_ext_ack 
>> > *extack);
>>
>> I think the extack pointer is already in the tc_cls_common_offload
>> struct inside tc_cls_flower_offload struct.
>
> True, but I'm not sure that tc_cls_common_offload is used in all cases.
> Take red_offload() as one of those.

For Flower, we know we have the extack pointer in
tc_cls_common_offload struct and we can use it to set the netlink
error message.  The point is that we don't have to modify
ndo_setup_tc().


[net 1/1] tipc: fix unbalanced reference counter

2018-04-11 Thread Jon Maloy
When a topology subscription is created, we may encounter (or KASAN
may provoke) a failure to create a corresponding service instance in
the binding table. Instead of letting the tipc_nametbl_subscribe()
report the failure back to the caller, the function just makes a warning
printout and returns, without incrementing the subscription reference
counter as expected by the caller.

This makes the caller believe that the subscription was successful, so
it will at a later moment try to unsubscribe the item. This involves
a sub_put() call. Since the reference counter never was incremented
in the first place, we get a premature delete of the subscription item,
followed by a "use-after-free" warning.

We fix this by adding a return value to tipc_nametbl_subscribe() and
make the caller aware of the failure to subscribe.

This bug seems to always have been around, but this fix only applies
back to the commit shown below. Given the low risk of this happening
we believe this to be sufficient.

Fixes: commit 218527fe27ad ("tipc: replace name table service range
array with rb tree")
Reported-by: syzbot+aa245f26d42b8305d...@syzkaller.appspotmail.com

Signed-off-by: Jon Maloy 
---
 net/tipc/name_table.c | 5 -
 net/tipc/name_table.h | 2 +-
 net/tipc/subscr.c | 5 -
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index b1fe209..4068eaa 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -665,13 +665,14 @@ int tipc_nametbl_withdraw(struct net *net, u32 type, u32 
lower,
 /**
  * tipc_nametbl_subscribe - add a subscription object to the name table
  */
-void tipc_nametbl_subscribe(struct tipc_subscription *sub)
+bool tipc_nametbl_subscribe(struct tipc_subscription *sub)
 {
struct name_table *nt = tipc_name_table(sub->net);
struct tipc_net *tn = tipc_net(sub->net);
struct tipc_subscr *s = >evt.s;
u32 type = tipc_sub_read(s, seq.type);
struct tipc_service *sc;
+   bool res = true;
 
spin_lock_bh(>nametbl_lock);
sc = tipc_service_find(sub->net, type);
@@ -685,8 +686,10 @@ void tipc_nametbl_subscribe(struct tipc_subscription *sub)
pr_warn("Failed to subscribe for {%u,%u,%u}\n", type,
tipc_sub_read(s, seq.lower),
tipc_sub_read(s, seq.upper));
+   res = false;
}
spin_unlock_bh(>nametbl_lock);
+   return res;
 }
 
 /**
diff --git a/net/tipc/name_table.h b/net/tipc/name_table.h
index 4b14fc2..0febba4 100644
--- a/net/tipc/name_table.h
+++ b/net/tipc/name_table.h
@@ -126,7 +126,7 @@ struct publication *tipc_nametbl_insert_publ(struct net 
*net, u32 type,
 struct publication *tipc_nametbl_remove_publ(struct net *net, u32 type,
 u32 lower, u32 upper,
 u32 node, u32 key);
-void tipc_nametbl_subscribe(struct tipc_subscription *s);
+bool tipc_nametbl_subscribe(struct tipc_subscription *s);
 void tipc_nametbl_unsubscribe(struct tipc_subscription *s);
 int tipc_nametbl_init(struct net *net);
 void tipc_nametbl_stop(struct net *net);
diff --git a/net/tipc/subscr.c b/net/tipc/subscr.c
index b7d80bc..f340e53 100644
--- a/net/tipc/subscr.c
+++ b/net/tipc/subscr.c
@@ -153,7 +153,10 @@ struct tipc_subscription *tipc_sub_subscribe(struct net 
*net,
memcpy(>evt.s, s, sizeof(*s));
spin_lock_init(>lock);
kref_init(>kref);
-   tipc_nametbl_subscribe(sub);
+   if (!tipc_nametbl_subscribe(sub)) {
+   kfree(sub);
+   return NULL;
+   }
timer_setup(>timer, tipc_sub_timeout, 0);
timeout = tipc_sub_read(>evt.s, timeout);
if (timeout != TIPC_WAIT_FOREVER)
-- 
2.1.4



Re: [PATCH] lan78xx: Don't reset the interface on open

2018-04-11 Thread David Miller
From: Phil Elwell 
Date: Tue, 10 Apr 2018 13:18:25 +0100

> Commit 92571a1aae40 ("lan78xx: Connect phy early") moves the PHY
> initialisation into lan78xx_probe, but lan78xx_open subsequently calls
> lan78xx_reset. As well as forcing a second round of link negotiation,
> this reset frequently prevents the phy interrupt from being generated
> (even though the link is up), rendering the interface unusable.
> 
> Fix this issue by removing the lan78xx_reset call from lan78xx_open.
> 
> Fixes: 92571a1aae40 ("lan78xx: Connect phy early")
> Signed-off-by: Phil Elwell 

Applied and queued up for -stable, thanks.


Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows

2018-04-11 Thread Andy Gospodarek
On Wed, Apr 11, 2018 at 01:41:31PM -0700, Michael Chan wrote:
> On Wed, Apr 11, 2018 at 1:31 PM, Andy Gospodarek
>  wrote:
> > On Wed, Apr 11, 2018 at 11:43:14AM -0700, Jakub Kicinski wrote:
> >> On Wed, 11 Apr 2018 11:50:14 -0400, Michael Chan wrote:
> >> > @@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, 
> >> > struct bnxt_tc_flow *flow)
> >> > return false;
> >> > }
> >> >
> >> > +   /* Currently source/dest MAC cannot be partial wildcard  */
> >> > +   if (bits_set(>l2_key.smac, sizeof(flow->l2_key.smac)) &&
> >> > +   !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
> >> > +   netdev_info(bp->dev, "Wildcard match unsupported for Source 
> >> > MAC\n");
> >>
> >> This wouldn't be something to do in net, but how do you feel about
> >> using extack for messages like this?
> >>
> >
> > I agree 'net' would not have been the place for a change like that, but
> > I do think that would be a good idea.  It looks like we could easily
> > change the ndo_setup_tc to something like this:
> >
> > int (*ndo_setup_tc)(struct net_device *dev,
> > enum tc_setup_type type,
> > void *type_data,
> > struct netlink_ext_ack 
> > *extack);
> 
> I think the extack pointer is already in the tc_cls_common_offload
> struct inside tc_cls_flower_offload struct.

True, but I'm not sure that tc_cls_common_offload is used in all cases.
Take red_offload() as one of those.


Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows

2018-04-11 Thread Michael Chan
On Wed, Apr 11, 2018 at 1:31 PM, Andy Gospodarek
 wrote:
> On Wed, Apr 11, 2018 at 11:43:14AM -0700, Jakub Kicinski wrote:
>> On Wed, 11 Apr 2018 11:50:14 -0400, Michael Chan wrote:
>> > @@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, 
>> > struct bnxt_tc_flow *flow)
>> > return false;
>> > }
>> >
>> > +   /* Currently source/dest MAC cannot be partial wildcard  */
>> > +   if (bits_set(>l2_key.smac, sizeof(flow->l2_key.smac)) &&
>> > +   !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
>> > +   netdev_info(bp->dev, "Wildcard match unsupported for Source 
>> > MAC\n");
>>
>> This wouldn't be something to do in net, but how do you feel about
>> using extack for messages like this?
>>
>
> I agree 'net' would not have been the place for a change like that, but
> I do think that would be a good idea.  It looks like we could easily
> change the ndo_setup_tc to something like this:
>
> int (*ndo_setup_tc)(struct net_device *dev,
> enum tc_setup_type type,
> void *type_data,
> struct netlink_ext_ack 
> *extack);

I think the extack pointer is already in the tc_cls_common_offload
struct inside tc_cls_flower_offload struct.


Re: [PATCH net 1/2] tun: set the flags before registering the netdevice

2018-04-11 Thread David Miller
From: Sabrina Dubroca 
Date: Tue, 10 Apr 2018 16:28:55 +0200

> Otherwise, register_netdevice advertises the creation of the device with
> the default flags, instead of what the user requested.
> 
> Reported-by: Thomas Haller 
> Fixes: 1ec010e70593 ("tun: export flags, uid, gid, queue information over 
> netlink")
> Signed-off-by: Sabrina Dubroca 

Applied and queued up for -stable.


Re: [PATCH net 2/2] tun: send netlink notification when the device is modified

2018-04-11 Thread David Miller
From: Sabrina Dubroca 
Date: Tue, 10 Apr 2018 16:28:56 +0200

> I added dumping of link information about tun devices over netlink in
> commit 1ec010e70593 ("tun: export flags, uid, gid, queue information
> over netlink"), but didn't add the missing netlink notifications when
> the device's exported properties change.
> 
> This patch adds notifications when owner/group or flags are modified,
> when queues are attached/detached, and when a tun fd is closed.
> 
> Reported-by: Thomas Haller 
> Fixes: 1ec010e70593 ("tun: export flags, uid, gid, queue information over 
> netlink")
> Signed-off-by: Sabrina Dubroca 

Applied and queued up for -stable.


Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows

2018-04-11 Thread Andy Gospodarek
On Wed, Apr 11, 2018 at 11:43:14AM -0700, Jakub Kicinski wrote:
> On Wed, 11 Apr 2018 11:50:14 -0400, Michael Chan wrote:
> > @@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, 
> > struct bnxt_tc_flow *flow)
> > return false;
> > }
> >  
> > +   /* Currently source/dest MAC cannot be partial wildcard  */
> > +   if (bits_set(>l2_key.smac, sizeof(flow->l2_key.smac)) &&
> > +   !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
> > +   netdev_info(bp->dev, "Wildcard match unsupported for Source 
> > MAC\n");
> 
> This wouldn't be something to do in net, but how do you feel about
> using extack for messages like this?
> 

I agree 'net' would not have been the place for a change like that, but
I do think that would be a good idea.  It looks like we could easily
change the ndo_setup_tc to something like this:

int (*ndo_setup_tc)(struct net_device *dev,
enum tc_setup_type type,
void *type_data,
struct netlink_ext_ack *extack);

It also looks like most of the callers of ndo_setup_tc have infra in
place to pass extack easily when the call is sourced from a netlink
message.   The others can just pass in NULL or define a local
netlink_ext_ack variable for short-term use.



Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-04-11 Thread Ivan Briano


On 04/11/2018 01:16 PM, Thomas Gleixner wrote:
> On Tue, 10 Apr 2018, Jesus Sanchez-Palencia wrote:
 This will be provided by tbs if the socket which is transmitting packets is
 configured for deadline mode.
>>>
>>> You don't want the socket to decide that. The qdisc into which a socket
>>> feeds defines the mode and the qdisc rejects requests with the wrong mode.
>>>
>>> Making a qdisc doing both and let the user decide what he wants it to be is
>>> not really going to fly. Especially if you have different users which want
>>> a different mode. It's clearly distinct functionality.
>>
>>
>> Ok, so just to make sure I got this right, are you suggesting that both the
>> 'tbs' qdisc *and* the socket (i.e. through SO_TXTIME) should have a config
>> parameter for specifying the txtime mode? This way if there is a mismatch,
>> packets from that socket are rejected by the qdisc.
> 
> Correct. The same is true if you try to set SO_TXTIME for something which
> is just routing regular traffic.
> 
>> (...)
>>>
 Another question for this mode (but perhaps that applies to both modes) 
 is, what
 if the qdisc misses the deadline for *any* reason? I'm assuming it should 
 drop
 the packet during dequeue.
>>>
>>> There the question is how user space is notified about that issue. The
>>> application which queued the packet on time does rightfully assume that
>>> it's going to be on the wire on time.
>>>
>>> This is a violation of the overall scheduling plan, so you need to have
>>> a sane design to handle that.
>>
>> In addition to the qdisc stats, we could look into using the socket's error
>> queue to notify the application about that.
> 
> Makes sense.
>  
 Putting it all together, we end up with:

 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look 
 like:
 $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 15 offload 
 sorting
>>>
>>> Why CLOCK_REALTIME? The only interesting time in a TSN network is
>>> CLOCK_TAI, really.
>>
>> REALTIME was just an example here to show that the qdisc has to be configured
>> with a clockid parameter. Are you suggesting that instead both of the new 
>> qdiscs
>> (i.e. tbs and taprio) should always be using CLOCK_TAI implicitly?
> 
> I think so. It's _the_ network time on which everything is based on.
> 
 2) a new cmsg-interface for setting a per-packet timestamp that will be 
 used
 either as a txtime or as deadline by tbs (and further the NIC driver for 
 the
 offlaod case): SCM_TXTIME.

 3) a new socket option: SO_TXTIME. It will be used to enable the feature 
 for a
 socket, and will have as parameters a clockid and a txtime mode (deadline 
 or
 explicit), that defines the semantics of the timestamp set on packets using
 SCM_TXTIME.

 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .
>>>
>>> Can you remind me why we would need that?
>>
>> So there is a "clockid" that can be used for the full hw offload modes. On 
>> this
>> case, the txtimes are in reference to the NIC's PTP clock, and, as 
>> discussed, we
>> can't just use a clockid that was computed from the fd pointing to /dev/ptpX 
>> .
> 
> And the NICs PTP clock is CLOCK_TAI, so there should be no reason to have
> yet another clock, right?
> 

Most likely, though you can technically have a different time domain
that is not based on TAI.

> Thanks,
> 
>   tglx
> 


Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows

2018-04-11 Thread Michael Chan
On Wed, Apr 11, 2018 at 11:43 AM, Jakub Kicinski  wrote:
> On Wed, 11 Apr 2018 11:50:14 -0400, Michael Chan wrote:
>> @@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, struct 
>> bnxt_tc_flow *flow)
>>   return false;
>>   }
>>
>> + /* Currently source/dest MAC cannot be partial wildcard  */
>> + if (bits_set(>l2_key.smac, sizeof(flow->l2_key.smac)) &&
>> + !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
>> + netdev_info(bp->dev, "Wildcard match unsupported for Source 
>> MAC\n");
>
> This wouldn't be something to do in net, but how do you feel about
> using extack for messages like this?
>

Sounds reasonable to me.  Just need to pass in the extack pointer to
this function to set the netlink error message.


[RFC PATCH] lan78xx: lan78xx_regs[] can be static

2018-04-11 Thread kbuild test robot

Fixes: 1373bfc60bd8 ("lan78xx: Add support to dump lan78xx registers")
Signed-off-by: Fengguang Wu 
---
 lan78xx.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index cbeec78..df5c3eb 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -278,7 +278,7 @@ struct lan78xx_statstage64 {
u64 eee_tx_lpi_time;
 };
 
-u32 lan78xx_regs[] = {
+static u32 lan78xx_regs[] = {
ID_REV,
INT_STS,
HW_CFG,


Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-04-11 Thread Thomas Gleixner
On Tue, 10 Apr 2018, Jesus Sanchez-Palencia wrote:
> >> This will be provided by tbs if the socket which is transmitting packets is
> >> configured for deadline mode.
> > 
> > You don't want the socket to decide that. The qdisc into which a socket
> > feeds defines the mode and the qdisc rejects requests with the wrong mode.
> > 
> > Making a qdisc doing both and let the user decide what he wants it to be is
> > not really going to fly. Especially if you have different users which want
> > a different mode. It's clearly distinct functionality.
> 
> 
> Ok, so just to make sure I got this right, are you suggesting that both the
> 'tbs' qdisc *and* the socket (i.e. through SO_TXTIME) should have a config
> parameter for specifying the txtime mode? This way if there is a mismatch,
> packets from that socket are rejected by the qdisc.

Correct. The same is true if you try to set SO_TXTIME for something which
is just routing regular traffic.

> (...)
> > 
> >> Another question for this mode (but perhaps that applies to both modes) 
> >> is, what
> >> if the qdisc misses the deadline for *any* reason? I'm assuming it should 
> >> drop
> >> the packet during dequeue.
> > 
> > There the question is how user space is notified about that issue. The
> > application which queued the packet on time does rightfully assume that
> > it's going to be on the wire on time.
> > 
> > This is a violation of the overall scheduling plan, so you need to have
> > a sane design to handle that.
> 
> In addition to the qdisc stats, we could look into using the socket's error
> queue to notify the application about that.

Makes sense.
 
> >> Putting it all together, we end up with:
> >>
> >> 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look 
> >> like:
> >> $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 15 offload 
> >> sorting
> > 
> > Why CLOCK_REALTIME? The only interesting time in a TSN network is
> > CLOCK_TAI, really.
> 
> REALTIME was just an example here to show that the qdisc has to be configured
> with a clockid parameter. Are you suggesting that instead both of the new 
> qdiscs
> (i.e. tbs and taprio) should always be using CLOCK_TAI implicitly?

I think so. It's _the_ network time on which everything is based on.

> >> 2) a new cmsg-interface for setting a per-packet timestamp that will be 
> >> used
> >> either as a txtime or as deadline by tbs (and further the NIC driver for 
> >> the
> >> offlaod case): SCM_TXTIME.
> >>
> >> 3) a new socket option: SO_TXTIME. It will be used to enable the feature 
> >> for a
> >> socket, and will have as parameters a clockid and a txtime mode (deadline 
> >> or
> >> explicit), that defines the semantics of the timestamp set on packets using
> >> SCM_TXTIME.
> >>
> >> 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .
> > 
> > Can you remind me why we would need that?
> 
> So there is a "clockid" that can be used for the full hw offload modes. On 
> this
> case, the txtimes are in reference to the NIC's PTP clock, and, as discussed, 
> we
> can't just use a clockid that was computed from the fd pointing to /dev/ptpX .

And the NICs PTP clock is CLOCK_TAI, so there should be no reason to have
yet another clock, right?

Thanks,

tglx


[PATCH 4.9 005/310] x86/asm: Dont use RBP as a temporary register in csum_partial_copy_generic()

2018-04-11 Thread Greg Kroah-Hartman
4.9-stable review patch.  If anyone has any objections, please let me know.

--

From: Josh Poimboeuf 


[ Upstream commit 42fc6c6cb1662ba2fa727dd01c9473c63be4e3b6 ]

Andrey Konovalov reported the following warning while fuzzing the kernel
with syzkaller:

  WARNING: kernel stack regs at 8800686869f8 in a.out:4933 has bad 'bp' 
value c3fc855a10167ec0

The unwinder dump revealed that RBP had a bad value when an interrupt
occurred in csum_partial_copy_generic().

That function saves RBP on the stack and then overwrites it, using it as
a scratch register.  That's problematic because it breaks stack traces
if an interrupt occurs in the middle of the function.

Replace the usage of RBP with another callee-saved register (R15) so
stack traces are no longer affected.

Reported-by: Andrey Konovalov 
Tested-by: Andrey Konovalov 
Signed-off-by: Josh Poimboeuf 
Cc: Cong Wang 
Cc: David S . Miller 
Cc: Dmitry Vyukov 
Cc: Eric Dumazet 
Cc: Kostya Serebryany 
Cc: Linus Torvalds 
Cc: Marcelo Ricardo Leitner 
Cc: Neil Horman 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Vlad Yasevich 
Cc: linux-s...@vger.kernel.org
Cc: netdev 
Cc: syzkaller 
Link: 
http://lkml.kernel.org/r/4b03a961efda5ec9bfe46b7b9c9ad72d1efad343.1493909486.git.jpoim...@redhat.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/lib/csum-copy_64.S |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/arch/x86/lib/csum-copy_64.S
+++ b/arch/x86/lib/csum-copy_64.S
@@ -55,7 +55,7 @@ ENTRY(csum_partial_copy_generic)
movq  %r12, 3*8(%rsp)
movq  %r14, 4*8(%rsp)
movq  %r13, 5*8(%rsp)
-   movq  %rbp, 6*8(%rsp)
+   movq  %r15, 6*8(%rsp)
 
movq  %r8, (%rsp)
movq  %r9, 1*8(%rsp)
@@ -74,7 +74,7 @@ ENTRY(csum_partial_copy_generic)
/* main loop. clear in 64 byte blocks */
/* r9: zero, r8: temp2, rbx: temp1, rax: sum, rcx: saved length */
/* r11: temp3, rdx: temp4, r12 loopcnt */
-   /* r10: temp5, rbp: temp6, r14 temp7, r13 temp8 */
+   /* r10: temp5, r15: temp6, r14 temp7, r13 temp8 */
.p2align 4
 .Lloop:
source
@@ -89,7 +89,7 @@ ENTRY(csum_partial_copy_generic)
source
movq  32(%rdi), %r10
source
-   movq  40(%rdi), %rbp
+   movq  40(%rdi), %r15
source
movq  48(%rdi), %r14
source
@@ -103,7 +103,7 @@ ENTRY(csum_partial_copy_generic)
adcq  %r11, %rax
adcq  %rdx, %rax
adcq  %r10, %rax
-   adcq  %rbp, %rax
+   adcq  %r15, %rax
adcq  %r14, %rax
adcq  %r13, %rax
 
@@ -121,7 +121,7 @@ ENTRY(csum_partial_copy_generic)
dest
movq %r10, 32(%rsi)
dest
-   movq %rbp, 40(%rsi)
+   movq %r15, 40(%rsi)
dest
movq %r14, 48(%rsi)
dest
@@ -203,7 +203,7 @@ ENTRY(csum_partial_copy_generic)
movq 3*8(%rsp), %r12
movq 4*8(%rsp), %r14
movq 5*8(%rsp), %r13
-   movq 6*8(%rsp), %rbp
+   movq 6*8(%rsp), %r15
addq $7*8, %rsp
ret
 




Re: [patch net] devlink: convert occ_get op to separate registration

2018-04-11 Thread Sasha Levin
On Wed, Apr 11, 2018 at 10:46:00AM +0200, gre...@linuxfoundation.org wrote:
>On Tue, Apr 10, 2018 at 08:43:31PM +, Sasha Levin wrote:
>> >Bots are starting to overwhelm actual content from human beings
>> >on this list, and I want to put my foot on the brake right now
>> >before it gets even more out of control.
>>
>> I think we're just hitting the limitations of using a mailing list. Bots
>> aren't around to spam everyone for fun, right?
>>
>> 0day bot is spammy because patches fail to compile, syzbot is spammy
>> because we have tons of bugs users can hit and the stable bot is
>> spammy because we miss lots of commits that should be in stable.
>>
>> As the kernel grows, not only the codebase is expanding but also the
>> tooling around it. While spammy, bots provide valuable input that in the
>> past would take blood, sweat and tears to acquire. We just need a better
>> way to consume it rather than blocking off these inputs.
>
>As one of the primary abusers of the "I send too much email" flood of
>mess, I'm going to start cutting down on what type of message I respond
>to a public vger list from now on.  I'll write more on the stable@ list
>about this, but for your individual patches like this, how about just
>responding to the developers themselves and not a public list?  I do
>that when I commit a patch to my tree, stripping out lists, as it's not
>a useful message for anyone other than the person who wrote the patch
>and the reviewers.

Sure.

I think we should have a discussion as to the best way to "spam" people.
There are a few alternatives I can think of (a "digest" mail once a
day/week? Web UI? different mailing list?) but we keep using the one
size fits all approach for all our bots.

I don't see a reason why we can't tailor bot output based on
maintainer/author preference? Maybe David would be less upset if he'd
see just one mail per week with commits we ask to review?

We're stuck with mailing lists for kernel development, might as well
make the best out of it.

Re: [PATCH net] sctp: do not check port in sctp_inet6_cmp_addr

2018-04-11 Thread Neil Horman
On Thu, Apr 12, 2018 at 12:16:58AM +0800, Xin Long wrote:
> On Wed, Apr 11, 2018 at 10:59 PM, Marcelo Ricardo Leitner
>  wrote:
> > On Wed, Apr 11, 2018 at 10:36:07AM -0400, Neil Horman wrote:
> >> On Wed, Apr 11, 2018 at 08:58:05PM +0800, Xin Long wrote:
> >> > pf->cmp_addr() is called before binding a v6 address to the sock. It
> >> > should not check ports, like in sctp_inet_cmp_addr.
> >> >
> >> > But sctp_inet6_cmp_addr checks the addr by invoking af(6)->cmp_addr,
> >> > sctp_v6_cmp_addr where it also compares the ports.
> >> >
> >> > This would cause that setsockopt(SCTP_SOCKOPT_BINDX_ADD) could bind
> >> > multiple duplicated IPv6 addresses after Commit 40b4f0fd74e4 ("sctp:
> >> > lack the check for ports in sctp_v6_cmp_addr").
> >> >
> >> > This patch is to remove af->cmp_addr called in sctp_inet6_cmp_addr,
> >> > but do the proper check for both v6 addrs and v4mapped addrs.
> >> >
> >> > Fixes: 40b4f0fd74e4 ("sctp: lack the check for ports in 
> >> > sctp_v6_cmp_addr")
> >> > Reported-by: Jianwen Ji 
> >> > Signed-off-by: Xin Long 
> >> > ---
> >> >  net/sctp/ipv6.c | 27 ---
> >> >  1 file changed, 24 insertions(+), 3 deletions(-)
> >> >
> >> > diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> >> > index f1fc48e..be4b72c 100644
> >> > --- a/net/sctp/ipv6.c
> >> > +++ b/net/sctp/ipv6.c
> >> > @@ -846,8 +846,8 @@ static int sctp_inet6_cmp_addr(const union sctp_addr 
> >> > *addr1,
> >> >const union sctp_addr *addr2,
> >> >struct sctp_sock *opt)
> >> >  {
> >> > -   struct sctp_af *af1, *af2;
> >> > struct sock *sk = sctp_opt2sk(opt);
> >> > +   struct sctp_af *af1, *af2;
> >> >
> >> > af1 = sctp_get_af_specific(addr1->sa.sa_family);
> >> > af2 = sctp_get_af_specific(addr2->sa.sa_family);
> >> > @@ -863,10 +863,31 @@ static int sctp_inet6_cmp_addr(const union 
> >> > sctp_addr *addr1,
> >> > if (sctp_is_any(sk, addr1) || sctp_is_any(sk, addr2))
> >> > return 1;
> >> >
> >> > -   if (addr1->sa.sa_family != addr2->sa.sa_family)
> >> > +   if (addr1->sa.sa_family != addr2->sa.sa_family) {
> >> > +   if (addr1->sa.sa_family == AF_INET &&
> >> > +   addr2->sa.sa_family == AF_INET6 &&
> >> > +   ipv6_addr_v4mapped(>v6.sin6_addr))
> >> > +   if (addr2->v6.sin6_addr.s6_addr32[3] ==
> >> > +   addr1->v4.sin_addr.s_addr)
> >> > +   return 1;
> >> > +   if (addr2->sa.sa_family == AF_INET &&
> >> > +   addr1->sa.sa_family == AF_INET6 &&
> >> > +   ipv6_addr_v4mapped(>v6.sin6_addr))
> >> > +   if (addr1->v6.sin6_addr.s6_addr32[3] ==
> >> > +   addr2->v4.sin_addr.s_addr)
> >> > +   return 1;
> >> > +   return 0;
> >> > +   }
> >> > +
> >> > +   if (!ipv6_addr_equal(>v6.sin6_addr, >v6.sin6_addr))
> >> > +   return 0;
> >> > +
> >> > +   if ((ipv6_addr_type(>v6.sin6_addr) & IPV6_ADDR_LINKLOCAL) &&
> >> > +   addr1->v6.sin6_scope_id && addr2->v6.sin6_scope_id &&
> >> > +   addr1->v6.sin6_scope_id != addr2->v6.sin6_scope_id)
> >> > return 0;
> >> >
> >> > -   return af1->cmp_addr(addr1, addr2);
> >> > +   return 1;
> >> >  }
> >> >
> >> >  /* Verify that the provided sockaddr looks bindable.   Common 
> >> > verification,
> >> > --
> >> > 2.1.0
> >> >
> >> This looks correct to me, but is it worth duplicating the comparison code 
> >> like
> >> this from the cmp_addr function?  It might be more worthwhile to add a 
> >> flag to
> >> the cmp_addr method to direct weather it needs to check port values or not.
> >> That way you could continue to use the cmp_addr function here.
> >
> > Adding a flag into sctp_v6_cmp_addr will get us a terrible code to
> > read. It's already not one of the best looking part of it. Maybe
> > still duplicate part of it it, but at 'af' level? As in:
> > - af->cmp_addr
> > - af->cmp_addr_port
> >
> What do you think of:
> 
> static int sctp_v6_cmp_addr(const union sctp_addr *addr1,
> const union sctp_addr *addr2)
> {
> return __sctp_v6_cmp_addr(addr1, addr2) &&
>addr1->v6.sin_port == addr2->v6.sin_port;
> }
> 
> (v6.sin_port and v4.sin_port have the same offset in union sctp_addr,
>  we've exploited this in many places in SCTP)
Yes, I'd be ok with that
Neil

> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PATCH net-next] netns: filter uevents correctly

2018-04-11 Thread Christian Brauner
On Wed, Apr 11, 2018 at 02:16:23PM -0500, Eric W. Biederman wrote:
> Christian Brauner  writes:
> 
> > On Wed, Apr 11, 2018 at 01:37:18PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner  writes:
> >> 
> >> > On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
> >> >> Christian Brauner  writes:
> >> >> > Yeah, agreed.
> >> >> > But I think the patch is not complete. To guarantee that no 
> >> >> > non-initial
> >> >> > user namespace actually receives uevents we need to:
> >> >> > 1. only sent uevents to uevent sockets that are located in network
> >> >> >namespaces that are owned by init_user_ns
> >> >> > 2. filter uevents that are sent to sockets in mc_list that have 
> >> >> > opened a
> >> >> >uevent socket that is owned by init_user_ns *from* a
> >> >> >non-init_user_ns
> >> >> >
> >> >> > We account for 1. by only recording uevent sockets in the global 
> >> >> > uevent
> >> >> > socket list who are owned by init_user_ns.
> >> >> > But to account for 2. we need to filter by the user namespace who owns
> >> >> > the socket in mc_list. So in addition to that we also need to slightly
> >> >> > change the filter logic in kobj_bcast_filter() I think:
> >> >> >
> >> >> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> >> > index 22a2c1a98b8f..064d7d29ace5 100644
> >> >> > --- a/lib/kobject_uevent.c
> >> >> > +++ b/lib/kobject_uevent.c
> >> >> > @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, 
> >> >> > struct sk_buff *skb, void *data)
> >> >> >   return sock_ns != ns;
> >> >> >   }
> >> >> >  
> >> >> > - return 0;
> >> >> > + /* Check if socket was opened from non-initial user namespace. 
> >> >> > */
> >> >> > + return sk_user_ns(dsk) != _user_ns;
> >> >> >  }
> >> >> >  #endif
> >> >> >  
> >> >> >
> >> >> > But correct me if I'm wrong.
> >> >> 
> >> >> You are worrying about NETLINK_LISTEN_ALL_NSID sockets.  That has
> >> >> permissions and an explicit opt-in to receiving packets from multiple
> >> >> network namespaces.
> >> >
> >> > I don't think that's what I'm talking about unless that is somehow the
> >> > default for NETLINK_KOBJECT_UEVENT sockets. What I'm worried about is
> >> > doing
> >> >
> >> > unshare -U --map-root
> >> >
> >> > then opening a NETLINK_KOBJECT_UEVENT socket and starting to listen to
> >> > uevents. Imho, this should not be possible because I'm in a
> >> > non-init_user_ns. But currently I'm able to - even with the patch to
> >> > come - since the uevent socket in the kernel was created when init_net
> >> > was created and hence is *owned* by the init_user_ns which means it is
> >> > in the list of uevent sockets. Here's a demo of what I mean:
> >> >
> >> > https://asciinema.org/a/175632
> >> 
> >> Why do you care about this case?
> >
> > It's not so much that I care about this case since any workload that
> > wants to run a separate udevd will have to unshare the network namespace
> > and the user namespace for it to make complete sense.
> > What I do care about is that the two of us are absolutely in the clear
> > about what semantics we are going to expose to userspace and it seems
> > that we were talking past each other wrt to this "corner case".
> > For userspace, it needs to be very clear that the intention is to filter
> > by *owning user namespace of the network namespace a given task resides
> > in* and not by user namespace of the task per se. This is what this
> > corner case basically shows, I think.
> 
> If this is just a clarification of semantics then yes this is a
> productive question.  I almost agree with your definition above.
> 
> I would make the definition very simple.  Uevents will not be broadcast
> via netlink in a network namespace where net->user_ns != _user_ns,
> with the exception of uevents for network devices in that network
> namespace.

Well, for the sake of posterity :) I should add that I'd prefer we'd add
what I suggested above:

-   return 0;
+   /* Check if socket was opened from non-initial user namespace. */
+   return sk_user_ns(dsk) != _user_ns;
 }

to slam the door shut once and for all for all non-init_user_ns
namespaces because it *seems* like the cleanest solution: uevents are
owned by init_user_ns; period. Because it is the only user namespace
that can do anything interesting with them *by default*.
But what we have now right now with my upcoming patch is at least
sufficient and safe.

Christian

> 
> The existing filtering by the sending uid and verifying that it is uid 0
> gives a little more room to filter if we want (as udev & friends will
> ignore the uevent), but I don't see the point.
> 
> Eric


Re: [PATCH net-next] netns: filter uevents correctly

2018-04-11 Thread Christian Brauner
On Wed, Apr 11, 2018 at 01:37:18PM -0500, Eric W. Biederman wrote:
> Christian Brauner  writes:
> 
> > On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
> >> Christian Brauner  writes:
> >> > Yeah, agreed.
> >> > But I think the patch is not complete. To guarantee that no non-initial
> >> > user namespace actually receives uevents we need to:
> >> > 1. only sent uevents to uevent sockets that are located in network
> >> >namespaces that are owned by init_user_ns
> >> > 2. filter uevents that are sent to sockets in mc_list that have opened a
> >> >uevent socket that is owned by init_user_ns *from* a
> >> >non-init_user_ns
> >> >
> >> > We account for 1. by only recording uevent sockets in the global uevent
> >> > socket list who are owned by init_user_ns.
> >> > But to account for 2. we need to filter by the user namespace who owns
> >> > the socket in mc_list. So in addition to that we also need to slightly
> >> > change the filter logic in kobj_bcast_filter() I think:
> >> >
> >> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> > index 22a2c1a98b8f..064d7d29ace5 100644
> >> > --- a/lib/kobject_uevent.c
> >> > +++ b/lib/kobject_uevent.c
> >> > @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, 
> >> > struct sk_buff *skb, void *data)
> >> >  return sock_ns != ns;
> >> >  }
> >> >  
> >> > -return 0;
> >> > +/* Check if socket was opened from non-initial user namespace. 
> >> > */
> >> > +return sk_user_ns(dsk) != _user_ns;
> >> >  }
> >> >  #endif
> >> >  
> >> >
> >> > But correct me if I'm wrong.
> >> 
> >> You are worrying about NETLINK_LISTEN_ALL_NSID sockets.  That has
> >> permissions and an explicit opt-in to receiving packets from multiple
> >> network namespaces.
> >
> > I don't think that's what I'm talking about unless that is somehow the
> > default for NETLINK_KOBJECT_UEVENT sockets. What I'm worried about is
> > doing
> >
> > unshare -U --map-root
> >
> > then opening a NETLINK_KOBJECT_UEVENT socket and starting to listen to
> > uevents. Imho, this should not be possible because I'm in a
> > non-init_user_ns. But currently I'm able to - even with the patch to
> > come - since the uevent socket in the kernel was created when init_net
> > was created and hence is *owned* by the init_user_ns which means it is
> > in the list of uevent sockets. Here's a demo of what I mean:
> >
> > https://asciinema.org/a/175632
> 
> Why do you care about this case?

It's not so much that I care about this case since any workload that
wants to run a separate udevd will have to unshare the network namespace
and the user namespace for it to make complete sense.
What I do care about is that the two of us are absolutely in the clear
about what semantics we are going to expose to userspace and it seems
that we were talking past each other wrt to this "corner case".
For userspace, it needs to be very clear that the intention is to filter
by *owning user namespace of the network namespace a given task resides
in* and not by user namespace of the task per se. This is what this
corner case basically shows, I think.

Christian

> 
> Everyone is allowed to listen to the uevent netlink sockets with or
> without user namespaces.  So there are no permission issues, and
> this is not even a data information leak.
> 
> If you don't want programs in your user namespace to have access you
> will be able to unshare the network namespace.
> 
> Eric


Re: [RFC net-next 1/2] net: net-porcfs: Reduce rcu lock critical section

2018-04-11 Thread Saeed Mahameed
On Tue, 2018-04-10 at 13:35 -0700, Eric Dumazet wrote:
> 
> On 04/10/2018 10:16 AM, David Miller wrote:
> > 
> > The tradeoff here is that now you are doing two unnecessary atomic
> > operations per stats dump.
> > 
> > That is what the RCU lock allows us to avoid.
> > 
> 
> dev_hold() and dev_put() are actually per cpu increment and
> decrements,
> pretty cheap these days.
> 

Yes, i am adding only 2 cpu instructions here.
I think the trade-off here is too small and the price to finally have
get_stats64 called from non atomic context is really worth it.

It  looks really odd to me that the device chain locks are held for
such long periods, while we already have the means to avoid this, same
goes for rtnl_lock, same trick can work here for many use cases and
many ndos, we are just over protective for no reasons.


> Problem here is that any preemption of the process holding device
> reference
> might trigger warnings in device unregister.
> 

This is true for any other place dev_hold is used,
as explained in the commit message dev_hold is used for a very brief
moment before calling the stats ndo and released directly after.

looking at 

netdev_wait_allrefs(...)
[...]
msleep(250);

refcnt = netdev_refcnt_read(dev);

if (time_after(jiffies, warning_time + 10 * HZ)) {
pr_emerg("unregister_netdevice: waiting for %s to
become free. Usage count = %d\n",
 dev->name, refcnt);
warning_time = jiffies;
}

The holder will get enough time to release the netdev way before the
warning is triggered.

The warning is triggered only if someone holds the dev for more than 10
seconds which is impossible for the stats ndo to take more than this,
in fact i just did a quick measurement and it seems that in average
get_stats64 ndo takes 0.5us !



Re: WARNING: possible recursive locking detected

2018-04-11 Thread Julian Anastasov

Hello,

On Wed, 11 Apr 2018, Dmitry Vyukov wrote:

> On Wed, Apr 11, 2018 at 4:02 PM, syzbot
>  wrote:
> > Hello,
> >
> > syzbot hit the following crash on upstream commit
> > b284d4d5a6785f8cd07eda2646a95782373cd01e (Tue Apr 10 19:25:30 2018 +)
> > Merge tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client
> > syzbot dashboard link:
> > https://syzkaller.appspot.com/bug?extid=3c43eecd7745a5ce1640
> >
> > So far this crash happened 3 times on upstream.
> > C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5103706542440448
> > syzkaller reproducer:
> > https://syzkaller.appspot.com/x/repro.syz?id=5641659786199040
> > Raw console output:
> > https://syzkaller.appspot.com/x/log.txt?id=5099510896263168
> > Kernel config:
> > https://syzkaller.appspot.com/x/.config?id=-1223000601505858474
> > compiler: gcc (GCC) 8.0.1 20180301 (experimental)
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+3c43eecd7745a5ce1...@syzkaller.appspotmail.com
> > It will help syzbot understand when the bug is fixed. See footer for
> > details.
> > If you forward the report, please keep this part and the footer.
> 
> #syz dup: possible deadlock in rtnl_lock (5)

Yes, patch is now in the "nf" tree, so all these
lockups around start_sync_thread should be resolved soon...

> > IPVS: sync thread started: state = BACKUP, mcast_ifn = lo, syncid = 0, id =
> > 0
> > IPVS: stopping backup sync thread 4546 ...
> >
> > 
> > IPVS: stopping backup sync thread 4559 ...
> > WARNING: possible recursive locking detected

Regards

--
Julian Anastasov 


Re: [PATCH net-next] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)

2018-04-11 Thread Julian Anastasov

Hello,

On Tue, 10 Apr 2018, Stephen Suryaputra wrote:

> This is enhanced from the proposed patch by Igor Maravic in 2011 to
> support per interface IPv4 stats. The enhancement is mainly adding a
> kernel configuration option CONFIG_IP_IFSTATS_TABLE.
> 
> Signed-off-by: Stephen Suryaputra 
> ---

...

> diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
> index b54b948..bb9be11 100644
> --- a/net/ipv4/ip_forward.c
> +++ b/net/ipv4/ip_forward.c
> @@ -66,8 +66,8 @@ static int ip_forward_finish(struct net *net, struct sock 
> *sk, struct sk_buff *s
>  {
>   struct ip_options *opt  = &(IPCB(skb)->opt);
>  
> - __IP_INC_STATS(net, IPSTATS_MIB_OUTFORWDATAGRAMS);
> - __IP_ADD_STATS(net, IPSTATS_MIB_OUTOCTETS, skb->len);
> + __IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTFORWDATAGRAMS);
> + __IP_ADD_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTOCTETS, skb->len);
>  
>   if (unlikely(opt->optlen))
>   ip_forward_options(skb);
> @@ -121,7 +121,7 @@ int ip_forward(struct sk_buff *skb)
>   IPCB(skb)->flags |= IPSKB_FORWARDED;
>   mtu = ip_dst_mtu_maybe_forward(>dst, true);
>   if (ip_exceeds_mtu(skb, mtu)) {
> - IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
> + IP_INC_STATS(net, rt->dst.dev, IPSTATS_MIB_FRAGFAILS);
>   icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
> htonl(mtu));
>   goto drop;
> @@ -158,7 +158,7 @@ int ip_forward(struct sk_buff *skb)
>  
>  too_many_hops:
>   /* Tell the sender its packet died... */
> - __IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
> + __IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_INHDRERRORS);

May be skb->dev if we want to account it to the
input device.

>   icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
>  drop:
>   kfree_skb(skb);

...

> diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> index 4527921..32bd3af 100644
> --- a/net/netfilter/ipvs/ip_vs_xmit.c
> +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> @@ -286,7 +286,7 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs,
>   {
>   if (ip_hdr(skb)->ttl <= 1) {
>   /* Tell the sender its packet died... */
> - __IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
> + __IP_INC_STATS(net, skb_dst(skb)->dev, 
> IPSTATS_MIB_INHDRERRORS);

At this point, skb_dst(skb) can be:

- input route at LOCAL_IN => dst->dev is "lo", skb->dev = input_device
- output route at LOCAL_OUT => dst->dev is output_device, skb->dev = NULL

We should see this error on LOCAL_IN but better to be
safe: use 'skb->dev ? : skb_dst(skb)->dev' instead of just
'skb_dst(skb)->dev'.

>   icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
>   return false;
>   }

The patch probably has other errors, for example,
using rt->dst.dev (lo) when rt->dst.error != 0 in ip_error,
may be 'dev' should be used instead...

Regards

--
Julian Anastasov 


Re: [PATCH net-next] netns: filter uevents correctly

2018-04-11 Thread Eric W. Biederman
Christian Brauner  writes:

> On Wed, Apr 11, 2018 at 01:37:18PM -0500, Eric W. Biederman wrote:
>> Christian Brauner  writes:
>> 
>> > On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
>> >> Christian Brauner  writes:
>> >> > Yeah, agreed.
>> >> > But I think the patch is not complete. To guarantee that no non-initial
>> >> > user namespace actually receives uevents we need to:
>> >> > 1. only sent uevents to uevent sockets that are located in network
>> >> >namespaces that are owned by init_user_ns
>> >> > 2. filter uevents that are sent to sockets in mc_list that have opened a
>> >> >uevent socket that is owned by init_user_ns *from* a
>> >> >non-init_user_ns
>> >> >
>> >> > We account for 1. by only recording uevent sockets in the global uevent
>> >> > socket list who are owned by init_user_ns.
>> >> > But to account for 2. we need to filter by the user namespace who owns
>> >> > the socket in mc_list. So in addition to that we also need to slightly
>> >> > change the filter logic in kobj_bcast_filter() I think:
>> >> >
>> >> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> >> > index 22a2c1a98b8f..064d7d29ace5 100644
>> >> > --- a/lib/kobject_uevent.c
>> >> > +++ b/lib/kobject_uevent.c
>> >> > @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, 
>> >> > struct sk_buff *skb, void *data)
>> >> > return sock_ns != ns;
>> >> > }
>> >> >  
>> >> > -   return 0;
>> >> > +   /* Check if socket was opened from non-initial user namespace. 
>> >> > */
>> >> > +   return sk_user_ns(dsk) != _user_ns;
>> >> >  }
>> >> >  #endif
>> >> >  
>> >> >
>> >> > But correct me if I'm wrong.
>> >> 
>> >> You are worrying about NETLINK_LISTEN_ALL_NSID sockets.  That has
>> >> permissions and an explicit opt-in to receiving packets from multiple
>> >> network namespaces.
>> >
>> > I don't think that's what I'm talking about unless that is somehow the
>> > default for NETLINK_KOBJECT_UEVENT sockets. What I'm worried about is
>> > doing
>> >
>> > unshare -U --map-root
>> >
>> > then opening a NETLINK_KOBJECT_UEVENT socket and starting to listen to
>> > uevents. Imho, this should not be possible because I'm in a
>> > non-init_user_ns. But currently I'm able to - even with the patch to
>> > come - since the uevent socket in the kernel was created when init_net
>> > was created and hence is *owned* by the init_user_ns which means it is
>> > in the list of uevent sockets. Here's a demo of what I mean:
>> >
>> > https://asciinema.org/a/175632
>> 
>> Why do you care about this case?
>
> It's not so much that I care about this case since any workload that
> wants to run a separate udevd will have to unshare the network namespace
> and the user namespace for it to make complete sense.
> What I do care about is that the two of us are absolutely in the clear
> about what semantics we are going to expose to userspace and it seems
> that we were talking past each other wrt to this "corner case".
> For userspace, it needs to be very clear that the intention is to filter
> by *owning user namespace of the network namespace a given task resides
> in* and not by user namespace of the task per se. This is what this
> corner case basically shows, I think.

If this is just a clarification of semantics then yes this is a
productive question.  I almost agree with your definition above.

I would make the definition very simple.  Uevents will not be broadcast
via netlink in a network namespace where net->user_ns != _user_ns,
with the exception of uevents for network devices in that network
namespace.

The existing filtering by the sending uid and verifying that it is uid 0
gives a little more room to filter if we want (as udev & friends will
ignore the uevent), but I don't see the point.

Eric


Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module

2018-04-11 Thread Samudrala, Sridhar

On 4/11/2018 8:51 AM, Jiri Pirko wrote:

Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudr...@intel.com wrote:

This provides a generic interface for paravirtual drivers to listen
for netdev register/unregister/link change events from pci ethernet
devices with the same MAC and takeover their datapath. The notifier and
event handling code is based on the existing netvsc implementation.

It exposes 2 sets of interfaces to the paravirtual drivers.
1. existing netvsc driver that uses 2 netdev model. In this model, no
master netdev is created. The paravirtual driver registers each bypass
instance along with a set of ops to manage the slave events.
 bypass_master_register()
 bypass_master_unregister()
2. new virtio_net based solution that uses 3 netdev model. In this model,
the bypass module provides interfaces to create/destroy additional master
netdev and all the slave events are managed internally.
  bypass_master_create()
  bypass_master_destroy()

Signed-off-by: Sridhar Samudrala 
---
include/linux/netdevice.h |  14 +
include/net/bypass.h  |  96 ++
net/Kconfig   |  18 +
net/core/Makefile |   1 +
net/core/bypass.c | 844 ++
5 files changed, 973 insertions(+)
create mode 100644 include/net/bypass.h
create mode 100644 net/core/bypass.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cf44503ea81a..587293728f70 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
IFF_PHONY_HEADROOM  = 1<<24,
IFF_MACSEC  = 1<<25,
IFF_NO_RX_HANDLER   = 1<<26,
+   IFF_BYPASS  = 1 << 27,
+   IFF_BYPASS_SLAVE= 1 << 28,

I wonder, why you don't follow the existing coding style... Also, please
add these to into the comment above.


To avoid checkpatch warnings. If it is OK to ignore these warnings, I can 
switch back
to the existing coding style to be consistent.





};

#define IFF_802_1Q_VLAN IFF_802_1Q_VLAN
@@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
#define IFF_RXFH_CONFIGURED IFF_RXFH_CONFIGURED
#define IFF_MACSEC  IFF_MACSEC
#define IFF_NO_RX_HANDLER   IFF_NO_RX_HANDLER
+#define IFF_BYPASS IFF_BYPASS
+#define IFF_BYPASS_SLAVE   IFF_BYPASS_SLAVE

/**
  * struct net_device - The DEVICE structure.
@@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct 
net_device *dev)
return dev->priv_flags & IFF_RXFH_CONFIGURED;
}

+static inline bool netif_is_bypass_master(const struct net_device *dev)
+{
+   return dev->priv_flags & IFF_BYPASS;
+}
+
+static inline bool netif_is_bypass_slave(const struct net_device *dev)
+{
+   return dev->priv_flags & IFF_BYPASS_SLAVE;
+}
+
/* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
static inline void netif_keep_dst(struct net_device *dev)
{
diff --git a/include/net/bypass.h b/include/net/bypass.h
new file mode 100644
index ..86b02cb894cf
--- /dev/null
+++ b/include/net/bypass.h
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018, Intel Corporation. */
+
+#ifndef _NET_BYPASS_H
+#define _NET_BYPASS_H
+
+#include 
+
+struct bypass_ops {
+   int (*slave_pre_register)(struct net_device *slave_netdev,
+ struct net_device *bypass_netdev);
+   int (*slave_join)(struct net_device *slave_netdev,
+ struct net_device *bypass_netdev);
+   int (*slave_pre_unregister)(struct net_device *slave_netdev,
+   struct net_device *bypass_netdev);
+   int (*slave_release)(struct net_device *slave_netdev,
+struct net_device *bypass_netdev);
+   int (*slave_link_change)(struct net_device *slave_netdev,
+struct net_device *bypass_netdev);
+   rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
+};
+
+struct bypass_master {
+   struct list_head list;
+   struct net_device __rcu *bypass_netdev;
+   struct bypass_ops __rcu *ops;
+};
+
+/* bypass state */
+struct bypass_info {
+   /* passthru netdev with same MAC */
+   struct net_device __rcu *active_netdev;

You still use "active"/"backup" names which is highly misleading as
it has completely different meaning that in bond for example.
I noted that in my previous review already. Please change it.


I guess the issue is with only the 'active'  name. 'backup' should be fine as 
it also
matches with the BACKUP feature bit we are adding to virtio_net.

With regards to alternate names for 'active', you suggested 'stolen', but i
am not too happy with it.
netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'







+
+   /* virtio_net 

[PULL] virtio: feature pull

2018-04-11 Thread Michael S. Tsirkin
The following changes since commit dc32bb678e103afbcfa4d814489af0566307f528:

  vhost: add vsock compat ioctl (2018-03-20 03:17:42 +0200)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 6c64fe7f2adcee21d7c4247f1ec021fd18428fc4:

  virtio_balloon: export hugetlb page allocation counts (2018-04-10 21:23:55 
+0300)


virtio: feature

This adds reporting hugepage stats to virtio-balloon.

Signed-off-by: Michael S. Tsirkin 


Jonathan Helman (1):
  virtio_balloon: export hugetlb page allocation counts

 drivers/virtio/virtio_balloon.c | 6 ++
 include/uapi/linux/virtio_balloon.h | 4 +++-
 2 files changed, 9 insertions(+), 1 deletion(-)


Re: [PATCH v1 net 2/3] lan78xx: Add support to dump lan78xx registers

2018-04-11 Thread kbuild test robot
Hi Raghuram,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]
[also build test WARNING on v4.16 next-20180411]
[cannot apply to net/master]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Raghuram-Chary-J/lan78xx-Fixes-and-enhancements/20180411-231134
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> drivers/net/usb/lan78xx.c:281:5: sparse: symbol 'lan78xx_regs' was not 
>> declared. Should it be static?
   drivers/net/usb/lan78xx.c:2992:29: sparse: cast to restricted __be16
   drivers/net/usb/lan78xx.c:2992:29: sparse: cast to restricted __be16
   drivers/net/usb/lan78xx.c:2992:29: sparse: cast to restricted __be16
   drivers/net/usb/lan78xx.c:2992:29: sparse: cast to restricted __be16
   drivers/net/usb/lan78xx.c:2992:27: sparse: incorrect type in assignment 
(different base types) @@expected restricted __wsum [usertype] csum @@
got e] csum @@
   drivers/net/usb/lan78xx.c:2992:27:expected restricted __wsum [usertype] 
csum
   drivers/net/usb/lan78xx.c:2992:27:got int

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


Re: [PATCH] net/mlx5: remove some extraneous spaces in indentations

2018-04-11 Thread Saeed Mahameed
On Tue, 2018-04-10 at 17:27 -0600, Jason Gunthorpe wrote:
> On Tue, Apr 10, 2018 at 05:22:54PM -0600, Jason Gunthorpe wrote:
> > On Mon, Apr 09, 2018 at 01:43:36PM +0100, Colin Ian King wrote:
> > > From: Colin Ian King 
> > > 
> > > There are several lines where there is an extraneous space
> > > causing
> > > indentation misalignment. Remove them.
> > > 
> > > Cleans up Cocconelle warnings:
> > > 
> > > ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:409:3-18: code
> > > aligned
> > > with following code on line 410
> > > ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:415:3-18: code
> > > aligned
> > > with following code on line 416
> > > ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:421:3-18: code
> > > aligned
> > > with following code on line 422
> > > 
> > > Signed-off-by: Colin Ian King 
> > >  drivers/net/ethernet/mellanox/mlx5/core/qp.c | 18 +-
> > > 
> > >  1 file changed, 9 insertions(+), 9 deletions(-)
> > 
> > applied to for-next thanks
> 
> Oh wait, this is for netdev, not rdma sorry. Never mind, DaveM should
> pick it up.
> 

net-next is currently closed.

Dave, if it is ok with you I would like to apply this into mlx5-next
branch so it won't cause any issues with our next pull requests to rdma
and netdev.

> Jason

Re: [RFC bpf-next v2 7/8] bpf: add documentation for eBPF helpers (51-57)

2018-04-11 Thread Quentin Monnet
2018-04-11 16:44 UTC+0100 ~ Quentin Monnet 
> 2018-04-10 09:58 UTC-0700 ~ Yonghong Song 
>> On 4/10/18 7:41 AM, Quentin Monnet wrote:
>>> Add documentation for eBPF helper functions to bpf.h user header file.
>>> This documentation can be parsed with the Python script provided in
>>> another commit of the patch series, in order to provide a RST document
>>> that can later be converted into a man page.
>>>
>>> The objective is to make the documentation easily understandable and
>>> accessible to all eBPF developers, including beginners.
>>>
>>> This patch contains descriptions for the following helper functions:
>>>
>>> Helpers from Lawrence:
>>> - bpf_setsockopt()
>>> - bpf_getsockopt()
>>> - bpf_sock_ops_cb_flags_set()
>>>
>>> Helpers from Yonghong:
>>> - bpf_perf_event_read_value()
>>> - bpf_perf_prog_read_value()
>>>
>>> Helper from Josef:
>>> - bpf_override_return()
>>>
>>> Helper from Andrey:
>>> - bpf_bind()
>>>
>>> Cc: Lawrence Brakmo 
>>> Cc: Yonghong Song 
>>> Cc: Josef Bacik 
>>> Cc: Andrey Ignatov 
>>> Signed-off-by: Quentin Monnet 
>>> ---

> [...]
> 
> Thanks Yonghong for the review!
> 
> I have a favor to ask of you. I got a bounce for Kaixu Xia's email
> address, and I don't know what alternative email address I could use. I
> CC-ed to have a review for helper bpf_perf_event_read() (in patch 6 of
> this series), which is rather close to bpf_perf_event_read_value().
> Would you mind having a look at that one too, please? The description is
> not long.

Well I read again the description I wrote, and actually the one for
bpf_perf_evnet_read() is nearly a subset of the one for
perf_event_read_value(). So the same comments that you raised earlier
apply, there's probably nothing more to review. But if you notice that
some important info is missing for bpf_perf_event_read(), I'm interested
too!

Quentin


Re: [PATCH net-next] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)

2018-04-11 Thread Stephen Suryaputra
I read the thread and understand some of the concerns, but these are
useful for network equipments. That's why in this patch, I made it as
configuration option.
If you have feedbacks, I'll appreciate them.

I can resend when net-next is opened again.

Thanks.

On Wed, Apr 11, 2018 at 12:14 PM, Stephen Hemminger
 wrote:
> On Tue, 10 Apr 2018 22:55:35 -0400
> Stephen Suryaputra  wrote:
>
>> This is enhanced from the proposed patch by Igor Maravic in 2011 to
>> support per interface IPv4 stats. The enhancement is mainly adding a
>> kernel configuration option CONFIG_IP_IFSTATS_TABLE.
>>
>> Signed-off-by: Stephen Suryaputra 
>
>
> Net-next is closed. http://vger.kernel.org/~davem/net-next.html
>
> Also, these kind of statistics have a negative performance impact.


Re: [PATCH net-next] netns: filter uevents correctly

2018-04-11 Thread Christian Brauner
On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
> Christian Brauner  writes:
> 
> > On Tue, Apr 10, 2018 at 10:04:46AM -0500, Eric W. Biederman wrote:
> >> Christian Brauner  writes:
> >> 
> >> > On Mon, Apr 09, 2018 at 06:21:31PM -0500, Eric W. Biederman wrote:
> >> >> Christian Brauner  writes:
> >> >> 
> >> >> > On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
> >> >> >> Christian Brauner  writes:
> >> >> >> 
> >> >> >> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
> >> >> >> >> On 05.04.2018 17:07, Christian Brauner wrote:
> >> >> >> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
> >> >> >> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
> >> >> >> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all 
> >> >> >> >> >>> network namespaces")
> >> >> >> >> >>>
> >> >> >> >> >>> enabled sending hotplug events into all network namespaces 
> >> >> >> >> >>> back in 2010.
> >> >> >> >> >>> Over time the set of uevents that get sent into all network 
> >> >> >> >> >>> namespaces has
> >> >> >> >> >>> shrunk. We have now reached the point where hotplug events 
> >> >> >> >> >>> for all devices
> >> >> >> >> >>> that carry a namespace tag are filtered according to that 
> >> >> >> >> >>> namespace.
> >> >> >> >> >>>
> >> >> >> >> >>> Specifically, they are filtered whenever the namespace tag of 
> >> >> >> >> >>> the kobject
> >> >> >> >> >>> does not match the namespace tag of the netlink socket. One 
> >> >> >> >> >>> example are
> >> >> >> >> >>> network devices. Uevents for network devices only show up in 
> >> >> >> >> >>> the network
> >> >> >> >> >>> namespaces these devices are moved to or created in.
> >> >> >> >> >>>
> >> >> >> >> >>> However, any uevent for a kobject that does not have a 
> >> >> >> >> >>> namespace tag
> >> >> >> >> >>> associated with it will not be filtered and we will *try* to 
> >> >> >> >> >>> broadcast it
> >> >> >> >> >>> into all network namespaces.
> >> >> >> >> >>>
> >> >> >> >> >>> The original patchset was written in 2010 before user 
> >> >> >> >> >>> namespaces were a
> >> >> >> >> >>> thing. With the introduction of user namespaces sending out 
> >> >> >> >> >>> uevents became
> >> >> >> >> >>> partially isolated as they were filtered by user namespaces:
> >> >> >> >> >>>
> >> >> >> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
> >> >> >> >> >>>
> >> >> >> >> >>> if (!net_eq(sock_net(sk), p->net)) {
> >> >> >> >> >>> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >> >> >> >> >>> return;
> >> >> >> >> >>>
> >> >> >> >> >>> if (!peernet_has_id(sock_net(sk), p->net))
> >> >> >> >> >>> return;
> >> >> >> >> >>>
> >> >> >> >> >>> if (!file_ns_capable(sk->sk_socket->file, 
> >> >> >> >> >>> p->net->user_ns,
> >> >> >> >> >>>  CAP_NET_BROADCAST))
> >> >> >> >> >>> j   return;
> >> >> >> >> >>> }
> >> >> >> >> >>>
> >> >> >> >> >>> The file_ns_capable() check will check whether the caller had
> >> >> >> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket 
> >> >> >> >> >>> in the user
> >> >> >> >> >>> namespace of interest. This check is fine in general but 
> >> >> >> >> >>> seems insufficient
> >> >> >> >> >>> to me when paired with uevents. The reason is that devices 
> >> >> >> >> >>> always belong to
> >> >> >> >> >>> the initial user namespace so uevents for kobjects that do 
> >> >> >> >> >>> not carry a
> >> >> >> >> >>> namespace tag should never be sent into another user 
> >> >> >> >> >>> namespace. This has
> >> >> >> >> >>> been the intention all along. But there's one case where this 
> >> >> >> >> >>> breaks,
> >> >> >> >> >>> namely if a new user namespace is created by root on the host 
> >> >> >> >> >>> and an
> >> >> >> >> >>> identity mapping is established between root on the host and 
> >> >> >> >> >>> root in the
> >> >> >> >> >>> new user namespace. Here's a reproducer:
> >> >> >> >> >>>
> >> >> >> >> >>>  sudo unshare -U --map-root
> >> >> >> >> >>>  udevadm monitor -k
> >> >> >> >> >>>  # Now change to initial user namespace and e.g. do
> >> >> >> >> >>>  modprobe kvm
> >> >> >> >> >>>  # or
> >> >> >> >> >>>  rmmod kvm
> >> >> >> >> >>>
> >> >> >> >> >>> will allow the non-initial user namespace to retrieve all 
> >> >> >> >> >>> uevents from the
> >> >> >> >> >>> host. This seems very anecdotal given that in the general 
> >> >> >> >> >>> case user
> >> >> >> >> >>> namespaces do not see any uevents and also can't really do 
> >> >> >> >> >>> anything useful
> >> >> >> >> >>> with them.
> >> >> >> >> >>>
> >> >> >> >> >>> Additionally, it is now possible to send uevents from 
> >> >> >> >> >>> userspace. As such we
> >> >> >> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the 
> >> >> >> >> >>> owning 

Re: [PATCH net-next] netns: filter uevents correctly

2018-04-11 Thread Eric W. Biederman
Christian Brauner  writes:

> On Tue, Apr 10, 2018 at 10:04:46AM -0500, Eric W. Biederman wrote:
>> Christian Brauner  writes:
>> 
>> > On Mon, Apr 09, 2018 at 06:21:31PM -0500, Eric W. Biederman wrote:
>> >> Christian Brauner  writes:
>> >> 
>> >> > On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
>> >> >> Christian Brauner  writes:
>> >> >> 
>> >> >> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
>> >> >> >> On 05.04.2018 17:07, Christian Brauner wrote:
>> >> >> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
>> >> >> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
>> >> >> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all 
>> >> >> >> >>> network namespaces")
>> >> >> >> >>>
>> >> >> >> >>> enabled sending hotplug events into all network namespaces back 
>> >> >> >> >>> in 2010.
>> >> >> >> >>> Over time the set of uevents that get sent into all network 
>> >> >> >> >>> namespaces has
>> >> >> >> >>> shrunk. We have now reached the point where hotplug events for 
>> >> >> >> >>> all devices
>> >> >> >> >>> that carry a namespace tag are filtered according to that 
>> >> >> >> >>> namespace.
>> >> >> >> >>>
>> >> >> >> >>> Specifically, they are filtered whenever the namespace tag of 
>> >> >> >> >>> the kobject
>> >> >> >> >>> does not match the namespace tag of the netlink socket. One 
>> >> >> >> >>> example are
>> >> >> >> >>> network devices. Uevents for network devices only show up in 
>> >> >> >> >>> the network
>> >> >> >> >>> namespaces these devices are moved to or created in.
>> >> >> >> >>>
>> >> >> >> >>> However, any uevent for a kobject that does not have a 
>> >> >> >> >>> namespace tag
>> >> >> >> >>> associated with it will not be filtered and we will *try* to 
>> >> >> >> >>> broadcast it
>> >> >> >> >>> into all network namespaces.
>> >> >> >> >>>
>> >> >> >> >>> The original patchset was written in 2010 before user 
>> >> >> >> >>> namespaces were a
>> >> >> >> >>> thing. With the introduction of user namespaces sending out 
>> >> >> >> >>> uevents became
>> >> >> >> >>> partially isolated as they were filtered by user namespaces:
>> >> >> >> >>>
>> >> >> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
>> >> >> >> >>>
>> >> >> >> >>> if (!net_eq(sock_net(sk), p->net)) {
>> >> >> >> >>> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>> >> >> >> >>> return;
>> >> >> >> >>>
>> >> >> >> >>> if (!peernet_has_id(sock_net(sk), p->net))
>> >> >> >> >>> return;
>> >> >> >> >>>
>> >> >> >> >>> if (!file_ns_capable(sk->sk_socket->file, 
>> >> >> >> >>> p->net->user_ns,
>> >> >> >> >>>  CAP_NET_BROADCAST))
>> >> >> >> >>> j   return;
>> >> >> >> >>> }
>> >> >> >> >>>
>> >> >> >> >>> The file_ns_capable() check will check whether the caller had
>> >> >> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in 
>> >> >> >> >>> the user
>> >> >> >> >>> namespace of interest. This check is fine in general but seems 
>> >> >> >> >>> insufficient
>> >> >> >> >>> to me when paired with uevents. The reason is that devices 
>> >> >> >> >>> always belong to
>> >> >> >> >>> the initial user namespace so uevents for kobjects that do not 
>> >> >> >> >>> carry a
>> >> >> >> >>> namespace tag should never be sent into another user namespace. 
>> >> >> >> >>> This has
>> >> >> >> >>> been the intention all along. But there's one case where this 
>> >> >> >> >>> breaks,
>> >> >> >> >>> namely if a new user namespace is created by root on the host 
>> >> >> >> >>> and an
>> >> >> >> >>> identity mapping is established between root on the host and 
>> >> >> >> >>> root in the
>> >> >> >> >>> new user namespace. Here's a reproducer:
>> >> >> >> >>>
>> >> >> >> >>>  sudo unshare -U --map-root
>> >> >> >> >>>  udevadm monitor -k
>> >> >> >> >>>  # Now change to initial user namespace and e.g. do
>> >> >> >> >>>  modprobe kvm
>> >> >> >> >>>  # or
>> >> >> >> >>>  rmmod kvm
>> >> >> >> >>>
>> >> >> >> >>> will allow the non-initial user namespace to retrieve all 
>> >> >> >> >>> uevents from the
>> >> >> >> >>> host. This seems very anecdotal given that in the general case 
>> >> >> >> >>> user
>> >> >> >> >>> namespaces do not see any uevents and also can't really do 
>> >> >> >> >>> anything useful
>> >> >> >> >>> with them.
>> >> >> >> >>>
>> >> >> >> >>> Additionally, it is now possible to send uevents from 
>> >> >> >> >>> userspace. As such we
>> >> >> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning 
>> >> >> >> >>> user
>> >> >> >> >>> namespace of the network namespace of the netlink socket) 
>> >> >> >> >>> userspace process
>> >> >> >> >>> make a decision what uevents should be sent.
>> >> >> >> >>>
>> >> >> >> >>> This makes me think that we should simply ensure that 

Re: [PATCH net] sctp: do not check port in sctp_inet6_cmp_addr

2018-04-11 Thread David Miller
From: Xin Long 
Date: Thu, 12 Apr 2018 00:16:58 +0800

> What do you think of:
> 
> static int sctp_v6_cmp_addr(const union sctp_addr *addr1,
> const union sctp_addr *addr2)
> {
> return __sctp_v6_cmp_addr(addr1, addr2) &&
>addr1->v6.sin_port == addr2->v6.sin_port;
> }
> 
> (v6.sin_port and v4.sin_port have the same offset in union sctp_addr,
>  we've exploited this in many places in SCTP)

>From my perspective this is OK.


Re: [PATCH] net: dsa: b53: Replace mdelay with msleep in b53_switch_reset_gpio

2018-04-11 Thread Florian Fainelli
On 04/11/2018 12:14 AM, Jia-Ju Bai wrote:
> 
> 
> On 2018/4/11 13:30, Phil Reid wrote:
>> On 11/04/2018 09:51, Jia-Ju Bai wrote:
>>> b53_switch_reset_gpio() is never called in atomic context.
>>>
>>> The call chain ending up at b53_switch_reset_gpio() is:
>>> [1] b53_switch_reset_gpio() <- b53_switch_reset() <-
>>>     b53_reset_switch() <- b53_setup()
>>>
>>> b53_switch_reset_gpio() is set as ".setup" in struct dsa_switch_ops.
>>> This function is not called in atomic context.
>>>
>>> Despite never getting called from atomic context,
>>> b53_switch_reset_gpio()
>>> calls mdelay() to busily wait.
>>> This is not necessary and can be replaced with msleep() to
>>> avoid busy waiting.
>>>
>>> This is found by a static analysis tool named DCNS written by myself.
>>> And I also manually check it.
>>>
>>> Signed-off-by: Jia-Ju Bai 
>>> ---
>>>   drivers/net/dsa/b53/b53_common.c | 4 ++--
>>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/dsa/b53/b53_common.c
>>> b/drivers/net/dsa/b53/b53_common.c
>>> index 274f367..e070ff6 100644
>>> --- a/drivers/net/dsa/b53/b53_common.c
>>> +++ b/drivers/net/dsa/b53/b53_common.c
>>> @@ -597,10 +597,10 @@ static void b53_switch_reset_gpio(struct
>>> b53_device *dev)
>>>   /* Reset sequence: RESET low(50ms)->high(20ms)
>>>    */
>>>   gpio_set_value(gpio, 0);
>>> -    mdelay(50);
>>> +    msleep(50);
>>>     gpio_set_value(gpio, 1);
>>> -    mdelay(20);
>>> +    msleep(20);
>>>     dev->current_page = 0xff;
>>>   }
>>>
>> Would that also imply gpio_set_value could be gpio_set_value_cansleep?
>>
> 
> Yes, I think gpio_set_value_cansleep() is okay here?
> Do I need to send a V2 patch to replace gpio_set_value()?

Yes, I would lump these two changes in the same patch since this is
effectively about solving sleeping vs. non sleeping operations.

Thanks!
-- 
Florian


Re: [PATCH net] sctp: do not check port in sctp_inet6_cmp_addr

2018-04-11 Thread Xin Long
On Wed, Apr 11, 2018 at 10:59 PM, Marcelo Ricardo Leitner
 wrote:
> On Wed, Apr 11, 2018 at 10:36:07AM -0400, Neil Horman wrote:
>> On Wed, Apr 11, 2018 at 08:58:05PM +0800, Xin Long wrote:
>> > pf->cmp_addr() is called before binding a v6 address to the sock. It
>> > should not check ports, like in sctp_inet_cmp_addr.
>> >
>> > But sctp_inet6_cmp_addr checks the addr by invoking af(6)->cmp_addr,
>> > sctp_v6_cmp_addr where it also compares the ports.
>> >
>> > This would cause that setsockopt(SCTP_SOCKOPT_BINDX_ADD) could bind
>> > multiple duplicated IPv6 addresses after Commit 40b4f0fd74e4 ("sctp:
>> > lack the check for ports in sctp_v6_cmp_addr").
>> >
>> > This patch is to remove af->cmp_addr called in sctp_inet6_cmp_addr,
>> > but do the proper check for both v6 addrs and v4mapped addrs.
>> >
>> > Fixes: 40b4f0fd74e4 ("sctp: lack the check for ports in sctp_v6_cmp_addr")
>> > Reported-by: Jianwen Ji 
>> > Signed-off-by: Xin Long 
>> > ---
>> >  net/sctp/ipv6.c | 27 ---
>> >  1 file changed, 24 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
>> > index f1fc48e..be4b72c 100644
>> > --- a/net/sctp/ipv6.c
>> > +++ b/net/sctp/ipv6.c
>> > @@ -846,8 +846,8 @@ static int sctp_inet6_cmp_addr(const union sctp_addr 
>> > *addr1,
>> >const union sctp_addr *addr2,
>> >struct sctp_sock *opt)
>> >  {
>> > -   struct sctp_af *af1, *af2;
>> > struct sock *sk = sctp_opt2sk(opt);
>> > +   struct sctp_af *af1, *af2;
>> >
>> > af1 = sctp_get_af_specific(addr1->sa.sa_family);
>> > af2 = sctp_get_af_specific(addr2->sa.sa_family);
>> > @@ -863,10 +863,31 @@ static int sctp_inet6_cmp_addr(const union sctp_addr 
>> > *addr1,
>> > if (sctp_is_any(sk, addr1) || sctp_is_any(sk, addr2))
>> > return 1;
>> >
>> > -   if (addr1->sa.sa_family != addr2->sa.sa_family)
>> > +   if (addr1->sa.sa_family != addr2->sa.sa_family) {
>> > +   if (addr1->sa.sa_family == AF_INET &&
>> > +   addr2->sa.sa_family == AF_INET6 &&
>> > +   ipv6_addr_v4mapped(>v6.sin6_addr))
>> > +   if (addr2->v6.sin6_addr.s6_addr32[3] ==
>> > +   addr1->v4.sin_addr.s_addr)
>> > +   return 1;
>> > +   if (addr2->sa.sa_family == AF_INET &&
>> > +   addr1->sa.sa_family == AF_INET6 &&
>> > +   ipv6_addr_v4mapped(>v6.sin6_addr))
>> > +   if (addr1->v6.sin6_addr.s6_addr32[3] ==
>> > +   addr2->v4.sin_addr.s_addr)
>> > +   return 1;
>> > +   return 0;
>> > +   }
>> > +
>> > +   if (!ipv6_addr_equal(>v6.sin6_addr, >v6.sin6_addr))
>> > +   return 0;
>> > +
>> > +   if ((ipv6_addr_type(>v6.sin6_addr) & IPV6_ADDR_LINKLOCAL) &&
>> > +   addr1->v6.sin6_scope_id && addr2->v6.sin6_scope_id &&
>> > +   addr1->v6.sin6_scope_id != addr2->v6.sin6_scope_id)
>> > return 0;
>> >
>> > -   return af1->cmp_addr(addr1, addr2);
>> > +   return 1;
>> >  }
>> >
>> >  /* Verify that the provided sockaddr looks bindable.   Common 
>> > verification,
>> > --
>> > 2.1.0
>> >
>> This looks correct to me, but is it worth duplicating the comparison code 
>> like
>> this from the cmp_addr function?  It might be more worthwhile to add a flag 
>> to
>> the cmp_addr method to direct weather it needs to check port values or not.
>> That way you could continue to use the cmp_addr function here.
>
> Adding a flag into sctp_v6_cmp_addr will get us a terrible code to
> read. It's already not one of the best looking part of it. Maybe
> still duplicate part of it it, but at 'af' level? As in:
> - af->cmp_addr
> - af->cmp_addr_port
>
What do you think of:

static int sctp_v6_cmp_addr(const union sctp_addr *addr1,
const union sctp_addr *addr2)
{
return __sctp_v6_cmp_addr(addr1, addr2) &&
   addr1->v6.sin_port == addr2->v6.sin_port;
}

(v6.sin_port and v4.sin_port have the same offset in union sctp_addr,
 we've exploited this in many places in SCTP)


Re: [PATCH v2] dec: tulip: de4x5: Replace mdelay with usleep_range in de4x5_hw_init

2018-04-11 Thread James Bottomley
On Wed, 2018-04-11 at 23:39 +0800, Jia-Ju Bai wrote:
> de4x5_hw_init() is never called in atomic context.
> 
> de4x5_hw_init() is only called by de4x5_pci_probe(), which is only 
> set as ".probe" in struct pci_driver.
> 
> Despite never getting called from atomic context, de4x5_hw_init() 
> calls mdelay() to busily wait. This is not necessary and can be
> replaced with usleep_range() to  avoid busy waiting.
> 
> This is found by a static analysis tool named DCNS written by myself.
> And I also manually check it.

Did you actually test this?  The usual reason for wanting m/udelay is
that the timing must be exact.  The driver is filled with mdelay()s for
this reason.  The one you've picked on is in the init path so it won't
affect the runtime in any way.  I also don't think we have the hrtimer
machinery for usleep_range() to work properly on parisc, so I don't
think the replacement works.

James



Re: [PATCH net-next] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)

2018-04-11 Thread Stephen Hemminger
On Tue, 10 Apr 2018 22:55:35 -0400
Stephen Suryaputra  wrote:

> This is enhanced from the proposed patch by Igor Maravic in 2011 to
> support per interface IPv4 stats. The enhancement is mainly adding a
> kernel configuration option CONFIG_IP_IFSTATS_TABLE.
> 
> Signed-off-by: Stephen Suryaputra 


Net-next is closed. http://vger.kernel.org/~davem/net-next.html

Also, these kind of statistics have a negative performance impact.


4.15.13 kernel panic, ip_rcv_finish, nf_xfrm_me_harder warnings continue to fill dmesg

2018-04-11 Thread Denys Fedoryshchenko

Apr 11 18:01:34[99194.935520] general protection fault:  [#1] SMP
Apr 11 18:01:34[99194.935998] Modules linked in: pppoe pppox ppp_generic 
slhc ip_set_hash_net xt_nat xt_string xt_connmark xt_TCPMSS xt_mark 
xt_CT xt_set xt_tcpudp ip_set_bitmap_port ip_set nfnetlink
iptable_raw iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip_tables x_tables 
netconsole configfs 8021q garp mrp stp llc ixgbe dca ipv6
Apr 11 18:01:34[99194.938313] CPU: 23 PID: 150 Comm: ksoftirqd/23 
Tainted: GW4.15.13-build-0135 #4
Apr 11 18:01:34[99194.939258] Hardware name: Intel Corporation 
S2600GZ/S2600GZ, BIOS SE5C600.86B.02.04.0003.102320141138 10/23/2014

Apr 11 18:01:34[99194.940189] RIP: 0010:ip_rcv_finish+0x2b5/0x2e5
Apr 11 18:01:34[99194.940716] RSP: 0018:c9000cad7cf8 EFLAGS: 
00010286
Apr 11 18:01:34[99194.941214] RAX: 00e2476d RBX: 
88178f944400 RCX: 88179a32a800
Apr 11 18:01:34[99194.941741] RDX: 88178f944400 RSI: 
 RDI: 88178f944400
Apr 11 18:01:34[99194.942234] RBP: 882fd580d000 R08: 
0001 R09: 882fd034ee00
Apr 11 18:01:34[99194.942771] R10: c9000cad7b58 R11: 
e92316b9 R12: 88179a32a8d6
Apr 11 18:01:34[99194.943286] R13: 882fd580d000 R14: 
00ea0008 R15: 882fd580d078
Apr 11 18:01:34[99194.943821] FS:  () 
GS:88303fcc() knlGS:
Apr 11 18:01:34[99194.944779] CS:  0010 DS:  ES:  CR0: 
80050033
Apr 11 18:01:34[99194.945287] CR2: 7f8bb37888f0 CR3: 
00303e209003 CR4: 001606e0

Apr 11 18:01:34[99194.945808] Call Trace:
Apr 11 18:01:34[99194.946307]  ip_rcv+0x2f2/0x325
Apr 11 18:01:34[99194.946816]  ? ip_local_deliver_finish+0x187/0x187
Apr 11 18:01:34[99194.947331]  __netif_receive_skb_core+0x81c/0x89c
Apr 11 18:01:34[99194.947872]  ? napi_complete_done+0xb4/0xba
Apr 11 18:01:34[99194.948391]  ? ixgbe_poll+0xf96/0x104d [ixgbe]
Apr 11 18:01:34[99194.948931]  ? process_backlog+0x8b/0x10d
Apr 11 18:01:34[99194.949424]  process_backlog+0x8b/0x10d
Apr 11 18:01:34[99194.949953]  net_rx_action+0x127/0x2b5
Apr 11 18:01:34[99194.950445]  __do_softirq+0xc1/0x1b1
Apr 11 18:01:34[99194.950951]  ? sort_range+0x17/0x17
Apr 11 18:01:34[99194.951442]  run_ksoftirqd+0x11/0x22
Apr 11 18:01:34[99194.951972]  smpboot_thread_fn+0x121/0x136
Apr 11 18:01:34[99194.952489]  kthread+0xfd/0x105
Apr 11 18:01:34[99194.953018]  ? kthread_create_on_node+0x3a/0x3a
Apr 11 18:01:34[99194.953528]  ret_from_fork+0x1f/0x30
Apr 11 18:01:34[99194.954047] Code: 15 77 9e 99 00 83 7a 7c 00 75 37 83 
b8 2c 01 00 00 00 75 2e 48 8b 43 58 48 89 df 5b 5d 48 83 e0 fe 41 5c 41 
5d 41 5e 48 8b 40 50  e0 83 f8 ee 75 10 49 8b 84 24

90 01 00 00 65 48 ff 80 40 02
Apr 11 18:01:34[99194.955449] RIP: ip_rcv_finish+0x2b5/0x2e5 RSP: 
c9000cad7cf8

Apr 11 18:01:34[99194.956008] ---[ end trace 312b0bf537b4709a ]---
Apr 11 18:01:34[99195.007900] Kernel panic - not syncing: Fatal 
exception in interrupt

Apr 11 18:01:34[99195.008400] Kernel Offset: disabled
Apr 11 18:01:34[99195.013950] Rebooting in 5 seconds..
--


and i reported before about warnings in nf_frm_me_harder, but probably 
nobody have interest to take a look, and it is seems plaguing 4.15.x and 
nearby versions kernels . Here is one of such warnings.


---
Apr 11 00:00:17[34320.802349] dst_release: dst:b32dca17 
refcnt:-2
Apr 11 00:00:19[34323.018468] WARNING: CPU: 7 PID: 0 at 
./include/net/dst.h:256 nf_xfrm_me_harder+0x62/0xfe [nf_nat]
Apr 11 00:00:19[34323.019357] Modules linked in: pppoe pppox ppp_generic 
slhc ip_set_hash_net xt_nat xt_string xt_connmark xt_TCPMSS xt_mark 
xt_CT xt_set xt_tcpudp ip_set_bitmap_port ip_set nfnetlink
iptable_raw iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip_tables x_tables 
netconsole configfs 8021q garp mrp stp llc ixgbe dca ipv6
Apr 11 00:00:19[34323.021503] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G   
 W4.15.13-build-0135 #4
Apr 11 00:00:19[34323.022380] Hardware name: Intel Corporation 
S2600GZ/S2600GZ, BIOS SE5C600.86B.02.04.0003.102320141138 10/23/2014
Apr 11 00:00:19[34323.023261] RIP: 0010:nf_xfrm_me_harder+0x62/0xfe 
[nf_nat]
Apr 11 00:00:19[34323.023737] RSP: 0018:88303fa43c90 EFLAGS: 
00010246
Apr 11 00:00:19[34323.024218] RAX:  RBX: 
8817b2c35200 RCX: 
Apr 11 00:00:19[34323.024703] RDX: 0002 RSI: 
88178fab3700 RDI: 88303fa43cd0
Apr 11 00:00:19[34323.025214] RBP: 822a6180 R08: 
0005 R09: 0001
Apr 11 00:00:19[34323.025717] R10: 00d6 R11: 
8817c945bca0 R12: 0001
Apr 11 00:00:19[34323.026214] R13: 88303fa43d60 R14: 
00ce0008 R15: 8817b7477078
Apr 11 00:00:19[34323.026736] FS:  () 
GS:88303fa4() knlGS:
Apr 11 00:00:19[34323.027680] 

[PATCH net-next] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)

2018-04-11 Thread Stephen Suryaputra
This is enhanced from the proposed patch by Igor Maravic in 2011 to
support per interface IPv4 stats. The enhancement is mainly adding a
kernel configuration option CONFIG_IP_IFSTATS_TABLE.

Signed-off-by: Stephen Suryaputra 
---
 drivers/net/vrf.c   |   2 +-
 include/linux/inetdevice.h  |  22 +++
 include/net/icmp.h  |  44 --
 include/net/ip.h|  72 +--
 include/net/ipv6.h  |  62 
 include/net/netns/mib.h |   3 +
 include/net/snmp.h  |  95 ++
 net/bridge/br_netfilter_hooks.c |  10 ++--
 net/dccp/ipv4.c |   4 +-
 net/ipv4/Kconfig|   8 +++
 net/ipv4/af_inet.c  |   7 ++-
 net/ipv4/datagram.c |   2 +-
 net/ipv4/devinet.c  |  85 ++-
 net/ipv4/icmp.c |  32 +-
 net/ipv4/inet_connection_sock.c |   8 ++-
 net/ipv4/ip_forward.c   |   8 +--
 net/ipv4/ip_fragment.c  |  20 ---
 net/ipv4/ip_input.c |  29 -
 net/ipv4/ip_output.c|  40 -
 net/ipv4/ipmr.c |   6 +-
 net/ipv4/ping.c |   9 ++-
 net/ipv4/proc.c | 126 
 net/ipv4/raw.c  |   4 +-
 net/ipv4/route.c|   6 +-
 net/ipv4/tcp_ipv4.c |   4 +-
 net/ipv4/udp.c  |   4 +-
 net/l2tp/l2tp_ip.c  |   4 +-
 net/l2tp/l2tp_ip6.c |   2 +-
 net/mpls/af_mpls.c  |   2 +-
 net/netfilter/ipvs/ip_vs_xmit.c |   2 +-
 net/sctp/input.c|   2 +-
 net/sctp/output.c   |   2 +-
 32 files changed, 570 insertions(+), 156 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 0a2b180..509c2ca 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -592,7 +592,7 @@ static int vrf_output(struct net *net, struct sock *sk, 
struct sk_buff *skb)
 {
struct net_device *dev = skb_dst(skb)->dev;
 
-   IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
+   IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUT, skb->len);
 
skb->dev = dev;
skb->protocol = htons(ETH_P_IP);
diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index e16fe7d..3d120cb 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -22,6 +22,15 @@ struct ipv4_devconf {
 
 #define MC_HASH_SZ_LOG 9
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+struct ipv4_devstat {
+   struct proc_dir_entry *proc_dir_entry;
+   DEFINE_SNMP_STAT(struct ipstats_mib, ip);
+   DEFINE_SNMP_STAT_ATOMIC(struct icmp_mib_device, icmpdev);
+   DEFINE_SNMP_STAT_ATOMIC(struct icmpmsg_mib_device, icmpmsgdev);
+};
+#endif
+
 struct in_device {
struct net_device   *dev;
refcount_t  refcnt;
@@ -45,6 +54,9 @@ struct in_device {
 
struct neigh_parms  *arp_parms;
struct ipv4_devconf cnf;
+#ifdef CONFIG_IP_IFSTATS_TABLE
+   struct ipv4_devstat stats;
+#endif
struct rcu_head rcu_head;
 };
 
@@ -216,6 +228,16 @@ static inline struct in_device *__in_dev_get_rcu(const 
struct net_device *dev)
return rcu_dereference(dev->ip_ptr);
 }
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+static inline struct in_device *__in_dev_get_rcu_safely(const struct 
net_device *dev)
+{
+   if (likely(dev))
+   return rcu_dereference(dev->ip_ptr);
+   else
+   return NULL;
+}
+#endif
+
 static inline struct in_device *in_dev_get(const struct net_device *dev)
 {
struct in_device *in_dev;
diff --git a/include/net/icmp.h b/include/net/icmp.h
index 3ef2743..fdfbc0f 100644
--- a/include/net/icmp.h
+++ b/include/net/icmp.h
@@ -29,10 +29,44 @@ struct icmp_err {
 };
 
 extern const struct icmp_err icmp_err_convert[];
-#define ICMP_INC_STATS(net, field) 
SNMP_INC_STATS((net)->mib.icmp_statistics, field)
-#define __ICMP_INC_STATS(net, field)   
__SNMP_INC_STATS((net)->mib.icmp_statistics, field)
-#define ICMPMSGOUT_INC_STATS(net, field)   
SNMP_INC_STATS_ATOMIC_LONG((net)->mib.icmpmsg_statistics, field+256)
-#define ICMPMSGIN_INC_STATS(net, field)
SNMP_INC_STATS_ATOMIC_LONG((net)->mib.icmpmsg_statistics, field)
+#ifdef CONFIG_IP_IFSTATS_TABLE
+#define ICMP_INC_STATS(net, dev, field)\
+   ({  \
+   rcu_read_lock();\
+   _DEVINCATOMIC(net, icmp, struct in_device,  \
+   __in_dev_get_rcu_safely(dev), field);   \
+   rcu_read_unlock();  \
+   })
+
+#define __ICMP_INC_STATS(net, dev, field)  \
+   ({  \
+   rcu_read_lock();\
+   ___DEVINCATOMIC(net, icmp, struct in_device,\
+   __in_dev_get_rcu_safely(dev), field);   \
+   rcu_read_unlock();  \
+   })
+
+#define 

Re: [PATCH] net: samsung: sxgbe: Replace mdelay with usleep_range in sxgbe_sw_reset

2018-04-11 Thread kbuild test robot
Hi Jia-Ju,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]
[also build test ERROR on v4.16 next-20180411]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Jia-Ju-Bai/net-samsung-sxgbe-Replace-mdelay-with-usleep_range-in-sxgbe_sw_reset/20180411-225900
config: i386-randconfig-s1-201814 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/net//ethernet/samsung/sxgbe/sxgbe_main.c: In function 
'sxgbe_sw_reset':
>> drivers/net//ethernet/samsung/sxgbe/sxgbe_main.c:2039:3: error: implicit 
>> declaration of function 'usleep' [-Werror=implicit-function-declaration]
  usleep(1, 11000);
  ^~
   cc1: some warnings being treated as errors

vim +/usleep +2039 drivers/net//ethernet/samsung/sxgbe/sxgbe_main.c

  2029  
  2030  static int sxgbe_sw_reset(void __iomem *addr)
  2031  {
  2032  int retry_count = 10;
  2033  
  2034  writel(SXGBE_DMA_SOFT_RESET, addr + SXGBE_DMA_MODE_REG);
  2035  while (retry_count--) {
  2036  if (!(readl(addr + SXGBE_DMA_MODE_REG) &
  2037SXGBE_DMA_SOFT_RESET))
  2038  break;
> 2039  usleep(1, 11000);
  2040  }
  2041  
  2042  if (retry_count < 0)
  2043  return -EBUSY;
  2044  
  2045  return 0;
  2046  }
  2047  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH v2] net: samsung: sxgbe: Replace mdelay with usleep_range in sxgbe_sw_reset

2018-04-11 Thread Jia-Ju Bai
sxgbe_sw_reset() is never called in atomic context.

sxgbe_sw_reset() is only called by sxgbe_drv_probe(), which is 
only called by sxgbe_platform_probe().
sxgbe_platform_probe() is set as ".probe" in struct platform_driver.
This function is not called in atomic context.

Despite never getting called from atomic context, sxgbe_sw_reset()
calls mdelay() to busily wait.
This is not necessary and can be replaced with usleep_range() to
avoid busy waiting.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai 
---
v2:
* Use usleep_range() to correct usleep() in v1.

--- 
 drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c 
b/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c
index 89831ad..99cd586 100644
--- a/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c
+++ b/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c
@@ -2038,7 +2038,7 @@ static int sxgbe_sw_reset(void __iomem *addr)
if (!(readl(addr + SXGBE_DMA_MODE_REG) &
  SXGBE_DMA_SOFT_RESET))
break;
-   mdelay(10);
+   usleep_range(1, 11000);
}
 
if (retry_count < 0)
-- 
1.9.1



[PATCH net v2 4/6] bnxt_en: Support max-mtu with VF-reps

2018-04-11 Thread Michael Chan
From: Sriharsha Basavapatna 

While a VF is configured with a bigger mtu (> 1500), any packets that
are punted to the VF-rep (slow-path) get dropped by OVS kernel-datapath
with the following message: "dropped over-mtu packet". Fix this by
returning the max-mtu value for a VF-rep derived from its corresponding VF.
VF-rep's mtu can be changed using 'ip' command as shown in this example:

$ ip link set bnxt0_pf0vf0 mtu 9000

Signed-off-by: Sriharsha Basavapatna 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c | 30 +++
 1 file changed, 30 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
index 2629040..38f635c 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
@@ -64,6 +64,31 @@ static int hwrm_cfa_vfr_free(struct bnxt *bp, u16 vf_idx)
return rc;
 }
 
+static int bnxt_hwrm_vfr_qcfg(struct bnxt *bp, struct bnxt_vf_rep *vf_rep,
+ u16 *max_mtu)
+{
+   struct hwrm_func_qcfg_output *resp = bp->hwrm_cmd_resp_addr;
+   struct hwrm_func_qcfg_input req = {0};
+   u16 mtu;
+   int rc;
+
+   bnxt_hwrm_cmd_hdr_init(bp, , HWRM_FUNC_QCFG, -1, -1);
+   req.fid = cpu_to_le16(bp->pf.vf[vf_rep->vf_idx].fw_fid);
+
+   mutex_lock(>hwrm_cmd_lock);
+
+   rc = _hwrm_send_message(bp, , sizeof(req), HWRM_CMD_TIMEOUT);
+   if (!rc) {
+   mtu = le16_to_cpu(resp->max_mtu_configured);
+   if (!mtu)
+   *max_mtu = BNXT_MAX_MTU;
+   else
+   *max_mtu = mtu;
+   }
+   mutex_unlock(>hwrm_cmd_lock);
+   return rc;
+}
+
 static int bnxt_vf_rep_open(struct net_device *dev)
 {
struct bnxt_vf_rep *vf_rep = netdev_priv(dev);
@@ -365,6 +390,7 @@ static void bnxt_vf_rep_netdev_init(struct bnxt *bp, struct 
bnxt_vf_rep *vf_rep,
struct net_device *dev)
 {
struct net_device *pf_dev = bp->dev;
+   u16 max_mtu;
 
dev->netdev_ops = _vf_rep_netdev_ops;
dev->ethtool_ops = _vf_rep_ethtool_ops;
@@ -380,6 +406,10 @@ static void bnxt_vf_rep_netdev_init(struct bnxt *bp, 
struct bnxt_vf_rep *vf_rep,
bnxt_vf_rep_eth_addr_gen(bp->pf.mac_addr, vf_rep->vf_idx,
 dev->perm_addr);
ether_addr_copy(dev->dev_addr, dev->perm_addr);
+   /* Set VF-Rep's max-mtu to the corresponding VF's max-mtu */
+   if (!bnxt_hwrm_vfr_qcfg(bp, vf_rep, _mtu))
+   dev->max_mtu = max_mtu;
+   dev->min_mtu = ETH_ZLEN;
 }
 
 static int bnxt_pcie_dsn_get(struct bnxt *bp, u8 dsn[])
-- 
1.8.3.1



Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module

2018-04-11 Thread Jiri Pirko
Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudr...@intel.com wrote:
>This provides a generic interface for paravirtual drivers to listen
>for netdev register/unregister/link change events from pci ethernet
>devices with the same MAC and takeover their datapath. The notifier and
>event handling code is based on the existing netvsc implementation.
>
>It exposes 2 sets of interfaces to the paravirtual drivers.
>1. existing netvsc driver that uses 2 netdev model. In this model, no
>master netdev is created. The paravirtual driver registers each bypass
>instance along with a set of ops to manage the slave events.
> bypass_master_register()
> bypass_master_unregister()
>2. new virtio_net based solution that uses 3 netdev model. In this model,
>the bypass module provides interfaces to create/destroy additional master
>netdev and all the slave events are managed internally.
>  bypass_master_create()
>  bypass_master_destroy()
>
>Signed-off-by: Sridhar Samudrala 
>---
> include/linux/netdevice.h |  14 +
> include/net/bypass.h  |  96 ++
> net/Kconfig   |  18 +
> net/core/Makefile |   1 +
> net/core/bypass.c | 844 ++
> 5 files changed, 973 insertions(+)
> create mode 100644 include/net/bypass.h
> create mode 100644 net/core/bypass.c
>
>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>index cf44503ea81a..587293728f70 100644
>--- a/include/linux/netdevice.h
>+++ b/include/linux/netdevice.h
>@@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>   IFF_PHONY_HEADROOM  = 1<<24,
>   IFF_MACSEC  = 1<<25,
>   IFF_NO_RX_HANDLER   = 1<<26,
>+  IFF_BYPASS  = 1 << 27,
>+  IFF_BYPASS_SLAVE= 1 << 28,

I wonder, why you don't follow the existing coding style... Also, please
add these to into the comment above.


> };
> 
> #define IFF_802_1Q_VLAN   IFF_802_1Q_VLAN
>@@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
> #define IFF_RXFH_CONFIGURED   IFF_RXFH_CONFIGURED
> #define IFF_MACSECIFF_MACSEC
> #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER
>+#define IFF_BYPASSIFF_BYPASS
>+#define IFF_BYPASS_SLAVE  IFF_BYPASS_SLAVE
> 
> /**
>  *struct net_device - The DEVICE structure.
>@@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const 
>struct net_device *dev)
>   return dev->priv_flags & IFF_RXFH_CONFIGURED;
> }
> 
>+static inline bool netif_is_bypass_master(const struct net_device *dev)
>+{
>+  return dev->priv_flags & IFF_BYPASS;
>+}
>+
>+static inline bool netif_is_bypass_slave(const struct net_device *dev)
>+{
>+  return dev->priv_flags & IFF_BYPASS_SLAVE;
>+}
>+
> /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
> static inline void netif_keep_dst(struct net_device *dev)
> {
>diff --git a/include/net/bypass.h b/include/net/bypass.h
>new file mode 100644
>index ..86b02cb894cf
>--- /dev/null
>+++ b/include/net/bypass.h
>@@ -0,0 +1,96 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright (c) 2018, Intel Corporation. */
>+
>+#ifndef _NET_BYPASS_H
>+#define _NET_BYPASS_H
>+
>+#include 
>+
>+struct bypass_ops {
>+  int (*slave_pre_register)(struct net_device *slave_netdev,
>+struct net_device *bypass_netdev);
>+  int (*slave_join)(struct net_device *slave_netdev,
>+struct net_device *bypass_netdev);
>+  int (*slave_pre_unregister)(struct net_device *slave_netdev,
>+  struct net_device *bypass_netdev);
>+  int (*slave_release)(struct net_device *slave_netdev,
>+   struct net_device *bypass_netdev);
>+  int (*slave_link_change)(struct net_device *slave_netdev,
>+   struct net_device *bypass_netdev);
>+  rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>+};
>+
>+struct bypass_master {
>+  struct list_head list;
>+  struct net_device __rcu *bypass_netdev;
>+  struct bypass_ops __rcu *ops;
>+};
>+
>+/* bypass state */
>+struct bypass_info {
>+  /* passthru netdev with same MAC */
>+  struct net_device __rcu *active_netdev;

You still use "active"/"backup" names which is highly misleading as
it has completely different meaning that in bond for example.
I noted that in my previous review already. Please change it.


>+
>+  /* virtio_net netdev */
>+  struct net_device __rcu *backup_netdev;
>+
>+  /* active netdev stats */
>+  struct rtnl_link_stats64 active_stats;
>+
>+  /* backup netdev stats */
>+  struct rtnl_link_stats64 backup_stats;
>+
>+  /* aggregated stats */
>+  struct rtnl_link_stats64 bypass_stats;
>+
>+  /* spinlock while updating stats */
>+  spinlock_t stats_lock;
>+};
>+
>+#if 

[PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows

2018-04-11 Thread Michael Chan
From: Andy Gospodarek 

Before this patch the following commands would succeed as far as the
user was concerned:

$ tc qdisc add dev p1p1 ingress
$ tc filter add dev p1p1 parent : protocol all \
flower skip_sw action drop
$ tc filter add dev p1p1 parent : protocol ipv4 \
flower skip_sw src_mac 00:02:00:00:00:01/44 action drop

The current flow offload infrastructure used does not support wildcard
matching for ethernet headers, so do not allow the second or third
commands to succeed.  If a user wants to drop traffic on that interface
the protocol and MAC addresses need to be specified explicitly:

$ tc qdisc add dev p1p1 ingress
$ tc filter add dev p1p1 parent : protocol arp \
flower skip_sw action drop
$ tc filter add dev p1p1 parent : protocol ipv4 \
flower skip_sw action drop
...
$ tc filter add dev p1p1 parent : protocol ipv4 \
flower skip_sw src_mac 00:02:00:00:00:01 action drop
$ tc filter add dev p1p1 parent : protocol ipv4 \
flower skip_sw src_mac 00:02:00:00:00:02 action drop
...

There are also checks for VLAN parameters in this patch as other callers
may wildcard those parameters even if tc does not.  Using different
flow infrastructure could allow this to work in the future for L2 flows,
but for now it does not.

Fixes: 2ae7408fedfe ("bnxt_en: bnxt: add TC flower filter offload support")
Signed-off-by: Andy Gospodarek 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c | 59 
 1 file changed, 59 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index 65c2cee..ac193408 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -377,6 +377,30 @@ static bool is_wildcard(void *mask, int len)
return true;
 }
 
+static bool is_exactmatch(void *mask, int len)
+{
+   const u8 *p = mask;
+   int i;
+
+   for (i = 0; i < len; i++)
+   if (p[i] != 0xff)
+   return false;
+
+   return true;
+}
+
+static bool bits_set(void *key, int len)
+{
+   const u8 *p = key;
+   int i;
+
+   for (i = 0; i < len; i++)
+   if (p[i] != 0)
+   return true;
+
+   return false;
+}
+
 static int bnxt_hwrm_cfa_flow_alloc(struct bnxt *bp, struct bnxt_tc_flow *flow,
__le16 ref_flow_handle,
__le32 tunnel_handle, __le16 *flow_handle)
@@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, struct 
bnxt_tc_flow *flow)
return false;
}
 
+   /* Currently source/dest MAC cannot be partial wildcard  */
+   if (bits_set(>l2_key.smac, sizeof(flow->l2_key.smac)) &&
+   !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
+   netdev_info(bp->dev, "Wildcard match unsupported for Source 
MAC\n");
+   return false;
+   }
+   if (bits_set(>l2_key.dmac, sizeof(flow->l2_key.dmac)) &&
+   !is_exactmatch(>l2_mask.dmac, sizeof(flow->l2_mask.dmac))) {
+   netdev_info(bp->dev, "Wildcard match unsupported for Dest 
MAC\n");
+   return false;
+   }
+
+   /* Currently VLAN fields cannot be partial wildcard */
+   if (bits_set(>l2_key.inner_vlan_tci,
+sizeof(flow->l2_key.inner_vlan_tci)) &&
+   !is_exactmatch(>l2_mask.inner_vlan_tci,
+  sizeof(flow->l2_mask.inner_vlan_tci))) {
+   netdev_info(bp->dev, "Wildcard match unsupported for VLAN 
TCI\n");
+   return false;
+   }
+   if (bits_set(>l2_key.inner_vlan_tpid,
+sizeof(flow->l2_key.inner_vlan_tpid)) &&
+   !is_exactmatch(>l2_mask.inner_vlan_tpid,
+  sizeof(flow->l2_mask.inner_vlan_tpid))) {
+   netdev_info(bp->dev, "Wildcard match unsupported for VLAN 
TPID\n");
+   return false;
+   }
+
+   /* Currently Ethertype must be set */
+   if (!is_exactmatch(>l2_mask.ether_type,
+  sizeof(flow->l2_mask.ether_type))) {
+   netdev_info(bp->dev, "Wildcard match unsupported for 
Ethertype\n");
+   return false;
+   }
+
return true;
 }
 
-- 
1.8.3.1



Re: [PATCH] net: samsung: sxgbe: Replace mdelay with usleep_range in sxgbe_sw_reset

2018-04-11 Thread kbuild test robot
Hi Jia-Ju,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]
[also build test ERROR on v4.16 next-20180411]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Jia-Ju-Bai/net-samsung-sxgbe-Replace-mdelay-with-usleep_range-in-sxgbe_sw_reset/20180411-225900
config: i386-randconfig-x019-201814 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/net//ethernet/samsung/sxgbe/sxgbe_main.c: In function 
'sxgbe_sw_reset':
>> drivers/net//ethernet/samsung/sxgbe/sxgbe_main.c:2039:3: error: implicit 
>> declaration of function 'usleep'; did you mean 'ssleep'? 
>> [-Werror=implicit-function-declaration]
  usleep(1, 11000);
  ^~
  ssleep
   cc1: some warnings being treated as errors

vim +2039 drivers/net//ethernet/samsung/sxgbe/sxgbe_main.c

  2029  
  2030  static int sxgbe_sw_reset(void __iomem *addr)
  2031  {
  2032  int retry_count = 10;
  2033  
  2034  writel(SXGBE_DMA_SOFT_RESET, addr + SXGBE_DMA_MODE_REG);
  2035  while (retry_count--) {
  2036  if (!(readl(addr + SXGBE_DMA_MODE_REG) &
  2037SXGBE_DMA_SOFT_RESET))
  2038  break;
> 2039  usleep(1, 11000);
  2040  }
  2041  
  2042  if (retry_count < 0)
  2043  return -EBUSY;
  2044  
  2045  return 0;
  2046  }
  2047  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH net v2 1/6] bnxt_en: Fix ethtool -x crash when device is down.

2018-04-11 Thread Michael Chan
Fix ethtool .get_rxfh() crash by checking for valid indirection table
address before copying the data.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index 8d8ccd6..1f622ca 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -870,17 +870,22 @@ static int bnxt_get_rxfh(struct net_device *dev, u32 
*indir, u8 *key,
 u8 *hfunc)
 {
struct bnxt *bp = netdev_priv(dev);
-   struct bnxt_vnic_info *vnic = >vnic_info[0];
+   struct bnxt_vnic_info *vnic;
int i = 0;
 
if (hfunc)
*hfunc = ETH_RSS_HASH_TOP;
 
-   if (indir)
+   if (!bp->vnic_info)
+   return 0;
+
+   vnic = >vnic_info[0];
+   if (indir && vnic->rss_table) {
for (i = 0; i < HW_HASH_INDEX_SIZE; i++)
indir[i] = le16_to_cpu(vnic->rss_table[i]);
+   }
 
-   if (key)
+   if (key && vnic->rss_hash_key)
memcpy(key, vnic->rss_hash_key, HW_HASH_KEY_SIZE);
 
return 0;
-- 
1.8.3.1



[PATCH net v2 6/6] bnxt_en: Fix NULL pointer dereference at bnxt_free_irq().

2018-04-11 Thread Michael Chan
When open fails during ethtool -L ring change, for example, the driver
may crash at bnxt_free_irq() because bp->bnapi is NULL.

If we fail to allocate all the new rings, bnxt_open_nic() will free
all the memory including bp->bnapi.  Subsequent call to bnxt_close_nic()
will try to dereference bp->bnapi in bnxt_free_irq().

Fix it by checking for !bp->bnapi in bnxt_free_irq().

Fixes: e5811b8c09df ("bnxt_en: Add IRQ remapping logic.")
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 9cb8b4b..f83769d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -6090,7 +6090,7 @@ static void bnxt_free_irq(struct bnxt *bp)
free_irq_cpu_rmap(bp->dev->rx_cpu_rmap);
bp->dev->rx_cpu_rmap = NULL;
 #endif
-   if (!bp->irq_tbl)
+   if (!bp->irq_tbl || !bp->bnapi)
return;
 
for (i = 0; i < bp->cp_nr_rings; i++) {
-- 
1.8.3.1



[PATCH net v2 0/6] bnxt_en: Fixes for net.

2018-04-11 Thread Michael Chan
This bug fix series include NULL pointer fixes in ethtool -x code path
and in the error clean up path when freeing IRQs, a ring accounting bug
that missed rings used by the RDMA driver, and 3 bug fixes related to TC
Flower and VF-reps.

v2: Fixed commit message of patch 4.  Changed the pound sign to $ sign
in front of the ip command.
 
Andy Gospodarek (1):
  bnxt_en: do not allow wildcard matches for L2 flows

Michael Chan (3):
  bnxt_en: Fix ethtool -x crash when device is down.
  bnxt_en: Need to include RDMA rings in bnxt_check_rings().
  bnxt_en: Fix NULL pointer dereference at bnxt_free_irq().

Sriharsha Basavapatna (2):
  bnxt_en: Ignore src port field in decap filter nodes
  bnxt_en: Support max-mtu with VF-reps

 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  4 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 11 ++--
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c  | 63 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c | 30 +++
 4 files changed, 103 insertions(+), 5 deletions(-)

-- 
1.8.3.1



[PATCH net v2 5/6] bnxt_en: Need to include RDMA rings in bnxt_check_rings().

2018-04-11 Thread Michael Chan
With recent changes to reserve both L2 and RDMA rings, we need to include
the RDMA rings in bnxt_check_rings().  Otherwise we will under-estimate
the rings we need during ethtool -L and may lead to failure.

Fixes: fbcfc8e46741 ("bnxt_en: Reserve completion rings and MSIX for bnxt_re 
RDMA driver.")
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 1991f0c..9cb8b4b 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -7686,6 +7686,8 @@ int bnxt_check_rings(struct bnxt *bp, int tx, int rx, 
bool sh, int tcs,
if (bp->flags & BNXT_FLAG_AGG_RINGS)
rx_rings <<= 1;
cp = sh ? max_t(int, tx_rings_needed, rx) : tx_rings_needed + rx;
+   if (bp->flags & BNXT_FLAG_NEW_RM)
+   cp += bnxt_get_ulp_msix_num(bp);
return bnxt_hwrm_check_rings(bp, tx_rings_needed, rx_rings, rx, cp,
 vnics);
 }
-- 
1.8.3.1



[PATCH net v2 3/6] bnxt_en: Ignore src port field in decap filter nodes

2018-04-11 Thread Michael Chan
From: Sriharsha Basavapatna 

The driver currently uses src port field (along with other fields) in the
decap tunnel key, while looking up and adding tunnel nodes. This leads to
redundant cfa_decap_filter_alloc() requests to the FW and flow-miss in the
flow engine. Fix this by ignoring the src port field in decap tunnel nodes.

Fixes: f484f6782e01 ("bnxt_en: add hwrm FW cmds for cfa_encap_record and 
decap_filter")
Signed-off-by: Sriharsha Basavapatna 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index ac193408..795f450 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -1051,8 +1051,10 @@ static int bnxt_tc_get_decap_handle(struct bnxt *bp, 
struct bnxt_tc_flow *flow,
 
/* Check if there's another flow using the same tunnel decap.
 * If not, add this tunnel to the table and resolve the other
-* tunnel header fileds
+* tunnel header fileds. Ignore src_port in the tunnel_key,
+* since it is not required for decap filters.
 */
+   decap_key->tp_src = 0;
decap_node = bnxt_tc_get_tunnel_node(bp, _info->decap_table,
 _info->decap_ht_params,
 decap_key);
-- 
1.8.3.1



[RFC bpf-next v2 8/8] bpf: add documentation for eBPF helpers (58-64)

2018-04-11 Thread Quentin Monnet
2018-04-11 12:17 UTC+0200 ~ Jesper Dangaard Brouer 
> On Tue, 10 Apr 2018 15:41:57 +0100
> Quentin Monnet  wrote:
> 
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 7343af4196c8..db090ad03626 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -1250,6 +1250,51 @@ union bpf_attr {
>>   *  Return
>>   *  0 on success, or a negative error in case of failure.
>>   *
>> + * int bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
>> + *  Description
>> + *  Redirect the packet to the endpoint referenced by *map* at
>> + *  index *key*. Depending on its type, his *map* can contain
>> + *  references to net devices (for forwarding packets through other
>> + *  ports), or to CPUs (for redirecting XDP frames to another CPU;
>> + *  but this is not fully implemented as of this writing).
> 
> Stating that CPUMAP redirect "is not fully implemented" is confusing.
> The issue is that CPUMAP only works for "native" XDP.
> 
> What about saying:
> 
> "[...] or to CPUs (for redirecting XDP frames to another CPU;
>  but this is only implemented for native XDP as of this writing)"
> 

Fine by me, I will change it. Thank you Jesper for the review!

Quentin




[RFC bpf-next v2 7/8] bpf: add documentation for eBPF helpers (51-57)

2018-04-11 Thread Quentin Monnet
2018-04-10 10:50 UTC-0700 ~ Andrey Ignatov 
> Quentin Monnet  [Tue, 2018-04-10 07:43 -0700]:
>> + * int bpf_bind(struct bpf_sock_addr_kern *ctx, struct sockaddr *addr, int 
>> addr_len)
>> + *  Description
>> + *  Bind the socket associated to *ctx* to the address pointed by
>> + *  *addr*, of length *addr_len*. This allows for making outgoing
>> + *  connection from the desired IP address, which can be useful for
>> + *  example when all processes inside a cgroup should use one
>> + *  single IP address on a host that has multiple IP configured.
>> + *
>> + *  This helper works for IPv4 and IPv6, TCP and UDP sockets. The
>> + *  domain (*addr*\ **->sa_family**) must be **AF_INET** (or
>> + *  **AF_INET6**). Looking for a free port to bind to can be
>> + *  expensive, therefore binding to port is not permitted by the
>> + *  helper: *addr*\ **->sin_port** (or **sin6_port**, respectively)
>> + *  must be set to zero.
>> + *
>> + *  As for the remote end, both parts of it can be overridden,
>> + *  remote IP and remote port. This can be useful if an application
>> + *  inside a cgroup wants to connect to another application inside
>> + *  the same cgroup or to itself, but knows nothing about the IP
>> + *  address assigned to the cgroup.
> 
> The last paragraph ("As for the remote end ...") is not relevant to
> bpf_bind() and should be removed. It's about sys_connect hook itself
> that can call to bpf_bind() but also has other functionality (and that
> other functionality is described by this paragraph).

Thanks Andrey, I will remove this paragraph.

Quentin





[RFC bpf-next v2 7/8] bpf: add documentation for eBPF helpers (51-57)

2018-04-11 Thread Quentin Monnet
2018-04-10 09:58 UTC-0700 ~ Yonghong Song 
> On 4/10/18 7:41 AM, Quentin Monnet wrote:
>> Add documentation for eBPF helper functions to bpf.h user header file.
>> This documentation can be parsed with the Python script provided in
>> another commit of the patch series, in order to provide a RST document
>> that can later be converted into a man page.
>>
>> The objective is to make the documentation easily understandable and
>> accessible to all eBPF developers, including beginners.
>>
>> This patch contains descriptions for the following helper functions:
>>
>> Helpers from Lawrence:
>> - bpf_setsockopt()
>> - bpf_getsockopt()
>> - bpf_sock_ops_cb_flags_set()
>>
>> Helpers from Yonghong:
>> - bpf_perf_event_read_value()
>> - bpf_perf_prog_read_value()
>>
>> Helper from Josef:
>> - bpf_override_return()
>>
>> Helper from Andrey:
>> - bpf_bind()
>>
>> Cc: Lawrence Brakmo 
>> Cc: Yonghong Song 
>> Cc: Josef Bacik 
>> Cc: Andrey Ignatov 
>> Signed-off-by: Quentin Monnet 
>> ---
>>   include/uapi/linux/bpf.h | 184
>> +++
>>   1 file changed, 184 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 15d9ccafebbe..7343af4196c8 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h

[...]

>> @@ -1255,6 +1277,168 @@ union bpf_attr {
>>    * performed again.
>>    * Return
>>    * 0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_perf_event_read_value(struct bpf_map *map, u64 flags,
>> struct bpf_perf_event_value *buf, u32 buf_size)
>> + * Description
>> + * Read the value of a perf event counter, and store it into
>> *buf*
>> + * of size *buf_size*. This helper relies on a *map* of type
>> + * **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The nature of the perf
>> + * event counter is selected at the creation of the *map*. The
> 
> The nature of the perf event counter is selected when *map* is updated
> with perf_event fd's.
> 

Thanks, I will fix it.

>> + * *map* is an array whose size is the number of available CPU
>> + * cores, and each cell contains a value relative to one
>> core. The
> 
> It is confusing to mix core/cpu here. Maybe just use perf_event
> convention, always using cpu?
> 

Right, I'll remove occurrences of "core".

>> + * value to retrieve is indicated by *flags*, that contains the
>> + * index of the core to look up, masked with
>> + * **BPF_F_INDEX_MASK**. Alternatively, *flags* can be set to
>> + * **BPF_F_CURRENT_CPU** to indicate that the value for the
>> + * current CPU core should be retrieved.
>> + *
>> + * This helper behaves in a way close to
>> + * **bpf_perf_event_read**\ () helper, save that instead of
>> + * just returning the value observed, it fills the *buf*
>> + * structure. This allows for additional data to be
>> retrieved: in
>> + * particular, the enabled and running times (in *buf*\
>> + * **->enabled** and *buf*\ **->running**, respectively) are
>> + * copied.
>> + *
>> + * These values are interesting, because hardware PMU
>> (Performance
>> + * Monitoring Unit) counters are limited resources. When
>> there are
>> + * more PMU based perf events opened than available counters,
>> + * kernel will multiplex these events so each event gets certain
>> + * percentage (but not all) of the PMU time. In case that
>> + * multiplexing happens, the number of samples or counter value
>> + * will not reflect the case compared to when no multiplexing
>> + * occurs. This makes comparison between different runs
>> difficult.
>> + * Typically, the counter value should be normalized before
>> + * comparing to other experiments. The usual normalization is
>> done
>> + * as follows.
>> + *
>> + * ::
>> + *
>> + * normalized_counter = counter * t_enabled / t_running
>> + *
>> + * Where t_enabled is the time enabled for event and
>> t_running is
>> + * the time running for event since last normalization. The
>> + * enabled and running times are accumulated since the perf
>> event
>> + * open. To achieve scaling factor between two invocations of an
>> + * eBPF program, users can can use CPU id as the key (which is
>> + * typical for perf array usage model) to remember the previous
>> + * value and do the calculation inside the eBPF program.
>> + * Return
>> + * 0 on success, or a negative error in case of failure.
>> + *

[...]

Thanks Yonghong for the review!

I have a favor to ask of you. I got a bounce for Kaixu Xia's email
address, and I don't know what alternative email address I could use. I
CC-ed to have a review for helper 

[RFC bpf-next v2 3/8] bpf: add documentation for eBPF helpers (12-22)

2018-04-11 Thread Quentin Monnet
2018-04-10 15:43 UTC-0700 ~ Alexei Starovoitov

> On Tue, Apr 10, 2018 at 03:41:52PM +0100, Quentin Monnet wrote:
>> Add documentation for eBPF helper functions to bpf.h user header file.
>> This documentation can be parsed with the Python script provided in
>> another commit of the patch series, in order to provide a RST document
>> that can later be converted into a man page.
>>
>> The objective is to make the documentation easily understandable and
>> accessible to all eBPF developers, including beginners.
>>
>> This patch contains descriptions for the following helper functions, all
>> writter by Alexei:
>>
>> - bpf_get_current_pid_tgid()
>> - bpf_get_current_uid_gid()
>> - bpf_get_current_comm()
>> - bpf_skb_vlan_push()
>> - bpf_skb_vlan_pop()
>> - bpf_skb_get_tunnel_key()
>> - bpf_skb_set_tunnel_key()
>> - bpf_redirect()
>> - bpf_perf_event_output()
>> - bpf_get_stackid()
>> - bpf_get_current_task()
>>
>> Cc: Alexei Starovoitov 
>> Signed-off-by: Quentin Monnet 
>> ---
>>  include/uapi/linux/bpf.h | 237 
>> +++
>>  1 file changed, 237 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 2bc653a3a20f..f3ea8824efbc 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -580,6 +580,243 @@ union bpf_attr {
>>   *  performed again.
>>   *  Return
>>   *  0 on success, or a negative error in case of failure.
>> + *
>> + * u64 bpf_get_current_pid_tgid(void)
>> + *  Return
>> + *  A 64-bit integer containing the current tgid and pid, and
>> + *  created as such:
>> + *  *current_task*\ **->tgid << 32 \|**
>> + *  *current_task*\ **->pid**.
>> + *
>> + * u64 bpf_get_current_uid_gid(void)
>> + *  Return
>> + *  A 64-bit integer containing the current GID and UID, and
>> + *  created as such: *current_gid* **<< 32 \|** *current_uid*.
>> + *
>> + * int bpf_get_current_comm(char *buf, u32 size_of_buf)
>> + *  Description
>> + *  Copy the **comm** attribute of the current task into *buf* of
>> + *  *size_of_buf*. The **comm** attribute contains the name of
>> + *  the executable (excluding the path) for the current task. The
>> + *  *size_of_buf* must be strictly positive. On success, the
> 
> that reminds me that we probably should relax it to ARG_CONST_SIZE_OR_ZERO.
> The programs won't be passing an actual zero into it, but it helps
> a lot to tell verifier that zero is also valid, since programs
> become much simpler.
> 

Ok. No change to helper description for now, we will update here when
your patch lands.

>> + *  helper makes sure that the *buf* is NUL-terminated. On failure,
>> + *  it is filled with zeroes.
>> + *  Return
>> + *  0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 
>> vlan_tci)
>> + *  Description
>> + *  Push a *vlan_tci* (VLAN tag control information) of protocol
>> + *  *vlan_proto* to the packet associated to *skb*, then update
>> + *  the checksum. Note that if *vlan_proto* is different from
>> + *  **ETH_P_8021Q** and **ETH_P_8021AD**, it is considered to
>> + *  be **ETH_P_8021Q**.
>> + *
>> + *  A call to this helper is susceptible to change data from the
>> + *  packet. Therefore, at load time, all checks on pointers
>> + *  previously done by the verifier are invalidated and must be
>> + *  performed again.
>> + *  Return
>> + *  0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_skb_vlan_pop(struct sk_buff *skb)
>> + *  Description
>> + *  Pop a VLAN header from the packet associated to *skb*.
>> + *
>> + *  A call to this helper is susceptible to change data from the
>> + *  packet. Therefore, at load time, all checks on pointers
>> + *  previously done by the verifier are invalidated and must be
>> + *  performed again.
>> + *  Return
>> + *  0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_skb_get_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key 
>> *key, u32 size, u64 flags)
>> + *  Description
>> + *  Get tunnel metadata. This helper takes a pointer *key* to an
>> + *  empty **struct bpf_tunnel_key** of **size**, that will be
>> + *  filled with tunnel metadata for the packet associated to *skb*.
>> + *  The *flags* can be set to **BPF_F_TUNINFO_IPV6**, which
>> + *  indicates that the tunnel is based on IPv6 protocol instead of
>> + *  IPv4.
>> + *
>> + *  This is typically used on the receive path to perform a lookup
>> + *  or a packet redirection based on the value of *key*:
> 
> above is correct, but feels a bit cryptic.
> May be give more 

[RFC bpf-next v2 2/8] bpf: add documentation for eBPF helpers (01-11)

2018-04-11 Thread Quentin Monnet
2018-04-10 10:56 UTC-0700 ~ Alexei Starovoitov

> On Tue, Apr 10, 2018 at 03:41:51PM +0100, Quentin Monnet wrote:
>> Add documentation for eBPF helper functions to bpf.h user header file.
>> This documentation can be parsed with the Python script provided in
>> another commit of the patch series, in order to provide a RST document
>> that can later be converted into a man page.
>>
>> The objective is to make the documentation easily understandable and
>> accessible to all eBPF developers, including beginners.
>>
>> This patch contains descriptions for the following helper functions, all
>> written by Alexei:
>>
>> - bpf_map_lookup_elem()
>> - bpf_map_update_elem()
>> - bpf_map_delete_elem()
>> - bpf_probe_read()
>> - bpf_ktime_get_ns()
>> - bpf_trace_printk()
>> - bpf_skb_store_bytes()
>> - bpf_l3_csum_replace()
>> - bpf_l4_csum_replace()
>> - bpf_tail_call()
>> - bpf_clone_redirect()
>>
>> Cc: Alexei Starovoitov 
>> Signed-off-by: Quentin Monnet 
>> ---
>>  include/uapi/linux/bpf.h | 199 
>> +++
>>  1 file changed, 199 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 45f77f01e672..2bc653a3a20f 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -381,6 +381,205 @@ union bpf_attr {
>>   * intentional, removing them would break paragraphs for rst2man.
>>   *
>>   * Start of BPF helper function descriptions:
>> + *
>> + * void *bpf_map_lookup_elem(struct bpf_map *map, void *key)
>> + *  Description
>> + *  Perform a lookup in *map* for an entry associated to *key*.
>> + *  Return
>> + *  Map value associated to *key*, or **NULL** if no entry was
>> + *  found.
>> + *
>> + * int bpf_map_update_elem(struct bpf_map *map, void *key, void *value, u64 
>> flags)
>> + *  Description
>> + *  Add or update the value of the entry associated to *key* in
>> + *  *map* with *value*. *flags* is one of:
>> + *
>> + *  **BPF_NOEXIST**
>> + *  The entry for *key* must not exist in the map.
>> + *  **BPF_EXIST**
>> + *  The entry for *key* must already exist in the map.
>> + *  **BPF_ANY**
>> + *  No condition on the existence of the entry for *key*.
>> + *
>> + *  These flags are only useful for maps of type
>> + *  **BPF_MAP_TYPE_HASH**. For all other map types, **BPF_ANY**
>> + *  should be used.
> 
> I think that's not entirely accurate.
> The flags work as expected for all other map types as well
> and for lru map, sockmap, map in map the flags have practical use cases.
> 

Ok, I missed that. I have to go back and check how the flags are used
for those maps. I will cook up something cleaner for the next version of
the set.

>> + *  Return
>> + *  0 on success, or a negative error in case of failure.
>> + *

[...]

>> + *
>> + * int bpf_trace_printk(const char *fmt, u32 fmt_size, ...)
>> + *  Description
>> + *  This helper is a "printk()-like" facility for debugging. It
>> + *  prints a message defined by format *fmt* (of size *fmt_size*)
>> + *  to file *\/sys/kernel/debug/tracing/trace* from DebugFS, if
>> + *  available. It can take up to three additional **u64**
>> + *  arguments (as an eBPF helpers, the total number of arguments is
>> + *  limited to five). Each time the helper is called, it appends a
>> + *  line that looks like the following:
>> + *
>> + *  ::
>> + *
>> + *  telnet-470   [001] .N.. 419421.045894: 0x0001: BPF 
>> command: 2
>> + *
>> + *  In the above:
>> + *
>> + *  * ``telnet`` is the name of the current task.
>> + *  * ``470`` is the PID of the current task.
>> + *  * ``001`` is the CPU number on which the task is
>> + *running.
>> + *  * In ``.N..``, each character refers to a set of
>> + *options (whether irqs are enabled, scheduling
>> + *options, whether hard/softirqs are running, level of
>> + *preempt_disabled respectively). **N** means that
>> + ***TIF_NEED_RESCHED** and **PREEMPT_NEED_RESCHED**
>> + *are set.
>> + *  * ``419421.045894`` is a timestamp.
>> + *  * ``0x0001`` is a fake value used by BPF for the
>> + *instruction pointer register.
>> + *  * ``BPF command: 2`` is the message formatted with
>> + **fmt*.
> 
> the above depends on how trace_pipe was configured. It's a default
> configuration for many, but would be good to explain this a bit better.
> 

I did not know about that. Would you have a pointer about how to
configure trace_pipe, please?

>> + *
>> + *  The conversion 

[RFC bpf-next v2 1/8] bpf: add script and prepare bpf.h for new helpers documentation

2018-04-11 Thread Quentin Monnet
2018-04-10 11:16 UTC-0700 ~ Alexei Starovoitov

> On Tue, Apr 10, 2018 at 03:41:50PM +0100, Quentin Monnet wrote:
>> Remove previous "overview" of eBPF helpers from user bpf.h header.
>> Replace it by a comment explaining how to process the new documentation
>> (to come in following patches) with a Python script to produce RST, then
>> man page documentation.
>>
>> Also add the aforementioned Python script under scripts/. It is used to
>> process include/uapi/linux/bpf.h and to extract helper descriptions, to
>> turn it into a RST document that can further be processed with rst2man
>> to produce a man page. The script takes one "--filename "
>> option. If the script is launched from scripts/ in the kernel root
>> directory, it should be able to find the location of the header to
>> parse, and "--filename " is then optional. If it cannot
>> find the file, then the option becomes mandatory. RST-formatted
>> documentation is printed to standard output.
>>
>> Typical workflow for producing the final man page would be:
>>
>> $ ./scripts/bpf_helpers_doc.py \
>> --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
>> $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
>> $ man /tmp/bpf-helpers.7
>>
>> Note that the tool kernel-doc cannot be used to document eBPF helpers,
>> whose signatures are not available directly in the header files
>> (pre-processor directives are used to produce them at the beginning of
>> the compilation process).
>>
>> Signed-off-by: Quentin Monnet 
>> ---
>>  include/uapi/linux/bpf.h   | 406 
>> ++--
>>  scripts/bpf_helpers_doc.py | 414 
>> +
>>  2 files changed, 430 insertions(+), 390 deletions(-)
>>  create mode 100755 scripts/bpf_helpers_doc.py
>>

[...]

>> diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
>> new file mode 100755
>> index ..3a15ba3f0a83
>> --- /dev/null
>> +++ b/scripts/bpf_helpers_doc.py
>> @@ -0,0 +1,414 @@
>> +#!/usr/bin/python3
>> +#
>> +# Copyright (C) 2018 Netronome Systems, Inc.
>> +#
>> +# This software is licensed under the GNU General License Version 2,
>> +# June 1991 as shown in the file COPYING in the top-level directory of this
>> +# source tree.
> 
> please use SPDX instead.
> 

Same as for bpftool, our layers remain a bit cautious about it. I'd be
happy to change it for SPDX as a follow-up when I get the green light.

>> +#
>> +# THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS"
>> +# WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING,
>> +# BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
>> +# FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND 
>> PERFORMANCE
>> +# OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME
>> +# THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
>> +

[...]

>> +class PrinterRST(Printer):
>> +"""
>> +A printer for dumping collected information about helpers as a 
>> ReStructured
>> +Text page compatible with the rst2man program, which can be used to
>> +generate a manual page for the helpers.
>> +@helpers: array of Helper objects to print to standard output
>> +"""
>> +def print_header(self):
>> +header = '''\
>> +.. Copyright (C) 2018 Netronome Systems, Inc.
> 
> I think would be good to capture copyrights of all authors that added
> the helpers being documented. Since a lot of text was copied from commit
> logs it's only fair to preserve the copyrights.
> Such man page file is automatically generated by the python script
> and script itself is copyrighted by Netronome. That's fine, but the text
> of man page is not netronome only.
> I'm not sure what would be the solution. May be something like:
> "
> Copyright (C) All BPF authors and contributors from 2011 to present
> See git log include/uapi/linux/bpf.h for details
> "
> ?

Seems fair indeed. I do not have a better suggestion myself, so I will
stick to yours.

Out of curiosity, why 2011 for the year? I thought you introduced eBPF
in the kernel in 2014 (bd4cf0ed331a), and I do not believe helpers have
any link with cBPF?

>> +.. 
>> +.. %%%LICENSE_START(VERBATIM)
>> +.. Permission is granted to make and distribute verbatim copies of this
>> +.. manual provided the copyright notice and this permission notice are
>> +.. preserved on all copies.
>> +.. 
>> +.. Permission is granted to copy and distribute modified versions of this
>> +.. manual under the conditions for verbatim copying, provided that the
>> +.. entire resulting derived work is distributed under the terms of a
>> +.. permission notice identical to this one.
>> +.. 
>> +.. Since the Linux kernel and libraries are constantly changing, this
>> +.. manual page may be incorrect or out-of-date.  The author(s) assume no
>> +.. responsibility for errors or omissions, or 

[PATCH v2] dec: tulip: de4x5: Replace mdelay with usleep_range in de4x5_hw_init

2018-04-11 Thread Jia-Ju Bai
de4x5_hw_init() is never called in atomic context.

de4x5_hw_init() is only called by de4x5_pci_probe(), which is only 
set as ".probe" in struct pci_driver.

Despite never getting called from atomic context, de4x5_hw_init() 
calls mdelay() to busily wait.
This is not necessary and can be replaced with usleep_range() to 
avoid busy waiting.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai 
---
v2:
* Use usleep_range() to correct usleep() in v1.

---
 drivers/net/ethernet/dec/tulip/de4x5.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/dec/tulip/de4x5.c 
b/drivers/net/ethernet/dec/tulip/de4x5.c
index 0affee9..3fb0119 100644
--- a/drivers/net/ethernet/dec/tulip/de4x5.c
+++ b/drivers/net/ethernet/dec/tulip/de4x5.c
@@ -1107,7 +1107,7 @@ static int (*dc_infoblock[])(struct net_device *dev, 
u_char, u_char *) = {
pdev = to_pci_dev (gendev);
pci_write_config_byte(pdev, PCI_CFDA_PSM, WAKEUP);
 }
-mdelay(10);
+usleep_range(1, 11000);
 
 RESET_DE4X5;
 
-- 
1.9.1



[PATCH v2 net 1/2] ibmvnic: Handle all login error conditions

2018-04-11 Thread Nathan Fontenot
There is a bug in handling the possible return codes from sending the
login CRQ. The current code treats any non-success return value,
minus failure to send the crq and a timeout waiting for a login response,
as a need to re-send the login CRQ. This can put the drive in an
infinite loop of trying to login when getting return values other
that a partial success such as a return code of aborted. For these
scenarios the login will not ever succeed at this point and the
driver would need to be reset again.

To resolve this loop trying to login is updated to only retry the
login if the driver gets a return code of a partial success. Other
return codes are treated as an error and the driver returns an error
from ibmvnic_login().

To avoid infinite looping in the partial success return cases, the
number of retries is capped at the maximum number of supported
queues. This value was chosen because the driver does a renegotiation
of capabilities which sets the number of queues possible and allows
the driver to attempt a login for possible value for the number
of queues supported.

Signed-off-by: Nathan Fontenot 
---
 drivers/net/ethernet/ibm/ibmvnic.c |   55 +++-
 drivers/net/ethernet/ibm/ibmvnic.h |1 -
 2 files changed, 35 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index aad5658d79d5..c6d5e9448eef 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -794,46 +794,61 @@ static int ibmvnic_login(struct net_device *netdev)
 {
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
unsigned long timeout = msecs_to_jiffies(3);
-   struct device *dev = >vdev->dev;
+   int retry_count = 0;
int rc;
 
do {
-   if (adapter->renegotiate) {
-   adapter->renegotiate = false;
+   if (retry_count > IBMVNIC_MAX_QUEUES) {
+   netdev_warn(netdev, "Login attempts exceeded\n");
+   return -1;
+   }
+
+   adapter->init_done_rc = 0;
+   reinit_completion(>init_done);
+   rc = send_login(adapter);
+   if (rc) {
+   netdev_warn(netdev, "Unable to login\n");
+   return rc;
+   }
+
+   if (!wait_for_completion_timeout(>init_done,
+timeout)) {
+   netdev_warn(netdev, "Login timed out\n");
+   return -1;
+   }
+
+   if (adapter->init_done_rc == PARTIALSUCCESS) {
+   retry_count++;
release_sub_crqs(adapter, 1);
 
+   adapter->init_done_rc = 0;
reinit_completion(>init_done);
send_cap_queries(adapter);
if (!wait_for_completion_timeout(>init_done,
 timeout)) {
-   dev_err(dev, "Capabilities query timeout\n");
+   netdev_warn(netdev,
+   "Capabilities query timed out\n");
return -1;
}
+
rc = init_sub_crqs(adapter);
if (rc) {
-   dev_err(dev,
-   "Initialization of SCRQ's failed\n");
+   netdev_warn(netdev,
+   "SCRQ initialization failed\n");
return -1;
}
+
rc = init_sub_crq_irqs(adapter);
if (rc) {
-   dev_err(dev,
-   "Initialization of SCRQ's irqs 
failed\n");
+   netdev_warn(netdev,
+   "SCRQ irq initialization failed\n");
return -1;
}
-   }
-
-   reinit_completion(>init_done);
-   rc = send_login(adapter);
-   if (rc) {
-   dev_err(dev, "Unable to attempt device login\n");
-   return rc;
-   } else if (!wait_for_completion_timeout(>init_done,
-timeout)) {
-   dev_err(dev, "Login timeout\n");
+   } else if (adapter->init_done_rc) {
+   netdev_warn(netdev, "Adapter login failed\n");
return -1;
}
-   } while (adapter->renegotiate);
+   } while (adapter->init_done_rc == PARTIALSUCCESS);
 
/* handle pending MAC address changes after successful login */
if 

[PATCH v2 net 0/2] ibmvnic: Fix parameter change request handling

2018-04-11 Thread Nathan Fontenot
When updating parameters for the ibmvnic driver there is a possibility
of entering an infinite loop if a return value other that a partial
success is received from sending the login CRQ.

Also, a deadlock can occur on the rtnl lock if netdev_notify_peers()
is called during driver reset for a parameter change reset.

This patch set corrects both of these issues by updating the return
code handling in ibmvnic_login() nand gaurding against calling
netdev_notify_peers() for parameter change requests.

-Nathan

Updates for V2: Correct spelling mistakes in commit messages.
---

Nathan Fontenot (2):
  ibmvnic: Handle all login error conditions
  ibmvnic: Do not notify peers on parameter change resets


 drivers/net/ethernet/ibm/ibmvnic.c |   58 +++-
 drivers/net/ethernet/ibm/ibmvnic.h |1 -
 2 files changed, 37 insertions(+), 22 deletions(-)



[PATCH v2 net 2/2] ibmvnic: Do not notify peers on parameter change resets

2018-04-11 Thread Nathan Fontenot
When attempting to change the driver parameters, such as the MTU
value or number of queues, do not call netdev_notify_peers().
Doing so will deadlock on the rtnl_lock.

Signed-off-by: Nathan Fontenot 
---
 drivers/net/ethernet/ibm/ibmvnic.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index c6d5e9448eef..fdca1ea5bbf9 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1843,7 +1843,8 @@ static int do_reset(struct ibmvnic_adapter *adapter,
for (i = 0; i < adapter->req_rx_queues; i++)
napi_schedule(>napi[i]);
 
-   if (adapter->reset_reason != VNIC_RESET_FAILOVER)
+   if (adapter->reset_reason != VNIC_RESET_FAILOVER &&
+   adapter->reset_reason != VNIC_RESET_CHANGE_PARAM)
netdev_notify_peers(netdev);
 
netif_carrier_on(netdev);



Re: [PATCH net 1/2] ibmvnic: Handle all login error conditions

2018-04-11 Thread Nathan Fontenot
On 04/11/2018 10:07 AM, David Miller wrote:
> From: Nathan Fontenot 
> Date: Wed, 11 Apr 2018 09:37:21 -0500
> 
>> There is a bug in handling the possible return codes from sending the
>> login CRQ. The current code treats any non-success return value,
>> minus failure to send the crq and a timeout waiting for a login response,
>> as a need to re-send the login CRQ. This can put the drive in an
>^
> 
> "driver"
> 
>> inifinite loop of trying to login when getting return values other
>   ^
> 
> "infinite"
> 
>> that a partial success such as a return code of aborted. For these
>> scenarios the login will not ever succeed at this point and the
>> driver would need to be reset again.
>>
>> To resolve this loop trying to login is updated to only retry the
>> login if the driver gets a return code of a partial success. Other
>> return coes are treated as an error and the driver returns an error
>  
> 
> "codes"
> 
>> from ibmvnic_login().
>>
>> To avoid infinite looping in the partial success return cases, the
>> number of retries is capped at the maximum number of supported
>> queues. This value was chosen because the driver does a renegatiation
>   ^
> 
> "renegotiation"
> 
>> of capabilities which sets the number of queus possible and allows
>^
> 
> "queues"
> 

Oh man, complete spelling failure. resending.

-Nathan



Re: WARNING: kobject bug in br_add_if

2018-04-11 Thread Dmitry Vyukov
On Wed, Apr 11, 2018 at 5:15 PM, syzbot
 wrote:
> Hello,
>
> syzbot hit the following crash on upstream commit
> 10b84daddbec72c6b440216a69de9a9605127f7a (Sat Mar 31 17:59:00 2018 +)
> Merge branch 'perf-urgent-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> syzbot dashboard link:
> https://syzkaller.appspot.com/bug?extid=de73361ee4971b6e6f75
>
> So far this crash happened 4 times on net-next, upstream.
> Unfortunately, I don't have any reproducer for this crash yet.
> Raw console output:
> https://syzkaller.appspot.com/x/log.txt?id=5007286875455488
> Kernel config:
> https://syzkaller.appspot.com/x/.config?id=-2760467897697295172
> compiler: gcc (GCC) 7.1.1 20170620
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+de73361ee4971b6e6...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.

+Greg

The plan is to remove this WARNING from kobject_add, if there are no objections.

> [ cut here ]
> binder: 23650:23651 unknown command 1078223622
> kobject_add_internal failed for brport (error: -12 parent: bond0)
> binder: 23650:23651 ioctl c0306201 2000dfd0 returned -22
> WARNING: CPU: 1 PID: 23647 at lib/kobject.c:242
> kobject_add_internal+0x3f6/0xbc0 lib/kobject.c:240
> Kernel panic - not syncing: panic_on_warn set ...
>
> CPU: 1 PID: 23647 Comm: syz-executor7 Not tainted 4.16.0-rc7+ #374
> binder: BINDER_SET_CONTEXT_MGR already set
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x24d lib/dump_stack.c:53
>  panic+0x1e4/0x41c kernel/panic.c:183
>  __warn+0x1dc/0x200 kernel/panic.c:547
>  report_bug+0x1f4/0x2b0 lib/bug.c:186
>  fixup_bug.part.10+0x37/0x80 arch/x86/kernel/traps.c:178
>  fixup_bug arch/x86/kernel/traps.c:247 [inline]
>  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>  invalid_op+0x1b/0x40 arch/x86/entry/entry_64.S:986
> RIP: 0010:kobject_add_internal+0x3f6/0xbc0 lib/kobject.c:240
> RSP: 0018:8801d089f560 EFLAGS: 00010286
> RAX: dc08 RBX: 8801adbee178 RCX: 815b193e
> RDX: 0004 RSI: c900022aa000 RDI: 11003a113e31
> RBP: 8801d089f658 R08: 11003a113df3 R09: 
> R10:  R11:  R12: 11003a113eb2
> R13: fff4 R14: 8801abd88828 R15: 8801d75a1e00
>  kobject_add_varg lib/kobject.c:364 [inline]
>  kobject_init_and_add+0xf9/0x150 lib/kobject.c:436
>  br_add_if+0x79a/0x1a70 net/bridge/br_if.c:533
>  add_del_if+0xf4/0x140 net/bridge/br_ioctl.c:101
>  br_dev_ioctl+0xa2/0xc0 net/bridge/br_ioctl.c:396
>  dev_ifsioc+0x333/0x9b0 net/core/dev_ioctl.c:334
>  dev_ioctl+0x176/0xbe0 net/core/dev_ioctl.c:500
>  sock_do_ioctl+0x1ba/0x390 net/socket.c:981
>  sock_ioctl+0x367/0x670 net/socket.c:1081
>  vfs_ioctl fs/ioctl.c:46 [inline]
>  do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
>  SYSC_ioctl fs/ioctl.c:701 [inline]
>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
>  do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> RIP: 0033:0x454e79
> RSP: 002b:7eff7dab7c68 EFLAGS: 0246 ORIG_RAX: 0010
> RAX: ffda RBX: 7eff7dab86d4 RCX: 00454e79
> RDX: 2000 RSI: 89a2 RDI: 0014
> RBP: 0072bea0 R08:  R09: 
> R10:  R11: 0246 R12: 0015
> R13: 0369 R14: 006f7278 R15: 0006
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>
>
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkal...@googlegroups.com.
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to syzkaller-bugs+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> 

  1   2   >