[PATCH net v2] net/sched: act_simple: fix parsing of TCA_DEF_DATA

2018-06-07 Thread Davide Caratti
use nla_strlcpy() to avoid copying data beyond the length of TCA_DEF_DATA
netlink attribute, in case it is less than SIMP_MAX_DATA and it does not
end with '\0' character.

v2: fix errors in the commit message, thanks Hangbin Liu

Fixes: fa1b1cff3d06 ("net_cls_act: Make act_simple use of netlink policy.")
Signed-off-by: Davide Caratti 
---
 net/sched/act_simple.c | 15 ++-
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/net/sched/act_simple.c b/net/sched/act_simple.c
index 9618b4a83cee..98c4afe7c15b 100644
--- a/net/sched/act_simple.c
+++ b/net/sched/act_simple.c
@@ -53,22 +53,22 @@ static void tcf_simp_release(struct tc_action *a)
kfree(d->tcfd_defdata);
 }
 
-static int alloc_defdata(struct tcf_defact *d, char *defdata)
+static int alloc_defdata(struct tcf_defact *d, const struct nlattr *defdata)
 {
d->tcfd_defdata = kzalloc(SIMP_MAX_DATA, GFP_KERNEL);
if (unlikely(!d->tcfd_defdata))
return -ENOMEM;
-   strlcpy(d->tcfd_defdata, defdata, SIMP_MAX_DATA);
+   nla_strlcpy(d->tcfd_defdata, defdata, SIMP_MAX_DATA);
return 0;
 }
 
-static void reset_policy(struct tcf_defact *d, char *defdata,
+static void reset_policy(struct tcf_defact *d, const struct nlattr *defdata,
 struct tc_defact *p)
 {
spin_lock_bh(>tcf_lock);
d->tcf_action = p->action;
memset(d->tcfd_defdata, 0, SIMP_MAX_DATA);
-   strlcpy(d->tcfd_defdata, defdata, SIMP_MAX_DATA);
+   nla_strlcpy(d->tcfd_defdata, defdata, SIMP_MAX_DATA);
spin_unlock_bh(>tcf_lock);
 }
 
@@ -87,7 +87,6 @@ static int tcf_simp_init(struct net *net, struct nlattr *nla,
struct tcf_defact *d;
bool exists = false;
int ret = 0, err;
-   char *defdata;
 
if (nla == NULL)
return -EINVAL;
@@ -110,8 +109,6 @@ static int tcf_simp_init(struct net *net, struct nlattr 
*nla,
return -EINVAL;
}
 
-   defdata = nla_data(tb[TCA_DEF_DATA]);
-
if (!exists) {
ret = tcf_idr_create(tn, parm->index, est, a,
 _simp_ops, bind, false);
@@ -119,7 +116,7 @@ static int tcf_simp_init(struct net *net, struct nlattr 
*nla,
return ret;
 
d = to_defact(*a);
-   ret = alloc_defdata(d, defdata);
+   ret = alloc_defdata(d, tb[TCA_DEF_DATA]);
if (ret < 0) {
tcf_idr_release(*a, bind);
return ret;
@@ -133,7 +130,7 @@ static int tcf_simp_init(struct net *net, struct nlattr 
*nla,
if (!ovr)
return -EEXIST;
 
-   reset_policy(d, defdata, parm);
+   reset_policy(d, tb[TCA_DEF_DATA], parm);
}
 
if (ret == ACT_P_CREATED)
-- 
2.17.0



Re: [PATCH net] net/sched: act_simple: fix parsing of TCA_DEFDATA

2018-06-07 Thread Davide Caratti
On Fri, 2018-06-08 at 10:07 +0800, Hangbin Liu wrote:
> On Thu, Jun 07, 2018 at 03:46:43PM +0200, Davide Caratti wrote:
> > use nla_strlcpy() to avoid copying data beyond the length of TCA_DEFDATA
> 
> s/TCA_DEFDATA/TCA_DEF_DATA/, incase someone search the commit history but
> could not find it.
> 
> Thanks
> Hangbin

sure, thanks, and after another look I think also the 'Fixes:' tag is
wrong. More probably it was introduced with

fa1b1cff3d06 "net_cls_act: Make act_simple use of netlink policy."

I will send a v2 in minutes.

regards,
-- 
davide


Re: [PATCH net-next] net: ipv6: Generate random IID for addresses on RAWIP devices

2018-06-07 Thread Lorenzo Colitti
On Mon, Jun 4, 2018 at 8:51 AM 吉藤英明  wrote:
>
> > +   if (ipv6_get_lladdr(dev, , IFA_F_TENTATIVE))
> > +   get_random_bytes(eui, 8);
>
> Please be aware of I/G bit and G/L bit.


Actually, I think this is fine. RFC 7136 clarified this, and says:

==
   Thus, we can conclude that the value of the "u" bit in IIDs has no
   particular meaning.  In the case of an IID created from a MAC address
   according to RFC 4291, its value is determined by the MAC address,
   but that is all.
[...]
   Specifications of other forms of 64-bit IIDs MUST specify how all 64
   bits are set, but a generic semantic meaning for the "u" and "g" bits
   MUST NOT be defined.  However, the method of generating IIDs for
   specific link types MAY define some local significance for certain
   bits.

   In all cases, the bits in an IID have no generic semantics; in other
   words, they have opaque values.  In fact, the whole IID value MUST be
   viewed as an opaque bit string by third parties, except possibly in
   the local context.
==

That said - we already have a way to form all-random IIDs:
IN6_ADDR_GEN_MODE_RANDOM. Can't you just ensure that links of type
ARPHRD_RAWIP always use IN6_ADDR_GEN_MODE_RANDOM? That might lead to
less special-casing.


Re: [PATCH net] net/sched: act_simple: fix parsing of TCA_DEFDATA

2018-06-07 Thread Hangbin Liu
On Thu, Jun 07, 2018 at 03:46:43PM +0200, Davide Caratti wrote:
> use nla_strlcpy() to avoid copying data beyond the length of TCA_DEFDATA

s/TCA_DEFDATA/TCA_DEF_DATA/, incase someone search the commit history but
could not find it.

Thanks
Hangbin


Proposal

2018-06-07 Thread Mr. Fawaz KhE. Al Saleh




--
Good day,

i know you do not know me personally but i have checked your profile  
and i see generosity in you, There's an urgent offer attach

to your name here in the office of Mr. Fawaz KhE. Al Saleh Member of
the Board of Directors, Kuveyt Türk Participation Bank  (Turkey) and
head of private banking and wealth management
Regards,
Mr. Fawaz KhE. Al Saleh



Re: [Bug 199637] New: UBSAN: Undefined behaviour in net/ipv4/fib_trie.c:503:6

2018-06-07 Thread David Ahern
On 6/7/18 5:07 PM, Jakub Kicinski wrote:

>> After recompiling the 4.16.7 kernel with gcc 8.1, UBSAN reports the 
>> following:
>>
>> [   25.427424]
>> 
>> [   25.429680] UBSAN: Undefined behaviour in net/ipv4/fib_trie.c:503:6
>> [   25.431920] member access within null pointer of type 'struct tnode'
>> [   25.434153] CPU: 3 PID: 1 Comm: systemd Not tainted 4.16.7-CUSTOM #1
>> [   25.436384] Hardware name: Gigabyte Technology Co., Ltd.
>> H67MA-UD2H-B3/H67MA-UD2H-B3, BIOS F8 03/27/2012
>> [   25.438647] Call Trace:
>> [   25.440889]  dump_stack+0x62/0x9f
>> [   25.443104]  ubsan_epilogue+0x9/0x35
>> [   25.445293]  handle_null_ptr_deref+0x80/0x90
>> [   25.447464]  __ubsan_handle_type_mismatch_v1+0x6a/0x80
>> [   25.449628]  tnode_free+0xce/0x120

arguably this one should be guarded:

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 5bc0c89e81e4..32c589059fb3 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -501,7 +501,8 @@ static void tnode_free(struct key_vector *tn)
tnode_free_size += TNODE_SIZE(1ul << tn->bits);
node_free(tn);

-   tn = container_of(head, struct tnode, rcu)->kv;
+   if (head)
+   tn = container_of(head, struct tnode, rcu)->kv;
}

if (tnode_free_size >= PAGE_SIZE * sync_pages) {


but if head is NULL, tn is set but not dereferenced as the loop breaks.


Re: [Bug 199643] New: UBSAN: Undefined behaviour in ./include/net/route.h:240:2

2018-06-07 Thread Jakub Kicinski
CC: Andrey

On Thu, 7 Jun 2018 17:53:35 -0700, David Ahern wrote:
> On 6/7/18 5:49 PM, Jakub Kicinski wrote:
> > On Thu, 7 Jun 2018 17:28:59 -0700, Eric Dumazet wrote:  
> >> On 06/07/2018 05:11 PM, David Miller wrote:  
> >>> From: Jakub Kicinski 
> >>> Date: Thu, 7 Jun 2018 17:06:23 -0700
> >>> 
>  [  293.213661]  ip_send_unicast_reply+0x1b67/0x1d0e
> >>>
> >>> This calls ip_setup_cork() which can NULL out the 'rt' route
> >>> pointer.  Hmmm... :-/
> >>
> >> UBSAN seems unhappy  with dst being NULL in :
> >>
> >> dst_release(>dst);
> >>
> >> But the code obviously is ready for dst being NULL, it is even documented 
> >> :)  
> > 
> > Oh, so the code depends on dst being the first member?  Would it make
> > sense to just cast the pointer instead?
> >   
> 
> I've been going the other way with 'rt to dst' and 'dst to rt'
> transformations.
> 
> Perhaps UBSAN should be updated to understand that NULL + 0 is ok.


Re: [Bug 199643] New: UBSAN: Undefined behaviour in ./include/net/route.h:240:2

2018-06-07 Thread David Ahern
On 6/7/18 5:49 PM, Jakub Kicinski wrote:
> On Thu, 7 Jun 2018 17:28:59 -0700, Eric Dumazet wrote:
>> On 06/07/2018 05:11 PM, David Miller wrote:
>>> From: Jakub Kicinski 
>>> Date: Thu, 7 Jun 2018 17:06:23 -0700
>>>   
 [  293.213661]  ip_send_unicast_reply+0x1b67/0x1d0e  
>>>
>>> This calls ip_setup_cork() which can NULL out the 'rt' route
>>> pointer.  Hmmm... :-/
>>>   
>>
>>
>> UBSAN seems unhappy  with dst being NULL in :
>>
>> dst_release(>dst);
>>
>> But the code obviously is ready for dst being NULL, it is even documented :)
> 
> Oh, so the code depends on dst being the first member?  Would it make
> sense to just cast the pointer instead?
> 

I've been going the other way with 'rt to dst' and 'dst to rt'
transformations.

Perhaps UBSAN should be updated to understand that NULL + 0 is ok.


Re: [Bug 199643] New: UBSAN: Undefined behaviour in ./include/net/route.h:240:2

2018-06-07 Thread Jakub Kicinski
On Thu, 7 Jun 2018 17:28:59 -0700, Eric Dumazet wrote:
> On 06/07/2018 05:11 PM, David Miller wrote:
> > From: Jakub Kicinski 
> > Date: Thu, 7 Jun 2018 17:06:23 -0700
> >   
> >> [  293.213661]  ip_send_unicast_reply+0x1b67/0x1d0e  
> > 
> > This calls ip_setup_cork() which can NULL out the 'rt' route
> > pointer.  Hmmm... :-/
> >   
> 
> 
> UBSAN seems unhappy  with dst being NULL in :
> 
> dst_release(>dst);
> 
> But the code obviously is ready for dst being NULL, it is even documented :)

Oh, so the code depends on dst being the first member?  Would it make
sense to just cast the pointer instead?


Re: [Bug 199643] New: UBSAN: Undefined behaviour in ./include/net/route.h:240:2

2018-06-07 Thread Eric Dumazet



On 06/07/2018 05:11 PM, David Miller wrote:
> From: Jakub Kicinski 
> Date: Thu, 7 Jun 2018 17:06:23 -0700
> 
>> [  293.213661]  ip_send_unicast_reply+0x1b67/0x1d0e
> 
> This calls ip_setup_cork() which can NULL out the 'rt' route
> pointer.  Hmmm... :-/
> 


UBSAN seems unhappy  with dst being NULL in :

dst_release(>dst);

But the code obviously is ready for dst being NULL, it is even documented :)


Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.

2018-06-07 Thread Cong Wang
On Thu, Jun 7, 2018 at 4:48 PM,   wrote:
> From: Ben Greear 
>
> While testing an ath10k firmware that often crashed under load,
> I was seeing kernel crashes as well.  One of them appeared to
> be a dereference of a NULL flow object in fq_tin_dequeue.
>
> I have since fixed the firmware flaw, but I think it would be
> worth adding the WARN_ON in case the problem appears again.
>
> BUG: unable to handle kernel NULL pointer dereference at 003c
> IP: ieee80211_tx_dequeue+0xfb/0xb10 [mac80211]

Instead of adding WARN_ON(), you need to think about
the locking there, it is suspicious:

fq is from struct ieee80211_local:

struct fq *fq = >fq;

tin is from struct txq_info:

struct fq_tin *tin = >tin;

I don't know if fq and tin are supposed to be 1:1, if not there is
a bug in the locking, because ->new_flows and ->old_flows are
both inside tin instead of fq, but they are protected by fq->lock


Re: [Bug 199643] New: UBSAN: Undefined behaviour in ./include/net/route.h:240:2

2018-06-07 Thread David Miller
From: Jakub Kicinski 
Date: Thu, 7 Jun 2018 17:06:23 -0700

> [  293.213661]  ip_send_unicast_reply+0x1b67/0x1d0e

This calls ip_setup_cork() which can NULL out the 'rt' route
pointer.  Hmmm... :-/


Re: [Bug 199637] New: UBSAN: Undefined behaviour in net/ipv4/fib_trie.c:503:6

2018-06-07 Thread Jakub Kicinski
On Mon, 7 May 2018 10:33:45 -0700, Stephen Hemminger wrote:
> Begin forwarded message:
> 
> Date: Mon, 07 May 2018 16:07:24 +
> From: bugzilla-dae...@bugzilla.kernel.org
> To: step...@networkplumber.org
> Subject: [Bug 199637] New: UBSAN: Undefined behaviour in 
> net/ipv4/fib_trie.c:503:6
> 
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=199637
> 
> Bug ID: 199637
>Summary: UBSAN: Undefined behaviour in
> net/ipv4/fib_trie.c:503:6
>Product: Networking
>Version: 2.5
> Kernel Version: 4.16.7
>   Hardware: x86-64
> OS: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: IPV4
>   Assignee: step...@networkplumber.org
>   Reporter: combus...@archlinux.us
> Regression: No
> 
> After recompiling the 4.16.7 kernel with gcc 8.1, UBSAN reports the following:
> 
> [   25.427424]
> 
> [   25.429680] UBSAN: Undefined behaviour in net/ipv4/fib_trie.c:503:6
> [   25.431920] member access within null pointer of type 'struct tnode'
> [   25.434153] CPU: 3 PID: 1 Comm: systemd Not tainted 4.16.7-CUSTOM #1
> [   25.436384] Hardware name: Gigabyte Technology Co., Ltd.
> H67MA-UD2H-B3/H67MA-UD2H-B3, BIOS F8 03/27/2012
> [   25.438647] Call Trace:
> [   25.440889]  dump_stack+0x62/0x9f
> [   25.443104]  ubsan_epilogue+0x9/0x35
> [   25.445293]  handle_null_ptr_deref+0x80/0x90
> [   25.447464]  __ubsan_handle_type_mismatch_v1+0x6a/0x80
> [   25.449628]  tnode_free+0xce/0x120
> [   25.451749]  ? replace+0xa0/0x1f0
> [   25.453833]  ? resize+0x4e2/0xb70
> [   25.455916]  ? __kmalloc+0x1fe/0x2d0
> [   25.457997]  ? tnode_new+0x66/0x160
> [   25.460072]  ? fib_insert_alias+0x4a8/0x9e0
> [   25.462145]  ? fib_table_insert+0x208/0x690
> [   25.464214]  ? fib_magic+0x20c/0x310
> [   25.466280]  ? fib_netdev_event+0x81/0x200
> [   25.468339]  ? notifier_call_chain+0x63/0x110
> [   25.470407]  ? __dev_notify_flags+0xa8/0x170
> [   25.472472]  ? dev_change_flags+0x56/0x80
> [   25.474538]  ? do_setlink+0x3c2/0x1a00
> [   25.476603]  ? fib_magic+0x20c/0x310
> [   25.478666]  ? rtnl_setlink+0x129/0x1e0
> [   25.480728]  ? rtnetlink_rcv_msg+0x2a4/0x7d0
> [   25.482765]  ? rtnetlink_rcv+0x10/0x10
> [   25.484757]  ? netlink_rcv_skb+0x6f/0x170
> [   25.486741]  ? netlink_unicast+0x1c0/0x2d0
> [   25.488716]  ? netlink_sendmsg+0x2c1/0x630
> [   25.490661]  ? sock_sendmsg+0x49/0xb0
> [   25.492564]  ? SyS_sendto+0x12b/0x1d0
> [   25.494449]  ? do_syscall_64+0xad/0x5cc
> [   25.496305]  ? page_fault+0x2f/0x50
> [   25.498140]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [   25.499974]
> 
> 
> UBSAN reported nothing when the same kernel was compiled with gcc 7.3.1 from
> Arch Linux repositories.
> 
> I have three more similar reports to make, if I continue to c/p in each I'm
> gonna feel like a fuzzbot...
> 

And this one I'm seeing too (once at boot):

[   32.459535] 

[   32.469133] UBSAN: Undefined behaviour in ../net/ipv4/fib_trie.c:504:6
[   32.476534] member access within null pointer of type 'struct tnode'
[   32.483733] CPU: 8 PID: 1 Comm: systemd Not tainted 
4.17.0-rc7-debug-01088-g47bffcfef048 #9
[   32.493191] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 
11/08/2016
[   32.501680] Call Trace:
[   32.504513]  dump_stack+0xe6/0x1a0
[   32.508412]  ? dump_stack_print_info.cold.0+0x1b/0x1b
[   32.514164]  ? do_raw_spin_lock+0xcf/0x220
[   32.518848]  ubsan_epilogue+0x9/0x7a
[   32.522940]  handle_null_ptr_deref+0x16b/0x1e0
[   32.528008]  ? ucs2_as_utf8+0x6b0/0x6b0
[   32.532397]  ? __x64_sys_sendto+0xe6/0x1d0
[   32.537079]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   32.543025]  __ubsan_handle_type_mismatch_v1+0x16b/0x19e
[   32.549054]  ? ubsan_type_mismatch_common.part.5.cold.9+0x1bb/0x1bb
[   32.556168]  ? fib_find_node+0x350/0x350
[   32.560655]  tnode_free+0x115/0x180
[   32.564655]  replace+0x21d/0x5e0
[   32.568361]  ? fib_insert_alias+0x1b20/0x1b20
[   32.573332]  ? put_child+0x546/0x7b0
[   32.577427]  ? __kmalloc+0x1b1/0x5f0
[   32.581520]  ? fib_trie_seq_start+0x510/0x510
[   32.586497]  resize+0x1253/0x2150
[   32.590299]  ? netlink_sendmsg+0x7b5/0x10c0
[   32.595074]  ? __sys_sendto+0x340/0x680
[   32.599460]  ? do_syscall_64+0x14b/0x720
[   32.603954]  ? __node_free_rcu+0x70/0x70
[   32.608442]  ? rcu_lockdep_current_cpu_online+0x1e7/0x2c0
[   32.614578]  ? rcu_dynticks_curr_cpu_in_eqs+0xd6/0x1f0
[   32.620435]  ? lockdep_rtnl_is_held+0x16/0x20
[   32.625401]  ? put_child+0x546/0x7b0
[   32.629494]  ? __kmalloc+0x1b1/0x5f0
[   32.633586]  ? fib_trie_seq_start+0x510/0x510
[   32.638561]  ? tnode_new+0x6c/0x310
[   32.642561]  fib_insert_alias+0xe9c/0x1b20
[   

Re: [PATCH net] bpfilter: fix race in pipe access

2018-06-07 Thread David Miller
From: Alexei Starovoitov 
Date: Thu, 7 Jun 2018 15:31:14 -0700

> syzbot reported the following crash
> [  338.293946] bpfilter: read fail -512
> [  338.304515] kasan: GPF could be caused by NULL-ptr deref or user memory 
> access
> [  338.311863] general protection fault:  [#1] SMP KASAN
> [  338.344360] RIP: 0010:__vfs_write+0x4a6/0x960
> [  338.426363] Call Trace:
> [  338.456967]  __kernel_write+0x10c/0x380
> [  338.460928]  __bpfilter_process_sockopt+0x1d8/0x35b
> [  338.487103]  bpfilter_mbox_request+0x4d/0xb0
> [  338.491492]  bpfilter_ip_get_sockopt+0x6b/0x90
> 
> This can happen when multiple cpus trying to talk to user mode process
> via bpfilter_mbox_request(). One cpu grabs the mutex while another goes to
> sleep on the same mutex. Then former cpu sees that umh pipe is down and
> shuts down the pipes. Later cpu finally acquires the mutex and crashes
> on freed pipe.
> Fix the race by using info.pid as an indicator that umh and pipes are healthy
> and check it after acquiring the mutex.
> 
> Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module")
> Reported-by: syzbot+7ade6c94abb2774c0...@syzkaller.appspotmail.com
> Signed-off-by: Alexei Starovoitov 

Applied, thanks Alexei.


Re: pull-request: bpf 2018-06-08

2018-06-07 Thread David Miller
From: Daniel Borkmann 
Date: Fri,  8 Jun 2018 01:30:00 +0200

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
 ...
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thanks Daniel.


Re: [Bug 199643] New: UBSAN: Undefined behaviour in ./include/net/route.h:240:2

2018-06-07 Thread Jakub Kicinski
On Tue, 8 May 2018 08:52:35 -0600, David Ahern wrote:
> On 5/7/18 10:12 PM, David Miller wrote:
> > From: Stephen Hemminger 
> > Date: Mon, 7 May 2018 10:34:00 -0700
> >   
> >> Subject: [Bug 199643] New: UBSAN: Undefined behaviour in 
> >> ./include/net/route.h:240:2  
> > 
> > That's an empty line in both of my trees.
> >   
> 
> In 4.16.7 it is the dst_release in:
> 
> static inline void ip_rt_put(struct rtable *rt)
> {
> /* dst_release() accepts a NULL parameter.
>  * We rely on dst being first structure in struct rtable
>  */
> BUILD_BUG_ON(offsetof(struct rtable, dst) != 0);
> --->dst_release(>dst);  

I'm seeing these on net-next as of yesterday, but admittedly I haven't
run with UBSAN enabled for a while :(  Was it resolved?

[  293.130007] UBSAN: Undefined behaviour in ../include/net/route.h:239:2
[  293.137408] member access within null pointer of type 'struct rtable'
[  293.144716] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 
4.17.0-rc7-debug-01088-g47bffcfef048 #9
[  293.154374] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 
11/08/2016
[  293.162866] Call Trace:
[  293.165696]  
[  293.168045]  dump_stack+0xe6/0x1a0
[  293.171943]  ? dump_stack_print_info.cold.0+0x1b/0x1b
[  293.177699]  ? do_raw_spin_lock+0xcf/0x220
[  293.182379]  ubsan_epilogue+0x9/0x7a
[  293.186471]  handle_null_ptr_deref+0x16b/0x1e0
[  293.191535]  ? ucs2_as_utf8+0x6b0/0x6b0
[  293.195919]  ? ip_mc_output+0x1610/0x1610
[  293.200505]  __ubsan_handle_type_mismatch_v1+0x16b/0x19e
[  293.206543]  ? ubsan_type_mismatch_common.part.5.cold.9+0x1bb/0x1bb
[  293.213661]  ip_send_unicast_reply+0x1b67/0x1d0e
[  293.218935]  ? ip_make_skb+0x410/0x410
[  293.223232]  ? lock_acquire+0x1a2/0x5a0
[  293.227622]  ? lock_release+0x980/0x980
[  293.232011]  ? free_user_ns+0x300/0x300
[  293.236396]  ? rcu_dynticks_curr_cpu_in_eqs+0xd6/0x1f0
[  293.242239]  ? rcu_bh_qs+0x500/0x500
[  293.246342]  tcp_v4_send_reset+0x13c6/0x29f0
[  293.251224]  ? tcp_v4_inbound_md5_hash+0x650/0x650
[  293.256698]  ? debug_check_no_locks_freed+0x260/0x260
[  293.262453]  ? rcu_lockdep_current_cpu_online+0x1e7/0x2c0
[  293.268586]  ? rcu_dynticks_curr_cpu_in_eqs+0xd6/0x1f0
[  293.274430]  ? rcu_start_gp_advanced+0x740/0x740
[  293.279688]  ? rcu_bh_qs+0x500/0x500
[  293.283790]  ? tcp_v4_rcv+0xf9f/0x3ec0
[  293.288075]  tcp_v4_rcv+0xf9f/0x3ec0
[  293.292189]  ? tcp_v4_early_demux+0xa70/0xa70
[  293.297179]  ? __isolate_free_page+0x890/0x890
[  293.302258]  ? __accumulate_pelt_segments+0x29/0x40
[  293.307819]  ? lock_acquire+0x1a2/0x5a0
[  293.312204]  ? ip_local_deliver_finish+0x189/0xcd0
[  293.317661]  ? raw_rcv+0x510/0x510
[  293.321564]  ? rcu_lockdep_current_cpu_online+0x1e7/0x2c0
[  293.327700]  ? rcu_dynticks_curr_cpu_in_eqs+0xd6/0x1f0
[  293.333546]  ? rcu_start_gp_advanced+0x740/0x740
[  293.338808]  ? rcu_bh_qs+0x500/0x500
[  293.342913]  ip_local_deliver_finish+0x475/0xcd0
[  293.348180]  ? inet_add_protocol.cold.0+0x28/0x28
[  293.353538]  ? rcu_read_lock_bh_held+0xc0/0xc0
[  293.358607]  ? rcu_dynticks_curr_cpu_in_eqs+0xd6/0x1f0
[  293.364455]  ip_local_deliver+0x1a1/0x680
[  293.369039]  ? ip_call_ra_chain+0x700/0x700
[  293.373816]  ? rcu_lockdep_current_cpu_online+0x1e7/0x2c0
[  293.379950]  ? rcu_dynticks_curr_cpu_in_eqs+0xd6/0x1f0
[  293.385792]  ? rcu_start_gp_advanced+0x740/0x740
[  293.391050]  ? rcu_bh_qs+0x500/0x500
[  293.395143]  ? rb_erase+0x3460/0x3460
[  293.399342]  ip_rcv_finish+0x727/0x25c0
[  293.403733]  ? ip_local_deliver_finish+0xcd0/0xcd0
[  293.409218]  ? print_irqtrace_events+0x280/0x280
[  293.414478]  ? print_irqtrace_events+0x280/0x280
[  293.419746]  ? tcp_v4_send_synack+0x450/0x450
[  293.424721]  ? print_irqtrace_events+0x280/0x280
[  293.429982]  ? enqueue_entity+0x3760/0x3760
[  293.434760]  ? print_irqtrace_events+0x280/0x280
[  293.440028]  ip_rcv+0x973/0x1758
[  293.443738]  ? ip_local_deliver+0x680/0x680
[  293.448513]  ? print_irqtrace_events+0x280/0x280
[  293.453771]  ? print_irqtrace_events+0x280/0x280
[  293.459021]  ? print_irqtrace_events+0x280/0x280
[  293.464283]  ? print_irqtrace_events+0x280/0x280
[  293.469549]  ? rcu_lockdep_current_cpu_online+0x1e7/0x2c0
[  293.475681]  ? rcu_dynticks_curr_cpu_in_eqs+0xd6/0x1f0
[  293.481526]  ? rcu_start_gp_advanced+0x740/0x740
[  293.486785]  ? rcu_bh_qs+0x500/0x500
[  293.490883]  ? ip_local_deliver+0x680/0x680
[  293.495659]  __netif_receive_skb_core+0x23e7/0x5a80
[  293.501244]  ? debug_check_no_locks_freed+0x1e0/0x260
[  293.506996]  ? netif_schedule_queue+0x2c0/0x2c0
[  293.512159]  ? __lock_acquire+0x6ad/0x3b10
[  293.516860]  ? rcu_start_gp_advanced+0x740/0x740
[  293.522122]  ? debug_check_no_locks_freed+0x260/0x260
[  293.527872]  ? rcu_read_lock_sched_held+0x107/0x120
[  293.533437]  ? nfp_net_poll+0x87/0x1a0 [nfp]
[  293.538306]  ? module_assert_mutex_or_preempt+0x41/0x70
[  293.544244]  ? __module_address+0xb4/0x860
[  293.548935]  ? unwind_next_frame+0x12e5/0x24d0
[  293.554002]  ? 

Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.

2018-06-07 Thread Cong Wang
On Thu, Jun 7, 2018 at 4:48 PM,   wrote:
> diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h
> index be7c0fa..cb911f0 100644
> --- a/include/net/fq_impl.h
> +++ b/include/net/fq_impl.h
> @@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,
> return NULL;
> }
>
> -   flow = list_first_entry(head, struct fq_flow, flowchain);
> +   flow = list_first_entry_or_null(head, struct fq_flow, flowchain);
> +
> +   if (WARN_ON_ONCE(!flow))
> +   return NULL;

This does not make sense either. list_first_entry_or_null()
returns NULL only when the list is empty, but we already check
list_empty() right before this code, and it is protected by fq->lock.


[PATCH v2] net-fq: Add WARN_ON check for null flow.

2018-06-07 Thread greearb
From: Ben Greear 

While testing an ath10k firmware that often crashed under load,
I was seeing kernel crashes as well.  One of them appeared to
be a dereference of a NULL flow object in fq_tin_dequeue.

I have since fixed the firmware flaw, but I think it would be
worth adding the WARN_ON in case the problem appears again.

BUG: unable to handle kernel NULL pointer dereference at 003c
IP: ieee80211_tx_dequeue+0xfb/0xb10 [mac80211]
PGD 8001417fe067 P4D 8001417fe067 PUD 13db41067 PMD 0
Oops:  [#1] PREEMPT SMP PTI
Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 
libcrc32c vrf 8021q garp mrp stp llc fuse macvlan wanlink(O) pktgen lm78 ]
CPU: 2 PID: 21733 Comm: ip Tainted: GW  O 4.16.8+ #35
Hardware name: _ _/, BIOS 5.11 08/26/2016
RIP: 0010:ieee80211_tx_dequeue+0xfb/0xb10 [mac80211]
RSP: 0018:880172d03c30 EFLAGS: 00010286
RAX: 88013b2c RBX: 88013b2c00b8 RCX: 0898
RDX: 0001 RSI: 88013b2c00d8 RDI: 88016ac40820
RBP: 88016ac42ba0 R08: 0020 R09: 
R10: 0010 R11: 001256c89fd8 R12: 88013b2c
R13: 88013b2c00d8 R14:  R15: 88013b2c00d8
FS:  7f04e3606700() GS:880172d0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 003c CR3: 00013b35a005 CR4: 003606e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 
 ? update_load_avg+0x607/0x6f0
 ath10k_mac_tx_push_txq+0x6e/0x220 [ath10k_core]
 ath10k_mac_tx_push_pending+0x151/0x1e0 [ath10k_core]
 ath10k_htt_txrx_compl_task+0x113e/0x1940 [ath10k_core]
 ? ath10k_ce_completed_send_next_nolock+0x6f/0x90 [ath10k_pci]
 ? ath10k_ce_completed_send_next+0x31/0x40 [ath10k_pci]
 ? ath10k_pci_htc_tx_cb+0x30/0xc0 [ath10k_pci]
 ? ath10k_bus_pci_write32+0x3c/0xa0 [ath10k_pci]
 ath10k_pci_napi_poll+0x44/0xf0 [ath10k_pci]
 net_rx_action+0x250/0x3b0
 __do_softirq+0xc2/0x2c2
 irq_exit+0x93/0xa0
 do_IRQ+0x45/0xc0
 common_interrupt+0xf/0xf
 

Signed-off-by: Ben Greear 
---

* v2:  Use list_first_entry_or_null as suggested.

 include/net/fq_impl.h | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h
index be7c0fa..cb911f0 100644
--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,
return NULL;
}
 
-   flow = list_first_entry(head, struct fq_flow, flowchain);
+   flow = list_first_entry_or_null(head, struct fq_flow, flowchain);
+
+   if (WARN_ON_ONCE(!flow))
+   return NULL;
 
if (flow->deficit <= 0) {
flow->deficit += fq->quantum;
-- 
2.4.11



pull-request: bpf 2018-06-08

2018-06-07 Thread Daniel Borkmann
Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix in the BPF verifier to reject modified ctx pointers on helper
   functions, from Daniel.

2) Fix in BPF kselftests for get_cgroup_id_user() helper to only
   record the cgroup id for a provided pid in order to reduce test
   failures from processes interferring with the test, from Yonghong.

3) Fix a crash in AF_XDP's mem accounting when the process owning
   the sock has CAP_IPC_LOCK capabilities set, from Daniel.

4) Fix an issue for AF_XDP on 32 bit machines where XDP_UMEM_PGOFF_*_RING
   defines need ULL suffixes and use loff_t type as they are otherwise
   truncated, from Geert.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!



The following changes since commit 1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 
(2018-06-06 18:39:49 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to c09290c5637692a9bfe7740e4c5e693efff12810:

  bpf, xdp: fix crash in xdp_umem_unaccount_pages (2018-06-07 15:32:28 -0700)


Daniel Borkmann (2):
  bpf: reject passing modified ctx to helper functions
  bpf, xdp: fix crash in xdp_umem_unaccount_pages

Geert Uytterhoeven (1):
  xsk: Fix umem fill/completion queue mmap on 32-bit

Yonghong Song (1):
  tools/bpf: fix selftest get_cgroup_id_user

 include/uapi/linux/if_xdp.h  |  4 +-
 kernel/bpf/verifier.c| 48 +---
 net/xdp/xdp_umem.c   |  6 ++-
 net/xdp/xsk.c|  2 +-
 tools/testing/selftests/bpf/get_cgroup_id_kern.c | 14 +-
 tools/testing/selftests/bpf/get_cgroup_id_user.c | 12 -
 tools/testing/selftests/bpf/test_verifier.c  | 58 +++-
 7 files changed, 118 insertions(+), 26 deletions(-)


Re: [PATCH bpf-next v6 1/2] trace_helpers.c: Add helpers to poll multiple perf FDs for events

2018-06-07 Thread Daniel Borkmann
Hi Toke,

On 06/06/2018 07:58 PM, Toke Høiland-Jørgensen wrote:
> Add two new helper functions to trace_helpers that supports polling
> multiple perf file descriptors for events. These are used to the XDP
> perf_event_output example, which needs to work with one perf fd per CPU.
> 
> Reviewed-by: Jakub Kicinski 
> Signed-off-by: Toke Høiland-Jørgensen 

Didn't make it in time unfortunately as bpf-next closed, please resend
these two once merge window is over and bpf-next open again.

Thanks a lot,
Daniel


Re: [PATCH bpf] bpf, xdp: fix crash in xdp_umem_unaccount_pages

2018-06-07 Thread Alexei Starovoitov
On Fri, Jun 08, 2018 at 12:06:01AM +0200, Daniel Borkmann wrote:
> syzkaller was able to trigger the following panic for AF_XDP:
> 
>   BUG: KASAN: null-ptr-deref in atomic64_sub 
> include/asm-generic/atomic-instrumented.h:144 [inline]
>   BUG: KASAN: null-ptr-deref in atomic_long_sub 
> include/asm-generic/atomic-long.h:199 [inline]
>   BUG: KASAN: null-ptr-deref in xdp_umem_unaccount_pages.isra.4+0x3d/0x80 
> net/xdp/xdp_umem.c:135
>   Write of size 8 at addr 0060 by task syz-executor246/4527
> 
>   CPU: 1 PID: 4527 Comm: syz-executor246 Not tainted 4.17.0+ #89
>   Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
>   Call Trace:
>__dump_stack lib/dump_stack.c:77 [inline]
>dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>kasan_report_error mm/kasan/report.c:352 [inline]
>kasan_report.cold.7+0x6d/0x2fe mm/kasan/report.c:412
>check_memory_region_inline mm/kasan/kasan.c:260 [inline]
>check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
>kasan_check_write+0x14/0x20 mm/kasan/kasan.c:278
>atomic64_sub include/asm-generic/atomic-instrumented.h:144 [inline]
>atomic_long_sub include/asm-generic/atomic-long.h:199 [inline]
>xdp_umem_unaccount_pages.isra.4+0x3d/0x80 net/xdp/xdp_umem.c:135
>xdp_umem_reg net/xdp/xdp_umem.c:334 [inline]
>xdp_umem_create+0xd6c/0x10f0 net/xdp/xdp_umem.c:349
>xsk_setsockopt+0x443/0x550 net/xdp/xsk.c:531
>__sys_setsockopt+0x1bd/0x390 net/socket.c:1935
>__do_sys_setsockopt net/socket.c:1946 [inline]
>__se_sys_setsockopt net/socket.c:1943 [inline]
>__x64_sys_setsockopt+0xbe/0x150 net/socket.c:1943
>do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
>entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> In xdp_umem_reg() the call to xdp_umem_account_pages() passed
> with CAP_IPC_LOCK where we didn't need to end up charging rlimit
> on memlock for the current user and therefore umem->user continues
> to be NULL. Later on through fault injection syzkaller triggered
> a failure in either umem->pgs or umem->pages allocation such that
> we bail out and undo accounting in xdp_umem_unaccount_pages()
> where we eventually hit the panic since it tries to deref the
> umem->user.
> 
> The code is pretty close to mm_account_pinned_pages() and
> mm_unaccount_pinned_pages() pair and potentially could reuse
> it even in a later cleanup, and it appears that the initial
> commit c0c77d8fb787 ("xsk: add user memory registration support
> sockopt") got this right while later follow-up introduced the
> bug via a49049ea2576 ("xsk: simplified umem setup").
> 
> Fixes: a49049ea2576 ("xsk: simplified umem setup")
> Reported-by: syzbot+979217770b09ebf5c...@syzkaller.appspotmail.com
> Signed-off-by: Daniel Borkmann 

Applied, Thanks


Re: [Patch net] socket: close race condition between sock_close() and sockfs_setattr()

2018-06-07 Thread Al Viro
On Thu, Jun 07, 2018 at 03:15:15PM -0700, Cong Wang wrote:

> > You do realize that the same ->setattr() can be called by way of
> > chown() on /proc/self/fd/, right?  What would you do there -
> > bump refcount on that struct file when traversing that symlink and
> > hold it past the end of pathname resolution, until... what exactly?
> 
> I was thinking about this:
> 
> error = user_path_at(dfd,); // hold dfd when needed
> 
> if (!error) {
> error = chown_common(, mode);
> path_put();  // release dfd if held
> 
> With this, we can guarantee ->release() is only possibly called
> after chown_common() which is after ->setattr() too.

No, we can't.  You are assuming that there *is* dfd and that it points
to the opened socket we are going to operate upon.  That is not guaranteed.
At all.  If e.g. 42 is a file descriptor of an opened socket, plain chown(2)
on /proc/self/fd/42 will trigger that ->setattr().  No dfd in sight.
We do run across an opened file at some point, all right - when we traverse
the symlink in procfs.  You can't bump ->f_count there.  Even leaving aside
the "where would I stash the reference to that file?" and "how long would I
hold it?", you can't bump ->f_count on other process' files.  That would
bugger the expectations of close(2) callers.


[PATCH net] bpfilter: fix race in pipe access

2018-06-07 Thread Alexei Starovoitov
syzbot reported the following crash
[  338.293946] bpfilter: read fail -512
[  338.304515] kasan: GPF could be caused by NULL-ptr deref or user memory 
access
[  338.311863] general protection fault:  [#1] SMP KASAN
[  338.344360] RIP: 0010:__vfs_write+0x4a6/0x960
[  338.426363] Call Trace:
[  338.456967]  __kernel_write+0x10c/0x380
[  338.460928]  __bpfilter_process_sockopt+0x1d8/0x35b
[  338.487103]  bpfilter_mbox_request+0x4d/0xb0
[  338.491492]  bpfilter_ip_get_sockopt+0x6b/0x90

This can happen when multiple cpus trying to talk to user mode process
via bpfilter_mbox_request(). One cpu grabs the mutex while another goes to
sleep on the same mutex. Then former cpu sees that umh pipe is down and
shuts down the pipes. Later cpu finally acquires the mutex and crashes
on freed pipe.
Fix the race by using info.pid as an indicator that umh and pipes are healthy
and check it after acquiring the mutex.

Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module")
Reported-by: syzbot+7ade6c94abb2774c0...@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov 
---
 net/bpfilter/bpfilter_kern.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
index b13d058f8c34..09522573f611 100644
--- a/net/bpfilter/bpfilter_kern.c
+++ b/net/bpfilter/bpfilter_kern.c
@@ -24,17 +24,19 @@ static void shutdown_umh(struct umh_info *info)
 {
struct task_struct *tsk;
 
+   if (!info->pid)
+   return;
tsk = pid_task(find_vpid(info->pid), PIDTYPE_PID);
if (tsk)
force_sig(SIGKILL, tsk);
fput(info->pipe_to_umh);
fput(info->pipe_from_umh);
+   info->pid = 0;
 }
 
 static void __stop_umh(void)
 {
-   if (IS_ENABLED(CONFIG_INET) &&
-   bpfilter_process_sockopt) {
+   if (IS_ENABLED(CONFIG_INET)) {
bpfilter_process_sockopt = NULL;
shutdown_umh();
}
@@ -55,7 +57,7 @@ static int __bpfilter_process_sockopt(struct sock *sk, int 
optname,
struct mbox_reply reply;
loff_t pos;
ssize_t n;
-   int ret;
+   int ret = -EFAULT;
 
req.is_set = is_set;
req.pid = current->pid;
@@ -63,6 +65,8 @@ static int __bpfilter_process_sockopt(struct sock *sk, int 
optname,
req.addr = (long)optval;
req.len = optlen;
mutex_lock(_lock);
+   if (!info.pid)
+   goto out;
n = __kernel_write(info.pipe_to_umh, , sizeof(req), );
if (n != sizeof(req)) {
pr_err("write fail %zd\n", n);
-- 
2.9.5



Re: [PATCH bpf] tools/bpf: fix selftest get_cgroup_id_user

2018-06-07 Thread Daniel Borkmann
On 06/06/2018 06:12 PM, Yonghong Song wrote:
> Commit f269099a7e7a ("tools/bpf: add a selftest for
> bpf_get_current_cgroup_id() helper") added a test
> for bpf_get_current_cgroup_id() helper. The bpf program
> is attached to tracepoint syscalls/sys_enter_nanosleep
> and will record the cgroup id if the tracepoint is hit.
> The test program creates a cgroup and attachs itself to
> this cgroup and expects that the test program process
> cgroup id is the same as the cgroup_id retrieved
> by the bpf program.
> 
> In a light system where no other processes called
> nanosleep syscall, the test case can pass.
> In a busy system where many different processes can hit
> syscalls/sys_enter_nanosleep tracepoint, the cgroup id
> recorded by bpf program may not match the test program
> process cgroup_id.
> 
> This patch fixed an issue by communicating the test program
> pid to bpf program. The bpf program only records
> cgroup id if the current task pid is the same as
> passed-in pid. This ensures that the recorded cgroup_id
> is for the cgroup within which the test program resides.
> 
> Fixes: f269099a7e7a ("tools/bpf: add a selftest for 
> bpf_get_current_cgroup_id() helper")
> Signed-off-by: Yonghong Song 

Applied to bpf, thanks Yonghong!


Re: [Patch net] socket: close race condition between sock_close() and sockfs_setattr()

2018-06-07 Thread Cong Wang
On Thu, Jun 7, 2018 at 3:04 PM, Al Viro  wrote:
> On Thu, Jun 07, 2018 at 02:45:58PM -0700, Cong Wang wrote:
>> On Thu, Jun 7, 2018 at 2:26 PM, Al Viro  wrote:
>> > On Thu, Jun 07, 2018 at 01:39:49PM -0700, Cong Wang wrote:
>> >> fchownat() doesn't even hold refcnt of fd until it figures out
>> >> fd is really needed (otherwise is ignored) and releases it after
>> >> it resolves the path. This means sock_close() could race with
>> >> sockfs_setattr(), which leads to a NULL pointer dereference
>> >> since typically we set sock->sk to NULL in ->release().
>> >>
>> >> As pointed out by Al, this is unique to sockfs. So we can fix this
>> >> in socket layer by acquiring inode_lock in sock_close() and
>> >> checking against NULL in sockfs_setattr().
>> >
>> > That looks like a massive overkill - it's way heavier than it should be.
>>
>> I don't see any other quick way to fix this. My initial thought is
>> to keep that refcnt until path_put(), apparently you don't like it
>> either.
>
> You do realize that the same ->setattr() can be called by way of
> chown() on /proc/self/fd/, right?  What would you do there -
> bump refcount on that struct file when traversing that symlink and
> hold it past the end of pathname resolution, until... what exactly?

I was thinking about this:

error = user_path_at(dfd,); // hold dfd when needed

if (!error) {
error = chown_common(, mode);
path_put();  // release dfd if held

With this, we can guarantee ->release() is only possibly called
after chown_common() which is after ->setattr() too.


Hello Dear

2018-06-07 Thread Charles Koch
Hello Fellow

Charles Koch here, ceo charles koch foundation worldwide (the largest charity 
foundation in the world).
This year under our motto of "Give while living", i and the foundation have 
decided to award $500,000.00
to a few selected lucky individuals and on receipt of this email kindly get 
back to me.


Regards,
Charles Koch.
chkoch2...@dbzmail.com


Re: [PATCH] net-fq: Add WARN_ON check for null flow.

2018-06-07 Thread Ben Greear

On 06/07/2018 02:52 PM, Cong Wang wrote:

On Thu, Jun 7, 2018 at 2:41 PM, Ben Greear  wrote:

On 06/07/2018 02:29 PM, Cong Wang wrote:


On Thu, Jun 7, 2018 at 9:06 AM,   wrote:


--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,

flow = list_first_entry(head, struct fq_flow, flowchain);

+   if (WARN_ON_ONCE(!flow))
+   return NULL;
+



How could even possibly list_first_entry() returns NULL?
You need list_first_entry_or_null().



I don't know for certain flow as null, but something was NULL in this method
near that line and it looked like a likely culprit.

I guess possibly tin or fq was passed in as NULL?


A NULL pointer is not always 0. You can trigger a NULL-ptr-def with 0x3c
too, but you are checking against 0 in your patch, that is the problem and
that is why list_first_entry_or_null() exists.



Ahh, I see what you mean, and that is my mistake.  In my case, it did seem to
be a mostly-null deref, not a 0x0 deref.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



[PATCH bpf] bpf, xdp: fix crash in xdp_umem_unaccount_pages

2018-06-07 Thread Daniel Borkmann
syzkaller was able to trigger the following panic for AF_XDP:

  BUG: KASAN: null-ptr-deref in atomic64_sub 
include/asm-generic/atomic-instrumented.h:144 [inline]
  BUG: KASAN: null-ptr-deref in atomic_long_sub 
include/asm-generic/atomic-long.h:199 [inline]
  BUG: KASAN: null-ptr-deref in xdp_umem_unaccount_pages.isra.4+0x3d/0x80 
net/xdp/xdp_umem.c:135
  Write of size 8 at addr 0060 by task syz-executor246/4527

  CPU: 1 PID: 4527 Comm: syz-executor246 Not tainted 4.17.0+ #89
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
Google 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x1b9/0x294 lib/dump_stack.c:113
   kasan_report_error mm/kasan/report.c:352 [inline]
   kasan_report.cold.7+0x6d/0x2fe mm/kasan/report.c:412
   check_memory_region_inline mm/kasan/kasan.c:260 [inline]
   check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
   kasan_check_write+0x14/0x20 mm/kasan/kasan.c:278
   atomic64_sub include/asm-generic/atomic-instrumented.h:144 [inline]
   atomic_long_sub include/asm-generic/atomic-long.h:199 [inline]
   xdp_umem_unaccount_pages.isra.4+0x3d/0x80 net/xdp/xdp_umem.c:135
   xdp_umem_reg net/xdp/xdp_umem.c:334 [inline]
   xdp_umem_create+0xd6c/0x10f0 net/xdp/xdp_umem.c:349
   xsk_setsockopt+0x443/0x550 net/xdp/xsk.c:531
   __sys_setsockopt+0x1bd/0x390 net/socket.c:1935
   __do_sys_setsockopt net/socket.c:1946 [inline]
   __se_sys_setsockopt net/socket.c:1943 [inline]
   __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1943
   do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

In xdp_umem_reg() the call to xdp_umem_account_pages() passed
with CAP_IPC_LOCK where we didn't need to end up charging rlimit
on memlock for the current user and therefore umem->user continues
to be NULL. Later on through fault injection syzkaller triggered
a failure in either umem->pgs or umem->pages allocation such that
we bail out and undo accounting in xdp_umem_unaccount_pages()
where we eventually hit the panic since it tries to deref the
umem->user.

The code is pretty close to mm_account_pinned_pages() and
mm_unaccount_pinned_pages() pair and potentially could reuse
it even in a later cleanup, and it appears that the initial
commit c0c77d8fb787 ("xsk: add user memory registration support
sockopt") got this right while later follow-up introduced the
bug via a49049ea2576 ("xsk: simplified umem setup").

Fixes: a49049ea2576 ("xsk: simplified umem setup")
Reported-by: syzbot+979217770b09ebf5c...@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann 
---
 net/xdp/xdp_umem.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 7eb4948..b9ef487 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -132,8 +132,10 @@ static void xdp_umem_unpin_pages(struct xdp_umem *umem)
 
 static void xdp_umem_unaccount_pages(struct xdp_umem *umem)
 {
-   atomic_long_sub(umem->npgs, >user->locked_vm);
-   free_uid(umem->user);
+   if (umem->user) {
+   atomic_long_sub(umem->npgs, >user->locked_vm);
+   free_uid(umem->user);
+   }
 }
 
 static void xdp_umem_release(struct xdp_umem *umem)
-- 
2.9.5



Re: [Patch net] socket: close race condition between sock_close() and sockfs_setattr()

2018-06-07 Thread Al Viro
On Thu, Jun 07, 2018 at 02:45:58PM -0700, Cong Wang wrote:
> On Thu, Jun 7, 2018 at 2:26 PM, Al Viro  wrote:
> > On Thu, Jun 07, 2018 at 01:39:49PM -0700, Cong Wang wrote:
> >> fchownat() doesn't even hold refcnt of fd until it figures out
> >> fd is really needed (otherwise is ignored) and releases it after
> >> it resolves the path. This means sock_close() could race with
> >> sockfs_setattr(), which leads to a NULL pointer dereference
> >> since typically we set sock->sk to NULL in ->release().
> >>
> >> As pointed out by Al, this is unique to sockfs. So we can fix this
> >> in socket layer by acquiring inode_lock in sock_close() and
> >> checking against NULL in sockfs_setattr().
> >
> > That looks like a massive overkill - it's way heavier than it should be.
> 
> I don't see any other quick way to fix this. My initial thought is
> to keep that refcnt until path_put(), apparently you don't like it
> either.

You do realize that the same ->setattr() can be called by way of
chown() on /proc/self/fd/, right?  What would you do there -
bump refcount on that struct file when traversing that symlink and
hold it past the end of pathname resolution, until... what exactly?


[PATCH net] net: aquantia: fix unsigned numvecs comparison with less than zero

2018-06-07 Thread Igor Russkikh
From: Colin Ian King 

From: Colin Ian King 

This was originally mistakenly submitted to net-next. Resubmitting to net.

The comparison of numvecs < 0 is always false because numvecs is a u32
and hence the error return from a failed call to pci_alloc_irq_vectores
is never detected.  Fix this by using the signed int ret to handle the
error return and assign numvecs to err.

Detected by CoverityScan, CID#1468650 ("Unsigned compared against 0")

Fixes: a09bd81b5413 ("net: aquantia: Limit number of vectors to actually 
allocated irqs")
Signed-off-by: Colin Ian King 
Signed-off-by: Igor Russkikh 
---
 drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c 
b/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c
index a50e08bb4748..750007513f9d 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c
@@ -267,14 +267,13 @@ static int aq_pci_probe(struct pci_dev *pdev,
numvecs = min(numvecs, num_online_cpus());
/*enable interrupts */
 #if !AQ_CFG_FORCE_LEGACY_INT
-   numvecs = pci_alloc_irq_vectors(self->pdev, 1, numvecs,
-   PCI_IRQ_MSIX | PCI_IRQ_MSI |
-   PCI_IRQ_LEGACY);
+   err = pci_alloc_irq_vectors(self->pdev, 1, numvecs,
+   PCI_IRQ_MSIX | PCI_IRQ_MSI |
+   PCI_IRQ_LEGACY);
 
-   if (numvecs < 0) {
-   err = numvecs;
+   if (err < 0)
goto err_hwinit;
-   }
+   numvecs = err;
 #endif
self->irqvecs = numvecs;
 
-- 
2.17.1



Re: [PATCH] net-fq: Add WARN_ON check for null flow.

2018-06-07 Thread Cong Wang
On Thu, Jun 7, 2018 at 2:41 PM, Ben Greear  wrote:
> On 06/07/2018 02:29 PM, Cong Wang wrote:
>>
>> On Thu, Jun 7, 2018 at 9:06 AM,   wrote:
>>>
>>> --- a/include/net/fq_impl.h
>>> +++ b/include/net/fq_impl.h
>>> @@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,
>>>
>>> flow = list_first_entry(head, struct fq_flow, flowchain);
>>>
>>> +   if (WARN_ON_ONCE(!flow))
>>> +   return NULL;
>>> +
>>
>>
>> How could even possibly list_first_entry() returns NULL?
>> You need list_first_entry_or_null().
>>
>
> I don't know for certain flow as null, but something was NULL in this method
> near that line and it looked like a likely culprit.
>
> I guess possibly tin or fq was passed in as NULL?

A NULL pointer is not always 0. You can trigger a NULL-ptr-def with 0x3c
too, but you are checking against 0 in your patch, that is the problem and
that is why list_first_entry_or_null() exists.


Re: [Patch net] socket: close race condition between sock_close() and sockfs_setattr()

2018-06-07 Thread Cong Wang
On Thu, Jun 7, 2018 at 2:26 PM, Al Viro  wrote:
> On Thu, Jun 07, 2018 at 01:39:49PM -0700, Cong Wang wrote:
>> fchownat() doesn't even hold refcnt of fd until it figures out
>> fd is really needed (otherwise is ignored) and releases it after
>> it resolves the path. This means sock_close() could race with
>> sockfs_setattr(), which leads to a NULL pointer dereference
>> since typically we set sock->sk to NULL in ->release().
>>
>> As pointed out by Al, this is unique to sockfs. So we can fix this
>> in socket layer by acquiring inode_lock in sock_close() and
>> checking against NULL in sockfs_setattr().
>
> That looks like a massive overkill - it's way heavier than it should be.

I don't see any other quick way to fix this. My initial thought is
to keep that refcnt until path_put(), apparently you don't like it
either.


> And it's very likely to trigger shitloads of deadlock warnings, some
> possibly even true.

I do audit the inode_lock usage in networking, I don't see any
deadlock, of course, there could be some non-networking code
uses socket API that I missed. But _generally_, socket doesn't
have a pointer to retrieve this inode.


Re: [PATCH] net-fq: Add WARN_ON check for null flow.

2018-06-07 Thread Ben Greear

On 06/07/2018 02:29 PM, Cong Wang wrote:

On Thu, Jun 7, 2018 at 9:06 AM,   wrote:

--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,

flow = list_first_entry(head, struct fq_flow, flowchain);

+   if (WARN_ON_ONCE(!flow))
+   return NULL;
+


How could even possibly list_first_entry() returns NULL?
You need list_first_entry_or_null().



I don't know for certain flow as null, but something was NULL in this method
near that line and it looked like a likely culprit.

I guess possibly tin or fq was passed in as NULL?

Anyway, if the patch seems worthless just ignore it.  I'll leave it in my tree
since it should be harmless and will let you know if I ever hit it.

If someone else hits a similar crash, hopefully they can report it.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH] net-fq: Add WARN_ON check for null flow.

2018-06-07 Thread Cong Wang
On Thu, Jun 7, 2018 at 9:06 AM,   wrote:
> --- a/include/net/fq_impl.h
> +++ b/include/net/fq_impl.h
> @@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,
>
> flow = list_first_entry(head, struct fq_flow, flowchain);
>
> +   if (WARN_ON_ONCE(!flow))
> +   return NULL;
> +

How could even possibly list_first_entry() returns NULL?
You need list_first_entry_or_null().


Re: [Patch net] socket: close race condition between sock_close() and sockfs_setattr()

2018-06-07 Thread Al Viro
On Thu, Jun 07, 2018 at 01:39:49PM -0700, Cong Wang wrote:
> fchownat() doesn't even hold refcnt of fd until it figures out
> fd is really needed (otherwise is ignored) and releases it after
> it resolves the path. This means sock_close() could race with
> sockfs_setattr(), which leads to a NULL pointer dereference
> since typically we set sock->sk to NULL in ->release().
> 
> As pointed out by Al, this is unique to sockfs. So we can fix this
> in socket layer by acquiring inode_lock in sock_close() and
> checking against NULL in sockfs_setattr().

That looks like a massive overkill - it's way heavier than it should be.
And it's very likely to trigger shitloads of deadlock warnings, some
possibly even true.


Re: [PATCH net-next] bpfilter: fix OUTPUT_FORMAT

2018-06-07 Thread David Miller
From: Alexei Starovoitov 
Date: Thu, 7 Jun 2018 10:29:29 -0700

> From: Alexei Starovoitov 
> 
> CONFIG_OUTPUT_FORMAT is x86 only macro.
> Used objdump to extract elf file format.
> 
> Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module")
> Reported-by: David S. Miller 
> Signed-off-by: Alexei Starovoitov 

Applied.


Re: [PATCH net] bonding: re-evaluate force_primary when the primary slave name changes

2018-06-07 Thread David Miller
From: Xiangning Yu 
Date: Thu, 7 Jun 2018 13:39:59 +0800

> From: Xiangning Yu 
> 
> There is a timing issue under active-standy mode, when bond_enslave() is
> called, bond->params.primary might not be initialized yet.
> 
> Any time the primary slave string changes, bond->force_primary should be
> set to true to make sure the primary becomes the active slave.
> 
> Signed-off-by: Xiangning Yu 

Applied and queued up for -stable.


[Patch net] socket: close race condition between sock_close() and sockfs_setattr()

2018-06-07 Thread Cong Wang
fchownat() doesn't even hold refcnt of fd until it figures out
fd is really needed (otherwise is ignored) and releases it after
it resolves the path. This means sock_close() could race with
sockfs_setattr(), which leads to a NULL pointer dereference
since typically we set sock->sk to NULL in ->release().

As pointed out by Al, this is unique to sockfs. So we can fix this
in socket layer by acquiring inode_lock in sock_close() and
checking against NULL in sockfs_setattr().

sock_release() is called in many places, only the sock_close()
path matters here. And fortunately, this should not affect normal
sock_close() as it is only called when the last fd refcnt is gone.
It only affects sock_close() with a parallel sockfs_setattr() in
progress, which is not common.

Fixes: 86741ec25462 ("net: core: Add a UID field to struct sock.")
Reported-by: shankarapailoor 
Cc: Tetsuo Handa 
Cc: Lorenzo Colitti 
Cc: Al Viro 
Signed-off-by: Cong Wang 
---
 net/socket.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/net/socket.c b/net/socket.c
index af57d85bcb48..8a109012608a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -541,7 +541,10 @@ static int sockfs_setattr(struct dentry *dentry, struct 
iattr *iattr)
if (!err && (iattr->ia_valid & ATTR_UID)) {
struct socket *sock = SOCKET_I(d_inode(dentry));
 
-   sock->sk->sk_uid = iattr->ia_uid;
+   if (sock->sk)
+   sock->sk->sk_uid = iattr->ia_uid;
+   else
+   err = -ENOENT;
}
 
return err;
@@ -590,12 +593,16 @@ EXPORT_SYMBOL(sock_alloc);
  * an inode not a file.
  */
 
-void sock_release(struct socket *sock)
+static void __sock_release(struct socket *sock, struct inode *inode)
 {
if (sock->ops) {
struct module *owner = sock->ops->owner;
 
+   if (inode)
+   inode_lock(inode);
sock->ops->release(sock);
+   if (inode)
+   inode_unlock(inode);
sock->ops = NULL;
module_put(owner);
}
@@ -609,6 +616,11 @@ void sock_release(struct socket *sock)
}
sock->file = NULL;
 }
+
+void sock_release(struct socket *sock)
+{
+   __sock_release(sock, NULL);
+}
 EXPORT_SYMBOL(sock_release);
 
 void __sock_tx_timestamp(__u16 tsflags, __u8 *tx_flags)
@@ -1171,7 +1183,7 @@ static int sock_mmap(struct file *file, struct 
vm_area_struct *vma)
 
 static int sock_close(struct inode *inode, struct file *filp)
 {
-   sock_release(SOCKET_I(inode));
+   __sock_release(SOCKET_I(inode), inode);
return 0;
 }
 
-- 
2.13.0



Re: [PATCH] ip_tunnel: Fix name string concatenate in __ip_tunnel_create()

2018-06-07 Thread David Miller
From: Sultan Alsawaf 
Date: Wed,  6 Jun 2018 15:56:54 -0700

> By passing a limit of 2 bytes to strncat, strncat is limited to writing
> fewer bytes than what it's supposed to append to the name here.
> 
> Since the bounds are checked on the line above this, just remove the string
> bounds checks entirely since they're unneeded.
> 
> Signed-off-by: Sultan Alsawaf 

Applied.


Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format

2018-06-07 Thread Arnaldo Carvalho de Melo
Em Thu, Jun 07, 2018 at 01:07:01PM -0700, Martin KaFai Lau escreveu:
> On Thu, Jun 07, 2018 at 04:30:29PM -0300, Arnaldo Carvalho de Melo wrote:
> > So this must be available in a newer llvm version? Which one?

> I should have put in the details in my last email or
> in the commit message, my bad.
 
> 1. The tools/testing/selftests/bpf/Makefile has the CLANG_FLAGS and
>LLC_FLAGS needed to compile the bpf prog.  It requires a new
>"-mattr=dwarf" llc option which was added to the future
>llvm 7.0.
 
>Hence, I have been using the llvm's master in github which
>also has the llvm-objcopy.
 
> 2. The kernel's btf part only focus on the BPF map.
>Hence, the testing bpf program should have the map's key
>and map's value.  e.g. tools/testing/selftests/bpf/test_btf_haskv.c

Thanks for the version required to test this, but where is this
test_btf_haskv.c file? Which tree? net-next?

Ok, just pulled torvalds/master and there it is. Gotcha.

struct bpf_map_def SEC("maps") __bpf_stdout__ = {
   .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
   .key_size = sizeof(int),
   .value_size = sizeof(u32),
   .max_entries = __NR_CPUS__,
};

This map is in the above hello.c example, but I guess its way too simple
:-)

Ok, I'll test this at home in another machine where I have the llvm's
git repo.

- Arnaldo


Re: [PATCH net] net: in virtio_net_hdr only add VLAN_HLEN to csum_start if payload holds vlan

2018-06-07 Thread David Miller
From: Willem de Bruijn 
Date: Wed,  6 Jun 2018 11:23:01 -0400

> From: Willem de Bruijn 
> 
> Tun, tap, virtio, packet and uml vector all use struct virtio_net_hdr
> to communicate packet metadata to userspace.
> 
> For skbuffs with vlan, the first two return the packet as it may have
> existed on the wire, inserting the VLAN tag in the user buffer.  Then
> virtio_net_hdr.csum_start needs to be adjusted by VLAN_HLEN bytes.
> 
> Commit f09e2249c4f5 ("macvtap: restore vlan header on user read")
> added this feature to macvtap. Commit 3ce9b20f1971 ("macvtap: Fix
> csum_start when VLAN tags are present") then fixed up csum_start.
> 
> Virtio, packet and uml do not insert the vlan header in the user
> buffer.
> 
> When introducing virtio_net_hdr_from_skb to deduplicate filling in
> the virtio_net_hdr, the variant from macvtap which adds VLAN_HLEN was
> applied uniformly, breaking csum offset for packets with vlan on
> virtio and packet.
> 
> Make insertion of VLAN_HLEN optional. Convert the callers to pass it
> when needed.
> 
> Fixes: e858fae2b0b8f4 ("virtio_net: use common code for virtio_net_hdr and 
> skb GSO conversion")
> Fixes: 1276f24eeef2 ("packet: use common code for virtio_net_hdr and skb GSO 
> conversion")
> Signed-off-by: Willem de Bruijn 

Applied and queued up for -stable, thanks Willem.


Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format

2018-06-07 Thread Martin KaFai Lau
On Thu, Jun 07, 2018 at 04:30:29PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Thu, Jun 07, 2018 at 12:05:10PM -0700, Martin KaFai Lau escreveu:
> > On Thu, Jun 07, 2018 at 11:03:37AM -0300, Arnaldo Carvalho de Melo wrote:
> > > Em Thu, Jun 07, 2018 at 10:54:01AM -0300, Arnaldo Carvalho de Melo 
> > > escreveu:
> > > > Em Tue, Jun 05, 2018 at 02:25:48PM -0700, Martin KaFai Lau escreveu:
> > > > > [ btw, the latest commit (1 commit) should be 94a11b59e592 ].
> 
> > > So, the commit log message for the pahole patch is non-existent:
> 
> > > https://github.com/iamkafai/pahole/commit/94a11b59e5920908085bfc8d24c92f95c8ffceaf
> 
> > > we should do better in describing what is done and how, I'm staring
> > > with a message you sent to the kernel part:
> > > --
> > > This patch introduces BPF Type Format (BTF).
> 
> > > BTF (BPF Type Format) is the meta data format which describes
> > > the data types of BPF program/map.  Hence, it basically focus
> > > on the C programming language which the modern BPF is primary
> > > using.  The first use case is to provide a generic pretty print
> > > capability for a BPF map.
> 
> > I will add details in the next github respin/push.
> 
> Ok, but I can do that if there is nothing else to do in the code at this
> stage :-)
>  
> > > Now I'm going to do the step-by-step guide on testing the feature just
> > > introduced, and will try to convert from dwarf to BTF and back, compare
> > > the pahole output for types encoded in DWARF and BTF, etc.
> 
> > > If you have something ressembling this already, please share.
> 
> > The pahole only has the encoder part.  I tested with the verbose output
> > from the "pahole -V -J".  Loading the btf to the kernel is also tested.
> 
> Ok, so here it goes my first stab at testing, using perf's BPF
> integration:
> 
> [root@jouet bpf]# cat hello.c
> #include "stdio.h"
> 
> int syscall_enter(openat)(void *ctx)
> {
>   puts("Hello, world\n");
>   return 0;
> }
> [root@jouet bpf]# cat ~/.perfconfig
> [llvm]
> dump-obj = true
> [root@jouet bpf]# perf trace -e open*,hello.c touch /tmp/hello.BTF
> LLVM: dumping hello.o
>  0.017 ( ): __bpf_stdout__:Hello, world
>  0.019 ( 0.011 ms): touch/28147 openat(dfd: CWD, filename: 
> /etc/ld.so.cache, flags: CLOEXEC   ) = 3
>  0.053 ( ): __bpf_stdout__:Hello, world
>  0.055 ( 0.011 ms): touch/28147 openat(dfd: CWD, filename: 
> /lib64/libc.so.6, flags: CLOEXEC   ) = 3
>  0.354 ( 0.012 ms): touch/28147 open(filename: 
> /usr/lib/locale/locale-archive, flags: CLOEXEC ) = 3
>  0.411 ( ): __bpf_stdout__:Hello, world
>  0.412 ( 0.198 ms): touch/28147 openat(dfd: CWD, filename: 
> /tmp/hello.BTF, flags: CREAT|NOCTTY|NONBLOCK|WRONLY, mode: IRUGO|IWUGO) = 3
> [root@jouet bpf]# 
> [root@jouet bpf]# file hello.o
> hello.o: ELF 64-bit LSB relocatable, *unknown arch 0xf7* version 1 (SYSV), 
> not stripped
> [root@jouet bpf]# pahole --btf_encode hello.o
> pahole: hello.o: No debugging information found
> [root@jouet bpf]#
> 
> [root@jouet bpf]# readelf -s hello.o
> 
> Symbol table '.symtab' contains 5 entries:
>Num:Value  Size TypeBind   Vis  Ndx Name
>  0:  0 NOTYPE  LOCAL  DEFAULT  UND 
>  1:  0 NOTYPE  GLOBAL DEFAULT7 __bpf_stdout__
>  2:  0 NOTYPE  GLOBAL DEFAULT5 _license
>  3:  0 NOTYPE  GLOBAL DEFAULT6 _version
>  4:  0 NOTYPE  GLOBAL DEFAULT3 
> syscall_enter_openat
> [root@jouet bpf]#
> [root@jouet bpf]# readelf -SW hello.o
> There are 10 section headers, starting at offset 0x1f8:
> 
> Section Headers:
>   [Nr] Name  TypeAddress  OffSize   ES 
> Flg Lk Inf Al
>   [ 0]   NULL 00 00 00
>   0   0  0
>   [ 1] .strtab   STRTAB   000178 7f 00
>   0   0  1
>   [ 2] .text PROGBITS 40 00 00  
> AX  0   0  4
>   [ 3] syscalls:sys_enter_openat PROGBITS 40 
> 88 00  AX  0   0  8
>   [ 4] .relsyscalls:sys_enter_openat REL  000168 
> 10 10  9   3  8
>   [ 5] license   PROGBITS c8 04 00  
> WA  0   0  1
>   [ 6] version   PROGBITS cc 04 00  
> WA  0   0  4
>   [ 7] maps  PROGBITS d0 10 00  
> WA  0   0  4
>   [ 8] .rodata.str1.1PROGBITS e0 0e 01 
> AMS  0   0  1
>   [ 9] .symtab   SYMTAB   f0 78 18
>   1   1  8
> Key to Flags:
>   W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
>   L (link order), O (extra OS processing required), G (group), T (TLS),
>   C (compressed), x (unknown), o (OS specific), E (exclude),
>   p (processor 

Re: [PATCH bpf] bpf: reject passing modified ctx to helper functions

2018-06-07 Thread Alexei Starovoitov
On Thu, Jun 07, 2018 at 05:40:03PM +0200, Daniel Borkmann wrote:
> As commit 28e33f9d78ee ("bpf: disallow arithmetic operations on
> context pointer") already describes, f1174f77b50c ("bpf/verifier:
> rework value tracking") removed the specific white-listed cases
> we had previously where we would allow for pointer arithmetic in
> order to further generalize it, and allow e.g. context access via
> modified registers. While the dereferencing of modified context
> pointers had been forbidden through 28e33f9d78ee, syzkaller did
> recently manage to trigger several KASAN splats for slab out of
> bounds access and use after frees by simply passing a modified
> context pointer to a helper function which would then do the bad
> access since verifier allowed it in adjust_ptr_min_max_vals().
> 
> Rejecting arithmetic on ctx pointer in adjust_ptr_min_max_vals()
> generally could break existing programs as there's a valid use
> case in tracing in combination with passing the ctx to helpers as
> bpf_probe_read(), where the register then becomes unknown at
> verification time due to adding a non-constant offset to it. An
> access sequence may look like the following:
> 
>   offset = args->filename;  /* field __data_loc filename */
>   bpf_probe_read(, len, (char *)args + offset); // args is ctx
> 
> There are two options: i) we could special case the ctx and as
> soon as we add a constant or bounded offset to it (hence ctx type
> wouldn't change) we could turn the ctx into an unknown scalar, or
> ii) we generalize the sanity test for ctx member access into a
> small helper and assert it on the ctx register that was passed
> as a function argument. Fwiw, latter is more obvious and less
> complex at the same time, and one case that may potentially be
> legitimate in future for ctx member access at least would be for
> ctx to carry a const offset. Therefore, fix follows approach
> from ii) and adds test cases to BPF kselftests.
> 
> Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
> Reported-by: syzbot+3d0b2441dbb717516...@syzkaller.appspotmail.com
> Reported-by: syzbot+c8504affd4fdd0c1b...@syzkaller.appspotmail.com
> Reported-by: syzbot+e5190cb881d8660fb...@syzkaller.appspotmail.com
> Reported-by: syzbot+efae31b384d5badbd...@syzkaller.appspotmail.com
> Signed-off-by: Daniel Borkmann 
> Acked-by: Alexei Starovoitov 

Applied, Thanks



Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format

2018-06-07 Thread Arnaldo Carvalho de Melo
Em Thu, Jun 07, 2018 at 12:05:10PM -0700, Martin KaFai Lau escreveu:
> On Thu, Jun 07, 2018 at 11:03:37AM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Thu, Jun 07, 2018 at 10:54:01AM -0300, Arnaldo Carvalho de Melo escreveu:
> > > Em Tue, Jun 05, 2018 at 02:25:48PM -0700, Martin KaFai Lau escreveu:
> > > > [ btw, the latest commit (1 commit) should be 94a11b59e592 ].

> > So, the commit log message for the pahole patch is non-existent:

> > https://github.com/iamkafai/pahole/commit/94a11b59e5920908085bfc8d24c92f95c8ffceaf

> > we should do better in describing what is done and how, I'm staring
> > with a message you sent to the kernel part:
> > --
> > This patch introduces BPF Type Format (BTF).

> > BTF (BPF Type Format) is the meta data format which describes
> > the data types of BPF program/map.  Hence, it basically focus
> > on the C programming language which the modern BPF is primary
> > using.  The first use case is to provide a generic pretty print
> > capability for a BPF map.

> I will add details in the next github respin/push.

Ok, but I can do that if there is nothing else to do in the code at this
stage :-)
 
> > Now I'm going to do the step-by-step guide on testing the feature just
> > introduced, and will try to convert from dwarf to BTF and back, compare
> > the pahole output for types encoded in DWARF and BTF, etc.

> > If you have something ressembling this already, please share.

> The pahole only has the encoder part.  I tested with the verbose output
> from the "pahole -V -J".  Loading the btf to the kernel is also tested.

Ok, so here it goes my first stab at testing, using perf's BPF
integration:

[root@jouet bpf]# cat hello.c
#include "stdio.h"

int syscall_enter(openat)(void *ctx)
{
puts("Hello, world\n");
return 0;
}
[root@jouet bpf]# cat ~/.perfconfig
[llvm]
dump-obj = true
[root@jouet bpf]# perf trace -e open*,hello.c touch /tmp/hello.BTF
LLVM: dumping hello.o
 0.017 ( ): __bpf_stdout__:Hello, world
 0.019 ( 0.011 ms): touch/28147 openat(dfd: CWD, filename: 
/etc/ld.so.cache, flags: CLOEXEC   ) = 3
 0.053 ( ): __bpf_stdout__:Hello, world
 0.055 ( 0.011 ms): touch/28147 openat(dfd: CWD, filename: 
/lib64/libc.so.6, flags: CLOEXEC   ) = 3
 0.354 ( 0.012 ms): touch/28147 open(filename: 
/usr/lib/locale/locale-archive, flags: CLOEXEC ) = 3
 0.411 ( ): __bpf_stdout__:Hello, world
 0.412 ( 0.198 ms): touch/28147 openat(dfd: CWD, filename: /tmp/hello.BTF, 
flags: CREAT|NOCTTY|NONBLOCK|WRONLY, mode: IRUGO|IWUGO) = 3
[root@jouet bpf]# 
[root@jouet bpf]# file hello.o
hello.o: ELF 64-bit LSB relocatable, *unknown arch 0xf7* version 1 (SYSV), not 
stripped
[root@jouet bpf]# pahole --btf_encode hello.o
pahole: hello.o: No debugging information found
[root@jouet bpf]#

[root@jouet bpf]# readelf -s hello.o

Symbol table '.symtab' contains 5 entries:
   Num:Value  Size TypeBind   Vis  Ndx Name
 0:  0 NOTYPE  LOCAL  DEFAULT  UND 
 1:  0 NOTYPE  GLOBAL DEFAULT7 __bpf_stdout__
 2:  0 NOTYPE  GLOBAL DEFAULT5 _license
 3:  0 NOTYPE  GLOBAL DEFAULT6 _version
 4:  0 NOTYPE  GLOBAL DEFAULT3 syscall_enter_openat
[root@jouet bpf]#
[root@jouet bpf]# readelf -SW hello.o
There are 10 section headers, starting at offset 0x1f8:

Section Headers:
  [Nr] Name  TypeAddress  OffSize   ES Flg 
Lk Inf Al
  [ 0]   NULL 00 00 00  
0   0  0
  [ 1] .strtab   STRTAB   000178 7f 00  
0   0  1
  [ 2] .text PROGBITS 40 00 00  AX  
0   0  4
  [ 3] syscalls:sys_enter_openat PROGBITS 40 88 
00  AX  0   0  8
  [ 4] .relsyscalls:sys_enter_openat REL  000168 
10 10  9   3  8
  [ 5] license   PROGBITS c8 04 00  WA  
0   0  1
  [ 6] version   PROGBITS cc 04 00  WA  
0   0  4
  [ 7] maps  PROGBITS d0 10 00  WA  
0   0  4
  [ 8] .rodata.str1.1PROGBITS e0 0e 01 AMS  
0   0  1
  [ 9] .symtab   SYMTAB   f0 78 18  
1   1  8
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  p (processor specific)
[root@jouet bpf]#

Humm, lemme try something, add -g to clang-opt:

[root@jouet bpf]# cat ~/.perfconfig
[llvm]
dump-obj = true
clang-opt = -g
[root@jouet bpf]# perf trace -e open*,hello.c touch /tmp/hello.BTF
LLVM: dumping hello.o
 0.185 ( ): __bpf_stdout__:Hello, world
 0.188 

Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format

2018-06-07 Thread Martin KaFai Lau
On Thu, Jun 07, 2018 at 11:03:37AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Thu, Jun 07, 2018 at 10:54:01AM -0300, Arnaldo Carvalho de Melo escreveu:
> > Em Tue, Jun 05, 2018 at 02:25:48PM -0700, Martin KaFai Lau escreveu:
> > > [ btw, the latest commit (1 commit) should be 94a11b59e592 ].
> 
> So, the commit log message for the pahole patch is non-existent:
> 
> https://github.com/iamkafai/pahole/commit/94a11b59e5920908085bfc8d24c92f95c8ffceaf
> 
> we should do better in describing what is done and how, I'm staring
> with a message you sent to the kernel part:
> 
> --
> This patch introduces BPF Type Format (BTF).
> 
> BTF (BPF Type Format) is the meta data format which describes
> the data types of BPF program/map.  Hence, it basically focus
> on the C programming language which the modern BPF is primary
> using.  The first use case is to provide a generic pretty print
> capability for a BPF map.
> --
I will add details in the next github respin/push.

> 
> Now I'm going to do the step-by-step guide on testing the feature just
> introduced, and will try to convert from dwarf to BTF and back, compare
> the pahole output for types encoded in DWARF and BTF, etc.
> 
> If you have something ressembling this already, please share.
The pahole only has the encoder part.  I tested with the verbose output
from the "pahole -V -J".  Loading the btf to the kernel is also tested.


Re: [PATCH bpf] bpf: reject passing modified ctx to helper functions

2018-06-07 Thread Edward Cree
On 07/06/18 16:40, Daniel Borkmann wrote:
> As commit 28e33f9d78ee ("bpf: disallow arithmetic operations on
> context pointer") already describes, f1174f77b50c ("bpf/verifier:
> rework value tracking") removed the specific white-listed cases
> we had previously where we would allow for pointer arithmetic in
> order to further generalize it, and allow e.g. context access via
> modified registers. While the dereferencing of modified context
> pointers had been forbidden through 28e33f9d78ee, syzkaller did
> recently manage to trigger several KASAN splats for slab out of
> bounds access and use after frees by simply passing a modified
> context pointer to a helper function which would then do the bad
> access since verifier allowed it in adjust_ptr_min_max_vals().
>
> Rejecting arithmetic on ctx pointer in adjust_ptr_min_max_vals()
> generally could break existing programs as there's a valid use
> case in tracing in combination with passing the ctx to helpers as
> bpf_probe_read(), where the register then becomes unknown at
> verification time due to adding a non-constant offset to it. An
> access sequence may look like the following:
>
>   offset = args->filename;  /* field __data_loc filename */
>   bpf_probe_read(, len, (char *)args + offset); // args is ctx
>
> There are two options: i) we could special case the ctx and as
> soon as we add a constant or bounded offset to it (hence ctx type
> wouldn't change) we could turn the ctx into an unknown scalar, or
> ii) we generalize the sanity test for ctx member access into a
> small helper and assert it on the ctx register that was passed
> as a function argument. Fwiw, latter is more obvious and less
> complex at the same time, and one case that may potentially be
> legitimate in future for ctx member access at least would be for
> ctx to carry a const offset. Therefore, fix follows approach
> from ii) and adds test cases to BPF kselftests.
>
> Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
> Reported-by: syzbot+3d0b2441dbb717516...@syzkaller.appspotmail.com
> Reported-by: syzbot+c8504affd4fdd0c1b...@syzkaller.appspotmail.com
> Reported-by: syzbot+e5190cb881d8660fb...@syzkaller.appspotmail.com
> Reported-by: syzbot+efae31b384d5badbd...@syzkaller.appspotmail.com
> Signed-off-by: Daniel Borkmann 
> Acked-by: Alexei Starovoitov 
Acked-by: Edward Cree 


Re: [PATCH bpf] bpf: reject passing modified ctx to helper functions

2018-06-07 Thread Y Song
On Thu, Jun 7, 2018 at 8:40 AM, Daniel Borkmann  wrote:
> As commit 28e33f9d78ee ("bpf: disallow arithmetic operations on
> context pointer") already describes, f1174f77b50c ("bpf/verifier:
> rework value tracking") removed the specific white-listed cases
> we had previously where we would allow for pointer arithmetic in
> order to further generalize it, and allow e.g. context access via
> modified registers. While the dereferencing of modified context
> pointers had been forbidden through 28e33f9d78ee, syzkaller did
> recently manage to trigger several KASAN splats for slab out of
> bounds access and use after frees by simply passing a modified
> context pointer to a helper function which would then do the bad
> access since verifier allowed it in adjust_ptr_min_max_vals().
>
> Rejecting arithmetic on ctx pointer in adjust_ptr_min_max_vals()
> generally could break existing programs as there's a valid use
> case in tracing in combination with passing the ctx to helpers as
> bpf_probe_read(), where the register then becomes unknown at
> verification time due to adding a non-constant offset to it. An
> access sequence may look like the following:
>
>   offset = args->filename;  /* field __data_loc filename */
>   bpf_probe_read(, len, (char *)args + offset); // args is ctx
>
> There are two options: i) we could special case the ctx and as
> soon as we add a constant or bounded offset to it (hence ctx type
> wouldn't change) we could turn the ctx into an unknown scalar, or
> ii) we generalize the sanity test for ctx member access into a
> small helper and assert it on the ctx register that was passed
> as a function argument. Fwiw, latter is more obvious and less
> complex at the same time, and one case that may potentially be
> legitimate in future for ctx member access at least would be for
> ctx to carry a const offset. Therefore, fix follows approach
> from ii) and adds test cases to BPF kselftests.
>
> Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
> Reported-by: syzbot+3d0b2441dbb717516...@syzkaller.appspotmail.com
> Reported-by: syzbot+c8504affd4fdd0c1b...@syzkaller.appspotmail.com
> Reported-by: syzbot+e5190cb881d8660fb...@syzkaller.appspotmail.com
> Reported-by: syzbot+efae31b384d5badbd...@syzkaller.appspotmail.com
> Signed-off-by: Daniel Borkmann 
> Acked-by: Alexei Starovoitov 

Tested with bcc for the above ctx + unknown_offset usage for bpf_probe_read.
No breakage.

Acked-by: Yonghong Song 

> ---
>  kernel/bpf/verifier.c   | 48 +++-
>  tools/testing/selftests/bpf/test_verifier.c | 58 
> -
>  2 files changed, 88 insertions(+), 18 deletions(-)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index d6403b5..cced0c1 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1617,6 +1617,30 @@ static int get_callee_stack_depth(struct 
> bpf_verifier_env *env,
>  }
>  #endif
>
> +static int check_ctx_reg(struct bpf_verifier_env *env,
> +const struct bpf_reg_state *reg, int regno)
> +{
> +   /* Access to ctx or passing it to a helper is only allowed in
> +* its original, unmodified form.
> +*/
> +
> +   if (reg->off) {
> +   verbose(env, "dereference of modified ctx ptr R%d off=%d 
> disallowed\n",
> +   regno, reg->off);
> +   return -EACCES;
> +   }
> +
> +   if (!tnum_is_const(reg->var_off) || reg->var_off.value) {
> +   char tn_buf[48];
> +
> +   tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
> +   verbose(env, "variable ctx access var_off=%s disallowed\n", 
> tn_buf);
> +   return -EACCES;
> +   }
> +
> +   return 0;
> +}
> +
>  /* truncate register to smaller size (in bytes)
>   * must be called with size < BPF_REG_SIZE
>   */
> @@ -1686,24 +1710,11 @@ static int check_mem_access(struct bpf_verifier_env 
> *env, int insn_idx, u32 regn
> verbose(env, "R%d leaks addr into ctx\n", 
> value_regno);
> return -EACCES;
> }
> -   /* ctx accesses must be at a fixed offset, so that we can
> -* determine what type of data were returned.
> -*/
> -   if (reg->off) {
> -   verbose(env,
> -   "dereference of modified ctx ptr R%d 
> off=%d+%d, ctx+const is allowed, ctx+const+const is not\n",
> -   regno, reg->off, off - reg->off);
> -   return -EACCES;
> -   }
> -   if (!tnum_is_const(reg->var_off) || reg->var_off.value) {
> -   char tn_buf[48];
>
> -   tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
> -   verbose(env,
> -   "variable ctx access var_off=%s off=%d 
> size=%d",
> -   tn_buf, off, 

[PATCH net-next] bpfilter: fix OUTPUT_FORMAT

2018-06-07 Thread Alexei Starovoitov
From: Alexei Starovoitov 

CONFIG_OUTPUT_FORMAT is x86 only macro.
Used objdump to extract elf file format.

Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module")
Reported-by: David S. Miller 
Signed-off-by: Alexei Starovoitov 
---
 net/bpfilter/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index aafa72001fcd..e0bbe7583e58 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -21,7 +21,7 @@ endif
 # which bpfilter_kern.c passes further into umh blob loader at run-time
 quiet_cmd_copy_umh = GEN $@
   cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
-  $(OBJCOPY) -I binary -O $(CONFIG_OUTPUT_FORMAT) \
+  $(OBJCOPY) -I binary -O `$(OBJDUMP) -f $<|grep format|cut -d' ' -f8` \
   -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
   --rename-section .data=.init.rodata $< $@
 
-- 
2.9.5



KMSAN: uninit-value in ebt_stp_mt_check (2)

2018-06-07 Thread syzbot

Hello,

syzbot found the following crash on:

HEAD commit:c6a6aed994b6 kmsan: remove dead code to trigger syzbot build
git tree:   https://github.com/google/kmsan.git/master
console output: https://syzkaller.appspot.com/x/log.txt?x=17bde74f80
kernel config:  https://syzkaller.appspot.com/x/.config?x=848e40757852af3e
dashboard link: https://syzkaller.appspot.com/bug?extid=da4494182233c23a5fcf
compiler:   clang version 7.0.0 (trunk 334104)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+da4494182233c23a5...@syzkaller.appspotmail.com

==
BUG: KMSAN: uninit-value in ebt_stp_mt_check+0x24b/0x450  
net/bridge/netfilter/ebt_stp.c:162

CPU: 0 PID: 12006 Comm: syz-executor7 Not tainted 4.17.0+ #3
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:113
 kmsan_report+0x149/0x260 mm/kmsan/kmsan.c:1084
 __msan_warning_32+0x6e/0xc0 mm/kmsan/kmsan_instr.c:620
 ebt_stp_mt_check+0x24b/0x450 net/bridge/netfilter/ebt_stp.c:162
 xt_check_match+0x1438/0x1650 net/netfilter/x_tables.c:506
 ebt_check_match net/bridge/netfilter/ebtables.c:372 [inline]
 ebt_check_entry net/bridge/netfilter/ebtables.c:702 [inline]
 translate_table+0x4e88/0x6120 net/bridge/netfilter/ebtables.c:943
 do_replace_finish+0x1258/0x2ea0 net/bridge/netfilter/ebtables.c:999
 do_replace+0x719/0x780 net/bridge/netfilter/ebtables.c:1138
 do_ebt_set_ctl+0x2ab/0x3c0 net/bridge/netfilter/ebtables.c:1517
 nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
 nf_setsockopt+0x47c/0x4e0 net/netfilter/nf_sockopt.c:115
 ip_setsockopt+0x24b/0x2b0 net/ipv4/ip_sockglue.c:1251
 udp_setsockopt+0x108/0x1b0 net/ipv4/udp.c:2416
 sock_common_setsockopt+0x13b/0x170 net/core/sock.c:3039
 __sys_setsockopt+0x496/0x540 net/socket.c:1903
 __do_sys_setsockopt net/socket.c:1914 [inline]
 __se_sys_setsockopt net/socket.c:1911 [inline]
 __x64_sys_setsockopt+0x15c/0x1c0 net/socket.c:1911
 do_syscall_64+0x15b/0x230 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x4559f9
RSP: 002b:7f45b9246c68 EFLAGS: 0246 ORIG_RAX: 0036
RAX: ffda RBX: 7f45b92476d4 RCX: 004559f9
RDX: 0080 RSI:  RDI: 0014
RBP: 0072bea0 R08: 0300 R09: 
R10: 2480 R11: 0246 R12: 
R13: 004c0d6d R14: 004d07c8 R15: 

Local variable description: mtpar.i@translate_table
Variable was created at:
 translate_table+0xbb/0x6120 net/bridge/netfilter/ebtables.c:831
 do_replace_finish+0x1258/0x2ea0 net/bridge/netfilter/ebtables.c:999
==


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkal...@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.


Re: [PATCH net] failover: eliminate callback hell

2018-06-07 Thread Michael S. Tsirkin
On Thu, Jun 07, 2018 at 09:17:42AM -0700, Stephen Hemminger wrote:
> On Thu, 7 Jun 2018 18:41:31 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > > > Why would DPDK care what we do in the kernel? Isn't it just slapping
> > > > vfio-pci on the netdevs it sees?  
> > > 
> > > Alex, you are correct for Intel devices; but DPDK on Azure is not Intel 
> > > based.,.
> > > The DPDK support uses:
> > >  * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to
> > >userspace. This means VF netdev device must exist and be visible.
> > >  * Slow path using kernel netvsc device, TAP and BPF to get exception
> > >path packets to userspace.
> > >  * A autodiscovery mechanism that to set all this up that relies on
> > >2 device model and sysfs.  
> > 
> > Could you describe what does it look for exactly? What will break if
> > instead of MLX5 being a child of the PV, it's a child of the failover
> > device?
> 
> So in DPDK there is an internal four device model:
>   1. failsafe is like failover in your model
>   2. TAP is used like netvsc in kernel
>   3. MLX5 is the VF
>   4. vdev_netvsc is a pseudo device whose only reason to exist
>  is to glue everything together.
> 
> Digging deeper inside...
> 
> Vdev_netvsc does:
>* driver is started in a convuluted way off device arguments
>* probe routine for driver runs
>   - scans list of kernel interfaces in sysfs
>   - matches those using VMBUS 

Could you tell a bit more what does this step entail?

>   - skip netvsc devices that have an IPV4 route
>* scan for PCI devices that have same MAC address as kernel netvsc
>  devices discovered in previous step
>* add these interfaces to arguments to failsafe
> 
> Then failsafe configures based on arguments on device
> 
> The code works but is specific to the Azure hardware model, and exposes lots
> of things to application that it should not have to care about.
> 
> If you  try and walk through this code in DPDK, you will see why I have 
> developed
> a dislike for high levels of indirection.
> 
> 
>  

Thanks that was helpful!  I'll try to poke at it next week.  Just from
the description it seems the kernel is merely used to locate the MAC
address through sysfs and that for this DPDK code to keep working the
hidden device must be hidden from it in sysfs - is that a fair summary?

-- 
MST


Re: [RFC PATCH 1/2] net: macb: Add CAP to disable hardware TX checksum offloading

2018-06-07 Thread Jennifer Dahm

Hi Nicolas,

On 06/04/2018 10:13 AM, Nicolas Ferre wrote:

On 25/05/2018 at 23:44, Jennifer Dahm wrote:
diff --git a/drivers/net/ethernet/cadence/macb_main.c 
b/drivers/net/ethernet/cadence/macb_main.c

index 3e93df5..a5d564b 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -3360,8 +3360,12 @@ static int macb_init(struct platform_device 
*pdev)

  dev->hw_features |= MACB_NETIF_LSO;
    /* Checksum offload is only available on gem with packet 
buffer */

-    if (macb_is_gem(bp) && !(bp->caps & MACB_CAPS_FIFO_MODE))
-    dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_RXCSUM;
+    if (macb_is_gem(bp) && !(bp->caps & MACB_CAPS_FIFO_MODE)) {
+    if (!(bp->caps & MACB_CAPS_DISABLE_TX_HW_CSUM))


Why not the other way around? negating a "disabled" feature is always 
challenge ;-)



+    dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_RXCSUM;
+    else
+    dev->hw_features |= NETIF_F_RXCSUM;
+    }
  if (bp->caps & MACB_CAPS_SG_DISABLED)
  dev->hw_features &= ~NETIF_F_SG;
  dev->features = dev->hw_features;


I can switch the ordering of the if-else clauses if that's what you're 
nitpicking. ;)


Alternatively, if you're asking why the flag is used to disable rather 
than enable checksum offloading: I was working under the assumption that 
this was an isolated bug, and so an opt-out would require less 
maintainance than an opt-in. If we discover that this is a problem 
across a wide variety of Cadence IP, it would definitely make sense to 
replace it with an opt-in (i.e. MACB_CAPS_TX_HW_CSUM_ENABLED).


Best,
Jennifer Dahm


Re: [PATCH] net-fq: Add WARN_ON check for null flow.

2018-06-07 Thread Ben Greear

On 06/07/2018 09:17 AM, Eric Dumazet wrote:



On 06/07/2018 09:06 AM, gree...@candelatech.com wrote:

From: Ben Greear 

While testing an ath10k firmware that often crashed under load,
I was seeing kernel crashes as well.  One of them appeared to
be a dereference of a NULL flow object in fq_tin_dequeue.

I have since fixed the firmware flaw, but I think it would be
worth adding the WARN_ON in case the problem appears again.

 common_interrupt+0xf/0xf
 



Please find the exact commit that brought this bug,
and add a corresponding Fixes: tag


It will be a total pain to bisect this problem since my test
case that causes this is running my modified firmware (and a buggy one at that),
modified ath10k driver (to work with this firmware and support my test case 
easily),
and the failure case appears to cause multiple different-but-probably-related
crashes and often hangs or reboots the test system.

Probably this is all caused by some nasty race or buggy logic related to
dealing with a crashed ath10k firmware tearing down txq logic from the
bottom up.  There have been many such bugs in the past, I and others fixed a 
few,
and very likely more remain.

For what it is worth, I didn't see this crash in 4.13, and I spent some time
testing buggy firmware there occasionally.

If someone else has interest in debugging the ath10k driver, I will be happy to 
generate
a mostly-stock firmware image with ability to crash in the TX path and give it 
to them.
It will crash the stock upstream code reliably in my experience.

Thanks,
Ben





Signed-off-by: Ben Greear 
---
 include/net/fq_impl.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h
index be7c0fa..e40354d 100644
--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,

flow = list_first_entry(head, struct fq_flow, flowchain);

+   if (WARN_ON_ONCE(!flow))
+   return NULL;
+
if (flow->deficit <= 0) {
flow->deficit += fq->quantum;
list_move_tail(>flowchain,






--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH net] net: ipv6: ip6_output: alloc skb with tailroom

2018-06-07 Thread Willem de Bruijn
On Thu, Jun 7, 2018 at 12:22 PM, Alexander Aring  wrote:
> Hi,
>
> On Wed, Jun 06, 2018 at 04:26:19PM -0400, Willem de Bruijn wrote:
>> On Wed, Jun 6, 2018 at 2:11 PM, David Miller  wrote:
>> > From: Alexander Aring 
>> > Date: Wed, 6 Jun 2018 14:09:20 -0400
>> >
>> >> okay, then you want to have this patch for net-next? As an optimization?
>> >>
>> >> Of course, when it's open again.
>> >
>> > Like you, I have questions about where this adjustment is applied and
>> > why.  So I'm not sure yet.
>> >
>> > For example, only IPV6 really takes it into consideration and as you
>> > saw only really for the fragmentation path and not the normal output
>> > path.
>> >
>> > This needs more consideration and investigation.
>>
>> This is the unconditional skb_put in ieee802154_tx. In many cases
>> there is some tailroom due to SKB_DATA_ALIGN in __alloc_skb,
>> so it may take a specific case to not have even 2 bytes of tailroom
>> available.
>
> Yes it's in ieee802154_tx, but we need tailroom not just for checksum.
> The bugreport is related to the two bytes of tailroom, because virtual
> hardware doing checksum by software. The most real transceivers offload
> this feature, so zero tailroom is needed.
>
> I will of course add checks before adding L2 header for headroom and
> tailroom in related subsystem code.
>
> In IEEE 802.15.4 and secured enabled frames we need a MIC field at the
> end of the frame. In worst case this can be 16 bytes.
>
> I looked ethernet macsec feature and it seems they need to have a similar
> reseved tailroom which is 16 bytes by default (max 32 bytes).

Allocating with necessary tailroom to avoid a realloc in the path
sounds fine to me, too. Packet sockets also take needed_tailroom
into account, to give another example.

We just have to avoid papering over bugs. Fixing the locations that
unconditionally move the tail pointer beyond the allocated
region is more important.


Re: Fw: [Bug 199963] New: UDP rx_queue incorrect calculation in /proc/net/udp

2018-06-07 Thread Paolo Abeni
On Thu, 2018-06-07 at 07:39 -0700, Stephen Hemminger wrote:
> 
> Begin forwarded message:
> 
> Date: Thu, 07 Jun 2018 13:21:23 +
> From: bugzilla-dae...@bugzilla.kernel.org
> To: step...@networkplumber.org
> Subject: [Bug 199963] New: UDP rx_queue incorrect calculation in /proc/net/udp
> 
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=199963
> 
> Bug ID: 199963
>Summary: UDP rx_queue incorrect calculation in /proc/net/udp
>Product: Networking
>Version: 2.5
> Kernel Version: Kernels >= 4.15
>   Hardware: All
> OS: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: IPV4
>   Assignee: step...@networkplumber.org
>   Reporter: trevor.fran...@46labs.com
> Regression: No
> 
> since upgrading to any kernel >= 4.15 the rx_queue in /proc/net/udp is now
> reporting a queue, regardless of system load and regardless of what
> applications are running on it. The tx_queue is always 0, but rx_queue has
> seemingly random spikes of udp queueing. This is observed across hundreds of
> servers with either varying or no workload.
> 
> netstat -nl|grep ^udp
> udp 4352 0 0.0.0.0:68 0.0.0.0:*
> 
> sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid
> timeout inode ref pointer drops
> 14645: 357F:0035 : 07 :C900 00: 
> 101 0 3367 2 8da177fdcc00 0

after:

commit 0d4a6608f68c7532dcbfec2ea1150c9761767d03
Author: Paolo Abeni 
Date:   Tue Sep 19 12:11:43 2017 +0200

udp: do rmem bulk free even if the rx sk queue is empty

rmem can be a non-0, small, value even if the rx queue is empty,
because it reports the (forward) allocated memory for this socket, not
the queue length in bytes. 

I'll cook a patch to fix the above.

Cheers,

Paolo


Re: [PATCH net] net: ipv6: ip6_output: alloc skb with tailroom

2018-06-07 Thread Alexander Aring
Hi,

On Wed, Jun 06, 2018 at 04:26:19PM -0400, Willem de Bruijn wrote:
> On Wed, Jun 6, 2018 at 2:11 PM, David Miller  wrote:
> > From: Alexander Aring 
> > Date: Wed, 6 Jun 2018 14:09:20 -0400
> >
> >> okay, then you want to have this patch for net-next? As an optimization?
> >>
> >> Of course, when it's open again.
> >
> > Like you, I have questions about where this adjustment is applied and
> > why.  So I'm not sure yet.
> >
> > For example, only IPV6 really takes it into consideration and as you
> > saw only really for the fragmentation path and not the normal output
> > path.
> >
> > This needs more consideration and investigation.
> 
> This is the unconditional skb_put in ieee802154_tx. In many cases
> there is some tailroom due to SKB_DATA_ALIGN in __alloc_skb,
> so it may take a specific case to not have even 2 bytes of tailroom
> available.

Yes it's in ieee802154_tx, but we need tailroom not just for checksum.
The bugreport is related to the two bytes of tailroom, because virtual
hardware doing checksum by software. The most real transceivers offload
this feature, so zero tailroom is needed.

I will of course add checks before adding L2 header for headroom and
tailroom in related subsystem code.

In IEEE 802.15.4 and secured enabled frames we need a MIC field at the
end of the frame. In worst case this can be 16 bytes.

I looked ethernet macsec feature and it seems they need to have a similar
reseved tailroom which is 16 bytes by default (max 32 bytes).

Maybe it's worth to take care for the tailroom in this path since it's
not just 2 bytes in some cases.

---

Meanwhile I think I found a bug in macsec, I cc Sabrina here:

diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
index 7de88b33d5b9..687323c0caf5 100644
--- a/drivers/net/macsec.c
+++ b/drivers/net/macsec.c
@@ -522,7 +522,7 @@ static bool macsec_validate_skb(struct sk_buff *skb, u16 
icv_len)
 }
 
 #define MACSEC_NEEDED_HEADROOM (macsec_extra_len(true))
-#define MACSEC_NEEDED_TAILROOM MACSEC_STD_ICV_LEN
+#define MACSEC_NEEDED_TAILROOM MACSEC_MAX_ICV_LEN
 
 static void macsec_fill_iv(unsigned char *iv, sci_t sci, u32 pn)
 {

---

MACSEC_NEEDED_TAILROOM is the define to check and run skb_copy_expand()
and should use the ?worst case? or the the value (icv_len + ?extra_foo?)
is set as runtime generation on newlink.

I see that in macsec_newlink() following code:

if (data && data[IFLA_MACSEC_ICV_LEN])
icv_len = nla_get_u8(data[IFLA_MACSEC_ICV_LEN]);

so the user can change it to (even a value above 32?, there is no check
for that). Anyway everything higher than MACSEC_STD_ICV_LEN could run
into a skb_over_panic().

- Alex


Re: [PATCH net] failover: eliminate callback hell

2018-06-07 Thread Stephen Hemminger
On Thu, 7 Jun 2018 18:41:31 +0300
"Michael S. Tsirkin"  wrote:

> > > Why would DPDK care what we do in the kernel? Isn't it just slapping
> > > vfio-pci on the netdevs it sees?  
> > 
> > Alex, you are correct for Intel devices; but DPDK on Azure is not Intel 
> > based.,.
> > The DPDK support uses:
> >  * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to
> >userspace. This means VF netdev device must exist and be visible.
> >  * Slow path using kernel netvsc device, TAP and BPF to get exception
> >path packets to userspace.
> >  * A autodiscovery mechanism that to set all this up that relies on
> >2 device model and sysfs.  
> 
> Could you describe what does it look for exactly? What will break if
> instead of MLX5 being a child of the PV, it's a child of the failover
> device?

So in DPDK there is an internal four device model:
1. failsafe is like failover in your model
2. TAP is used like netvsc in kernel
3. MLX5 is the VF
4. vdev_netvsc is a pseudo device whose only reason to exist
   is to glue everything together.

Digging deeper inside...

Vdev_netvsc does:
   * driver is started in a convuluted way off device arguments
   * probe routine for driver runs
  - scans list of kernel interfaces in sysfs
  - matches those using VMBUS 
  - skip netvsc devices that have an IPV4 route
   * scan for PCI devices that have same MAC address as kernel netvsc
 devices discovered in previous step
   * add these interfaces to arguments to failsafe

Then failsafe configures based on arguments on device

The code works but is specific to the Azure hardware model, and exposes lots
of things to application that it should not have to care about.

If you  try and walk through this code in DPDK, you will see why I have 
developed
a dislike for high levels of indirection.


   


Re: [PATCH] net-fq: Add WARN_ON check for null flow.

2018-06-07 Thread Eric Dumazet



On 06/07/2018 09:06 AM, gree...@candelatech.com wrote:
> From: Ben Greear 
> 
> While testing an ath10k firmware that often crashed under load,
> I was seeing kernel crashes as well.  One of them appeared to
> be a dereference of a NULL flow object in fq_tin_dequeue.
> 
> I have since fixed the firmware flaw, but I think it would be
> worth adding the WARN_ON in case the problem appears again.
> 
>  common_interrupt+0xf/0xf
>  
> 

Please find the exact commit that brought this bug,
and add a corresponding Fixes: tag

> Signed-off-by: Ben Greear 
> ---
>  include/net/fq_impl.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h
> index be7c0fa..e40354d 100644
> --- a/include/net/fq_impl.h
> +++ b/include/net/fq_impl.h
> @@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,
>  
>   flow = list_first_entry(head, struct fq_flow, flowchain);
>  
> + if (WARN_ON_ONCE(!flow))
> + return NULL;
> +
>   if (flow->deficit <= 0) {
>   flow->deficit += fq->quantum;
>   list_move_tail(>flowchain,
> 


[PATCH] net-fq: Add WARN_ON check for null flow.

2018-06-07 Thread greearb
From: Ben Greear 

While testing an ath10k firmware that often crashed under load,
I was seeing kernel crashes as well.  One of them appeared to
be a dereference of a NULL flow object in fq_tin_dequeue.

I have since fixed the firmware flaw, but I think it would be
worth adding the WARN_ON in case the problem appears again.

BUG: unable to handle kernel NULL pointer dereference at 003c
IP: ieee80211_tx_dequeue+0xfb/0xb10 [mac80211]
PGD 8001417fe067 P4D 8001417fe067 PUD 13db41067 PMD 0
Oops:  [#1] PREEMPT SMP PTI
Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 
libcrc32c vrf 8021q garp mrp stp llc fuse macvlan wanlink(O) pktgen lm78 ]
CPU: 2 PID: 21733 Comm: ip Tainted: GW  O 4.16.8+ #35
Hardware name: _ _/, BIOS 5.11 08/26/2016
RIP: 0010:ieee80211_tx_dequeue+0xfb/0xb10 [mac80211]
RSP: 0018:880172d03c30 EFLAGS: 00010286
RAX: 88013b2c RBX: 88013b2c00b8 RCX: 0898
RDX: 0001 RSI: 88013b2c00d8 RDI: 88016ac40820
RBP: 88016ac42ba0 R08: 0020 R09: 
R10: 0010 R11: 001256c89fd8 R12: 88013b2c
R13: 88013b2c00d8 R14:  R15: 88013b2c00d8
FS:  7f04e3606700() GS:880172d0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 003c CR3: 00013b35a005 CR4: 003606e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 
 ? update_load_avg+0x607/0x6f0
 ath10k_mac_tx_push_txq+0x6e/0x220 [ath10k_core]
 ath10k_mac_tx_push_pending+0x151/0x1e0 [ath10k_core]
 ath10k_htt_txrx_compl_task+0x113e/0x1940 [ath10k_core]
 ? ath10k_ce_completed_send_next_nolock+0x6f/0x90 [ath10k_pci]
 ? ath10k_ce_completed_send_next+0x31/0x40 [ath10k_pci]
 ? ath10k_pci_htc_tx_cb+0x30/0xc0 [ath10k_pci]
 ? ath10k_bus_pci_write32+0x3c/0xa0 [ath10k_pci]
 ath10k_pci_napi_poll+0x44/0xf0 [ath10k_pci]
 net_rx_action+0x250/0x3b0
 __do_softirq+0xc2/0x2c2
 irq_exit+0x93/0xa0
 do_IRQ+0x45/0xc0
 common_interrupt+0xf/0xf
 

Signed-off-by: Ben Greear 
---
 include/net/fq_impl.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h
index be7c0fa..e40354d 100644
--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,
 
flow = list_first_entry(head, struct fq_flow, flowchain);
 
+   if (WARN_ON_ONCE(!flow))
+   return NULL;
+
if (flow->deficit <= 0) {
flow->deficit += fq->quantum;
list_move_tail(>flowchain,
-- 
2.4.11



Fw: [Bug 199963] New: UDP rx_queue incorrect calculation in /proc/net/udp

2018-06-07 Thread Stephen Hemminger



Begin forwarded message:

Date: Thu, 07 Jun 2018 13:21:23 +
From: bugzilla-dae...@bugzilla.kernel.org
To: step...@networkplumber.org
Subject: [Bug 199963] New: UDP rx_queue incorrect calculation in /proc/net/udp


https://bugzilla.kernel.org/show_bug.cgi?id=199963

Bug ID: 199963
   Summary: UDP rx_queue incorrect calculation in /proc/net/udp
   Product: Networking
   Version: 2.5
Kernel Version: Kernels >= 4.15
  Hardware: All
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: IPV4
  Assignee: step...@networkplumber.org
  Reporter: trevor.fran...@46labs.com
Regression: No

since upgrading to any kernel >= 4.15 the rx_queue in /proc/net/udp is now
reporting a queue, regardless of system load and regardless of what
applications are running on it. The tx_queue is always 0, but rx_queue has
seemingly random spikes of udp queueing. This is observed across hundreds of
servers with either varying or no workload.

netstat -nl|grep ^udp
udp 4352 0 0.0.0.0:68 0.0.0.0:*

sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid
timeout inode ref pointer drops
14645: 357F:0035 : 07 :C900 00: 
101 0 3367 2 8da177fdcc00 0

-- 
You are receiving this mail because:
You are the assignee for the bug.


Re: [PATCH net] failover: eliminate callback hell

2018-06-07 Thread Michael S. Tsirkin
On Thu, Jun 07, 2018 at 07:51:12AM -0700, Stephen Hemminger wrote:
> On Thu, 7 Jun 2018 07:17:51 -0700
> Alexander Duyck  wrote:
> 
> > On Wed, Jun 6, 2018 at 3:25 PM, Stephen Hemminger
> >  wrote:
> > > On Wed, 6 Jun 2018 14:54:04 -0700
> > > "Samudrala, Sridhar"  wrote:
> > >  
> > >> On 6/6/2018 2:24 PM, Stephen Hemminger wrote:  
> > >> > On Wed, 6 Jun 2018 15:30:27 +0300
> > >> > "Michael S. Tsirkin"  wrote:
> > >> >  
> > >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> > >> >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org 
> > >> >>> wrote:  
> > >>  The net failover should be a simple library, not a virtual
> > >>  object with function callbacks (see callback hell).  
> > >> >>> Why just a library? It should do a common things. I think it should 
> > >> >>> be a
> > >> >>> virtual object. Looks like your patch again splits the common
> > >> >>> functionality into multiple drivers. That is kind of backwards 
> > >> >>> attitude.
> > >> >>> I don't get it. We should rather focus on fixing the mess the
> > >> >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> > >> >>> model.  
> > >> >> So it seems that at least one benefit for netvsc would be better
> > >> >> handling of renames.
> > >> >>
> > >> >> Question is how can this change to 3-netdev happen?  Stephen is
> > >> >> concerned about risk of breaking some userspace.
> > >> >>
> > >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> > >> >> address, and you said then "why not use existing network namespaces
> > >> >> rather than inventing a new abstraction". So how about it then? Do you
> > >> >> want to find a way to use namespaces to hide the PV device for netvsc
> > >> >> compatibility?
> > >> >>  
> > >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's 
> > >> > and
> > >> > startups that all demand eth0 always be present. And VF may come and 
> > >> > go.
> > >> > After this history, there is a strong motivation not to change how 
> > >> > kernel
> > >> > behaves. Switching to 3 device model would be perceived as breaking
> > >> > existing userspace.  
> > >>
> > >> I think it should be possible for netvsc to work with 3 dev model if the 
> > >> only
> > >> requirement is that eth0 will always be present. With net_failover, you 
> > >> will
> > >> see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an 
> > >> issue
> > >> if somehow there is userspace requirement that there can be only 2 
> > >> netdevs, not 3
> > >> when VF is plugged.
> > >>
> > >> eth0 will be the net_failover device and eth0nsby/eth1 will be the 
> > >> netvsc device
> > >> and the IP address gets configured on eth0. Will this be an issue?  
> > >
> > > DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess
> > > but that is the way it is.  
> > 
> > Why would DPDK care what we do in the kernel? Isn't it just slapping
> > vfio-pci on the netdevs it sees?
> 
> Alex, you are correct for Intel devices; but DPDK on Azure is not Intel 
> based.,.
> The DPDK support uses:
>  * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to
>userspace. This means VF netdev device must exist and be visible.
>  * Slow path using kernel netvsc device, TAP and BPF to get exception
>path packets to userspace.
>  * A autodiscovery mechanism that to set all this up that relies on
>2 device model and sysfs.

Could you describe what does it look for exactly? What will break if
instead of MLX5 being a child of the PV, it's a child of the failover
device?

> In this version, there is no VFIO-PCI. And also Hyper-V does not have virtual
> IOMMU so VFIO will not work there at all.
>


[PATCH bpf] bpf: reject passing modified ctx to helper functions

2018-06-07 Thread Daniel Borkmann
As commit 28e33f9d78ee ("bpf: disallow arithmetic operations on
context pointer") already describes, f1174f77b50c ("bpf/verifier:
rework value tracking") removed the specific white-listed cases
we had previously where we would allow for pointer arithmetic in
order to further generalize it, and allow e.g. context access via
modified registers. While the dereferencing of modified context
pointers had been forbidden through 28e33f9d78ee, syzkaller did
recently manage to trigger several KASAN splats for slab out of
bounds access and use after frees by simply passing a modified
context pointer to a helper function which would then do the bad
access since verifier allowed it in adjust_ptr_min_max_vals().

Rejecting arithmetic on ctx pointer in adjust_ptr_min_max_vals()
generally could break existing programs as there's a valid use
case in tracing in combination with passing the ctx to helpers as
bpf_probe_read(), where the register then becomes unknown at
verification time due to adding a non-constant offset to it. An
access sequence may look like the following:

  offset = args->filename;  /* field __data_loc filename */
  bpf_probe_read(, len, (char *)args + offset); // args is ctx

There are two options: i) we could special case the ctx and as
soon as we add a constant or bounded offset to it (hence ctx type
wouldn't change) we could turn the ctx into an unknown scalar, or
ii) we generalize the sanity test for ctx member access into a
small helper and assert it on the ctx register that was passed
as a function argument. Fwiw, latter is more obvious and less
complex at the same time, and one case that may potentially be
legitimate in future for ctx member access at least would be for
ctx to carry a const offset. Therefore, fix follows approach
from ii) and adds test cases to BPF kselftests.

Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Reported-by: syzbot+3d0b2441dbb717516...@syzkaller.appspotmail.com
Reported-by: syzbot+c8504affd4fdd0c1b...@syzkaller.appspotmail.com
Reported-by: syzbot+e5190cb881d8660fb...@syzkaller.appspotmail.com
Reported-by: syzbot+efae31b384d5badbd...@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c   | 48 +++-
 tools/testing/selftests/bpf/test_verifier.c | 58 -
 2 files changed, 88 insertions(+), 18 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d6403b5..cced0c1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1617,6 +1617,30 @@ static int get_callee_stack_depth(struct 
bpf_verifier_env *env,
 }
 #endif
 
+static int check_ctx_reg(struct bpf_verifier_env *env,
+const struct bpf_reg_state *reg, int regno)
+{
+   /* Access to ctx or passing it to a helper is only allowed in
+* its original, unmodified form.
+*/
+
+   if (reg->off) {
+   verbose(env, "dereference of modified ctx ptr R%d off=%d 
disallowed\n",
+   regno, reg->off);
+   return -EACCES;
+   }
+
+   if (!tnum_is_const(reg->var_off) || reg->var_off.value) {
+   char tn_buf[48];
+
+   tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
+   verbose(env, "variable ctx access var_off=%s disallowed\n", 
tn_buf);
+   return -EACCES;
+   }
+
+   return 0;
+}
+
 /* truncate register to smaller size (in bytes)
  * must be called with size < BPF_REG_SIZE
  */
@@ -1686,24 +1710,11 @@ static int check_mem_access(struct bpf_verifier_env 
*env, int insn_idx, u32 regn
verbose(env, "R%d leaks addr into ctx\n", value_regno);
return -EACCES;
}
-   /* ctx accesses must be at a fixed offset, so that we can
-* determine what type of data were returned.
-*/
-   if (reg->off) {
-   verbose(env,
-   "dereference of modified ctx ptr R%d off=%d+%d, 
ctx+const is allowed, ctx+const+const is not\n",
-   regno, reg->off, off - reg->off);
-   return -EACCES;
-   }
-   if (!tnum_is_const(reg->var_off) || reg->var_off.value) {
-   char tn_buf[48];
 
-   tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
-   verbose(env,
-   "variable ctx access var_off=%s off=%d size=%d",
-   tn_buf, off, size);
-   return -EACCES;
-   }
+   err = check_ctx_reg(env, reg, regno);
+   if (err < 0)
+   return err;
+
err = check_ctx_access(env, insn_idx, off, size, t, _type);
if (!err && t == BPF_READ && value_regno >= 0) {
/* ctx access returns either a 

Re: [PATCH net] failover: eliminate callback hell

2018-06-07 Thread Stephen Hemminger
On Thu, 7 Jun 2018 17:57:50 +0300
"Michael S. Tsirkin"  wrote:

> On Wed, Jun 06, 2018 at 03:24:08PM -0700, Stephen Hemminger wrote:
> > On Thu, 7 Jun 2018 00:47:52 +0300
> > "Michael S. Tsirkin"  wrote:
> >   
> > > On Wed, Jun 06, 2018 at 02:24:47PM -0700, Stephen Hemminger wrote:  
> > > > On Wed, 6 Jun 2018 15:30:27 +0300
> > > > "Michael S. Tsirkin"  wrote:
> > > > 
> > > > > On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
> > > > > > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org 
> > > > > > wrote:  
> > > > > > >The net failover should be a simple library, not a virtual
> > > > > > >object with function callbacks (see callback hell).  
> > > > > > 
> > > > > > Why just a library? It should do a common things. I think it should 
> > > > > > be a
> > > > > > virtual object. Looks like your patch again splits the common
> > > > > > functionality into multiple drivers. That is kind of backwards 
> > > > > > attitude.
> > > > > > I don't get it. We should rather focus on fixing the mess the
> > > > > > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> > > > > > model.  
> > > > > 
> > > > > So it seems that at least one benefit for netvsc would be better
> > > > > handling of renames.
> > > > > 
> > > > > Question is how can this change to 3-netdev happen?  Stephen is
> > > > > concerned about risk of breaking some userspace.
> > > > > 
> > > > > Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> > > > > address, and you said then "why not use existing network namespaces
> > > > > rather than inventing a new abstraction". So how about it then? Do you
> > > > > want to find a way to use namespaces to hide the PV device for netvsc
> > > > > compatibility?
> > > > > 
> > > > 
> > > > Netvsc can't work with 3 dev model. MS has worked with enough distro's 
> > > > and
> > > > startups that all demand eth0 always be present. And VF may come and 
> > > > go.
> > > 
> > > Well failover seems to maintain this invariant with the 3 dev model.
> > >   
> > > > After this history, there is a strong motivation not to change how 
> > > > kernel
> > > > behaves. Switching to 3 device model would be perceived as breaking
> > > > existing userspace.
> > > 
> > > I feel I'm misunderstood. I was asking whether a 3-rd device can be
> > > hidden so that userspace does not know that you switched to a 3 device
> > > model. It will think there are 2 devices and will keep working.
> > > 
> > > If you do that, then there won't be anything that
> > > would be perceived as breaking existing userspace, will there?  
> > 
> > DPDK now knows about the netvsc 2 device model and drivers in userspace
> > depend on it.  
> 
> Interesting but I'm not sure how this answers the question. How would
> DPDK care that there's a hidden device? If you can point out the
> code in question, maybe a way can be found to make changes while
> keeping it working.

See http://dpdk.org/browse/dpdk/tree/drivers/net/vdev_netvsc/vdev_netvsc.c

I am working to eliminate the necessity of this complex model in DPDK.
Having a 3 device model inside DPDK has just as many problems as in the
kernel.



Re: [PATCH net] failover: eliminate callback hell

2018-06-07 Thread Michael S. Tsirkin
On Wed, Jun 06, 2018 at 03:24:08PM -0700, Stephen Hemminger wrote:
> On Thu, 7 Jun 2018 00:47:52 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Wed, Jun 06, 2018 at 02:24:47PM -0700, Stephen Hemminger wrote:
> > > On Wed, 6 Jun 2018 15:30:27 +0300
> > > "Michael S. Tsirkin"  wrote:
> > >   
> > > > On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> > > > > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org 
> > > > > wrote:
> > > > > >The net failover should be a simple library, not a virtual
> > > > > >object with function callbacks (see callback hell).
> > > > > 
> > > > > Why just a library? It should do a common things. I think it should 
> > > > > be a
> > > > > virtual object. Looks like your patch again splits the common
> > > > > functionality into multiple drivers. That is kind of backwards 
> > > > > attitude.
> > > > > I don't get it. We should rather focus on fixing the mess the
> > > > > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> > > > > model.
> > > > 
> > > > So it seems that at least one benefit for netvsc would be better
> > > > handling of renames.
> > > > 
> > > > Question is how can this change to 3-netdev happen?  Stephen is
> > > > concerned about risk of breaking some userspace.
> > > > 
> > > > Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> > > > address, and you said then "why not use existing network namespaces
> > > > rather than inventing a new abstraction". So how about it then? Do you
> > > > want to find a way to use namespaces to hide the PV device for netvsc
> > > > compatibility?
> > > >   
> > > 
> > > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> > > startups that all demand eth0 always be present. And VF may come and go.  
> > 
> > Well failover seems to maintain this invariant with the 3 dev model.
> > 
> > > After this history, there is a strong motivation not to change how kernel
> > > behaves. Switching to 3 device model would be perceived as breaking
> > > existing userspace.  
> > 
> > I feel I'm misunderstood. I was asking whether a 3-rd device can be
> > hidden so that userspace does not know that you switched to a 3 device
> > model. It will think there are 2 devices and will keep working.
> > 
> > If you do that, then there won't be anything that
> > would be perceived as breaking existing userspace, will there?
> 
> DPDK now knows about the netvsc 2 device model and drivers in userspace
> depend on it.

Interesting but I'm not sure how this answers the question. How would
DPDK care that there's a hidden device? If you can point out the
code in question, maybe a way can be found to make changes while
keeping it working.

> > 
> > 
> > > With virtio you can  work it out with the distro's yourself.
> > > There is no pre-existing semantics to deal with.
> > > 
> > > For the virtio, I don't see the need for IFF_HIDDEN.
> > > With 3-dev model as long as you mark the PV and VF devices
> > > as slaves, then userspace knows to leave them alone. Assuming userspace
> > > is already able to deal with team and bond devices.  
> > 
> > That's clear enough.
> > 
> > > Any time you introduce new UAPI behavior something breaks.  
> > 
> > Not if we do it right.
> > 
> > > On the rename front, I really don't care if VF can be renamed.  
> > 
> > OK that's nice.
> > 
> > > And for
> > > netvsc want to allow the PV device to be renamed.  
> > 
> > That's because of the 2 device model, right?  So that explains why even
> > if the delayed hack is good for the goose it might not be good for the
> > gander :)
> 
> You are bringing up the VF right away. How does the 3-device initialization
> state machine work?  Do you give a window for udev to possibly rename the
> VF? Do you rely on that?
> 
> > 
> > > Udev developers want that
> > > but have not found a stable/persistent value to expose to userspace
> > > to allow it.  


Re: [PATCH net] failover: eliminate callback hell

2018-06-07 Thread Stephen Hemminger
On Thu, 7 Jun 2018 07:17:51 -0700
Alexander Duyck  wrote:

> On Wed, Jun 6, 2018 at 3:25 PM, Stephen Hemminger
>  wrote:
> > On Wed, 6 Jun 2018 14:54:04 -0700
> > "Samudrala, Sridhar"  wrote:
> >  
> >> On 6/6/2018 2:24 PM, Stephen Hemminger wrote:  
> >> > On Wed, 6 Jun 2018 15:30:27 +0300
> >> > "Michael S. Tsirkin"  wrote:
> >> >  
> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> >> >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org 
> >> >>> wrote:  
> >>  The net failover should be a simple library, not a virtual
> >>  object with function callbacks (see callback hell).  
> >> >>> Why just a library? It should do a common things. I think it should be 
> >> >>> a
> >> >>> virtual object. Looks like your patch again splits the common
> >> >>> functionality into multiple drivers. That is kind of backwards 
> >> >>> attitude.
> >> >>> I don't get it. We should rather focus on fixing the mess the
> >> >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >> >>> model.  
> >> >> So it seems that at least one benefit for netvsc would be better
> >> >> handling of renames.
> >> >>
> >> >> Question is how can this change to 3-netdev happen?  Stephen is
> >> >> concerned about risk of breaking some userspace.
> >> >>
> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >> >> address, and you said then "why not use existing network namespaces
> >> >> rather than inventing a new abstraction". So how about it then? Do you
> >> >> want to find a way to use namespaces to hide the PV device for netvsc
> >> >> compatibility?
> >> >>  
> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's 
> >> > and
> >> > startups that all demand eth0 always be present. And VF may come and go.
> >> > After this history, there is a strong motivation not to change how kernel
> >> > behaves. Switching to 3 device model would be perceived as breaking
> >> > existing userspace.  
> >>
> >> I think it should be possible for netvsc to work with 3 dev model if the 
> >> only
> >> requirement is that eth0 will always be present. With net_failover, you 
> >> will
> >> see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an 
> >> issue
> >> if somehow there is userspace requirement that there can be only 2 
> >> netdevs, not 3
> >> when VF is plugged.
> >>
> >> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc 
> >> device
> >> and the IP address gets configured on eth0. Will this be an issue?  
> >
> > DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess
> > but that is the way it is.  
> 
> Why would DPDK care what we do in the kernel? Isn't it just slapping
> vfio-pci on the netdevs it sees?

Alex, you are correct for Intel devices; but DPDK on Azure is not Intel based.,.
The DPDK support uses:
 * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to
   userspace. This means VF netdev device must exist and be visible.
 * Slow path using kernel netvsc device, TAP and BPF to get exception
   path packets to userspace.
 * A autodiscovery mechanism that to set all this up that relies on
   2 device model and sysfs.

In this version, there is no VFIO-PCI. And also Hyper-V does not have virtual
IOMMU so VFIO will not work there at all.
   


Re: [PATCH net] failover: eliminate callback hell

2018-06-07 Thread Alexander Duyck
On Wed, Jun 6, 2018 at 3:25 PM, Stephen Hemminger
 wrote:
> On Wed, 6 Jun 2018 14:54:04 -0700
> "Samudrala, Sridhar"  wrote:
>
>> On 6/6/2018 2:24 PM, Stephen Hemminger wrote:
>> > On Wed, 6 Jun 2018 15:30:27 +0300
>> > "Michael S. Tsirkin"  wrote:
>> >
>> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:
>>  The net failover should be a simple library, not a virtual
>>  object with function callbacks (see callback hell).
>> >>> Why just a library? It should do a common things. I think it should be a
>> >>> virtual object. Looks like your patch again splits the common
>> >>> functionality into multiple drivers. That is kind of backwards attitude.
>> >>> I don't get it. We should rather focus on fixing the mess the
>> >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> >>> model.
>> >> So it seems that at least one benefit for netvsc would be better
>> >> handling of renames.
>> >>
>> >> Question is how can this change to 3-netdev happen?  Stephen is
>> >> concerned about risk of breaking some userspace.
>> >>
>> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> address, and you said then "why not use existing network namespaces
>> >> rather than inventing a new abstraction". So how about it then? Do you
>> >> want to find a way to use namespaces to hide the PV device for netvsc
>> >> compatibility?
>> >>
>> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> > startups that all demand eth0 always be present. And VF may come and go.
>> > After this history, there is a strong motivation not to change how kernel
>> > behaves. Switching to 3 device model would be perceived as breaking
>> > existing userspace.
>>
>> I think it should be possible for netvsc to work with 3 dev model if the only
>> requirement is that eth0 will always be present. With net_failover, you will
>> see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an 
>> issue
>> if somehow there is userspace requirement that there can be only 2 netdevs, 
>> not 3
>> when VF is plugged.
>>
>> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc 
>> device
>> and the IP address gets configured on eth0. Will this be an issue?
>
> DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess
> but that is the way it is.

Why would DPDK care what we do in the kernel? Isn't it just slapping
vfio-pci on the netdevs it sees?


Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format

2018-06-07 Thread Arnaldo Carvalho de Melo
Em Thu, Jun 07, 2018 at 10:54:01AM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Tue, Jun 05, 2018 at 02:25:48PM -0700, Martin KaFai Lau escreveu:
> > [ btw, the latest commit (1 commit) should be 94a11b59e592 ].

So, the commit log message for the pahole patch is non-existent:

https://github.com/iamkafai/pahole/commit/94a11b59e5920908085bfc8d24c92f95c8ffceaf

we should do better in describing what is done and how, I'm staring
with a message you sent to the kernel part:

--
This patch introduces BPF Type Format (BTF).

BTF (BPF Type Format) is the meta data format which describes
the data types of BPF program/map.  Hence, it basically focus
on the C programming language which the modern BPF is primary
using.  The first use case is to provide a generic pretty print
capability for a BPF map.
--

Now I'm going to do the step-by-step guide on testing the feature just
introduced, and will try to convert from dwarf to BTF and back, compare
the pahole output for types encoded in DWARF and BTF, etc.

If you have something ressembling this already, please share.

Thanks,

- Arnaldo


Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format

2018-06-07 Thread Arnaldo Carvalho de Melo
Em Tue, Jun 05, 2018 at 02:25:48PM -0700, Martin KaFai Lau escreveu:
> On Thu, Apr 19, 2018 at 04:40:34PM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Wed, Apr 18, 2018 at 03:55:56PM -0700, Martin KaFai Lau escreveu:
> > > This patch introduces BPF Type Format (BTF).
> > > 
> > > BTF (BPF Type Format) is the meta data format which describes
> > > the data types of BPF program/map.  Hence, it basically focus
> > > on the C programming language which the modern BPF is primary
> > > using.  The first use case is to provide a generic pretty print
> > > capability for a BPF map.
> > > 
> > > A modified pahole that can convert dwarf to BTF is here:
> > > https://github.com/iamkafai/pahole/tree/btf
> > > (Arnaldo, there is some BTF_KIND numbering changes on
> > >  Apr 18th, d61426c1571)
> > 
> > Thanks for letting me know, I'm starting to look at this,
> Hi Arnaldo,
> 
> Do you have a chance to take a look and pull it?  The kernel
> changes will be in 4.18, so it will be handy if it is available in
> the pahole repository.
> 
> [ btw, the latest commit (1 commit) should be 94a11b59e592 ].

Yeah, the one I had before had:

It also raises the number of types (and functions) limit from 0x7fff to
0x7fff.



And on this last one I see that:

 /* Max # of type identifier */
-#define BTF_MAX_TYPE   0x7fff
+#define BTF_MAX_TYPE   0x
 /* Max offset into the string section */
-#define BTF_MAX_NAME_OFFSET0x7fff
+#define BTF_MAX_NAME_OFFSET0x

So somehow (still reading) you'll be able to get more space, if we find
necessary, to have more types and names, ok.

Continuing...

- Arnaldo
 
> > 
> > - Arnaldo
> >  
> > > Please see individual patch for details.
> > > 
> > > v5:
> > > - Remove BTF_KIND_FLOAT and BTF_KIND_FUNC which are not
> > >   currently used.  They can be added in the future.
> > >   Some bpf_df_xxx() are removed together.
> > > - Add comment in patch 7 to clarify that the new bpffs_map_fops
> > >   should not be extended further.
> > > 
> > > v4:
> > > - Fix warning (remove unneeded semicolon)
> > > - Remove a redundant variable (nr_bytes) from btf_int_check_meta() in
> > >   patch 1.  Caught by W=1.
> > > 
> > > v3:
> > > - Rebase to bpf-next
> > > - Fix sparse warning (by adding static)
> > > - Add BTF header logging: btf_verifier_log_hdr()
> > > - Fix the alignment test on btf->type_off
> > > - Add tests for the BTF header
> > > - Lower the max BTF size to 16MB.  It should be enough
> > >   for some time.  We could raise it later if it would
> > >   be needed.
> > > 
> > > v2:
> > > - Use kvfree where needed in patch 1 and 2
> > > - Also consider BTF_INT_OFFSET() in the btf_int_check_meta()
> > >   in patch 1
> > > - Fix an incorrect goto target in map_create() during
> > >   the btf-error-path in patch 7
> > > - re-org some local vars to keep the rev xmas tree in btf.c
> > > 
> > > Martin KaFai Lau (10):
> > >   bpf: btf: Introduce BPF Type Format (BTF)
> > >   bpf: btf: Validate type reference
> > >   bpf: btf: Check members of struct/union
> > >   bpf: btf: Add pretty print capability for data with BTF type info
> > >   bpf: btf: Add BPF_BTF_LOAD command
> > >   bpf: btf: Add BPF_OBJ_GET_INFO_BY_FD support to BTF fd
> > >   bpf: btf: Add pretty print support to the basic arraymap
> > >   bpf: btf: Sync bpf.h and btf.h to tools/
> > >   bpf: btf: Add BTF support to libbpf
> > >   bpf: btf: Add BTF tests
> > > 
> > >  include/linux/bpf.h  |   20 +-
> > >  include/linux/btf.h  |   48 +
> > >  include/uapi/linux/bpf.h |   12 +
> > >  include/uapi/linux/btf.h |  130 ++
> > >  kernel/bpf/Makefile  |1 +
> > >  kernel/bpf/arraymap.c|   50 +
> > >  kernel/bpf/btf.c | 2064 
> > > ++
> > >  kernel/bpf/inode.c   |  156 +-
> > >  kernel/bpf/syscall.c |   51 +-
> > >  tools/include/uapi/linux/bpf.h   |   12 +
> > >  tools/include/uapi/linux/btf.h   |  130 ++
> > >  tools/lib/bpf/Build  |2 +-
> > >  tools/lib/bpf/bpf.c  |   92 +-
> > >  tools/lib/bpf/bpf.h  |   16 +
> > >  tools/lib/bpf/btf.c  |  374 +
> > >  tools/lib/bpf/btf.h  |   22 +
> > >  tools/lib/bpf/libbpf.c   |  148 +-
> > >  tools/lib/bpf/libbpf.h   |3 +
> > >  tools/testing/selftests/bpf/Makefile |   26 +-
> > >  tools/testing/selftests/bpf/test_btf.c   | 1669 +
> > >  tools/testing/selftests/bpf/test_btf_haskv.c |   48 +
> > >  tools/testing/selftests/bpf/test_btf_nokv.c  |   43 +
> > >  22 files changed, 5076 insertions(+), 41 deletions(-)
> > >  create mode 100644 include/linux/btf.h
> > >  create mode 100644 include/uapi/linux/btf.h
> > >  

[PATCH net] net/sched: act_simple: fix parsing of TCA_DEFDATA

2018-06-07 Thread Davide Caratti
use nla_strlcpy() to avoid copying data beyond the length of TCA_DEFDATA
netlink attribute, in case it is less than SIMP_MAX_DATA and it does not
end with '\0' character.

Fixes: 0eff683f737b ("net/sched: potential data corruption")
Signed-off-by: Davide Caratti 
---
 net/sched/act_simple.c | 15 ++-
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/net/sched/act_simple.c b/net/sched/act_simple.c
index 9618b4a83cee..98c4afe7c15b 100644
--- a/net/sched/act_simple.c
+++ b/net/sched/act_simple.c
@@ -53,22 +53,22 @@ static void tcf_simp_release(struct tc_action *a)
kfree(d->tcfd_defdata);
 }
 
-static int alloc_defdata(struct tcf_defact *d, char *defdata)
+static int alloc_defdata(struct tcf_defact *d, const struct nlattr *defdata)
 {
d->tcfd_defdata = kzalloc(SIMP_MAX_DATA, GFP_KERNEL);
if (unlikely(!d->tcfd_defdata))
return -ENOMEM;
-   strlcpy(d->tcfd_defdata, defdata, SIMP_MAX_DATA);
+   nla_strlcpy(d->tcfd_defdata, defdata, SIMP_MAX_DATA);
return 0;
 }
 
-static void reset_policy(struct tcf_defact *d, char *defdata,
+static void reset_policy(struct tcf_defact *d, const struct nlattr *defdata,
 struct tc_defact *p)
 {
spin_lock_bh(>tcf_lock);
d->tcf_action = p->action;
memset(d->tcfd_defdata, 0, SIMP_MAX_DATA);
-   strlcpy(d->tcfd_defdata, defdata, SIMP_MAX_DATA);
+   nla_strlcpy(d->tcfd_defdata, defdata, SIMP_MAX_DATA);
spin_unlock_bh(>tcf_lock);
 }
 
@@ -87,7 +87,6 @@ static int tcf_simp_init(struct net *net, struct nlattr *nla,
struct tcf_defact *d;
bool exists = false;
int ret = 0, err;
-   char *defdata;
 
if (nla == NULL)
return -EINVAL;
@@ -110,8 +109,6 @@ static int tcf_simp_init(struct net *net, struct nlattr 
*nla,
return -EINVAL;
}
 
-   defdata = nla_data(tb[TCA_DEF_DATA]);
-
if (!exists) {
ret = tcf_idr_create(tn, parm->index, est, a,
 _simp_ops, bind, false);
@@ -119,7 +116,7 @@ static int tcf_simp_init(struct net *net, struct nlattr 
*nla,
return ret;
 
d = to_defact(*a);
-   ret = alloc_defdata(d, defdata);
+   ret = alloc_defdata(d, tb[TCA_DEF_DATA]);
if (ret < 0) {
tcf_idr_release(*a, bind);
return ret;
@@ -133,7 +130,7 @@ static int tcf_simp_init(struct net *net, struct nlattr 
*nla,
if (!ovr)
return -EEXIST;
 
-   reset_policy(d, defdata, parm);
+   reset_policy(d, tb[TCA_DEF_DATA], parm);
}
 
if (ret == ACT_P_CREATED)
-- 
2.17.0



Re: [RFC net-next] ipv4: Don't promote secondaries when flushing addresses

2018-06-07 Thread Michal Kubecek
On Thu, Jun 07, 2018 at 02:35:39PM +0200, Phil Sutter wrote:
> Yes, I agree with Michal. IIRC, flushing a specific primary along with
> all it's secondaries from an interface is not even supported by
> iproute2, so no need to optimize for that I guess. OTOH, if your
> solution allowed to get rid of that nasty loop in ipaddr_flush(), I owe
> you one extra beer at the next occasion. :)

I'm afraid it will have to stay as fallback for older kernels not
supporting flush requests. But there would be no need to actually use
it. If we know RTM_DELADDR request for zero address is guaranteed to
fail with current and older kernels, we could do

  - use RTM_DELADDR with IFA_F_FLUSH and zero address
  - if it fails, get the list and run the loop

If not, it could still be

  - use RTM_DELADDR with IFA_F_FLUSH and zero address
  - get the list of addresses (empty if first step worked)
  - run the loop

Michal Kubecek


Re: [RFC net-next] ipv4: Don't promote secondaries when flushing addresses

2018-06-07 Thread Phil Sutter
Hi Jakub,

On Thu, Jun 07, 2018 at 02:17:50PM +0200, Jakub Sitnicki wrote:
> On Thu, 7 Jun 2018 13:00:29 +0200
> Michal Kubecek  wrote:
> 
> > On Thu, Jun 07, 2018 at 12:13:01PM +0200, Jakub Sitnicki wrote:
> > > Promoting secondary addresses on address removal makes flushing all
> > > addresses from a device with 1000's of them slow. This is because we
> > > cannot take down the secondary addresses when we are removing the
> > > primary one, which would make it faster.
> > > 
> > > However, the userspace, when performing a flush, will in the end remove
> > > all the addresses regardless of secondary address promotion taking
> > > place. Unfortunately the kernel currently cannot distinguish between a
> > > single address removal and a flush of all addresses.
> > > 
> > > To help with this case introduce a IFA_F_FLUSH flag that can be used by
> > > userspace to signal that a removal operation is being done because of a
> > > flush. When the flag is set, don't bother with secondary address
> > > promotion as we expect that secondary addresses will be removed soon as
> > > well.  
> > 
> > Unless you intend to use the flag to allow deleting a specific address
> > with its secondaries (overriding promote_secondaries), maybe it would
> > be more practical to go even further and delete all addresses on the
> > interface if IFA_F_FLUSH is set so that userspace could delete all
> > addresses with one request.
> 
> Thanks for input, Michal. The intend as I understand it is to make
> flushing all the addresses fast(er). Let me see if I can rework it
> according to your suggestion. It does make more sense to do it like
> that to me too.

Yes, I agree with Michal. IIRC, flushing a specific primary along with
all it's secondaries from an interface is not even supported by
iproute2, so no need to optimize for that I guess. OTOH, if your
solution allowed to get rid of that nasty loop in ipaddr_flush(), I owe
you one extra beer at the next occasion. :)

Thanks for holding on to this old ticket!

Cheers, Phil


Re: [RFC net-next] ipv4: Don't promote secondaries when flushing addresses

2018-06-07 Thread Jakub Sitnicki
On Thu, 7 Jun 2018 13:00:29 +0200
Michal Kubecek  wrote:

> On Thu, Jun 07, 2018 at 12:13:01PM +0200, Jakub Sitnicki wrote:
> > Promoting secondary addresses on address removal makes flushing all
> > addresses from a device with 1000's of them slow. This is because we
> > cannot take down the secondary addresses when we are removing the
> > primary one, which would make it faster.
> > 
> > However, the userspace, when performing a flush, will in the end remove
> > all the addresses regardless of secondary address promotion taking
> > place. Unfortunately the kernel currently cannot distinguish between a
> > single address removal and a flush of all addresses.
> > 
> > To help with this case introduce a IFA_F_FLUSH flag that can be used by
> > userspace to signal that a removal operation is being done because of a
> > flush. When the flag is set, don't bother with secondary address
> > promotion as we expect that secondary addresses will be removed soon as
> > well.  
> 
> Unless you intend to use the flag to allow deleting a specific address
> with its secondaries (overriding promote_secondaries), maybe it would
> be more practical to go even further and delete all addresses on the
> interface if IFA_F_FLUSH is set so that userspace could delete all
> addresses with one request.

Thanks for input, Michal. The intend as I understand it is to make
flushing all the addresses fast(er). Let me see if I can rework it
according to your suggestion. It does make more sense to do it like
that to me too.

Thanks,
Jakub



Re: [RFC net-next] ipv4: Don't promote secondaries when flushing addresses

2018-06-07 Thread Michal Kubecek
On Thu, Jun 07, 2018 at 12:13:01PM +0200, Jakub Sitnicki wrote:
> Promoting secondary addresses on address removal makes flushing all
> addresses from a device with 1000's of them slow. This is because we
> cannot take down the secondary addresses when we are removing the
> primary one, which would make it faster.
> 
> However, the userspace, when performing a flush, will in the end remove
> all the addresses regardless of secondary address promotion taking
> place. Unfortunately the kernel currently cannot distinguish between a
> single address removal and a flush of all addresses.
> 
> To help with this case introduce a IFA_F_FLUSH flag that can be used by
> userspace to signal that a removal operation is being done because of a
> flush. When the flag is set, don't bother with secondary address
> promotion as we expect that secondary addresses will be removed soon as
> well.

Unless you intend to use the flag to allow deleting a specific address
with its secondaries (overriding promote_secondaries), maybe it would
be more practical to go even further and delete all addresses on the
interface if IFA_F_FLUSH is set so that userspace could delete all
addresses with one request.

Michal Kubecek


[RFC net-next] ipv4: Don't promote secondaries when flushing addresses

2018-06-07 Thread Jakub Sitnicki
Promoting secondary addresses on address removal makes flushing all
addresses from a device with 1000's of them slow. This is because we
cannot take down the secondary addresses when we are removing the
primary one, which would make it faster.

However, the userspace, when performing a flush, will in the end remove
all the addresses regardless of secondary address promotion taking
place. Unfortunately the kernel currently cannot distinguish between a
single address removal and a flush of all addresses.

To help with this case introduce a IFA_F_FLUSH flag that can be used by
userspace to signal that a removal operation is being done because of a
flush. When the flag is set, don't bother with secondary address
promotion as we expect that secondary addresses will be removed soon as
well.

Signed-off-by: Jakub Sitnicki 

---

A benchmark involving a flush of 40,000 addresses from a dummy device
shows a x4 speed-up of the 'flush' operation. 'ip' had to be modified to
set the IFA_F_FLUSH flag for RTM_DELADDR requests issued for the
'flush':

  # time $IP -stats addr flush dev dum0

Before:

  real0m30.596s
  user0m0.000s
  sys 0m30.567s

After:

  real0m7.601s
  user0m0.000s
  sys 0m7.569s

It's also worth noting that promote_secondaries sysctl param is enabled by
default since systemd 216 thus making it the new "normal" on some distros.


 include/uapi/linux/if_addr.h |  1 +
 net/ipv4/devinet.c   | 14 ++
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/if_addr.h b/include/uapi/linux/if_addr.h
index ebaf5701c9db..19aab9a9cec5 100644
--- a/include/uapi/linux/if_addr.h
+++ b/include/uapi/linux/if_addr.h
@@ -54,6 +54,7 @@ enum {
 #define IFA_F_NOPREFIXROUTE0x200
 #define IFA_F_MCAUTOJOIN   0x400
 #define IFA_F_STABLE_PRIVACY   0x800
+#define IFA_F_FLUSH0x1000

 struct ifa_cacheinfo {
__u32   ifa_prefered;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index d7585ab1a77a..1f436e1e5222 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -331,13 +331,14 @@ int inet_addr_onlink(struct in_device *in_dev, __be32 a, 
__be32 b)
 }

 static void __inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap,
-int destroy, struct nlmsghdr *nlh, u32 portid)
+  int destroy, struct nlmsghdr *nlh, u32 portid,
+  bool flush)
 {
struct in_ifaddr *promote = NULL;
struct in_ifaddr *ifa, *ifa1 = *ifap;
struct in_ifaddr *last_prim = in_dev->ifa_list;
struct in_ifaddr *prev_prom = NULL;
-   int do_promote = IN_DEV_PROMOTE_SECONDARIES(in_dev);
+   int do_promote = IN_DEV_PROMOTE_SECONDARIES(in_dev) && !flush;

ASSERT_RTNL();

@@ -437,7 +438,7 @@ static void __inet_del_ifa(struct in_device *in_dev, struct 
in_ifaddr **ifap,
 static void inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap,
 int destroy)
 {
-   __inet_del_ifa(in_dev, ifap, destroy, NULL, 0);
+   __inet_del_ifa(in_dev, ifap, destroy, NULL, 0, false);
 }

 static void check_lifetime(struct work_struct *work);
@@ -607,6 +608,7 @@ static int inet_rtm_deladdr(struct sk_buff *skb, struct 
nlmsghdr *nlh,
struct in_device *in_dev;
struct ifaddrmsg *ifm;
struct in_ifaddr *ifa, **ifap;
+   bool flush = false;
int err = -EINVAL;

ASSERT_RTNL();
@@ -623,6 +625,9 @@ static int inet_rtm_deladdr(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto errout;
}

+   if (tb[IFA_FLAGS])
+   flush = !!(nla_get_u32(tb[IFA_FLAGS]) & IFA_F_FLUSH);
+
for (ifap = _dev->ifa_list; (ifa = *ifap) != NULL;
 ifap = >ifa_next) {
if (tb[IFA_LOCAL] &&
@@ -639,7 +644,8 @@ static int inet_rtm_deladdr(struct sk_buff *skb, struct 
nlmsghdr *nlh,

if (ipv4_is_multicast(ifa->ifa_address))
ip_mc_config(net->ipv4.mc_autojoin_sk, false, ifa);
-   __inet_del_ifa(in_dev, ifap, 1, nlh, NETLINK_CB(skb).portid);
+   __inet_del_ifa(in_dev, ifap, 1, nlh, NETLINK_CB(skb).portid,
+  flush);
return 0;
}

--
2.14.4


Re: [PATCH bpf-next 09/11] i40e: implement AF_XDP zero-copy support for Rx

2018-06-07 Thread Björn Töpel
Den mån 4 juni 2018 kl 22:35 skrev Alexander Duyck :
>
> On Mon, Jun 4, 2018 at 5:05 AM, Björn Töpel  wrote:
> > From: Björn Töpel 
> >
> > This commit adds initial AF_XDP zero-copy support for i40e-based
> > NICs. First we add support for the new XDP_QUERY_XSK_UMEM and
> > XDP_SETUP_XSK_UMEM commands in ndo_bpf. This allows the AF_XDP socket
> > to pass a UMEM to the driver. The driver will then DMA map all the
> > frames in the UMEM for the driver. Next, the Rx code will allocate
> > frames from the UMEM fill queue, instead of the regular page
> > allocator.
> >
> > Externally, for the rest of the XDP code, the driver internal UMEM
> > allocator will appear as a MEM_TYPE_ZERO_COPY.
> >
> > The commit also introduces a completely new clean_rx_irq/allocator
> > functions for zero-copy, and means (functions pointers) to set
> > allocators and clean_rx functions.
> >
> > This first version does not support:
> > * passing frames to the stack via XDP_PASS (clone/copy to skb).
> > * doing XDP redirect to other than AF_XDP sockets
> >   (convert_to_xdp_frame does not clone the frame yet).
> >
> > Signed-off-by: Björn Töpel 
> > ---
> >  drivers/net/ethernet/intel/i40e/Makefile|   3 +-
> >  drivers/net/ethernet/intel/i40e/i40e.h  |  23 ++
> >  drivers/net/ethernet/intel/i40e/i40e_main.c |  35 +-
> >  drivers/net/ethernet/intel/i40e/i40e_txrx.c | 163 ++---
> >  drivers/net/ethernet/intel/i40e/i40e_txrx.h | 128 ++-
> >  drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 537 
> > 
> >  drivers/net/ethernet/intel/i40e/i40e_xsk.h  |  17 +
> >  include/net/xdp_sock.h  |  19 +
> >  net/xdp/xdp_umem.h  |  10 -
> >  9 files changed, 789 insertions(+), 146 deletions(-)
> >  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.c
> >  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.h
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/Makefile 
> > b/drivers/net/ethernet/intel/i40e/Makefile
> > index 14397e7e9925..50590e8d1fd1 100644
> > --- a/drivers/net/ethernet/intel/i40e/Makefile
> > +++ b/drivers/net/ethernet/intel/i40e/Makefile
> > @@ -22,6 +22,7 @@ i40e-objs := i40e_main.o \
> > i40e_txrx.o \
> > i40e_ptp.o  \
> > i40e_client.o   \
> > -   i40e_virtchnl_pf.o
> > +   i40e_virtchnl_pf.o \
> > +   i40e_xsk.o
> >
> >  i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
> > b/drivers/net/ethernet/intel/i40e/i40e.h
> > index 7a80652e2500..20955e5dce02 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e.h
> > +++ b/drivers/net/ethernet/intel/i40e/i40e.h
> > @@ -786,6 +786,12 @@ struct i40e_vsi {
> >
> > /* VSI specific handlers */
> > irqreturn_t (*irq_handler)(int irq, void *data);
> > +
> > +   /* AF_XDP zero-copy */
> > +   struct xdp_umem **xsk_umems;
> > +   u16 num_xsk_umems_used;
> > +   u16 num_xsk_umems;
> > +
> >  } cacheline_internodealigned_in_smp;
> >
> >  struct i40e_netdev_priv {
> > @@ -1090,6 +1096,20 @@ static inline bool i40e_enabled_xdp_vsi(struct 
> > i40e_vsi *vsi)
> > return !!vsi->xdp_prog;
> >  }
> >
> > +static inline struct xdp_umem *i40e_xsk_umem(struct i40e_ring *ring)
> > +{
> > +   bool xdp_on = i40e_enabled_xdp_vsi(ring->vsi);
> > +   int qid = ring->queue_index;
> > +
> > +   if (ring_is_xdp(ring))
> > +   qid -= ring->vsi->alloc_queue_pairs;
> > +
> > +   if (!ring->vsi->xsk_umems || !ring->vsi->xsk_umems[qid] || !xdp_on)
> > +   return NULL;
> > +
> > +   return ring->vsi->xsk_umems[qid];
> > +}
> > +
> >  int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel 
> > *ch);
> >  int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate);
> >  int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
> > @@ -1098,4 +1118,7 @@ int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
> >  int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
> >   struct i40e_cloud_filter *filter,
> >   bool add);
> > +int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair);
> > +int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair);
> > +
> >  #endif /* _I40E_H_ */
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
> > b/drivers/net/ethernet/intel/i40e/i40e_main.c
> > index 369a116edaa1..8c602424d339 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> > @@ -5,6 +5,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  /* Local includes */
> >  #include "i40e.h"
> > @@ -16,6 +17,7 @@
> >   */
> >  #define CREATE_TRACE_POINTS
> >  #include "i40e_trace.h"
> > +#include "i40e_xsk.h"
> >
> >  const char i40e_driver_name[] = "i40e";
> >  static const char i40e_driver_string[] =
> > @@ -3071,6 +3073,9 @@ static int 

[PATCH ipsec] vti6: fix PMTU caching and reporting on xmit

2018-06-07 Thread Eyal Birger
When setting the skb->dst before doing the MTU check, the route PMTU
caching and reporting is done on the new dst which is about to be
released.

Instead, PMTU handling should be done using the original dst.

This is aligned with IPv4 VTI.

Signed-off-by: Eyal Birger 
Fixes: ccd740cbc6 ("vti6: Add pmtu handling to vti6_xmit.")
---
 net/ipv6/ip6_vti.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index ca957dd..e675ec7 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -480,10 +480,6 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, 
struct flowi *fl)
goto tx_err_dst_release;
}
 
-   skb_scrub_packet(skb, !net_eq(t->net, dev_net(dev)));
-   skb_dst_set(skb, dst);
-   skb->dev = skb_dst(skb)->dev;
-
mtu = dst_mtu(dst);
if (!skb->ignore_df && skb->len > mtu) {
skb_dst_update_pmtu(skb, mtu);
@@ -498,9 +494,14 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, 
struct flowi *fl)
  htonl(mtu));
}
 
-   return -EMSGSIZE;
+   err = -EMSGSIZE;
+   goto tx_err_dst_release;
}
 
+   skb_scrub_packet(skb, !net_eq(t->net, dev_net(dev)));
+   skb_dst_set(skb, dst);
+   skb->dev = skb_dst(skb)->dev;
+
err = dst_output(t->net, skb->sk, skb);
if (net_xmit_eval(err) == 0) {
struct pcpu_sw_netstats *tstats = this_cpu_ptr(dev->tstats);
-- 
2.7.4