[PATCH net v2] net/sched: act_simple: fix parsing of TCA_DEF_DATA
use nla_strlcpy() to avoid copying data beyond the length of TCA_DEF_DATA netlink attribute, in case it is less than SIMP_MAX_DATA and it does not end with '\0' character. v2: fix errors in the commit message, thanks Hangbin Liu Fixes: fa1b1cff3d06 ("net_cls_act: Make act_simple use of netlink policy.") Signed-off-by: Davide Caratti --- net/sched/act_simple.c | 15 ++- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/net/sched/act_simple.c b/net/sched/act_simple.c index 9618b4a83cee..98c4afe7c15b 100644 --- a/net/sched/act_simple.c +++ b/net/sched/act_simple.c @@ -53,22 +53,22 @@ static void tcf_simp_release(struct tc_action *a) kfree(d->tcfd_defdata); } -static int alloc_defdata(struct tcf_defact *d, char *defdata) +static int alloc_defdata(struct tcf_defact *d, const struct nlattr *defdata) { d->tcfd_defdata = kzalloc(SIMP_MAX_DATA, GFP_KERNEL); if (unlikely(!d->tcfd_defdata)) return -ENOMEM; - strlcpy(d->tcfd_defdata, defdata, SIMP_MAX_DATA); + nla_strlcpy(d->tcfd_defdata, defdata, SIMP_MAX_DATA); return 0; } -static void reset_policy(struct tcf_defact *d, char *defdata, +static void reset_policy(struct tcf_defact *d, const struct nlattr *defdata, struct tc_defact *p) { spin_lock_bh(>tcf_lock); d->tcf_action = p->action; memset(d->tcfd_defdata, 0, SIMP_MAX_DATA); - strlcpy(d->tcfd_defdata, defdata, SIMP_MAX_DATA); + nla_strlcpy(d->tcfd_defdata, defdata, SIMP_MAX_DATA); spin_unlock_bh(>tcf_lock); } @@ -87,7 +87,6 @@ static int tcf_simp_init(struct net *net, struct nlattr *nla, struct tcf_defact *d; bool exists = false; int ret = 0, err; - char *defdata; if (nla == NULL) return -EINVAL; @@ -110,8 +109,6 @@ static int tcf_simp_init(struct net *net, struct nlattr *nla, return -EINVAL; } - defdata = nla_data(tb[TCA_DEF_DATA]); - if (!exists) { ret = tcf_idr_create(tn, parm->index, est, a, _simp_ops, bind, false); @@ -119,7 +116,7 @@ static int tcf_simp_init(struct net *net, struct nlattr *nla, return ret; d = to_defact(*a); - ret = alloc_defdata(d, defdata); + ret = alloc_defdata(d, tb[TCA_DEF_DATA]); if (ret < 0) { tcf_idr_release(*a, bind); return ret; @@ -133,7 +130,7 @@ static int tcf_simp_init(struct net *net, struct nlattr *nla, if (!ovr) return -EEXIST; - reset_policy(d, defdata, parm); + reset_policy(d, tb[TCA_DEF_DATA], parm); } if (ret == ACT_P_CREATED) -- 2.17.0
Re: [PATCH net] net/sched: act_simple: fix parsing of TCA_DEFDATA
On Fri, 2018-06-08 at 10:07 +0800, Hangbin Liu wrote: > On Thu, Jun 07, 2018 at 03:46:43PM +0200, Davide Caratti wrote: > > use nla_strlcpy() to avoid copying data beyond the length of TCA_DEFDATA > > s/TCA_DEFDATA/TCA_DEF_DATA/, incase someone search the commit history but > could not find it. > > Thanks > Hangbin sure, thanks, and after another look I think also the 'Fixes:' tag is wrong. More probably it was introduced with fa1b1cff3d06 "net_cls_act: Make act_simple use of netlink policy." I will send a v2 in minutes. regards, -- davide
Re: [PATCH net-next] net: ipv6: Generate random IID for addresses on RAWIP devices
On Mon, Jun 4, 2018 at 8:51 AM 吉藤英明 wrote: > > > + if (ipv6_get_lladdr(dev, , IFA_F_TENTATIVE)) > > + get_random_bytes(eui, 8); > > Please be aware of I/G bit and G/L bit. Actually, I think this is fine. RFC 7136 clarified this, and says: == Thus, we can conclude that the value of the "u" bit in IIDs has no particular meaning. In the case of an IID created from a MAC address according to RFC 4291, its value is determined by the MAC address, but that is all. [...] Specifications of other forms of 64-bit IIDs MUST specify how all 64 bits are set, but a generic semantic meaning for the "u" and "g" bits MUST NOT be defined. However, the method of generating IIDs for specific link types MAY define some local significance for certain bits. In all cases, the bits in an IID have no generic semantics; in other words, they have opaque values. In fact, the whole IID value MUST be viewed as an opaque bit string by third parties, except possibly in the local context. == That said - we already have a way to form all-random IIDs: IN6_ADDR_GEN_MODE_RANDOM. Can't you just ensure that links of type ARPHRD_RAWIP always use IN6_ADDR_GEN_MODE_RANDOM? That might lead to less special-casing.
Re: [PATCH net] net/sched: act_simple: fix parsing of TCA_DEFDATA
On Thu, Jun 07, 2018 at 03:46:43PM +0200, Davide Caratti wrote: > use nla_strlcpy() to avoid copying data beyond the length of TCA_DEFDATA s/TCA_DEFDATA/TCA_DEF_DATA/, incase someone search the commit history but could not find it. Thanks Hangbin
Proposal
-- Good day, i know you do not know me personally but i have checked your profile and i see generosity in you, There's an urgent offer attach to your name here in the office of Mr. Fawaz KhE. Al Saleh Member of the Board of Directors, Kuveyt Türk Participation Bank (Turkey) and head of private banking and wealth management Regards, Mr. Fawaz KhE. Al Saleh
Re: [Bug 199637] New: UBSAN: Undefined behaviour in net/ipv4/fib_trie.c:503:6
On 6/7/18 5:07 PM, Jakub Kicinski wrote: >> After recompiling the 4.16.7 kernel with gcc 8.1, UBSAN reports the >> following: >> >> [ 25.427424] >> >> [ 25.429680] UBSAN: Undefined behaviour in net/ipv4/fib_trie.c:503:6 >> [ 25.431920] member access within null pointer of type 'struct tnode' >> [ 25.434153] CPU: 3 PID: 1 Comm: systemd Not tainted 4.16.7-CUSTOM #1 >> [ 25.436384] Hardware name: Gigabyte Technology Co., Ltd. >> H67MA-UD2H-B3/H67MA-UD2H-B3, BIOS F8 03/27/2012 >> [ 25.438647] Call Trace: >> [ 25.440889] dump_stack+0x62/0x9f >> [ 25.443104] ubsan_epilogue+0x9/0x35 >> [ 25.445293] handle_null_ptr_deref+0x80/0x90 >> [ 25.447464] __ubsan_handle_type_mismatch_v1+0x6a/0x80 >> [ 25.449628] tnode_free+0xce/0x120 arguably this one should be guarded: diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 5bc0c89e81e4..32c589059fb3 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -501,7 +501,8 @@ static void tnode_free(struct key_vector *tn) tnode_free_size += TNODE_SIZE(1ul << tn->bits); node_free(tn); - tn = container_of(head, struct tnode, rcu)->kv; + if (head) + tn = container_of(head, struct tnode, rcu)->kv; } if (tnode_free_size >= PAGE_SIZE * sync_pages) { but if head is NULL, tn is set but not dereferenced as the loop breaks.
Re: [Bug 199643] New: UBSAN: Undefined behaviour in ./include/net/route.h:240:2
CC: Andrey On Thu, 7 Jun 2018 17:53:35 -0700, David Ahern wrote: > On 6/7/18 5:49 PM, Jakub Kicinski wrote: > > On Thu, 7 Jun 2018 17:28:59 -0700, Eric Dumazet wrote: > >> On 06/07/2018 05:11 PM, David Miller wrote: > >>> From: Jakub Kicinski > >>> Date: Thu, 7 Jun 2018 17:06:23 -0700 > >>> > [ 293.213661] ip_send_unicast_reply+0x1b67/0x1d0e > >>> > >>> This calls ip_setup_cork() which can NULL out the 'rt' route > >>> pointer. Hmmm... :-/ > >> > >> UBSAN seems unhappy with dst being NULL in : > >> > >> dst_release(>dst); > >> > >> But the code obviously is ready for dst being NULL, it is even documented > >> :) > > > > Oh, so the code depends on dst being the first member? Would it make > > sense to just cast the pointer instead? > > > > I've been going the other way with 'rt to dst' and 'dst to rt' > transformations. > > Perhaps UBSAN should be updated to understand that NULL + 0 is ok.
Re: [Bug 199643] New: UBSAN: Undefined behaviour in ./include/net/route.h:240:2
On 6/7/18 5:49 PM, Jakub Kicinski wrote: > On Thu, 7 Jun 2018 17:28:59 -0700, Eric Dumazet wrote: >> On 06/07/2018 05:11 PM, David Miller wrote: >>> From: Jakub Kicinski >>> Date: Thu, 7 Jun 2018 17:06:23 -0700 >>> [ 293.213661] ip_send_unicast_reply+0x1b67/0x1d0e >>> >>> This calls ip_setup_cork() which can NULL out the 'rt' route >>> pointer. Hmmm... :-/ >>> >> >> >> UBSAN seems unhappy with dst being NULL in : >> >> dst_release(>dst); >> >> But the code obviously is ready for dst being NULL, it is even documented :) > > Oh, so the code depends on dst being the first member? Would it make > sense to just cast the pointer instead? > I've been going the other way with 'rt to dst' and 'dst to rt' transformations. Perhaps UBSAN should be updated to understand that NULL + 0 is ok.
Re: [Bug 199643] New: UBSAN: Undefined behaviour in ./include/net/route.h:240:2
On Thu, 7 Jun 2018 17:28:59 -0700, Eric Dumazet wrote: > On 06/07/2018 05:11 PM, David Miller wrote: > > From: Jakub Kicinski > > Date: Thu, 7 Jun 2018 17:06:23 -0700 > > > >> [ 293.213661] ip_send_unicast_reply+0x1b67/0x1d0e > > > > This calls ip_setup_cork() which can NULL out the 'rt' route > > pointer. Hmmm... :-/ > > > > > UBSAN seems unhappy with dst being NULL in : > > dst_release(>dst); > > But the code obviously is ready for dst being NULL, it is even documented :) Oh, so the code depends on dst being the first member? Would it make sense to just cast the pointer instead?
Re: [Bug 199643] New: UBSAN: Undefined behaviour in ./include/net/route.h:240:2
On 06/07/2018 05:11 PM, David Miller wrote: > From: Jakub Kicinski > Date: Thu, 7 Jun 2018 17:06:23 -0700 > >> [ 293.213661] ip_send_unicast_reply+0x1b67/0x1d0e > > This calls ip_setup_cork() which can NULL out the 'rt' route > pointer. Hmmm... :-/ > UBSAN seems unhappy with dst being NULL in : dst_release(>dst); But the code obviously is ready for dst being NULL, it is even documented :)
Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.
On Thu, Jun 7, 2018 at 4:48 PM, wrote: > From: Ben Greear > > While testing an ath10k firmware that often crashed under load, > I was seeing kernel crashes as well. One of them appeared to > be a dereference of a NULL flow object in fq_tin_dequeue. > > I have since fixed the firmware flaw, but I think it would be > worth adding the WARN_ON in case the problem appears again. > > BUG: unable to handle kernel NULL pointer dereference at 003c > IP: ieee80211_tx_dequeue+0xfb/0xb10 [mac80211] Instead of adding WARN_ON(), you need to think about the locking there, it is suspicious: fq is from struct ieee80211_local: struct fq *fq = >fq; tin is from struct txq_info: struct fq_tin *tin = >tin; I don't know if fq and tin are supposed to be 1:1, if not there is a bug in the locking, because ->new_flows and ->old_flows are both inside tin instead of fq, but they are protected by fq->lock
Re: [Bug 199643] New: UBSAN: Undefined behaviour in ./include/net/route.h:240:2
From: Jakub Kicinski Date: Thu, 7 Jun 2018 17:06:23 -0700 > [ 293.213661] ip_send_unicast_reply+0x1b67/0x1d0e This calls ip_setup_cork() which can NULL out the 'rt' route pointer. Hmmm... :-/
Re: [Bug 199637] New: UBSAN: Undefined behaviour in net/ipv4/fib_trie.c:503:6
On Mon, 7 May 2018 10:33:45 -0700, Stephen Hemminger wrote: > Begin forwarded message: > > Date: Mon, 07 May 2018 16:07:24 + > From: bugzilla-dae...@bugzilla.kernel.org > To: step...@networkplumber.org > Subject: [Bug 199637] New: UBSAN: Undefined behaviour in > net/ipv4/fib_trie.c:503:6 > > > https://bugzilla.kernel.org/show_bug.cgi?id=199637 > > Bug ID: 199637 >Summary: UBSAN: Undefined behaviour in > net/ipv4/fib_trie.c:503:6 >Product: Networking >Version: 2.5 > Kernel Version: 4.16.7 > Hardware: x86-64 > OS: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: IPV4 > Assignee: step...@networkplumber.org > Reporter: combus...@archlinux.us > Regression: No > > After recompiling the 4.16.7 kernel with gcc 8.1, UBSAN reports the following: > > [ 25.427424] > > [ 25.429680] UBSAN: Undefined behaviour in net/ipv4/fib_trie.c:503:6 > [ 25.431920] member access within null pointer of type 'struct tnode' > [ 25.434153] CPU: 3 PID: 1 Comm: systemd Not tainted 4.16.7-CUSTOM #1 > [ 25.436384] Hardware name: Gigabyte Technology Co., Ltd. > H67MA-UD2H-B3/H67MA-UD2H-B3, BIOS F8 03/27/2012 > [ 25.438647] Call Trace: > [ 25.440889] dump_stack+0x62/0x9f > [ 25.443104] ubsan_epilogue+0x9/0x35 > [ 25.445293] handle_null_ptr_deref+0x80/0x90 > [ 25.447464] __ubsan_handle_type_mismatch_v1+0x6a/0x80 > [ 25.449628] tnode_free+0xce/0x120 > [ 25.451749] ? replace+0xa0/0x1f0 > [ 25.453833] ? resize+0x4e2/0xb70 > [ 25.455916] ? __kmalloc+0x1fe/0x2d0 > [ 25.457997] ? tnode_new+0x66/0x160 > [ 25.460072] ? fib_insert_alias+0x4a8/0x9e0 > [ 25.462145] ? fib_table_insert+0x208/0x690 > [ 25.464214] ? fib_magic+0x20c/0x310 > [ 25.466280] ? fib_netdev_event+0x81/0x200 > [ 25.468339] ? notifier_call_chain+0x63/0x110 > [ 25.470407] ? __dev_notify_flags+0xa8/0x170 > [ 25.472472] ? dev_change_flags+0x56/0x80 > [ 25.474538] ? do_setlink+0x3c2/0x1a00 > [ 25.476603] ? fib_magic+0x20c/0x310 > [ 25.478666] ? rtnl_setlink+0x129/0x1e0 > [ 25.480728] ? rtnetlink_rcv_msg+0x2a4/0x7d0 > [ 25.482765] ? rtnetlink_rcv+0x10/0x10 > [ 25.484757] ? netlink_rcv_skb+0x6f/0x170 > [ 25.486741] ? netlink_unicast+0x1c0/0x2d0 > [ 25.488716] ? netlink_sendmsg+0x2c1/0x630 > [ 25.490661] ? sock_sendmsg+0x49/0xb0 > [ 25.492564] ? SyS_sendto+0x12b/0x1d0 > [ 25.494449] ? do_syscall_64+0xad/0x5cc > [ 25.496305] ? page_fault+0x2f/0x50 > [ 25.498140] ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > [ 25.499974] > > > UBSAN reported nothing when the same kernel was compiled with gcc 7.3.1 from > Arch Linux repositories. > > I have three more similar reports to make, if I continue to c/p in each I'm > gonna feel like a fuzzbot... > And this one I'm seeing too (once at boot): [ 32.459535] [ 32.469133] UBSAN: Undefined behaviour in ../net/ipv4/fib_trie.c:504:6 [ 32.476534] member access within null pointer of type 'struct tnode' [ 32.483733] CPU: 8 PID: 1 Comm: systemd Not tainted 4.17.0-rc7-debug-01088-g47bffcfef048 #9 [ 32.493191] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016 [ 32.501680] Call Trace: [ 32.504513] dump_stack+0xe6/0x1a0 [ 32.508412] ? dump_stack_print_info.cold.0+0x1b/0x1b [ 32.514164] ? do_raw_spin_lock+0xcf/0x220 [ 32.518848] ubsan_epilogue+0x9/0x7a [ 32.522940] handle_null_ptr_deref+0x16b/0x1e0 [ 32.528008] ? ucs2_as_utf8+0x6b0/0x6b0 [ 32.532397] ? __x64_sys_sendto+0xe6/0x1d0 [ 32.537079] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 32.543025] __ubsan_handle_type_mismatch_v1+0x16b/0x19e [ 32.549054] ? ubsan_type_mismatch_common.part.5.cold.9+0x1bb/0x1bb [ 32.556168] ? fib_find_node+0x350/0x350 [ 32.560655] tnode_free+0x115/0x180 [ 32.564655] replace+0x21d/0x5e0 [ 32.568361] ? fib_insert_alias+0x1b20/0x1b20 [ 32.573332] ? put_child+0x546/0x7b0 [ 32.577427] ? __kmalloc+0x1b1/0x5f0 [ 32.581520] ? fib_trie_seq_start+0x510/0x510 [ 32.586497] resize+0x1253/0x2150 [ 32.590299] ? netlink_sendmsg+0x7b5/0x10c0 [ 32.595074] ? __sys_sendto+0x340/0x680 [ 32.599460] ? do_syscall_64+0x14b/0x720 [ 32.603954] ? __node_free_rcu+0x70/0x70 [ 32.608442] ? rcu_lockdep_current_cpu_online+0x1e7/0x2c0 [ 32.614578] ? rcu_dynticks_curr_cpu_in_eqs+0xd6/0x1f0 [ 32.620435] ? lockdep_rtnl_is_held+0x16/0x20 [ 32.625401] ? put_child+0x546/0x7b0 [ 32.629494] ? __kmalloc+0x1b1/0x5f0 [ 32.633586] ? fib_trie_seq_start+0x510/0x510 [ 32.638561] ? tnode_new+0x6c/0x310 [ 32.642561] fib_insert_alias+0xe9c/0x1b20 [
Re: [PATCH net] bpfilter: fix race in pipe access
From: Alexei Starovoitov Date: Thu, 7 Jun 2018 15:31:14 -0700 > syzbot reported the following crash > [ 338.293946] bpfilter: read fail -512 > [ 338.304515] kasan: GPF could be caused by NULL-ptr deref or user memory > access > [ 338.311863] general protection fault: [#1] SMP KASAN > [ 338.344360] RIP: 0010:__vfs_write+0x4a6/0x960 > [ 338.426363] Call Trace: > [ 338.456967] __kernel_write+0x10c/0x380 > [ 338.460928] __bpfilter_process_sockopt+0x1d8/0x35b > [ 338.487103] bpfilter_mbox_request+0x4d/0xb0 > [ 338.491492] bpfilter_ip_get_sockopt+0x6b/0x90 > > This can happen when multiple cpus trying to talk to user mode process > via bpfilter_mbox_request(). One cpu grabs the mutex while another goes to > sleep on the same mutex. Then former cpu sees that umh pipe is down and > shuts down the pipes. Later cpu finally acquires the mutex and crashes > on freed pipe. > Fix the race by using info.pid as an indicator that umh and pipes are healthy > and check it after acquiring the mutex. > > Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module") > Reported-by: syzbot+7ade6c94abb2774c0...@syzkaller.appspotmail.com > Signed-off-by: Alexei Starovoitov Applied, thanks Alexei.
Re: pull-request: bpf 2018-06-08
From: Daniel Borkmann Date: Fri, 8 Jun 2018 01:30:00 +0200 > The following pull-request contains BPF updates for your *net* tree. > > The main changes are: ... > Please consider pulling these changes from: > > git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Pulled, thanks Daniel.
Re: [Bug 199643] New: UBSAN: Undefined behaviour in ./include/net/route.h:240:2
On Tue, 8 May 2018 08:52:35 -0600, David Ahern wrote: > On 5/7/18 10:12 PM, David Miller wrote: > > From: Stephen Hemminger > > Date: Mon, 7 May 2018 10:34:00 -0700 > > > >> Subject: [Bug 199643] New: UBSAN: Undefined behaviour in > >> ./include/net/route.h:240:2 > > > > That's an empty line in both of my trees. > > > > In 4.16.7 it is the dst_release in: > > static inline void ip_rt_put(struct rtable *rt) > { > /* dst_release() accepts a NULL parameter. > * We rely on dst being first structure in struct rtable > */ > BUILD_BUG_ON(offsetof(struct rtable, dst) != 0); > --->dst_release(>dst); I'm seeing these on net-next as of yesterday, but admittedly I haven't run with UBSAN enabled for a while :( Was it resolved? [ 293.130007] UBSAN: Undefined behaviour in ../include/net/route.h:239:2 [ 293.137408] member access within null pointer of type 'struct rtable' [ 293.144716] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.17.0-rc7-debug-01088-g47bffcfef048 #9 [ 293.154374] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016 [ 293.162866] Call Trace: [ 293.165696] [ 293.168045] dump_stack+0xe6/0x1a0 [ 293.171943] ? dump_stack_print_info.cold.0+0x1b/0x1b [ 293.177699] ? do_raw_spin_lock+0xcf/0x220 [ 293.182379] ubsan_epilogue+0x9/0x7a [ 293.186471] handle_null_ptr_deref+0x16b/0x1e0 [ 293.191535] ? ucs2_as_utf8+0x6b0/0x6b0 [ 293.195919] ? ip_mc_output+0x1610/0x1610 [ 293.200505] __ubsan_handle_type_mismatch_v1+0x16b/0x19e [ 293.206543] ? ubsan_type_mismatch_common.part.5.cold.9+0x1bb/0x1bb [ 293.213661] ip_send_unicast_reply+0x1b67/0x1d0e [ 293.218935] ? ip_make_skb+0x410/0x410 [ 293.223232] ? lock_acquire+0x1a2/0x5a0 [ 293.227622] ? lock_release+0x980/0x980 [ 293.232011] ? free_user_ns+0x300/0x300 [ 293.236396] ? rcu_dynticks_curr_cpu_in_eqs+0xd6/0x1f0 [ 293.242239] ? rcu_bh_qs+0x500/0x500 [ 293.246342] tcp_v4_send_reset+0x13c6/0x29f0 [ 293.251224] ? tcp_v4_inbound_md5_hash+0x650/0x650 [ 293.256698] ? debug_check_no_locks_freed+0x260/0x260 [ 293.262453] ? rcu_lockdep_current_cpu_online+0x1e7/0x2c0 [ 293.268586] ? rcu_dynticks_curr_cpu_in_eqs+0xd6/0x1f0 [ 293.274430] ? rcu_start_gp_advanced+0x740/0x740 [ 293.279688] ? rcu_bh_qs+0x500/0x500 [ 293.283790] ? tcp_v4_rcv+0xf9f/0x3ec0 [ 293.288075] tcp_v4_rcv+0xf9f/0x3ec0 [ 293.292189] ? tcp_v4_early_demux+0xa70/0xa70 [ 293.297179] ? __isolate_free_page+0x890/0x890 [ 293.302258] ? __accumulate_pelt_segments+0x29/0x40 [ 293.307819] ? lock_acquire+0x1a2/0x5a0 [ 293.312204] ? ip_local_deliver_finish+0x189/0xcd0 [ 293.317661] ? raw_rcv+0x510/0x510 [ 293.321564] ? rcu_lockdep_current_cpu_online+0x1e7/0x2c0 [ 293.327700] ? rcu_dynticks_curr_cpu_in_eqs+0xd6/0x1f0 [ 293.333546] ? rcu_start_gp_advanced+0x740/0x740 [ 293.338808] ? rcu_bh_qs+0x500/0x500 [ 293.342913] ip_local_deliver_finish+0x475/0xcd0 [ 293.348180] ? inet_add_protocol.cold.0+0x28/0x28 [ 293.353538] ? rcu_read_lock_bh_held+0xc0/0xc0 [ 293.358607] ? rcu_dynticks_curr_cpu_in_eqs+0xd6/0x1f0 [ 293.364455] ip_local_deliver+0x1a1/0x680 [ 293.369039] ? ip_call_ra_chain+0x700/0x700 [ 293.373816] ? rcu_lockdep_current_cpu_online+0x1e7/0x2c0 [ 293.379950] ? rcu_dynticks_curr_cpu_in_eqs+0xd6/0x1f0 [ 293.385792] ? rcu_start_gp_advanced+0x740/0x740 [ 293.391050] ? rcu_bh_qs+0x500/0x500 [ 293.395143] ? rb_erase+0x3460/0x3460 [ 293.399342] ip_rcv_finish+0x727/0x25c0 [ 293.403733] ? ip_local_deliver_finish+0xcd0/0xcd0 [ 293.409218] ? print_irqtrace_events+0x280/0x280 [ 293.414478] ? print_irqtrace_events+0x280/0x280 [ 293.419746] ? tcp_v4_send_synack+0x450/0x450 [ 293.424721] ? print_irqtrace_events+0x280/0x280 [ 293.429982] ? enqueue_entity+0x3760/0x3760 [ 293.434760] ? print_irqtrace_events+0x280/0x280 [ 293.440028] ip_rcv+0x973/0x1758 [ 293.443738] ? ip_local_deliver+0x680/0x680 [ 293.448513] ? print_irqtrace_events+0x280/0x280 [ 293.453771] ? print_irqtrace_events+0x280/0x280 [ 293.459021] ? print_irqtrace_events+0x280/0x280 [ 293.464283] ? print_irqtrace_events+0x280/0x280 [ 293.469549] ? rcu_lockdep_current_cpu_online+0x1e7/0x2c0 [ 293.475681] ? rcu_dynticks_curr_cpu_in_eqs+0xd6/0x1f0 [ 293.481526] ? rcu_start_gp_advanced+0x740/0x740 [ 293.486785] ? rcu_bh_qs+0x500/0x500 [ 293.490883] ? ip_local_deliver+0x680/0x680 [ 293.495659] __netif_receive_skb_core+0x23e7/0x5a80 [ 293.501244] ? debug_check_no_locks_freed+0x1e0/0x260 [ 293.506996] ? netif_schedule_queue+0x2c0/0x2c0 [ 293.512159] ? __lock_acquire+0x6ad/0x3b10 [ 293.516860] ? rcu_start_gp_advanced+0x740/0x740 [ 293.522122] ? debug_check_no_locks_freed+0x260/0x260 [ 293.527872] ? rcu_read_lock_sched_held+0x107/0x120 [ 293.533437] ? nfp_net_poll+0x87/0x1a0 [nfp] [ 293.538306] ? module_assert_mutex_or_preempt+0x41/0x70 [ 293.544244] ? __module_address+0xb4/0x860 [ 293.548935] ? unwind_next_frame+0x12e5/0x24d0 [ 293.554002] ?
Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.
On Thu, Jun 7, 2018 at 4:48 PM, wrote: > diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h > index be7c0fa..cb911f0 100644 > --- a/include/net/fq_impl.h > +++ b/include/net/fq_impl.h > @@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, > return NULL; > } > > - flow = list_first_entry(head, struct fq_flow, flowchain); > + flow = list_first_entry_or_null(head, struct fq_flow, flowchain); > + > + if (WARN_ON_ONCE(!flow)) > + return NULL; This does not make sense either. list_first_entry_or_null() returns NULL only when the list is empty, but we already check list_empty() right before this code, and it is protected by fq->lock.
[PATCH v2] net-fq: Add WARN_ON check for null flow.
From: Ben Greear While testing an ath10k firmware that often crashed under load, I was seeing kernel crashes as well. One of them appeared to be a dereference of a NULL flow object in fq_tin_dequeue. I have since fixed the firmware flaw, but I think it would be worth adding the WARN_ON in case the problem appears again. BUG: unable to handle kernel NULL pointer dereference at 003c IP: ieee80211_tx_dequeue+0xfb/0xb10 [mac80211] PGD 8001417fe067 P4D 8001417fe067 PUD 13db41067 PMD 0 Oops: [#1] PREEMPT SMP PTI Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c vrf 8021q garp mrp stp llc fuse macvlan wanlink(O) pktgen lm78 ] CPU: 2 PID: 21733 Comm: ip Tainted: GW O 4.16.8+ #35 Hardware name: _ _/, BIOS 5.11 08/26/2016 RIP: 0010:ieee80211_tx_dequeue+0xfb/0xb10 [mac80211] RSP: 0018:880172d03c30 EFLAGS: 00010286 RAX: 88013b2c RBX: 88013b2c00b8 RCX: 0898 RDX: 0001 RSI: 88013b2c00d8 RDI: 88016ac40820 RBP: 88016ac42ba0 R08: 0020 R09: R10: 0010 R11: 001256c89fd8 R12: 88013b2c R13: 88013b2c00d8 R14: R15: 88013b2c00d8 FS: 7f04e3606700() GS:880172d0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 003c CR3: 00013b35a005 CR4: 003606e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: ? update_load_avg+0x607/0x6f0 ath10k_mac_tx_push_txq+0x6e/0x220 [ath10k_core] ath10k_mac_tx_push_pending+0x151/0x1e0 [ath10k_core] ath10k_htt_txrx_compl_task+0x113e/0x1940 [ath10k_core] ? ath10k_ce_completed_send_next_nolock+0x6f/0x90 [ath10k_pci] ? ath10k_ce_completed_send_next+0x31/0x40 [ath10k_pci] ? ath10k_pci_htc_tx_cb+0x30/0xc0 [ath10k_pci] ? ath10k_bus_pci_write32+0x3c/0xa0 [ath10k_pci] ath10k_pci_napi_poll+0x44/0xf0 [ath10k_pci] net_rx_action+0x250/0x3b0 __do_softirq+0xc2/0x2c2 irq_exit+0x93/0xa0 do_IRQ+0x45/0xc0 common_interrupt+0xf/0xf Signed-off-by: Ben Greear --- * v2: Use list_first_entry_or_null as suggested. include/net/fq_impl.h | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h index be7c0fa..cb911f0 100644 --- a/include/net/fq_impl.h +++ b/include/net/fq_impl.h @@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, return NULL; } - flow = list_first_entry(head, struct fq_flow, flowchain); + flow = list_first_entry_or_null(head, struct fq_flow, flowchain); + + if (WARN_ON_ONCE(!flow)) + return NULL; if (flow->deficit <= 0) { flow->deficit += fq->quantum; -- 2.4.11
pull-request: bpf 2018-06-08
Hi David, The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix in the BPF verifier to reject modified ctx pointers on helper functions, from Daniel. 2) Fix in BPF kselftests for get_cgroup_id_user() helper to only record the cgroup id for a provided pid in order to reduce test failures from processes interferring with the test, from Yonghong. 3) Fix a crash in AF_XDP's mem accounting when the process owning the sock has CAP_IPC_LOCK capabilities set, from Daniel. 4) Fix an issue for AF_XDP on 32 bit machines where XDP_UMEM_PGOFF_*_RING defines need ULL suffixes and use loff_t type as they are otherwise truncated, from Geert. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Thanks a lot! The following changes since commit 1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21: Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next (2018-06-06 18:39:49 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git for you to fetch changes up to c09290c5637692a9bfe7740e4c5e693efff12810: bpf, xdp: fix crash in xdp_umem_unaccount_pages (2018-06-07 15:32:28 -0700) Daniel Borkmann (2): bpf: reject passing modified ctx to helper functions bpf, xdp: fix crash in xdp_umem_unaccount_pages Geert Uytterhoeven (1): xsk: Fix umem fill/completion queue mmap on 32-bit Yonghong Song (1): tools/bpf: fix selftest get_cgroup_id_user include/uapi/linux/if_xdp.h | 4 +- kernel/bpf/verifier.c| 48 +--- net/xdp/xdp_umem.c | 6 ++- net/xdp/xsk.c| 2 +- tools/testing/selftests/bpf/get_cgroup_id_kern.c | 14 +- tools/testing/selftests/bpf/get_cgroup_id_user.c | 12 - tools/testing/selftests/bpf/test_verifier.c | 58 +++- 7 files changed, 118 insertions(+), 26 deletions(-)
Re: [PATCH bpf-next v6 1/2] trace_helpers.c: Add helpers to poll multiple perf FDs for events
Hi Toke, On 06/06/2018 07:58 PM, Toke Høiland-Jørgensen wrote: > Add two new helper functions to trace_helpers that supports polling > multiple perf file descriptors for events. These are used to the XDP > perf_event_output example, which needs to work with one perf fd per CPU. > > Reviewed-by: Jakub Kicinski > Signed-off-by: Toke Høiland-Jørgensen Didn't make it in time unfortunately as bpf-next closed, please resend these two once merge window is over and bpf-next open again. Thanks a lot, Daniel
Re: [PATCH bpf] bpf, xdp: fix crash in xdp_umem_unaccount_pages
On Fri, Jun 08, 2018 at 12:06:01AM +0200, Daniel Borkmann wrote: > syzkaller was able to trigger the following panic for AF_XDP: > > BUG: KASAN: null-ptr-deref in atomic64_sub > include/asm-generic/atomic-instrumented.h:144 [inline] > BUG: KASAN: null-ptr-deref in atomic_long_sub > include/asm-generic/atomic-long.h:199 [inline] > BUG: KASAN: null-ptr-deref in xdp_umem_unaccount_pages.isra.4+0x3d/0x80 > net/xdp/xdp_umem.c:135 > Write of size 8 at addr 0060 by task syz-executor246/4527 > > CPU: 1 PID: 4527 Comm: syz-executor246 Not tainted 4.17.0+ #89 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: >__dump_stack lib/dump_stack.c:77 [inline] >dump_stack+0x1b9/0x294 lib/dump_stack.c:113 >kasan_report_error mm/kasan/report.c:352 [inline] >kasan_report.cold.7+0x6d/0x2fe mm/kasan/report.c:412 >check_memory_region_inline mm/kasan/kasan.c:260 [inline] >check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267 >kasan_check_write+0x14/0x20 mm/kasan/kasan.c:278 >atomic64_sub include/asm-generic/atomic-instrumented.h:144 [inline] >atomic_long_sub include/asm-generic/atomic-long.h:199 [inline] >xdp_umem_unaccount_pages.isra.4+0x3d/0x80 net/xdp/xdp_umem.c:135 >xdp_umem_reg net/xdp/xdp_umem.c:334 [inline] >xdp_umem_create+0xd6c/0x10f0 net/xdp/xdp_umem.c:349 >xsk_setsockopt+0x443/0x550 net/xdp/xsk.c:531 >__sys_setsockopt+0x1bd/0x390 net/socket.c:1935 >__do_sys_setsockopt net/socket.c:1946 [inline] >__se_sys_setsockopt net/socket.c:1943 [inline] >__x64_sys_setsockopt+0xbe/0x150 net/socket.c:1943 >do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 >entry_SYSCALL_64_after_hwframe+0x49/0xbe > > In xdp_umem_reg() the call to xdp_umem_account_pages() passed > with CAP_IPC_LOCK where we didn't need to end up charging rlimit > on memlock for the current user and therefore umem->user continues > to be NULL. Later on through fault injection syzkaller triggered > a failure in either umem->pgs or umem->pages allocation such that > we bail out and undo accounting in xdp_umem_unaccount_pages() > where we eventually hit the panic since it tries to deref the > umem->user. > > The code is pretty close to mm_account_pinned_pages() and > mm_unaccount_pinned_pages() pair and potentially could reuse > it even in a later cleanup, and it appears that the initial > commit c0c77d8fb787 ("xsk: add user memory registration support > sockopt") got this right while later follow-up introduced the > bug via a49049ea2576 ("xsk: simplified umem setup"). > > Fixes: a49049ea2576 ("xsk: simplified umem setup") > Reported-by: syzbot+979217770b09ebf5c...@syzkaller.appspotmail.com > Signed-off-by: Daniel Borkmann Applied, Thanks
Re: [Patch net] socket: close race condition between sock_close() and sockfs_setattr()
On Thu, Jun 07, 2018 at 03:15:15PM -0700, Cong Wang wrote: > > You do realize that the same ->setattr() can be called by way of > > chown() on /proc/self/fd/, right? What would you do there - > > bump refcount on that struct file when traversing that symlink and > > hold it past the end of pathname resolution, until... what exactly? > > I was thinking about this: > > error = user_path_at(dfd,); // hold dfd when needed > > if (!error) { > error = chown_common(, mode); > path_put(); // release dfd if held > > With this, we can guarantee ->release() is only possibly called > after chown_common() which is after ->setattr() too. No, we can't. You are assuming that there *is* dfd and that it points to the opened socket we are going to operate upon. That is not guaranteed. At all. If e.g. 42 is a file descriptor of an opened socket, plain chown(2) on /proc/self/fd/42 will trigger that ->setattr(). No dfd in sight. We do run across an opened file at some point, all right - when we traverse the symlink in procfs. You can't bump ->f_count there. Even leaving aside the "where would I stash the reference to that file?" and "how long would I hold it?", you can't bump ->f_count on other process' files. That would bugger the expectations of close(2) callers.
[PATCH net] bpfilter: fix race in pipe access
syzbot reported the following crash [ 338.293946] bpfilter: read fail -512 [ 338.304515] kasan: GPF could be caused by NULL-ptr deref or user memory access [ 338.311863] general protection fault: [#1] SMP KASAN [ 338.344360] RIP: 0010:__vfs_write+0x4a6/0x960 [ 338.426363] Call Trace: [ 338.456967] __kernel_write+0x10c/0x380 [ 338.460928] __bpfilter_process_sockopt+0x1d8/0x35b [ 338.487103] bpfilter_mbox_request+0x4d/0xb0 [ 338.491492] bpfilter_ip_get_sockopt+0x6b/0x90 This can happen when multiple cpus trying to talk to user mode process via bpfilter_mbox_request(). One cpu grabs the mutex while another goes to sleep on the same mutex. Then former cpu sees that umh pipe is down and shuts down the pipes. Later cpu finally acquires the mutex and crashes on freed pipe. Fix the race by using info.pid as an indicator that umh and pipes are healthy and check it after acquiring the mutex. Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module") Reported-by: syzbot+7ade6c94abb2774c0...@syzkaller.appspotmail.com Signed-off-by: Alexei Starovoitov --- net/bpfilter/bpfilter_kern.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c index b13d058f8c34..09522573f611 100644 --- a/net/bpfilter/bpfilter_kern.c +++ b/net/bpfilter/bpfilter_kern.c @@ -24,17 +24,19 @@ static void shutdown_umh(struct umh_info *info) { struct task_struct *tsk; + if (!info->pid) + return; tsk = pid_task(find_vpid(info->pid), PIDTYPE_PID); if (tsk) force_sig(SIGKILL, tsk); fput(info->pipe_to_umh); fput(info->pipe_from_umh); + info->pid = 0; } static void __stop_umh(void) { - if (IS_ENABLED(CONFIG_INET) && - bpfilter_process_sockopt) { + if (IS_ENABLED(CONFIG_INET)) { bpfilter_process_sockopt = NULL; shutdown_umh(); } @@ -55,7 +57,7 @@ static int __bpfilter_process_sockopt(struct sock *sk, int optname, struct mbox_reply reply; loff_t pos; ssize_t n; - int ret; + int ret = -EFAULT; req.is_set = is_set; req.pid = current->pid; @@ -63,6 +65,8 @@ static int __bpfilter_process_sockopt(struct sock *sk, int optname, req.addr = (long)optval; req.len = optlen; mutex_lock(_lock); + if (!info.pid) + goto out; n = __kernel_write(info.pipe_to_umh, , sizeof(req), ); if (n != sizeof(req)) { pr_err("write fail %zd\n", n); -- 2.9.5
Re: [PATCH bpf] tools/bpf: fix selftest get_cgroup_id_user
On 06/06/2018 06:12 PM, Yonghong Song wrote: > Commit f269099a7e7a ("tools/bpf: add a selftest for > bpf_get_current_cgroup_id() helper") added a test > for bpf_get_current_cgroup_id() helper. The bpf program > is attached to tracepoint syscalls/sys_enter_nanosleep > and will record the cgroup id if the tracepoint is hit. > The test program creates a cgroup and attachs itself to > this cgroup and expects that the test program process > cgroup id is the same as the cgroup_id retrieved > by the bpf program. > > In a light system where no other processes called > nanosleep syscall, the test case can pass. > In a busy system where many different processes can hit > syscalls/sys_enter_nanosleep tracepoint, the cgroup id > recorded by bpf program may not match the test program > process cgroup_id. > > This patch fixed an issue by communicating the test program > pid to bpf program. The bpf program only records > cgroup id if the current task pid is the same as > passed-in pid. This ensures that the recorded cgroup_id > is for the cgroup within which the test program resides. > > Fixes: f269099a7e7a ("tools/bpf: add a selftest for > bpf_get_current_cgroup_id() helper") > Signed-off-by: Yonghong Song Applied to bpf, thanks Yonghong!
Re: [Patch net] socket: close race condition between sock_close() and sockfs_setattr()
On Thu, Jun 7, 2018 at 3:04 PM, Al Viro wrote: > On Thu, Jun 07, 2018 at 02:45:58PM -0700, Cong Wang wrote: >> On Thu, Jun 7, 2018 at 2:26 PM, Al Viro wrote: >> > On Thu, Jun 07, 2018 at 01:39:49PM -0700, Cong Wang wrote: >> >> fchownat() doesn't even hold refcnt of fd until it figures out >> >> fd is really needed (otherwise is ignored) and releases it after >> >> it resolves the path. This means sock_close() could race with >> >> sockfs_setattr(), which leads to a NULL pointer dereference >> >> since typically we set sock->sk to NULL in ->release(). >> >> >> >> As pointed out by Al, this is unique to sockfs. So we can fix this >> >> in socket layer by acquiring inode_lock in sock_close() and >> >> checking against NULL in sockfs_setattr(). >> > >> > That looks like a massive overkill - it's way heavier than it should be. >> >> I don't see any other quick way to fix this. My initial thought is >> to keep that refcnt until path_put(), apparently you don't like it >> either. > > You do realize that the same ->setattr() can be called by way of > chown() on /proc/self/fd/, right? What would you do there - > bump refcount on that struct file when traversing that symlink and > hold it past the end of pathname resolution, until... what exactly? I was thinking about this: error = user_path_at(dfd,); // hold dfd when needed if (!error) { error = chown_common(, mode); path_put(); // release dfd if held With this, we can guarantee ->release() is only possibly called after chown_common() which is after ->setattr() too.
Hello Dear
Hello Fellow Charles Koch here, ceo charles koch foundation worldwide (the largest charity foundation in the world). This year under our motto of "Give while living", i and the foundation have decided to award $500,000.00 to a few selected lucky individuals and on receipt of this email kindly get back to me. Regards, Charles Koch. chkoch2...@dbzmail.com
Re: [PATCH] net-fq: Add WARN_ON check for null flow.
On 06/07/2018 02:52 PM, Cong Wang wrote: On Thu, Jun 7, 2018 at 2:41 PM, Ben Greear wrote: On 06/07/2018 02:29 PM, Cong Wang wrote: On Thu, Jun 7, 2018 at 9:06 AM, wrote: --- a/include/net/fq_impl.h +++ b/include/net/fq_impl.h @@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, flow = list_first_entry(head, struct fq_flow, flowchain); + if (WARN_ON_ONCE(!flow)) + return NULL; + How could even possibly list_first_entry() returns NULL? You need list_first_entry_or_null(). I don't know for certain flow as null, but something was NULL in this method near that line and it looked like a likely culprit. I guess possibly tin or fq was passed in as NULL? A NULL pointer is not always 0. You can trigger a NULL-ptr-def with 0x3c too, but you are checking against 0 in your patch, that is the problem and that is why list_first_entry_or_null() exists. Ahh, I see what you mean, and that is my mistake. In my case, it did seem to be a mostly-null deref, not a 0x0 deref. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
[PATCH bpf] bpf, xdp: fix crash in xdp_umem_unaccount_pages
syzkaller was able to trigger the following panic for AF_XDP: BUG: KASAN: null-ptr-deref in atomic64_sub include/asm-generic/atomic-instrumented.h:144 [inline] BUG: KASAN: null-ptr-deref in atomic_long_sub include/asm-generic/atomic-long.h:199 [inline] BUG: KASAN: null-ptr-deref in xdp_umem_unaccount_pages.isra.4+0x3d/0x80 net/xdp/xdp_umem.c:135 Write of size 8 at addr 0060 by task syz-executor246/4527 CPU: 1 PID: 4527 Comm: syz-executor246 Not tainted 4.17.0+ #89 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1b9/0x294 lib/dump_stack.c:113 kasan_report_error mm/kasan/report.c:352 [inline] kasan_report.cold.7+0x6d/0x2fe mm/kasan/report.c:412 check_memory_region_inline mm/kasan/kasan.c:260 [inline] check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267 kasan_check_write+0x14/0x20 mm/kasan/kasan.c:278 atomic64_sub include/asm-generic/atomic-instrumented.h:144 [inline] atomic_long_sub include/asm-generic/atomic-long.h:199 [inline] xdp_umem_unaccount_pages.isra.4+0x3d/0x80 net/xdp/xdp_umem.c:135 xdp_umem_reg net/xdp/xdp_umem.c:334 [inline] xdp_umem_create+0xd6c/0x10f0 net/xdp/xdp_umem.c:349 xsk_setsockopt+0x443/0x550 net/xdp/xsk.c:531 __sys_setsockopt+0x1bd/0x390 net/socket.c:1935 __do_sys_setsockopt net/socket.c:1946 [inline] __se_sys_setsockopt net/socket.c:1943 [inline] __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1943 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe In xdp_umem_reg() the call to xdp_umem_account_pages() passed with CAP_IPC_LOCK where we didn't need to end up charging rlimit on memlock for the current user and therefore umem->user continues to be NULL. Later on through fault injection syzkaller triggered a failure in either umem->pgs or umem->pages allocation such that we bail out and undo accounting in xdp_umem_unaccount_pages() where we eventually hit the panic since it tries to deref the umem->user. The code is pretty close to mm_account_pinned_pages() and mm_unaccount_pinned_pages() pair and potentially could reuse it even in a later cleanup, and it appears that the initial commit c0c77d8fb787 ("xsk: add user memory registration support sockopt") got this right while later follow-up introduced the bug via a49049ea2576 ("xsk: simplified umem setup"). Fixes: a49049ea2576 ("xsk: simplified umem setup") Reported-by: syzbot+979217770b09ebf5c...@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann --- net/xdp/xdp_umem.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c index 7eb4948..b9ef487 100644 --- a/net/xdp/xdp_umem.c +++ b/net/xdp/xdp_umem.c @@ -132,8 +132,10 @@ static void xdp_umem_unpin_pages(struct xdp_umem *umem) static void xdp_umem_unaccount_pages(struct xdp_umem *umem) { - atomic_long_sub(umem->npgs, >user->locked_vm); - free_uid(umem->user); + if (umem->user) { + atomic_long_sub(umem->npgs, >user->locked_vm); + free_uid(umem->user); + } } static void xdp_umem_release(struct xdp_umem *umem) -- 2.9.5
Re: [Patch net] socket: close race condition between sock_close() and sockfs_setattr()
On Thu, Jun 07, 2018 at 02:45:58PM -0700, Cong Wang wrote: > On Thu, Jun 7, 2018 at 2:26 PM, Al Viro wrote: > > On Thu, Jun 07, 2018 at 01:39:49PM -0700, Cong Wang wrote: > >> fchownat() doesn't even hold refcnt of fd until it figures out > >> fd is really needed (otherwise is ignored) and releases it after > >> it resolves the path. This means sock_close() could race with > >> sockfs_setattr(), which leads to a NULL pointer dereference > >> since typically we set sock->sk to NULL in ->release(). > >> > >> As pointed out by Al, this is unique to sockfs. So we can fix this > >> in socket layer by acquiring inode_lock in sock_close() and > >> checking against NULL in sockfs_setattr(). > > > > That looks like a massive overkill - it's way heavier than it should be. > > I don't see any other quick way to fix this. My initial thought is > to keep that refcnt until path_put(), apparently you don't like it > either. You do realize that the same ->setattr() can be called by way of chown() on /proc/self/fd/, right? What would you do there - bump refcount on that struct file when traversing that symlink and hold it past the end of pathname resolution, until... what exactly?
[PATCH net] net: aquantia: fix unsigned numvecs comparison with less than zero
From: Colin Ian King From: Colin Ian King This was originally mistakenly submitted to net-next. Resubmitting to net. The comparison of numvecs < 0 is always false because numvecs is a u32 and hence the error return from a failed call to pci_alloc_irq_vectores is never detected. Fix this by using the signed int ret to handle the error return and assign numvecs to err. Detected by CoverityScan, CID#1468650 ("Unsigned compared against 0") Fixes: a09bd81b5413 ("net: aquantia: Limit number of vectors to actually allocated irqs") Signed-off-by: Colin Ian King Signed-off-by: Igor Russkikh --- drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c b/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c index a50e08bb4748..750007513f9d 100644 --- a/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c +++ b/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c @@ -267,14 +267,13 @@ static int aq_pci_probe(struct pci_dev *pdev, numvecs = min(numvecs, num_online_cpus()); /*enable interrupts */ #if !AQ_CFG_FORCE_LEGACY_INT - numvecs = pci_alloc_irq_vectors(self->pdev, 1, numvecs, - PCI_IRQ_MSIX | PCI_IRQ_MSI | - PCI_IRQ_LEGACY); + err = pci_alloc_irq_vectors(self->pdev, 1, numvecs, + PCI_IRQ_MSIX | PCI_IRQ_MSI | + PCI_IRQ_LEGACY); - if (numvecs < 0) { - err = numvecs; + if (err < 0) goto err_hwinit; - } + numvecs = err; #endif self->irqvecs = numvecs; -- 2.17.1
Re: [PATCH] net-fq: Add WARN_ON check for null flow.
On Thu, Jun 7, 2018 at 2:41 PM, Ben Greear wrote: > On 06/07/2018 02:29 PM, Cong Wang wrote: >> >> On Thu, Jun 7, 2018 at 9:06 AM, wrote: >>> >>> --- a/include/net/fq_impl.h >>> +++ b/include/net/fq_impl.h >>> @@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, >>> >>> flow = list_first_entry(head, struct fq_flow, flowchain); >>> >>> + if (WARN_ON_ONCE(!flow)) >>> + return NULL; >>> + >> >> >> How could even possibly list_first_entry() returns NULL? >> You need list_first_entry_or_null(). >> > > I don't know for certain flow as null, but something was NULL in this method > near that line and it looked like a likely culprit. > > I guess possibly tin or fq was passed in as NULL? A NULL pointer is not always 0. You can trigger a NULL-ptr-def with 0x3c too, but you are checking against 0 in your patch, that is the problem and that is why list_first_entry_or_null() exists.
Re: [Patch net] socket: close race condition between sock_close() and sockfs_setattr()
On Thu, Jun 7, 2018 at 2:26 PM, Al Viro wrote: > On Thu, Jun 07, 2018 at 01:39:49PM -0700, Cong Wang wrote: >> fchownat() doesn't even hold refcnt of fd until it figures out >> fd is really needed (otherwise is ignored) and releases it after >> it resolves the path. This means sock_close() could race with >> sockfs_setattr(), which leads to a NULL pointer dereference >> since typically we set sock->sk to NULL in ->release(). >> >> As pointed out by Al, this is unique to sockfs. So we can fix this >> in socket layer by acquiring inode_lock in sock_close() and >> checking against NULL in sockfs_setattr(). > > That looks like a massive overkill - it's way heavier than it should be. I don't see any other quick way to fix this. My initial thought is to keep that refcnt until path_put(), apparently you don't like it either. > And it's very likely to trigger shitloads of deadlock warnings, some > possibly even true. I do audit the inode_lock usage in networking, I don't see any deadlock, of course, there could be some non-networking code uses socket API that I missed. But _generally_, socket doesn't have a pointer to retrieve this inode.
Re: [PATCH] net-fq: Add WARN_ON check for null flow.
On 06/07/2018 02:29 PM, Cong Wang wrote: On Thu, Jun 7, 2018 at 9:06 AM, wrote: --- a/include/net/fq_impl.h +++ b/include/net/fq_impl.h @@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, flow = list_first_entry(head, struct fq_flow, flowchain); + if (WARN_ON_ONCE(!flow)) + return NULL; + How could even possibly list_first_entry() returns NULL? You need list_first_entry_or_null(). I don't know for certain flow as null, but something was NULL in this method near that line and it looked like a likely culprit. I guess possibly tin or fq was passed in as NULL? Anyway, if the patch seems worthless just ignore it. I'll leave it in my tree since it should be harmless and will let you know if I ever hit it. If someone else hits a similar crash, hopefully they can report it. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH] net-fq: Add WARN_ON check for null flow.
On Thu, Jun 7, 2018 at 9:06 AM, wrote: > --- a/include/net/fq_impl.h > +++ b/include/net/fq_impl.h > @@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, > > flow = list_first_entry(head, struct fq_flow, flowchain); > > + if (WARN_ON_ONCE(!flow)) > + return NULL; > + How could even possibly list_first_entry() returns NULL? You need list_first_entry_or_null().
Re: [Patch net] socket: close race condition between sock_close() and sockfs_setattr()
On Thu, Jun 07, 2018 at 01:39:49PM -0700, Cong Wang wrote: > fchownat() doesn't even hold refcnt of fd until it figures out > fd is really needed (otherwise is ignored) and releases it after > it resolves the path. This means sock_close() could race with > sockfs_setattr(), which leads to a NULL pointer dereference > since typically we set sock->sk to NULL in ->release(). > > As pointed out by Al, this is unique to sockfs. So we can fix this > in socket layer by acquiring inode_lock in sock_close() and > checking against NULL in sockfs_setattr(). That looks like a massive overkill - it's way heavier than it should be. And it's very likely to trigger shitloads of deadlock warnings, some possibly even true.
Re: [PATCH net-next] bpfilter: fix OUTPUT_FORMAT
From: Alexei Starovoitov Date: Thu, 7 Jun 2018 10:29:29 -0700 > From: Alexei Starovoitov > > CONFIG_OUTPUT_FORMAT is x86 only macro. > Used objdump to extract elf file format. > > Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module") > Reported-by: David S. Miller > Signed-off-by: Alexei Starovoitov Applied.
Re: [PATCH net] bonding: re-evaluate force_primary when the primary slave name changes
From: Xiangning Yu Date: Thu, 7 Jun 2018 13:39:59 +0800 > From: Xiangning Yu > > There is a timing issue under active-standy mode, when bond_enslave() is > called, bond->params.primary might not be initialized yet. > > Any time the primary slave string changes, bond->force_primary should be > set to true to make sure the primary becomes the active slave. > > Signed-off-by: Xiangning Yu Applied and queued up for -stable.
[Patch net] socket: close race condition between sock_close() and sockfs_setattr()
fchownat() doesn't even hold refcnt of fd until it figures out fd is really needed (otherwise is ignored) and releases it after it resolves the path. This means sock_close() could race with sockfs_setattr(), which leads to a NULL pointer dereference since typically we set sock->sk to NULL in ->release(). As pointed out by Al, this is unique to sockfs. So we can fix this in socket layer by acquiring inode_lock in sock_close() and checking against NULL in sockfs_setattr(). sock_release() is called in many places, only the sock_close() path matters here. And fortunately, this should not affect normal sock_close() as it is only called when the last fd refcnt is gone. It only affects sock_close() with a parallel sockfs_setattr() in progress, which is not common. Fixes: 86741ec25462 ("net: core: Add a UID field to struct sock.") Reported-by: shankarapailoor Cc: Tetsuo Handa Cc: Lorenzo Colitti Cc: Al Viro Signed-off-by: Cong Wang --- net/socket.c | 18 +++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/net/socket.c b/net/socket.c index af57d85bcb48..8a109012608a 100644 --- a/net/socket.c +++ b/net/socket.c @@ -541,7 +541,10 @@ static int sockfs_setattr(struct dentry *dentry, struct iattr *iattr) if (!err && (iattr->ia_valid & ATTR_UID)) { struct socket *sock = SOCKET_I(d_inode(dentry)); - sock->sk->sk_uid = iattr->ia_uid; + if (sock->sk) + sock->sk->sk_uid = iattr->ia_uid; + else + err = -ENOENT; } return err; @@ -590,12 +593,16 @@ EXPORT_SYMBOL(sock_alloc); * an inode not a file. */ -void sock_release(struct socket *sock) +static void __sock_release(struct socket *sock, struct inode *inode) { if (sock->ops) { struct module *owner = sock->ops->owner; + if (inode) + inode_lock(inode); sock->ops->release(sock); + if (inode) + inode_unlock(inode); sock->ops = NULL; module_put(owner); } @@ -609,6 +616,11 @@ void sock_release(struct socket *sock) } sock->file = NULL; } + +void sock_release(struct socket *sock) +{ + __sock_release(sock, NULL); +} EXPORT_SYMBOL(sock_release); void __sock_tx_timestamp(__u16 tsflags, __u8 *tx_flags) @@ -1171,7 +1183,7 @@ static int sock_mmap(struct file *file, struct vm_area_struct *vma) static int sock_close(struct inode *inode, struct file *filp) { - sock_release(SOCKET_I(inode)); + __sock_release(SOCKET_I(inode), inode); return 0; } -- 2.13.0
Re: [PATCH] ip_tunnel: Fix name string concatenate in __ip_tunnel_create()
From: Sultan Alsawaf Date: Wed, 6 Jun 2018 15:56:54 -0700 > By passing a limit of 2 bytes to strncat, strncat is limited to writing > fewer bytes than what it's supposed to append to the name here. > > Since the bounds are checked on the line above this, just remove the string > bounds checks entirely since they're unneeded. > > Signed-off-by: Sultan Alsawaf Applied.
Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format
Em Thu, Jun 07, 2018 at 01:07:01PM -0700, Martin KaFai Lau escreveu: > On Thu, Jun 07, 2018 at 04:30:29PM -0300, Arnaldo Carvalho de Melo wrote: > > So this must be available in a newer llvm version? Which one? > I should have put in the details in my last email or > in the commit message, my bad. > 1. The tools/testing/selftests/bpf/Makefile has the CLANG_FLAGS and >LLC_FLAGS needed to compile the bpf prog. It requires a new >"-mattr=dwarf" llc option which was added to the future >llvm 7.0. >Hence, I have been using the llvm's master in github which >also has the llvm-objcopy. > 2. The kernel's btf part only focus on the BPF map. >Hence, the testing bpf program should have the map's key >and map's value. e.g. tools/testing/selftests/bpf/test_btf_haskv.c Thanks for the version required to test this, but where is this test_btf_haskv.c file? Which tree? net-next? Ok, just pulled torvalds/master and there it is. Gotcha. struct bpf_map_def SEC("maps") __bpf_stdout__ = { .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY, .key_size = sizeof(int), .value_size = sizeof(u32), .max_entries = __NR_CPUS__, }; This map is in the above hello.c example, but I guess its way too simple :-) Ok, I'll test this at home in another machine where I have the llvm's git repo. - Arnaldo
Re: [PATCH net] net: in virtio_net_hdr only add VLAN_HLEN to csum_start if payload holds vlan
From: Willem de Bruijn Date: Wed, 6 Jun 2018 11:23:01 -0400 > From: Willem de Bruijn > > Tun, tap, virtio, packet and uml vector all use struct virtio_net_hdr > to communicate packet metadata to userspace. > > For skbuffs with vlan, the first two return the packet as it may have > existed on the wire, inserting the VLAN tag in the user buffer. Then > virtio_net_hdr.csum_start needs to be adjusted by VLAN_HLEN bytes. > > Commit f09e2249c4f5 ("macvtap: restore vlan header on user read") > added this feature to macvtap. Commit 3ce9b20f1971 ("macvtap: Fix > csum_start when VLAN tags are present") then fixed up csum_start. > > Virtio, packet and uml do not insert the vlan header in the user > buffer. > > When introducing virtio_net_hdr_from_skb to deduplicate filling in > the virtio_net_hdr, the variant from macvtap which adds VLAN_HLEN was > applied uniformly, breaking csum offset for packets with vlan on > virtio and packet. > > Make insertion of VLAN_HLEN optional. Convert the callers to pass it > when needed. > > Fixes: e858fae2b0b8f4 ("virtio_net: use common code for virtio_net_hdr and > skb GSO conversion") > Fixes: 1276f24eeef2 ("packet: use common code for virtio_net_hdr and skb GSO > conversion") > Signed-off-by: Willem de Bruijn Applied and queued up for -stable, thanks Willem.
Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format
On Thu, Jun 07, 2018 at 04:30:29PM -0300, Arnaldo Carvalho de Melo wrote: > Em Thu, Jun 07, 2018 at 12:05:10PM -0700, Martin KaFai Lau escreveu: > > On Thu, Jun 07, 2018 at 11:03:37AM -0300, Arnaldo Carvalho de Melo wrote: > > > Em Thu, Jun 07, 2018 at 10:54:01AM -0300, Arnaldo Carvalho de Melo > > > escreveu: > > > > Em Tue, Jun 05, 2018 at 02:25:48PM -0700, Martin KaFai Lau escreveu: > > > > > [ btw, the latest commit (1 commit) should be 94a11b59e592 ]. > > > > So, the commit log message for the pahole patch is non-existent: > > > > https://github.com/iamkafai/pahole/commit/94a11b59e5920908085bfc8d24c92f95c8ffceaf > > > > we should do better in describing what is done and how, I'm staring > > > with a message you sent to the kernel part: > > > -- > > > This patch introduces BPF Type Format (BTF). > > > > BTF (BPF Type Format) is the meta data format which describes > > > the data types of BPF program/map. Hence, it basically focus > > > on the C programming language which the modern BPF is primary > > > using. The first use case is to provide a generic pretty print > > > capability for a BPF map. > > > I will add details in the next github respin/push. > > Ok, but I can do that if there is nothing else to do in the code at this > stage :-) > > > > Now I'm going to do the step-by-step guide on testing the feature just > > > introduced, and will try to convert from dwarf to BTF and back, compare > > > the pahole output for types encoded in DWARF and BTF, etc. > > > > If you have something ressembling this already, please share. > > > The pahole only has the encoder part. I tested with the verbose output > > from the "pahole -V -J". Loading the btf to the kernel is also tested. > > Ok, so here it goes my first stab at testing, using perf's BPF > integration: > > [root@jouet bpf]# cat hello.c > #include "stdio.h" > > int syscall_enter(openat)(void *ctx) > { > puts("Hello, world\n"); > return 0; > } > [root@jouet bpf]# cat ~/.perfconfig > [llvm] > dump-obj = true > [root@jouet bpf]# perf trace -e open*,hello.c touch /tmp/hello.BTF > LLVM: dumping hello.o > 0.017 ( ): __bpf_stdout__:Hello, world > 0.019 ( 0.011 ms): touch/28147 openat(dfd: CWD, filename: > /etc/ld.so.cache, flags: CLOEXEC ) = 3 > 0.053 ( ): __bpf_stdout__:Hello, world > 0.055 ( 0.011 ms): touch/28147 openat(dfd: CWD, filename: > /lib64/libc.so.6, flags: CLOEXEC ) = 3 > 0.354 ( 0.012 ms): touch/28147 open(filename: > /usr/lib/locale/locale-archive, flags: CLOEXEC ) = 3 > 0.411 ( ): __bpf_stdout__:Hello, world > 0.412 ( 0.198 ms): touch/28147 openat(dfd: CWD, filename: > /tmp/hello.BTF, flags: CREAT|NOCTTY|NONBLOCK|WRONLY, mode: IRUGO|IWUGO) = 3 > [root@jouet bpf]# > [root@jouet bpf]# file hello.o > hello.o: ELF 64-bit LSB relocatable, *unknown arch 0xf7* version 1 (SYSV), > not stripped > [root@jouet bpf]# pahole --btf_encode hello.o > pahole: hello.o: No debugging information found > [root@jouet bpf]# > > [root@jouet bpf]# readelf -s hello.o > > Symbol table '.symtab' contains 5 entries: >Num:Value Size TypeBind Vis Ndx Name > 0: 0 NOTYPE LOCAL DEFAULT UND > 1: 0 NOTYPE GLOBAL DEFAULT7 __bpf_stdout__ > 2: 0 NOTYPE GLOBAL DEFAULT5 _license > 3: 0 NOTYPE GLOBAL DEFAULT6 _version > 4: 0 NOTYPE GLOBAL DEFAULT3 > syscall_enter_openat > [root@jouet bpf]# > [root@jouet bpf]# readelf -SW hello.o > There are 10 section headers, starting at offset 0x1f8: > > Section Headers: > [Nr] Name TypeAddress OffSize ES > Flg Lk Inf Al > [ 0] NULL 00 00 00 > 0 0 0 > [ 1] .strtab STRTAB 000178 7f 00 > 0 0 1 > [ 2] .text PROGBITS 40 00 00 > AX 0 0 4 > [ 3] syscalls:sys_enter_openat PROGBITS 40 > 88 00 AX 0 0 8 > [ 4] .relsyscalls:sys_enter_openat REL 000168 > 10 10 9 3 8 > [ 5] license PROGBITS c8 04 00 > WA 0 0 1 > [ 6] version PROGBITS cc 04 00 > WA 0 0 4 > [ 7] maps PROGBITS d0 10 00 > WA 0 0 4 > [ 8] .rodata.str1.1PROGBITS e0 0e 01 > AMS 0 0 1 > [ 9] .symtab SYMTAB f0 78 18 > 1 1 8 > Key to Flags: > W (write), A (alloc), X (execute), M (merge), S (strings), I (info), > L (link order), O (extra OS processing required), G (group), T (TLS), > C (compressed), x (unknown), o (OS specific), E (exclude), > p (processor
Re: [PATCH bpf] bpf: reject passing modified ctx to helper functions
On Thu, Jun 07, 2018 at 05:40:03PM +0200, Daniel Borkmann wrote: > As commit 28e33f9d78ee ("bpf: disallow arithmetic operations on > context pointer") already describes, f1174f77b50c ("bpf/verifier: > rework value tracking") removed the specific white-listed cases > we had previously where we would allow for pointer arithmetic in > order to further generalize it, and allow e.g. context access via > modified registers. While the dereferencing of modified context > pointers had been forbidden through 28e33f9d78ee, syzkaller did > recently manage to trigger several KASAN splats for slab out of > bounds access and use after frees by simply passing a modified > context pointer to a helper function which would then do the bad > access since verifier allowed it in adjust_ptr_min_max_vals(). > > Rejecting arithmetic on ctx pointer in adjust_ptr_min_max_vals() > generally could break existing programs as there's a valid use > case in tracing in combination with passing the ctx to helpers as > bpf_probe_read(), where the register then becomes unknown at > verification time due to adding a non-constant offset to it. An > access sequence may look like the following: > > offset = args->filename; /* field __data_loc filename */ > bpf_probe_read(, len, (char *)args + offset); // args is ctx > > There are two options: i) we could special case the ctx and as > soon as we add a constant or bounded offset to it (hence ctx type > wouldn't change) we could turn the ctx into an unknown scalar, or > ii) we generalize the sanity test for ctx member access into a > small helper and assert it on the ctx register that was passed > as a function argument. Fwiw, latter is more obvious and less > complex at the same time, and one case that may potentially be > legitimate in future for ctx member access at least would be for > ctx to carry a const offset. Therefore, fix follows approach > from ii) and adds test cases to BPF kselftests. > > Fixes: f1174f77b50c ("bpf/verifier: rework value tracking") > Reported-by: syzbot+3d0b2441dbb717516...@syzkaller.appspotmail.com > Reported-by: syzbot+c8504affd4fdd0c1b...@syzkaller.appspotmail.com > Reported-by: syzbot+e5190cb881d8660fb...@syzkaller.appspotmail.com > Reported-by: syzbot+efae31b384d5badbd...@syzkaller.appspotmail.com > Signed-off-by: Daniel Borkmann > Acked-by: Alexei Starovoitov Applied, Thanks
Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format
Em Thu, Jun 07, 2018 at 12:05:10PM -0700, Martin KaFai Lau escreveu: > On Thu, Jun 07, 2018 at 11:03:37AM -0300, Arnaldo Carvalho de Melo wrote: > > Em Thu, Jun 07, 2018 at 10:54:01AM -0300, Arnaldo Carvalho de Melo escreveu: > > > Em Tue, Jun 05, 2018 at 02:25:48PM -0700, Martin KaFai Lau escreveu: > > > > [ btw, the latest commit (1 commit) should be 94a11b59e592 ]. > > So, the commit log message for the pahole patch is non-existent: > > https://github.com/iamkafai/pahole/commit/94a11b59e5920908085bfc8d24c92f95c8ffceaf > > we should do better in describing what is done and how, I'm staring > > with a message you sent to the kernel part: > > -- > > This patch introduces BPF Type Format (BTF). > > BTF (BPF Type Format) is the meta data format which describes > > the data types of BPF program/map. Hence, it basically focus > > on the C programming language which the modern BPF is primary > > using. The first use case is to provide a generic pretty print > > capability for a BPF map. > I will add details in the next github respin/push. Ok, but I can do that if there is nothing else to do in the code at this stage :-) > > Now I'm going to do the step-by-step guide on testing the feature just > > introduced, and will try to convert from dwarf to BTF and back, compare > > the pahole output for types encoded in DWARF and BTF, etc. > > If you have something ressembling this already, please share. > The pahole only has the encoder part. I tested with the verbose output > from the "pahole -V -J". Loading the btf to the kernel is also tested. Ok, so here it goes my first stab at testing, using perf's BPF integration: [root@jouet bpf]# cat hello.c #include "stdio.h" int syscall_enter(openat)(void *ctx) { puts("Hello, world\n"); return 0; } [root@jouet bpf]# cat ~/.perfconfig [llvm] dump-obj = true [root@jouet bpf]# perf trace -e open*,hello.c touch /tmp/hello.BTF LLVM: dumping hello.o 0.017 ( ): __bpf_stdout__:Hello, world 0.019 ( 0.011 ms): touch/28147 openat(dfd: CWD, filename: /etc/ld.so.cache, flags: CLOEXEC ) = 3 0.053 ( ): __bpf_stdout__:Hello, world 0.055 ( 0.011 ms): touch/28147 openat(dfd: CWD, filename: /lib64/libc.so.6, flags: CLOEXEC ) = 3 0.354 ( 0.012 ms): touch/28147 open(filename: /usr/lib/locale/locale-archive, flags: CLOEXEC ) = 3 0.411 ( ): __bpf_stdout__:Hello, world 0.412 ( 0.198 ms): touch/28147 openat(dfd: CWD, filename: /tmp/hello.BTF, flags: CREAT|NOCTTY|NONBLOCK|WRONLY, mode: IRUGO|IWUGO) = 3 [root@jouet bpf]# [root@jouet bpf]# file hello.o hello.o: ELF 64-bit LSB relocatable, *unknown arch 0xf7* version 1 (SYSV), not stripped [root@jouet bpf]# pahole --btf_encode hello.o pahole: hello.o: No debugging information found [root@jouet bpf]# [root@jouet bpf]# readelf -s hello.o Symbol table '.symtab' contains 5 entries: Num:Value Size TypeBind Vis Ndx Name 0: 0 NOTYPE LOCAL DEFAULT UND 1: 0 NOTYPE GLOBAL DEFAULT7 __bpf_stdout__ 2: 0 NOTYPE GLOBAL DEFAULT5 _license 3: 0 NOTYPE GLOBAL DEFAULT6 _version 4: 0 NOTYPE GLOBAL DEFAULT3 syscall_enter_openat [root@jouet bpf]# [root@jouet bpf]# readelf -SW hello.o There are 10 section headers, starting at offset 0x1f8: Section Headers: [Nr] Name TypeAddress OffSize ES Flg Lk Inf Al [ 0] NULL 00 00 00 0 0 0 [ 1] .strtab STRTAB 000178 7f 00 0 0 1 [ 2] .text PROGBITS 40 00 00 AX 0 0 4 [ 3] syscalls:sys_enter_openat PROGBITS 40 88 00 AX 0 0 8 [ 4] .relsyscalls:sys_enter_openat REL 000168 10 10 9 3 8 [ 5] license PROGBITS c8 04 00 WA 0 0 1 [ 6] version PROGBITS cc 04 00 WA 0 0 4 [ 7] maps PROGBITS d0 10 00 WA 0 0 4 [ 8] .rodata.str1.1PROGBITS e0 0e 01 AMS 0 0 1 [ 9] .symtab SYMTAB f0 78 18 1 1 8 Key to Flags: W (write), A (alloc), X (execute), M (merge), S (strings), I (info), L (link order), O (extra OS processing required), G (group), T (TLS), C (compressed), x (unknown), o (OS specific), E (exclude), p (processor specific) [root@jouet bpf]# Humm, lemme try something, add -g to clang-opt: [root@jouet bpf]# cat ~/.perfconfig [llvm] dump-obj = true clang-opt = -g [root@jouet bpf]# perf trace -e open*,hello.c touch /tmp/hello.BTF LLVM: dumping hello.o 0.185 ( ): __bpf_stdout__:Hello, world 0.188
Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format
On Thu, Jun 07, 2018 at 11:03:37AM -0300, Arnaldo Carvalho de Melo wrote: > Em Thu, Jun 07, 2018 at 10:54:01AM -0300, Arnaldo Carvalho de Melo escreveu: > > Em Tue, Jun 05, 2018 at 02:25:48PM -0700, Martin KaFai Lau escreveu: > > > [ btw, the latest commit (1 commit) should be 94a11b59e592 ]. > > So, the commit log message for the pahole patch is non-existent: > > https://github.com/iamkafai/pahole/commit/94a11b59e5920908085bfc8d24c92f95c8ffceaf > > we should do better in describing what is done and how, I'm staring > with a message you sent to the kernel part: > > -- > This patch introduces BPF Type Format (BTF). > > BTF (BPF Type Format) is the meta data format which describes > the data types of BPF program/map. Hence, it basically focus > on the C programming language which the modern BPF is primary > using. The first use case is to provide a generic pretty print > capability for a BPF map. > -- I will add details in the next github respin/push. > > Now I'm going to do the step-by-step guide on testing the feature just > introduced, and will try to convert from dwarf to BTF and back, compare > the pahole output for types encoded in DWARF and BTF, etc. > > If you have something ressembling this already, please share. The pahole only has the encoder part. I tested with the verbose output from the "pahole -V -J". Loading the btf to the kernel is also tested.
Re: [PATCH bpf] bpf: reject passing modified ctx to helper functions
On 07/06/18 16:40, Daniel Borkmann wrote: > As commit 28e33f9d78ee ("bpf: disallow arithmetic operations on > context pointer") already describes, f1174f77b50c ("bpf/verifier: > rework value tracking") removed the specific white-listed cases > we had previously where we would allow for pointer arithmetic in > order to further generalize it, and allow e.g. context access via > modified registers. While the dereferencing of modified context > pointers had been forbidden through 28e33f9d78ee, syzkaller did > recently manage to trigger several KASAN splats for slab out of > bounds access and use after frees by simply passing a modified > context pointer to a helper function which would then do the bad > access since verifier allowed it in adjust_ptr_min_max_vals(). > > Rejecting arithmetic on ctx pointer in adjust_ptr_min_max_vals() > generally could break existing programs as there's a valid use > case in tracing in combination with passing the ctx to helpers as > bpf_probe_read(), where the register then becomes unknown at > verification time due to adding a non-constant offset to it. An > access sequence may look like the following: > > offset = args->filename; /* field __data_loc filename */ > bpf_probe_read(, len, (char *)args + offset); // args is ctx > > There are two options: i) we could special case the ctx and as > soon as we add a constant or bounded offset to it (hence ctx type > wouldn't change) we could turn the ctx into an unknown scalar, or > ii) we generalize the sanity test for ctx member access into a > small helper and assert it on the ctx register that was passed > as a function argument. Fwiw, latter is more obvious and less > complex at the same time, and one case that may potentially be > legitimate in future for ctx member access at least would be for > ctx to carry a const offset. Therefore, fix follows approach > from ii) and adds test cases to BPF kselftests. > > Fixes: f1174f77b50c ("bpf/verifier: rework value tracking") > Reported-by: syzbot+3d0b2441dbb717516...@syzkaller.appspotmail.com > Reported-by: syzbot+c8504affd4fdd0c1b...@syzkaller.appspotmail.com > Reported-by: syzbot+e5190cb881d8660fb...@syzkaller.appspotmail.com > Reported-by: syzbot+efae31b384d5badbd...@syzkaller.appspotmail.com > Signed-off-by: Daniel Borkmann > Acked-by: Alexei Starovoitov Acked-by: Edward Cree
Re: [PATCH bpf] bpf: reject passing modified ctx to helper functions
On Thu, Jun 7, 2018 at 8:40 AM, Daniel Borkmann wrote: > As commit 28e33f9d78ee ("bpf: disallow arithmetic operations on > context pointer") already describes, f1174f77b50c ("bpf/verifier: > rework value tracking") removed the specific white-listed cases > we had previously where we would allow for pointer arithmetic in > order to further generalize it, and allow e.g. context access via > modified registers. While the dereferencing of modified context > pointers had been forbidden through 28e33f9d78ee, syzkaller did > recently manage to trigger several KASAN splats for slab out of > bounds access and use after frees by simply passing a modified > context pointer to a helper function which would then do the bad > access since verifier allowed it in adjust_ptr_min_max_vals(). > > Rejecting arithmetic on ctx pointer in adjust_ptr_min_max_vals() > generally could break existing programs as there's a valid use > case in tracing in combination with passing the ctx to helpers as > bpf_probe_read(), where the register then becomes unknown at > verification time due to adding a non-constant offset to it. An > access sequence may look like the following: > > offset = args->filename; /* field __data_loc filename */ > bpf_probe_read(, len, (char *)args + offset); // args is ctx > > There are two options: i) we could special case the ctx and as > soon as we add a constant or bounded offset to it (hence ctx type > wouldn't change) we could turn the ctx into an unknown scalar, or > ii) we generalize the sanity test for ctx member access into a > small helper and assert it on the ctx register that was passed > as a function argument. Fwiw, latter is more obvious and less > complex at the same time, and one case that may potentially be > legitimate in future for ctx member access at least would be for > ctx to carry a const offset. Therefore, fix follows approach > from ii) and adds test cases to BPF kselftests. > > Fixes: f1174f77b50c ("bpf/verifier: rework value tracking") > Reported-by: syzbot+3d0b2441dbb717516...@syzkaller.appspotmail.com > Reported-by: syzbot+c8504affd4fdd0c1b...@syzkaller.appspotmail.com > Reported-by: syzbot+e5190cb881d8660fb...@syzkaller.appspotmail.com > Reported-by: syzbot+efae31b384d5badbd...@syzkaller.appspotmail.com > Signed-off-by: Daniel Borkmann > Acked-by: Alexei Starovoitov Tested with bcc for the above ctx + unknown_offset usage for bpf_probe_read. No breakage. Acked-by: Yonghong Song > --- > kernel/bpf/verifier.c | 48 +++- > tools/testing/selftests/bpf/test_verifier.c | 58 > - > 2 files changed, 88 insertions(+), 18 deletions(-) > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index d6403b5..cced0c1 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -1617,6 +1617,30 @@ static int get_callee_stack_depth(struct > bpf_verifier_env *env, > } > #endif > > +static int check_ctx_reg(struct bpf_verifier_env *env, > +const struct bpf_reg_state *reg, int regno) > +{ > + /* Access to ctx or passing it to a helper is only allowed in > +* its original, unmodified form. > +*/ > + > + if (reg->off) { > + verbose(env, "dereference of modified ctx ptr R%d off=%d > disallowed\n", > + regno, reg->off); > + return -EACCES; > + } > + > + if (!tnum_is_const(reg->var_off) || reg->var_off.value) { > + char tn_buf[48]; > + > + tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); > + verbose(env, "variable ctx access var_off=%s disallowed\n", > tn_buf); > + return -EACCES; > + } > + > + return 0; > +} > + > /* truncate register to smaller size (in bytes) > * must be called with size < BPF_REG_SIZE > */ > @@ -1686,24 +1710,11 @@ static int check_mem_access(struct bpf_verifier_env > *env, int insn_idx, u32 regn > verbose(env, "R%d leaks addr into ctx\n", > value_regno); > return -EACCES; > } > - /* ctx accesses must be at a fixed offset, so that we can > -* determine what type of data were returned. > -*/ > - if (reg->off) { > - verbose(env, > - "dereference of modified ctx ptr R%d > off=%d+%d, ctx+const is allowed, ctx+const+const is not\n", > - regno, reg->off, off - reg->off); > - return -EACCES; > - } > - if (!tnum_is_const(reg->var_off) || reg->var_off.value) { > - char tn_buf[48]; > > - tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); > - verbose(env, > - "variable ctx access var_off=%s off=%d > size=%d", > - tn_buf, off,
[PATCH net-next] bpfilter: fix OUTPUT_FORMAT
From: Alexei Starovoitov CONFIG_OUTPUT_FORMAT is x86 only macro. Used objdump to extract elf file format. Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module") Reported-by: David S. Miller Signed-off-by: Alexei Starovoitov --- net/bpfilter/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile index aafa72001fcd..e0bbe7583e58 100644 --- a/net/bpfilter/Makefile +++ b/net/bpfilter/Makefile @@ -21,7 +21,7 @@ endif # which bpfilter_kern.c passes further into umh blob loader at run-time quiet_cmd_copy_umh = GEN $@ cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \ - $(OBJCOPY) -I binary -O $(CONFIG_OUTPUT_FORMAT) \ + $(OBJCOPY) -I binary -O `$(OBJDUMP) -f $<|grep format|cut -d' ' -f8` \ -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \ --rename-section .data=.init.rodata $< $@ -- 2.9.5
KMSAN: uninit-value in ebt_stp_mt_check (2)
Hello, syzbot found the following crash on: HEAD commit:c6a6aed994b6 kmsan: remove dead code to trigger syzbot build git tree: https://github.com/google/kmsan.git/master console output: https://syzkaller.appspot.com/x/log.txt?x=17bde74f80 kernel config: https://syzkaller.appspot.com/x/.config?x=848e40757852af3e dashboard link: https://syzkaller.appspot.com/bug?extid=da4494182233c23a5fcf compiler: clang version 7.0.0 (trunk 334104) Unfortunately, I don't have any reproducer for this crash yet. IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+da4494182233c23a5...@syzkaller.appspotmail.com == BUG: KMSAN: uninit-value in ebt_stp_mt_check+0x24b/0x450 net/bridge/netfilter/ebt_stp.c:162 CPU: 0 PID: 12006 Comm: syz-executor7 Not tainted 4.17.0+ #3 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:113 kmsan_report+0x149/0x260 mm/kmsan/kmsan.c:1084 __msan_warning_32+0x6e/0xc0 mm/kmsan/kmsan_instr.c:620 ebt_stp_mt_check+0x24b/0x450 net/bridge/netfilter/ebt_stp.c:162 xt_check_match+0x1438/0x1650 net/netfilter/x_tables.c:506 ebt_check_match net/bridge/netfilter/ebtables.c:372 [inline] ebt_check_entry net/bridge/netfilter/ebtables.c:702 [inline] translate_table+0x4e88/0x6120 net/bridge/netfilter/ebtables.c:943 do_replace_finish+0x1258/0x2ea0 net/bridge/netfilter/ebtables.c:999 do_replace+0x719/0x780 net/bridge/netfilter/ebtables.c:1138 do_ebt_set_ctl+0x2ab/0x3c0 net/bridge/netfilter/ebtables.c:1517 nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] nf_setsockopt+0x47c/0x4e0 net/netfilter/nf_sockopt.c:115 ip_setsockopt+0x24b/0x2b0 net/ipv4/ip_sockglue.c:1251 udp_setsockopt+0x108/0x1b0 net/ipv4/udp.c:2416 sock_common_setsockopt+0x13b/0x170 net/core/sock.c:3039 __sys_setsockopt+0x496/0x540 net/socket.c:1903 __do_sys_setsockopt net/socket.c:1914 [inline] __se_sys_setsockopt net/socket.c:1911 [inline] __x64_sys_setsockopt+0x15c/0x1c0 net/socket.c:1911 do_syscall_64+0x15b/0x230 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x4559f9 RSP: 002b:7f45b9246c68 EFLAGS: 0246 ORIG_RAX: 0036 RAX: ffda RBX: 7f45b92476d4 RCX: 004559f9 RDX: 0080 RSI: RDI: 0014 RBP: 0072bea0 R08: 0300 R09: R10: 2480 R11: 0246 R12: R13: 004c0d6d R14: 004d07c8 R15: Local variable description: mtpar.i@translate_table Variable was created at: translate_table+0xbb/0x6120 net/bridge/netfilter/ebtables.c:831 do_replace_finish+0x1258/0x2ea0 net/bridge/netfilter/ebtables.c:999 == --- This bug is generated by a bot. It may contain errors. See https://goo.gl/tpsmEJ for more information about syzbot. syzbot engineers can be reached at syzkal...@googlegroups.com. syzbot will keep track of this bug report. See: https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with syzbot.
Re: [PATCH net] failover: eliminate callback hell
On Thu, Jun 07, 2018 at 09:17:42AM -0700, Stephen Hemminger wrote: > On Thu, 7 Jun 2018 18:41:31 +0300 > "Michael S. Tsirkin" wrote: > > > > > Why would DPDK care what we do in the kernel? Isn't it just slapping > > > > vfio-pci on the netdevs it sees? > > > > > > Alex, you are correct for Intel devices; but DPDK on Azure is not Intel > > > based.,. > > > The DPDK support uses: > > > * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to > > >userspace. This means VF netdev device must exist and be visible. > > > * Slow path using kernel netvsc device, TAP and BPF to get exception > > >path packets to userspace. > > > * A autodiscovery mechanism that to set all this up that relies on > > >2 device model and sysfs. > > > > Could you describe what does it look for exactly? What will break if > > instead of MLX5 being a child of the PV, it's a child of the failover > > device? > > So in DPDK there is an internal four device model: > 1. failsafe is like failover in your model > 2. TAP is used like netvsc in kernel > 3. MLX5 is the VF > 4. vdev_netvsc is a pseudo device whose only reason to exist > is to glue everything together. > > Digging deeper inside... > > Vdev_netvsc does: >* driver is started in a convuluted way off device arguments >* probe routine for driver runs > - scans list of kernel interfaces in sysfs > - matches those using VMBUS Could you tell a bit more what does this step entail? > - skip netvsc devices that have an IPV4 route >* scan for PCI devices that have same MAC address as kernel netvsc > devices discovered in previous step >* add these interfaces to arguments to failsafe > > Then failsafe configures based on arguments on device > > The code works but is specific to the Azure hardware model, and exposes lots > of things to application that it should not have to care about. > > If you try and walk through this code in DPDK, you will see why I have > developed > a dislike for high levels of indirection. > > > Thanks that was helpful! I'll try to poke at it next week. Just from the description it seems the kernel is merely used to locate the MAC address through sysfs and that for this DPDK code to keep working the hidden device must be hidden from it in sysfs - is that a fair summary? -- MST
Re: [RFC PATCH 1/2] net: macb: Add CAP to disable hardware TX checksum offloading
Hi Nicolas, On 06/04/2018 10:13 AM, Nicolas Ferre wrote: On 25/05/2018 at 23:44, Jennifer Dahm wrote: diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c index 3e93df5..a5d564b 100644 --- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c @@ -3360,8 +3360,12 @@ static int macb_init(struct platform_device *pdev) dev->hw_features |= MACB_NETIF_LSO; /* Checksum offload is only available on gem with packet buffer */ - if (macb_is_gem(bp) && !(bp->caps & MACB_CAPS_FIFO_MODE)) - dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_RXCSUM; + if (macb_is_gem(bp) && !(bp->caps & MACB_CAPS_FIFO_MODE)) { + if (!(bp->caps & MACB_CAPS_DISABLE_TX_HW_CSUM)) Why not the other way around? negating a "disabled" feature is always challenge ;-) + dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_RXCSUM; + else + dev->hw_features |= NETIF_F_RXCSUM; + } if (bp->caps & MACB_CAPS_SG_DISABLED) dev->hw_features &= ~NETIF_F_SG; dev->features = dev->hw_features; I can switch the ordering of the if-else clauses if that's what you're nitpicking. ;) Alternatively, if you're asking why the flag is used to disable rather than enable checksum offloading: I was working under the assumption that this was an isolated bug, and so an opt-out would require less maintainance than an opt-in. If we discover that this is a problem across a wide variety of Cadence IP, it would definitely make sense to replace it with an opt-in (i.e. MACB_CAPS_TX_HW_CSUM_ENABLED). Best, Jennifer Dahm
Re: [PATCH] net-fq: Add WARN_ON check for null flow.
On 06/07/2018 09:17 AM, Eric Dumazet wrote: On 06/07/2018 09:06 AM, gree...@candelatech.com wrote: From: Ben Greear While testing an ath10k firmware that often crashed under load, I was seeing kernel crashes as well. One of them appeared to be a dereference of a NULL flow object in fq_tin_dequeue. I have since fixed the firmware flaw, but I think it would be worth adding the WARN_ON in case the problem appears again. common_interrupt+0xf/0xf Please find the exact commit that brought this bug, and add a corresponding Fixes: tag It will be a total pain to bisect this problem since my test case that causes this is running my modified firmware (and a buggy one at that), modified ath10k driver (to work with this firmware and support my test case easily), and the failure case appears to cause multiple different-but-probably-related crashes and often hangs or reboots the test system. Probably this is all caused by some nasty race or buggy logic related to dealing with a crashed ath10k firmware tearing down txq logic from the bottom up. There have been many such bugs in the past, I and others fixed a few, and very likely more remain. For what it is worth, I didn't see this crash in 4.13, and I spent some time testing buggy firmware there occasionally. If someone else has interest in debugging the ath10k driver, I will be happy to generate a mostly-stock firmware image with ability to crash in the TX path and give it to them. It will crash the stock upstream code reliably in my experience. Thanks, Ben Signed-off-by: Ben Greear --- include/net/fq_impl.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h index be7c0fa..e40354d 100644 --- a/include/net/fq_impl.h +++ b/include/net/fq_impl.h @@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, flow = list_first_entry(head, struct fq_flow, flowchain); + if (WARN_ON_ONCE(!flow)) + return NULL; + if (flow->deficit <= 0) { flow->deficit += fq->quantum; list_move_tail(>flowchain, -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH net] net: ipv6: ip6_output: alloc skb with tailroom
On Thu, Jun 7, 2018 at 12:22 PM, Alexander Aring wrote: > Hi, > > On Wed, Jun 06, 2018 at 04:26:19PM -0400, Willem de Bruijn wrote: >> On Wed, Jun 6, 2018 at 2:11 PM, David Miller wrote: >> > From: Alexander Aring >> > Date: Wed, 6 Jun 2018 14:09:20 -0400 >> > >> >> okay, then you want to have this patch for net-next? As an optimization? >> >> >> >> Of course, when it's open again. >> > >> > Like you, I have questions about where this adjustment is applied and >> > why. So I'm not sure yet. >> > >> > For example, only IPV6 really takes it into consideration and as you >> > saw only really for the fragmentation path and not the normal output >> > path. >> > >> > This needs more consideration and investigation. >> >> This is the unconditional skb_put in ieee802154_tx. In many cases >> there is some tailroom due to SKB_DATA_ALIGN in __alloc_skb, >> so it may take a specific case to not have even 2 bytes of tailroom >> available. > > Yes it's in ieee802154_tx, but we need tailroom not just for checksum. > The bugreport is related to the two bytes of tailroom, because virtual > hardware doing checksum by software. The most real transceivers offload > this feature, so zero tailroom is needed. > > I will of course add checks before adding L2 header for headroom and > tailroom in related subsystem code. > > In IEEE 802.15.4 and secured enabled frames we need a MIC field at the > end of the frame. In worst case this can be 16 bytes. > > I looked ethernet macsec feature and it seems they need to have a similar > reseved tailroom which is 16 bytes by default (max 32 bytes). Allocating with necessary tailroom to avoid a realloc in the path sounds fine to me, too. Packet sockets also take needed_tailroom into account, to give another example. We just have to avoid papering over bugs. Fixing the locations that unconditionally move the tail pointer beyond the allocated region is more important.
Re: Fw: [Bug 199963] New: UDP rx_queue incorrect calculation in /proc/net/udp
On Thu, 2018-06-07 at 07:39 -0700, Stephen Hemminger wrote: > > Begin forwarded message: > > Date: Thu, 07 Jun 2018 13:21:23 + > From: bugzilla-dae...@bugzilla.kernel.org > To: step...@networkplumber.org > Subject: [Bug 199963] New: UDP rx_queue incorrect calculation in /proc/net/udp > > > https://bugzilla.kernel.org/show_bug.cgi?id=199963 > > Bug ID: 199963 >Summary: UDP rx_queue incorrect calculation in /proc/net/udp >Product: Networking >Version: 2.5 > Kernel Version: Kernels >= 4.15 > Hardware: All > OS: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: IPV4 > Assignee: step...@networkplumber.org > Reporter: trevor.fran...@46labs.com > Regression: No > > since upgrading to any kernel >= 4.15 the rx_queue in /proc/net/udp is now > reporting a queue, regardless of system load and regardless of what > applications are running on it. The tx_queue is always 0, but rx_queue has > seemingly random spikes of udp queueing. This is observed across hundreds of > servers with either varying or no workload. > > netstat -nl|grep ^udp > udp 4352 0 0.0.0.0:68 0.0.0.0:* > > sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid > timeout inode ref pointer drops > 14645: 357F:0035 : 07 :C900 00: > 101 0 3367 2 8da177fdcc00 0 after: commit 0d4a6608f68c7532dcbfec2ea1150c9761767d03 Author: Paolo Abeni Date: Tue Sep 19 12:11:43 2017 +0200 udp: do rmem bulk free even if the rx sk queue is empty rmem can be a non-0, small, value even if the rx queue is empty, because it reports the (forward) allocated memory for this socket, not the queue length in bytes. I'll cook a patch to fix the above. Cheers, Paolo
Re: [PATCH net] net: ipv6: ip6_output: alloc skb with tailroom
Hi, On Wed, Jun 06, 2018 at 04:26:19PM -0400, Willem de Bruijn wrote: > On Wed, Jun 6, 2018 at 2:11 PM, David Miller wrote: > > From: Alexander Aring > > Date: Wed, 6 Jun 2018 14:09:20 -0400 > > > >> okay, then you want to have this patch for net-next? As an optimization? > >> > >> Of course, when it's open again. > > > > Like you, I have questions about where this adjustment is applied and > > why. So I'm not sure yet. > > > > For example, only IPV6 really takes it into consideration and as you > > saw only really for the fragmentation path and not the normal output > > path. > > > > This needs more consideration and investigation. > > This is the unconditional skb_put in ieee802154_tx. In many cases > there is some tailroom due to SKB_DATA_ALIGN in __alloc_skb, > so it may take a specific case to not have even 2 bytes of tailroom > available. Yes it's in ieee802154_tx, but we need tailroom not just for checksum. The bugreport is related to the two bytes of tailroom, because virtual hardware doing checksum by software. The most real transceivers offload this feature, so zero tailroom is needed. I will of course add checks before adding L2 header for headroom and tailroom in related subsystem code. In IEEE 802.15.4 and secured enabled frames we need a MIC field at the end of the frame. In worst case this can be 16 bytes. I looked ethernet macsec feature and it seems they need to have a similar reseved tailroom which is 16 bytes by default (max 32 bytes). Maybe it's worth to take care for the tailroom in this path since it's not just 2 bytes in some cases. --- Meanwhile I think I found a bug in macsec, I cc Sabrina here: diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c index 7de88b33d5b9..687323c0caf5 100644 --- a/drivers/net/macsec.c +++ b/drivers/net/macsec.c @@ -522,7 +522,7 @@ static bool macsec_validate_skb(struct sk_buff *skb, u16 icv_len) } #define MACSEC_NEEDED_HEADROOM (macsec_extra_len(true)) -#define MACSEC_NEEDED_TAILROOM MACSEC_STD_ICV_LEN +#define MACSEC_NEEDED_TAILROOM MACSEC_MAX_ICV_LEN static void macsec_fill_iv(unsigned char *iv, sci_t sci, u32 pn) { --- MACSEC_NEEDED_TAILROOM is the define to check and run skb_copy_expand() and should use the ?worst case? or the the value (icv_len + ?extra_foo?) is set as runtime generation on newlink. I see that in macsec_newlink() following code: if (data && data[IFLA_MACSEC_ICV_LEN]) icv_len = nla_get_u8(data[IFLA_MACSEC_ICV_LEN]); so the user can change it to (even a value above 32?, there is no check for that). Anyway everything higher than MACSEC_STD_ICV_LEN could run into a skb_over_panic(). - Alex
Re: [PATCH net] failover: eliminate callback hell
On Thu, 7 Jun 2018 18:41:31 +0300 "Michael S. Tsirkin" wrote: > > > Why would DPDK care what we do in the kernel? Isn't it just slapping > > > vfio-pci on the netdevs it sees? > > > > Alex, you are correct for Intel devices; but DPDK on Azure is not Intel > > based.,. > > The DPDK support uses: > > * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to > >userspace. This means VF netdev device must exist and be visible. > > * Slow path using kernel netvsc device, TAP and BPF to get exception > >path packets to userspace. > > * A autodiscovery mechanism that to set all this up that relies on > >2 device model and sysfs. > > Could you describe what does it look for exactly? What will break if > instead of MLX5 being a child of the PV, it's a child of the failover > device? So in DPDK there is an internal four device model: 1. failsafe is like failover in your model 2. TAP is used like netvsc in kernel 3. MLX5 is the VF 4. vdev_netvsc is a pseudo device whose only reason to exist is to glue everything together. Digging deeper inside... Vdev_netvsc does: * driver is started in a convuluted way off device arguments * probe routine for driver runs - scans list of kernel interfaces in sysfs - matches those using VMBUS - skip netvsc devices that have an IPV4 route * scan for PCI devices that have same MAC address as kernel netvsc devices discovered in previous step * add these interfaces to arguments to failsafe Then failsafe configures based on arguments on device The code works but is specific to the Azure hardware model, and exposes lots of things to application that it should not have to care about. If you try and walk through this code in DPDK, you will see why I have developed a dislike for high levels of indirection.
Re: [PATCH] net-fq: Add WARN_ON check for null flow.
On 06/07/2018 09:06 AM, gree...@candelatech.com wrote: > From: Ben Greear > > While testing an ath10k firmware that often crashed under load, > I was seeing kernel crashes as well. One of them appeared to > be a dereference of a NULL flow object in fq_tin_dequeue. > > I have since fixed the firmware flaw, but I think it would be > worth adding the WARN_ON in case the problem appears again. > > common_interrupt+0xf/0xf > > Please find the exact commit that brought this bug, and add a corresponding Fixes: tag > Signed-off-by: Ben Greear > --- > include/net/fq_impl.h | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h > index be7c0fa..e40354d 100644 > --- a/include/net/fq_impl.h > +++ b/include/net/fq_impl.h > @@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, > > flow = list_first_entry(head, struct fq_flow, flowchain); > > + if (WARN_ON_ONCE(!flow)) > + return NULL; > + > if (flow->deficit <= 0) { > flow->deficit += fq->quantum; > list_move_tail(>flowchain, >
[PATCH] net-fq: Add WARN_ON check for null flow.
From: Ben Greear While testing an ath10k firmware that often crashed under load, I was seeing kernel crashes as well. One of them appeared to be a dereference of a NULL flow object in fq_tin_dequeue. I have since fixed the firmware flaw, but I think it would be worth adding the WARN_ON in case the problem appears again. BUG: unable to handle kernel NULL pointer dereference at 003c IP: ieee80211_tx_dequeue+0xfb/0xb10 [mac80211] PGD 8001417fe067 P4D 8001417fe067 PUD 13db41067 PMD 0 Oops: [#1] PREEMPT SMP PTI Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c vrf 8021q garp mrp stp llc fuse macvlan wanlink(O) pktgen lm78 ] CPU: 2 PID: 21733 Comm: ip Tainted: GW O 4.16.8+ #35 Hardware name: _ _/, BIOS 5.11 08/26/2016 RIP: 0010:ieee80211_tx_dequeue+0xfb/0xb10 [mac80211] RSP: 0018:880172d03c30 EFLAGS: 00010286 RAX: 88013b2c RBX: 88013b2c00b8 RCX: 0898 RDX: 0001 RSI: 88013b2c00d8 RDI: 88016ac40820 RBP: 88016ac42ba0 R08: 0020 R09: R10: 0010 R11: 001256c89fd8 R12: 88013b2c R13: 88013b2c00d8 R14: R15: 88013b2c00d8 FS: 7f04e3606700() GS:880172d0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 003c CR3: 00013b35a005 CR4: 003606e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: ? update_load_avg+0x607/0x6f0 ath10k_mac_tx_push_txq+0x6e/0x220 [ath10k_core] ath10k_mac_tx_push_pending+0x151/0x1e0 [ath10k_core] ath10k_htt_txrx_compl_task+0x113e/0x1940 [ath10k_core] ? ath10k_ce_completed_send_next_nolock+0x6f/0x90 [ath10k_pci] ? ath10k_ce_completed_send_next+0x31/0x40 [ath10k_pci] ? ath10k_pci_htc_tx_cb+0x30/0xc0 [ath10k_pci] ? ath10k_bus_pci_write32+0x3c/0xa0 [ath10k_pci] ath10k_pci_napi_poll+0x44/0xf0 [ath10k_pci] net_rx_action+0x250/0x3b0 __do_softirq+0xc2/0x2c2 irq_exit+0x93/0xa0 do_IRQ+0x45/0xc0 common_interrupt+0xf/0xf Signed-off-by: Ben Greear --- include/net/fq_impl.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h index be7c0fa..e40354d 100644 --- a/include/net/fq_impl.h +++ b/include/net/fq_impl.h @@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq, flow = list_first_entry(head, struct fq_flow, flowchain); + if (WARN_ON_ONCE(!flow)) + return NULL; + if (flow->deficit <= 0) { flow->deficit += fq->quantum; list_move_tail(>flowchain, -- 2.4.11
Fw: [Bug 199963] New: UDP rx_queue incorrect calculation in /proc/net/udp
Begin forwarded message: Date: Thu, 07 Jun 2018 13:21:23 + From: bugzilla-dae...@bugzilla.kernel.org To: step...@networkplumber.org Subject: [Bug 199963] New: UDP rx_queue incorrect calculation in /proc/net/udp https://bugzilla.kernel.org/show_bug.cgi?id=199963 Bug ID: 199963 Summary: UDP rx_queue incorrect calculation in /proc/net/udp Product: Networking Version: 2.5 Kernel Version: Kernels >= 4.15 Hardware: All OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: IPV4 Assignee: step...@networkplumber.org Reporter: trevor.fran...@46labs.com Regression: No since upgrading to any kernel >= 4.15 the rx_queue in /proc/net/udp is now reporting a queue, regardless of system load and regardless of what applications are running on it. The tx_queue is always 0, but rx_queue has seemingly random spikes of udp queueing. This is observed across hundreds of servers with either varying or no workload. netstat -nl|grep ^udp udp 4352 0 0.0.0.0:68 0.0.0.0:* sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops 14645: 357F:0035 : 07 :C900 00: 101 0 3367 2 8da177fdcc00 0 -- You are receiving this mail because: You are the assignee for the bug.
Re: [PATCH net] failover: eliminate callback hell
On Thu, Jun 07, 2018 at 07:51:12AM -0700, Stephen Hemminger wrote: > On Thu, 7 Jun 2018 07:17:51 -0700 > Alexander Duyck wrote: > > > On Wed, Jun 6, 2018 at 3:25 PM, Stephen Hemminger > > wrote: > > > On Wed, 6 Jun 2018 14:54:04 -0700 > > > "Samudrala, Sridhar" wrote: > > > > > >> On 6/6/2018 2:24 PM, Stephen Hemminger wrote: > > >> > On Wed, 6 Jun 2018 15:30:27 +0300 > > >> > "Michael S. Tsirkin" wrote: > > >> > > > >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: > > >> >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org > > >> >>> wrote: > > >> The net failover should be a simple library, not a virtual > > >> object with function callbacks (see callback hell). > > >> >>> Why just a library? It should do a common things. I think it should > > >> >>> be a > > >> >>> virtual object. Looks like your patch again splits the common > > >> >>> functionality into multiple drivers. That is kind of backwards > > >> >>> attitude. > > >> >>> I don't get it. We should rather focus on fixing the mess the > > >> >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev > > >> >>> model. > > >> >> So it seems that at least one benefit for netvsc would be better > > >> >> handling of renames. > > >> >> > > >> >> Question is how can this change to 3-netdev happen? Stephen is > > >> >> concerned about risk of breaking some userspace. > > >> >> > > >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to > > >> >> address, and you said then "why not use existing network namespaces > > >> >> rather than inventing a new abstraction". So how about it then? Do you > > >> >> want to find a way to use namespaces to hide the PV device for netvsc > > >> >> compatibility? > > >> >> > > >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's > > >> > and > > >> > startups that all demand eth0 always be present. And VF may come and > > >> > go. > > >> > After this history, there is a strong motivation not to change how > > >> > kernel > > >> > behaves. Switching to 3 device model would be perceived as breaking > > >> > existing userspace. > > >> > > >> I think it should be possible for netvsc to work with 3 dev model if the > > >> only > > >> requirement is that eth0 will always be present. With net_failover, you > > >> will > > >> see eth0 and eth0nsby OR with older distros eth0 and eth1. It may be an > > >> issue > > >> if somehow there is userspace requirement that there can be only 2 > > >> netdevs, not 3 > > >> when VF is plugged. > > >> > > >> eth0 will be the net_failover device and eth0nsby/eth1 will be the > > >> netvsc device > > >> and the IP address gets configured on eth0. Will this be an issue? > > > > > > DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess > > > but that is the way it is. > > > > Why would DPDK care what we do in the kernel? Isn't it just slapping > > vfio-pci on the netdevs it sees? > > Alex, you are correct for Intel devices; but DPDK on Azure is not Intel > based.,. > The DPDK support uses: > * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to >userspace. This means VF netdev device must exist and be visible. > * Slow path using kernel netvsc device, TAP and BPF to get exception >path packets to userspace. > * A autodiscovery mechanism that to set all this up that relies on >2 device model and sysfs. Could you describe what does it look for exactly? What will break if instead of MLX5 being a child of the PV, it's a child of the failover device? > In this version, there is no VFIO-PCI. And also Hyper-V does not have virtual > IOMMU so VFIO will not work there at all. >
[PATCH bpf] bpf: reject passing modified ctx to helper functions
As commit 28e33f9d78ee ("bpf: disallow arithmetic operations on context pointer") already describes, f1174f77b50c ("bpf/verifier: rework value tracking") removed the specific white-listed cases we had previously where we would allow for pointer arithmetic in order to further generalize it, and allow e.g. context access via modified registers. While the dereferencing of modified context pointers had been forbidden through 28e33f9d78ee, syzkaller did recently manage to trigger several KASAN splats for slab out of bounds access and use after frees by simply passing a modified context pointer to a helper function which would then do the bad access since verifier allowed it in adjust_ptr_min_max_vals(). Rejecting arithmetic on ctx pointer in adjust_ptr_min_max_vals() generally could break existing programs as there's a valid use case in tracing in combination with passing the ctx to helpers as bpf_probe_read(), where the register then becomes unknown at verification time due to adding a non-constant offset to it. An access sequence may look like the following: offset = args->filename; /* field __data_loc filename */ bpf_probe_read(, len, (char *)args + offset); // args is ctx There are two options: i) we could special case the ctx and as soon as we add a constant or bounded offset to it (hence ctx type wouldn't change) we could turn the ctx into an unknown scalar, or ii) we generalize the sanity test for ctx member access into a small helper and assert it on the ctx register that was passed as a function argument. Fwiw, latter is more obvious and less complex at the same time, and one case that may potentially be legitimate in future for ctx member access at least would be for ctx to carry a const offset. Therefore, fix follows approach from ii) and adds test cases to BPF kselftests. Fixes: f1174f77b50c ("bpf/verifier: rework value tracking") Reported-by: syzbot+3d0b2441dbb717516...@syzkaller.appspotmail.com Reported-by: syzbot+c8504affd4fdd0c1b...@syzkaller.appspotmail.com Reported-by: syzbot+e5190cb881d8660fb...@syzkaller.appspotmail.com Reported-by: syzbot+efae31b384d5badbd...@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 48 +++- tools/testing/selftests/bpf/test_verifier.c | 58 - 2 files changed, 88 insertions(+), 18 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index d6403b5..cced0c1 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1617,6 +1617,30 @@ static int get_callee_stack_depth(struct bpf_verifier_env *env, } #endif +static int check_ctx_reg(struct bpf_verifier_env *env, +const struct bpf_reg_state *reg, int regno) +{ + /* Access to ctx or passing it to a helper is only allowed in +* its original, unmodified form. +*/ + + if (reg->off) { + verbose(env, "dereference of modified ctx ptr R%d off=%d disallowed\n", + regno, reg->off); + return -EACCES; + } + + if (!tnum_is_const(reg->var_off) || reg->var_off.value) { + char tn_buf[48]; + + tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); + verbose(env, "variable ctx access var_off=%s disallowed\n", tn_buf); + return -EACCES; + } + + return 0; +} + /* truncate register to smaller size (in bytes) * must be called with size < BPF_REG_SIZE */ @@ -1686,24 +1710,11 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn verbose(env, "R%d leaks addr into ctx\n", value_regno); return -EACCES; } - /* ctx accesses must be at a fixed offset, so that we can -* determine what type of data were returned. -*/ - if (reg->off) { - verbose(env, - "dereference of modified ctx ptr R%d off=%d+%d, ctx+const is allowed, ctx+const+const is not\n", - regno, reg->off, off - reg->off); - return -EACCES; - } - if (!tnum_is_const(reg->var_off) || reg->var_off.value) { - char tn_buf[48]; - tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); - verbose(env, - "variable ctx access var_off=%s off=%d size=%d", - tn_buf, off, size); - return -EACCES; - } + err = check_ctx_reg(env, reg, regno); + if (err < 0) + return err; + err = check_ctx_access(env, insn_idx, off, size, t, _type); if (!err && t == BPF_READ && value_regno >= 0) { /* ctx access returns either a
Re: [PATCH net] failover: eliminate callback hell
On Thu, 7 Jun 2018 17:57:50 +0300 "Michael S. Tsirkin" wrote: > On Wed, Jun 06, 2018 at 03:24:08PM -0700, Stephen Hemminger wrote: > > On Thu, 7 Jun 2018 00:47:52 +0300 > > "Michael S. Tsirkin" wrote: > > > > > On Wed, Jun 06, 2018 at 02:24:47PM -0700, Stephen Hemminger wrote: > > > > On Wed, 6 Jun 2018 15:30:27 +0300 > > > > "Michael S. Tsirkin" wrote: > > > > > > > > > On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: > > > > > > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org > > > > > > wrote: > > > > > > >The net failover should be a simple library, not a virtual > > > > > > >object with function callbacks (see callback hell). > > > > > > > > > > > > Why just a library? It should do a common things. I think it should > > > > > > be a > > > > > > virtual object. Looks like your patch again splits the common > > > > > > functionality into multiple drivers. That is kind of backwards > > > > > > attitude. > > > > > > I don't get it. We should rather focus on fixing the mess the > > > > > > introduction of netvsc-bonding caused and switch netvsc to 3-netdev > > > > > > model. > > > > > > > > > > So it seems that at least one benefit for netvsc would be better > > > > > handling of renames. > > > > > > > > > > Question is how can this change to 3-netdev happen? Stephen is > > > > > concerned about risk of breaking some userspace. > > > > > > > > > > Stephen, this seems to be the usecase that IFF_HIDDEN was trying to > > > > > address, and you said then "why not use existing network namespaces > > > > > rather than inventing a new abstraction". So how about it then? Do you > > > > > want to find a way to use namespaces to hide the PV device for netvsc > > > > > compatibility? > > > > > > > > > > > > > Netvsc can't work with 3 dev model. MS has worked with enough distro's > > > > and > > > > startups that all demand eth0 always be present. And VF may come and > > > > go. > > > > > > Well failover seems to maintain this invariant with the 3 dev model. > > > > > > > After this history, there is a strong motivation not to change how > > > > kernel > > > > behaves. Switching to 3 device model would be perceived as breaking > > > > existing userspace. > > > > > > I feel I'm misunderstood. I was asking whether a 3-rd device can be > > > hidden so that userspace does not know that you switched to a 3 device > > > model. It will think there are 2 devices and will keep working. > > > > > > If you do that, then there won't be anything that > > > would be perceived as breaking existing userspace, will there? > > > > DPDK now knows about the netvsc 2 device model and drivers in userspace > > depend on it. > > Interesting but I'm not sure how this answers the question. How would > DPDK care that there's a hidden device? If you can point out the > code in question, maybe a way can be found to make changes while > keeping it working. See http://dpdk.org/browse/dpdk/tree/drivers/net/vdev_netvsc/vdev_netvsc.c I am working to eliminate the necessity of this complex model in DPDK. Having a 3 device model inside DPDK has just as many problems as in the kernel.
Re: [PATCH net] failover: eliminate callback hell
On Wed, Jun 06, 2018 at 03:24:08PM -0700, Stephen Hemminger wrote: > On Thu, 7 Jun 2018 00:47:52 +0300 > "Michael S. Tsirkin" wrote: > > > On Wed, Jun 06, 2018 at 02:24:47PM -0700, Stephen Hemminger wrote: > > > On Wed, 6 Jun 2018 15:30:27 +0300 > > > "Michael S. Tsirkin" wrote: > > > > > > > On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: > > > > > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org > > > > > wrote: > > > > > >The net failover should be a simple library, not a virtual > > > > > >object with function callbacks (see callback hell). > > > > > > > > > > Why just a library? It should do a common things. I think it should > > > > > be a > > > > > virtual object. Looks like your patch again splits the common > > > > > functionality into multiple drivers. That is kind of backwards > > > > > attitude. > > > > > I don't get it. We should rather focus on fixing the mess the > > > > > introduction of netvsc-bonding caused and switch netvsc to 3-netdev > > > > > model. > > > > > > > > So it seems that at least one benefit for netvsc would be better > > > > handling of renames. > > > > > > > > Question is how can this change to 3-netdev happen? Stephen is > > > > concerned about risk of breaking some userspace. > > > > > > > > Stephen, this seems to be the usecase that IFF_HIDDEN was trying to > > > > address, and you said then "why not use existing network namespaces > > > > rather than inventing a new abstraction". So how about it then? Do you > > > > want to find a way to use namespaces to hide the PV device for netvsc > > > > compatibility? > > > > > > > > > > Netvsc can't work with 3 dev model. MS has worked with enough distro's and > > > startups that all demand eth0 always be present. And VF may come and go. > > > > Well failover seems to maintain this invariant with the 3 dev model. > > > > > After this history, there is a strong motivation not to change how kernel > > > behaves. Switching to 3 device model would be perceived as breaking > > > existing userspace. > > > > I feel I'm misunderstood. I was asking whether a 3-rd device can be > > hidden so that userspace does not know that you switched to a 3 device > > model. It will think there are 2 devices and will keep working. > > > > If you do that, then there won't be anything that > > would be perceived as breaking existing userspace, will there? > > DPDK now knows about the netvsc 2 device model and drivers in userspace > depend on it. Interesting but I'm not sure how this answers the question. How would DPDK care that there's a hidden device? If you can point out the code in question, maybe a way can be found to make changes while keeping it working. > > > > > > > With virtio you can work it out with the distro's yourself. > > > There is no pre-existing semantics to deal with. > > > > > > For the virtio, I don't see the need for IFF_HIDDEN. > > > With 3-dev model as long as you mark the PV and VF devices > > > as slaves, then userspace knows to leave them alone. Assuming userspace > > > is already able to deal with team and bond devices. > > > > That's clear enough. > > > > > Any time you introduce new UAPI behavior something breaks. > > > > Not if we do it right. > > > > > On the rename front, I really don't care if VF can be renamed. > > > > OK that's nice. > > > > > And for > > > netvsc want to allow the PV device to be renamed. > > > > That's because of the 2 device model, right? So that explains why even > > if the delayed hack is good for the goose it might not be good for the > > gander :) > > You are bringing up the VF right away. How does the 3-device initialization > state machine work? Do you give a window for udev to possibly rename the > VF? Do you rely on that? > > > > > > Udev developers want that > > > but have not found a stable/persistent value to expose to userspace > > > to allow it.
Re: [PATCH net] failover: eliminate callback hell
On Thu, 7 Jun 2018 07:17:51 -0700 Alexander Duyck wrote: > On Wed, Jun 6, 2018 at 3:25 PM, Stephen Hemminger > wrote: > > On Wed, 6 Jun 2018 14:54:04 -0700 > > "Samudrala, Sridhar" wrote: > > > >> On 6/6/2018 2:24 PM, Stephen Hemminger wrote: > >> > On Wed, 6 Jun 2018 15:30:27 +0300 > >> > "Michael S. Tsirkin" wrote: > >> > > >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: > >> >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org > >> >>> wrote: > >> The net failover should be a simple library, not a virtual > >> object with function callbacks (see callback hell). > >> >>> Why just a library? It should do a common things. I think it should be > >> >>> a > >> >>> virtual object. Looks like your patch again splits the common > >> >>> functionality into multiple drivers. That is kind of backwards > >> >>> attitude. > >> >>> I don't get it. We should rather focus on fixing the mess the > >> >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev > >> >>> model. > >> >> So it seems that at least one benefit for netvsc would be better > >> >> handling of renames. > >> >> > >> >> Question is how can this change to 3-netdev happen? Stephen is > >> >> concerned about risk of breaking some userspace. > >> >> > >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to > >> >> address, and you said then "why not use existing network namespaces > >> >> rather than inventing a new abstraction". So how about it then? Do you > >> >> want to find a way to use namespaces to hide the PV device for netvsc > >> >> compatibility? > >> >> > >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's > >> > and > >> > startups that all demand eth0 always be present. And VF may come and go. > >> > After this history, there is a strong motivation not to change how kernel > >> > behaves. Switching to 3 device model would be perceived as breaking > >> > existing userspace. > >> > >> I think it should be possible for netvsc to work with 3 dev model if the > >> only > >> requirement is that eth0 will always be present. With net_failover, you > >> will > >> see eth0 and eth0nsby OR with older distros eth0 and eth1. It may be an > >> issue > >> if somehow there is userspace requirement that there can be only 2 > >> netdevs, not 3 > >> when VF is plugged. > >> > >> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc > >> device > >> and the IP address gets configured on eth0. Will this be an issue? > > > > DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess > > but that is the way it is. > > Why would DPDK care what we do in the kernel? Isn't it just slapping > vfio-pci on the netdevs it sees? Alex, you are correct for Intel devices; but DPDK on Azure is not Intel based.,. The DPDK support uses: * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to userspace. This means VF netdev device must exist and be visible. * Slow path using kernel netvsc device, TAP and BPF to get exception path packets to userspace. * A autodiscovery mechanism that to set all this up that relies on 2 device model and sysfs. In this version, there is no VFIO-PCI. And also Hyper-V does not have virtual IOMMU so VFIO will not work there at all.
Re: [PATCH net] failover: eliminate callback hell
On Wed, Jun 6, 2018 at 3:25 PM, Stephen Hemminger wrote: > On Wed, 6 Jun 2018 14:54:04 -0700 > "Samudrala, Sridhar" wrote: > >> On 6/6/2018 2:24 PM, Stephen Hemminger wrote: >> > On Wed, 6 Jun 2018 15:30:27 +0300 >> > "Michael S. Tsirkin" wrote: >> > >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: >> >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote: >> The net failover should be a simple library, not a virtual >> object with function callbacks (see callback hell). >> >>> Why just a library? It should do a common things. I think it should be a >> >>> virtual object. Looks like your patch again splits the common >> >>> functionality into multiple drivers. That is kind of backwards attitude. >> >>> I don't get it. We should rather focus on fixing the mess the >> >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev >> >>> model. >> >> So it seems that at least one benefit for netvsc would be better >> >> handling of renames. >> >> >> >> Question is how can this change to 3-netdev happen? Stephen is >> >> concerned about risk of breaking some userspace. >> >> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to >> >> address, and you said then "why not use existing network namespaces >> >> rather than inventing a new abstraction". So how about it then? Do you >> >> want to find a way to use namespaces to hide the PV device for netvsc >> >> compatibility? >> >> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and >> > startups that all demand eth0 always be present. And VF may come and go. >> > After this history, there is a strong motivation not to change how kernel >> > behaves. Switching to 3 device model would be perceived as breaking >> > existing userspace. >> >> I think it should be possible for netvsc to work with 3 dev model if the only >> requirement is that eth0 will always be present. With net_failover, you will >> see eth0 and eth0nsby OR with older distros eth0 and eth1. It may be an >> issue >> if somehow there is userspace requirement that there can be only 2 netdevs, >> not 3 >> when VF is plugged. >> >> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc >> device >> and the IP address gets configured on eth0. Will this be an issue? > > DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess > but that is the way it is. Why would DPDK care what we do in the kernel? Isn't it just slapping vfio-pci on the netdevs it sees?
Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format
Em Thu, Jun 07, 2018 at 10:54:01AM -0300, Arnaldo Carvalho de Melo escreveu: > Em Tue, Jun 05, 2018 at 02:25:48PM -0700, Martin KaFai Lau escreveu: > > [ btw, the latest commit (1 commit) should be 94a11b59e592 ]. So, the commit log message for the pahole patch is non-existent: https://github.com/iamkafai/pahole/commit/94a11b59e5920908085bfc8d24c92f95c8ffceaf we should do better in describing what is done and how, I'm staring with a message you sent to the kernel part: -- This patch introduces BPF Type Format (BTF). BTF (BPF Type Format) is the meta data format which describes the data types of BPF program/map. Hence, it basically focus on the C programming language which the modern BPF is primary using. The first use case is to provide a generic pretty print capability for a BPF map. -- Now I'm going to do the step-by-step guide on testing the feature just introduced, and will try to convert from dwarf to BTF and back, compare the pahole output for types encoded in DWARF and BTF, etc. If you have something ressembling this already, please share. Thanks, - Arnaldo
Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format
Em Tue, Jun 05, 2018 at 02:25:48PM -0700, Martin KaFai Lau escreveu: > On Thu, Apr 19, 2018 at 04:40:34PM -0300, Arnaldo Carvalho de Melo wrote: > > Em Wed, Apr 18, 2018 at 03:55:56PM -0700, Martin KaFai Lau escreveu: > > > This patch introduces BPF Type Format (BTF). > > > > > > BTF (BPF Type Format) is the meta data format which describes > > > the data types of BPF program/map. Hence, it basically focus > > > on the C programming language which the modern BPF is primary > > > using. The first use case is to provide a generic pretty print > > > capability for a BPF map. > > > > > > A modified pahole that can convert dwarf to BTF is here: > > > https://github.com/iamkafai/pahole/tree/btf > > > (Arnaldo, there is some BTF_KIND numbering changes on > > > Apr 18th, d61426c1571) > > > > Thanks for letting me know, I'm starting to look at this, > Hi Arnaldo, > > Do you have a chance to take a look and pull it? The kernel > changes will be in 4.18, so it will be handy if it is available in > the pahole repository. > > [ btw, the latest commit (1 commit) should be 94a11b59e592 ]. Yeah, the one I had before had: It also raises the number of types (and functions) limit from 0x7fff to 0x7fff. And on this last one I see that: /* Max # of type identifier */ -#define BTF_MAX_TYPE 0x7fff +#define BTF_MAX_TYPE 0x /* Max offset into the string section */ -#define BTF_MAX_NAME_OFFSET0x7fff +#define BTF_MAX_NAME_OFFSET0x So somehow (still reading) you'll be able to get more space, if we find necessary, to have more types and names, ok. Continuing... - Arnaldo > > > > - Arnaldo > > > > > Please see individual patch for details. > > > > > > v5: > > > - Remove BTF_KIND_FLOAT and BTF_KIND_FUNC which are not > > > currently used. They can be added in the future. > > > Some bpf_df_xxx() are removed together. > > > - Add comment in patch 7 to clarify that the new bpffs_map_fops > > > should not be extended further. > > > > > > v4: > > > - Fix warning (remove unneeded semicolon) > > > - Remove a redundant variable (nr_bytes) from btf_int_check_meta() in > > > patch 1. Caught by W=1. > > > > > > v3: > > > - Rebase to bpf-next > > > - Fix sparse warning (by adding static) > > > - Add BTF header logging: btf_verifier_log_hdr() > > > - Fix the alignment test on btf->type_off > > > - Add tests for the BTF header > > > - Lower the max BTF size to 16MB. It should be enough > > > for some time. We could raise it later if it would > > > be needed. > > > > > > v2: > > > - Use kvfree where needed in patch 1 and 2 > > > - Also consider BTF_INT_OFFSET() in the btf_int_check_meta() > > > in patch 1 > > > - Fix an incorrect goto target in map_create() during > > > the btf-error-path in patch 7 > > > - re-org some local vars to keep the rev xmas tree in btf.c > > > > > > Martin KaFai Lau (10): > > > bpf: btf: Introduce BPF Type Format (BTF) > > > bpf: btf: Validate type reference > > > bpf: btf: Check members of struct/union > > > bpf: btf: Add pretty print capability for data with BTF type info > > > bpf: btf: Add BPF_BTF_LOAD command > > > bpf: btf: Add BPF_OBJ_GET_INFO_BY_FD support to BTF fd > > > bpf: btf: Add pretty print support to the basic arraymap > > > bpf: btf: Sync bpf.h and btf.h to tools/ > > > bpf: btf: Add BTF support to libbpf > > > bpf: btf: Add BTF tests > > > > > > include/linux/bpf.h | 20 +- > > > include/linux/btf.h | 48 + > > > include/uapi/linux/bpf.h | 12 + > > > include/uapi/linux/btf.h | 130 ++ > > > kernel/bpf/Makefile |1 + > > > kernel/bpf/arraymap.c| 50 + > > > kernel/bpf/btf.c | 2064 > > > ++ > > > kernel/bpf/inode.c | 156 +- > > > kernel/bpf/syscall.c | 51 +- > > > tools/include/uapi/linux/bpf.h | 12 + > > > tools/include/uapi/linux/btf.h | 130 ++ > > > tools/lib/bpf/Build |2 +- > > > tools/lib/bpf/bpf.c | 92 +- > > > tools/lib/bpf/bpf.h | 16 + > > > tools/lib/bpf/btf.c | 374 + > > > tools/lib/bpf/btf.h | 22 + > > > tools/lib/bpf/libbpf.c | 148 +- > > > tools/lib/bpf/libbpf.h |3 + > > > tools/testing/selftests/bpf/Makefile | 26 +- > > > tools/testing/selftests/bpf/test_btf.c | 1669 + > > > tools/testing/selftests/bpf/test_btf_haskv.c | 48 + > > > tools/testing/selftests/bpf/test_btf_nokv.c | 43 + > > > 22 files changed, 5076 insertions(+), 41 deletions(-) > > > create mode 100644 include/linux/btf.h > > > create mode 100644 include/uapi/linux/btf.h > > >
[PATCH net] net/sched: act_simple: fix parsing of TCA_DEFDATA
use nla_strlcpy() to avoid copying data beyond the length of TCA_DEFDATA netlink attribute, in case it is less than SIMP_MAX_DATA and it does not end with '\0' character. Fixes: 0eff683f737b ("net/sched: potential data corruption") Signed-off-by: Davide Caratti --- net/sched/act_simple.c | 15 ++- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/net/sched/act_simple.c b/net/sched/act_simple.c index 9618b4a83cee..98c4afe7c15b 100644 --- a/net/sched/act_simple.c +++ b/net/sched/act_simple.c @@ -53,22 +53,22 @@ static void tcf_simp_release(struct tc_action *a) kfree(d->tcfd_defdata); } -static int alloc_defdata(struct tcf_defact *d, char *defdata) +static int alloc_defdata(struct tcf_defact *d, const struct nlattr *defdata) { d->tcfd_defdata = kzalloc(SIMP_MAX_DATA, GFP_KERNEL); if (unlikely(!d->tcfd_defdata)) return -ENOMEM; - strlcpy(d->tcfd_defdata, defdata, SIMP_MAX_DATA); + nla_strlcpy(d->tcfd_defdata, defdata, SIMP_MAX_DATA); return 0; } -static void reset_policy(struct tcf_defact *d, char *defdata, +static void reset_policy(struct tcf_defact *d, const struct nlattr *defdata, struct tc_defact *p) { spin_lock_bh(>tcf_lock); d->tcf_action = p->action; memset(d->tcfd_defdata, 0, SIMP_MAX_DATA); - strlcpy(d->tcfd_defdata, defdata, SIMP_MAX_DATA); + nla_strlcpy(d->tcfd_defdata, defdata, SIMP_MAX_DATA); spin_unlock_bh(>tcf_lock); } @@ -87,7 +87,6 @@ static int tcf_simp_init(struct net *net, struct nlattr *nla, struct tcf_defact *d; bool exists = false; int ret = 0, err; - char *defdata; if (nla == NULL) return -EINVAL; @@ -110,8 +109,6 @@ static int tcf_simp_init(struct net *net, struct nlattr *nla, return -EINVAL; } - defdata = nla_data(tb[TCA_DEF_DATA]); - if (!exists) { ret = tcf_idr_create(tn, parm->index, est, a, _simp_ops, bind, false); @@ -119,7 +116,7 @@ static int tcf_simp_init(struct net *net, struct nlattr *nla, return ret; d = to_defact(*a); - ret = alloc_defdata(d, defdata); + ret = alloc_defdata(d, tb[TCA_DEF_DATA]); if (ret < 0) { tcf_idr_release(*a, bind); return ret; @@ -133,7 +130,7 @@ static int tcf_simp_init(struct net *net, struct nlattr *nla, if (!ovr) return -EEXIST; - reset_policy(d, defdata, parm); + reset_policy(d, tb[TCA_DEF_DATA], parm); } if (ret == ACT_P_CREATED) -- 2.17.0
Re: [RFC net-next] ipv4: Don't promote secondaries when flushing addresses
On Thu, Jun 07, 2018 at 02:35:39PM +0200, Phil Sutter wrote: > Yes, I agree with Michal. IIRC, flushing a specific primary along with > all it's secondaries from an interface is not even supported by > iproute2, so no need to optimize for that I guess. OTOH, if your > solution allowed to get rid of that nasty loop in ipaddr_flush(), I owe > you one extra beer at the next occasion. :) I'm afraid it will have to stay as fallback for older kernels not supporting flush requests. But there would be no need to actually use it. If we know RTM_DELADDR request for zero address is guaranteed to fail with current and older kernels, we could do - use RTM_DELADDR with IFA_F_FLUSH and zero address - if it fails, get the list and run the loop If not, it could still be - use RTM_DELADDR with IFA_F_FLUSH and zero address - get the list of addresses (empty if first step worked) - run the loop Michal Kubecek
Re: [RFC net-next] ipv4: Don't promote secondaries when flushing addresses
Hi Jakub, On Thu, Jun 07, 2018 at 02:17:50PM +0200, Jakub Sitnicki wrote: > On Thu, 7 Jun 2018 13:00:29 +0200 > Michal Kubecek wrote: > > > On Thu, Jun 07, 2018 at 12:13:01PM +0200, Jakub Sitnicki wrote: > > > Promoting secondary addresses on address removal makes flushing all > > > addresses from a device with 1000's of them slow. This is because we > > > cannot take down the secondary addresses when we are removing the > > > primary one, which would make it faster. > > > > > > However, the userspace, when performing a flush, will in the end remove > > > all the addresses regardless of secondary address promotion taking > > > place. Unfortunately the kernel currently cannot distinguish between a > > > single address removal and a flush of all addresses. > > > > > > To help with this case introduce a IFA_F_FLUSH flag that can be used by > > > userspace to signal that a removal operation is being done because of a > > > flush. When the flag is set, don't bother with secondary address > > > promotion as we expect that secondary addresses will be removed soon as > > > well. > > > > Unless you intend to use the flag to allow deleting a specific address > > with its secondaries (overriding promote_secondaries), maybe it would > > be more practical to go even further and delete all addresses on the > > interface if IFA_F_FLUSH is set so that userspace could delete all > > addresses with one request. > > Thanks for input, Michal. The intend as I understand it is to make > flushing all the addresses fast(er). Let me see if I can rework it > according to your suggestion. It does make more sense to do it like > that to me too. Yes, I agree with Michal. IIRC, flushing a specific primary along with all it's secondaries from an interface is not even supported by iproute2, so no need to optimize for that I guess. OTOH, if your solution allowed to get rid of that nasty loop in ipaddr_flush(), I owe you one extra beer at the next occasion. :) Thanks for holding on to this old ticket! Cheers, Phil
Re: [RFC net-next] ipv4: Don't promote secondaries when flushing addresses
On Thu, 7 Jun 2018 13:00:29 +0200 Michal Kubecek wrote: > On Thu, Jun 07, 2018 at 12:13:01PM +0200, Jakub Sitnicki wrote: > > Promoting secondary addresses on address removal makes flushing all > > addresses from a device with 1000's of them slow. This is because we > > cannot take down the secondary addresses when we are removing the > > primary one, which would make it faster. > > > > However, the userspace, when performing a flush, will in the end remove > > all the addresses regardless of secondary address promotion taking > > place. Unfortunately the kernel currently cannot distinguish between a > > single address removal and a flush of all addresses. > > > > To help with this case introduce a IFA_F_FLUSH flag that can be used by > > userspace to signal that a removal operation is being done because of a > > flush. When the flag is set, don't bother with secondary address > > promotion as we expect that secondary addresses will be removed soon as > > well. > > Unless you intend to use the flag to allow deleting a specific address > with its secondaries (overriding promote_secondaries), maybe it would > be more practical to go even further and delete all addresses on the > interface if IFA_F_FLUSH is set so that userspace could delete all > addresses with one request. Thanks for input, Michal. The intend as I understand it is to make flushing all the addresses fast(er). Let me see if I can rework it according to your suggestion. It does make more sense to do it like that to me too. Thanks, Jakub
Re: [RFC net-next] ipv4: Don't promote secondaries when flushing addresses
On Thu, Jun 07, 2018 at 12:13:01PM +0200, Jakub Sitnicki wrote: > Promoting secondary addresses on address removal makes flushing all > addresses from a device with 1000's of them slow. This is because we > cannot take down the secondary addresses when we are removing the > primary one, which would make it faster. > > However, the userspace, when performing a flush, will in the end remove > all the addresses regardless of secondary address promotion taking > place. Unfortunately the kernel currently cannot distinguish between a > single address removal and a flush of all addresses. > > To help with this case introduce a IFA_F_FLUSH flag that can be used by > userspace to signal that a removal operation is being done because of a > flush. When the flag is set, don't bother with secondary address > promotion as we expect that secondary addresses will be removed soon as > well. Unless you intend to use the flag to allow deleting a specific address with its secondaries (overriding promote_secondaries), maybe it would be more practical to go even further and delete all addresses on the interface if IFA_F_FLUSH is set so that userspace could delete all addresses with one request. Michal Kubecek
[RFC net-next] ipv4: Don't promote secondaries when flushing addresses
Promoting secondary addresses on address removal makes flushing all addresses from a device with 1000's of them slow. This is because we cannot take down the secondary addresses when we are removing the primary one, which would make it faster. However, the userspace, when performing a flush, will in the end remove all the addresses regardless of secondary address promotion taking place. Unfortunately the kernel currently cannot distinguish between a single address removal and a flush of all addresses. To help with this case introduce a IFA_F_FLUSH flag that can be used by userspace to signal that a removal operation is being done because of a flush. When the flag is set, don't bother with secondary address promotion as we expect that secondary addresses will be removed soon as well. Signed-off-by: Jakub Sitnicki --- A benchmark involving a flush of 40,000 addresses from a dummy device shows a x4 speed-up of the 'flush' operation. 'ip' had to be modified to set the IFA_F_FLUSH flag for RTM_DELADDR requests issued for the 'flush': # time $IP -stats addr flush dev dum0 Before: real0m30.596s user0m0.000s sys 0m30.567s After: real0m7.601s user0m0.000s sys 0m7.569s It's also worth noting that promote_secondaries sysctl param is enabled by default since systemd 216 thus making it the new "normal" on some distros. include/uapi/linux/if_addr.h | 1 + net/ipv4/devinet.c | 14 ++ 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/if_addr.h b/include/uapi/linux/if_addr.h index ebaf5701c9db..19aab9a9cec5 100644 --- a/include/uapi/linux/if_addr.h +++ b/include/uapi/linux/if_addr.h @@ -54,6 +54,7 @@ enum { #define IFA_F_NOPREFIXROUTE0x200 #define IFA_F_MCAUTOJOIN 0x400 #define IFA_F_STABLE_PRIVACY 0x800 +#define IFA_F_FLUSH0x1000 struct ifa_cacheinfo { __u32 ifa_prefered; diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index d7585ab1a77a..1f436e1e5222 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -331,13 +331,14 @@ int inet_addr_onlink(struct in_device *in_dev, __be32 a, __be32 b) } static void __inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap, -int destroy, struct nlmsghdr *nlh, u32 portid) + int destroy, struct nlmsghdr *nlh, u32 portid, + bool flush) { struct in_ifaddr *promote = NULL; struct in_ifaddr *ifa, *ifa1 = *ifap; struct in_ifaddr *last_prim = in_dev->ifa_list; struct in_ifaddr *prev_prom = NULL; - int do_promote = IN_DEV_PROMOTE_SECONDARIES(in_dev); + int do_promote = IN_DEV_PROMOTE_SECONDARIES(in_dev) && !flush; ASSERT_RTNL(); @@ -437,7 +438,7 @@ static void __inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap, static void inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap, int destroy) { - __inet_del_ifa(in_dev, ifap, destroy, NULL, 0); + __inet_del_ifa(in_dev, ifap, destroy, NULL, 0, false); } static void check_lifetime(struct work_struct *work); @@ -607,6 +608,7 @@ static int inet_rtm_deladdr(struct sk_buff *skb, struct nlmsghdr *nlh, struct in_device *in_dev; struct ifaddrmsg *ifm; struct in_ifaddr *ifa, **ifap; + bool flush = false; int err = -EINVAL; ASSERT_RTNL(); @@ -623,6 +625,9 @@ static int inet_rtm_deladdr(struct sk_buff *skb, struct nlmsghdr *nlh, goto errout; } + if (tb[IFA_FLAGS]) + flush = !!(nla_get_u32(tb[IFA_FLAGS]) & IFA_F_FLUSH); + for (ifap = _dev->ifa_list; (ifa = *ifap) != NULL; ifap = >ifa_next) { if (tb[IFA_LOCAL] && @@ -639,7 +644,8 @@ static int inet_rtm_deladdr(struct sk_buff *skb, struct nlmsghdr *nlh, if (ipv4_is_multicast(ifa->ifa_address)) ip_mc_config(net->ipv4.mc_autojoin_sk, false, ifa); - __inet_del_ifa(in_dev, ifap, 1, nlh, NETLINK_CB(skb).portid); + __inet_del_ifa(in_dev, ifap, 1, nlh, NETLINK_CB(skb).portid, + flush); return 0; } -- 2.14.4
Re: [PATCH bpf-next 09/11] i40e: implement AF_XDP zero-copy support for Rx
Den mån 4 juni 2018 kl 22:35 skrev Alexander Duyck : > > On Mon, Jun 4, 2018 at 5:05 AM, Björn Töpel wrote: > > From: Björn Töpel > > > > This commit adds initial AF_XDP zero-copy support for i40e-based > > NICs. First we add support for the new XDP_QUERY_XSK_UMEM and > > XDP_SETUP_XSK_UMEM commands in ndo_bpf. This allows the AF_XDP socket > > to pass a UMEM to the driver. The driver will then DMA map all the > > frames in the UMEM for the driver. Next, the Rx code will allocate > > frames from the UMEM fill queue, instead of the regular page > > allocator. > > > > Externally, for the rest of the XDP code, the driver internal UMEM > > allocator will appear as a MEM_TYPE_ZERO_COPY. > > > > The commit also introduces a completely new clean_rx_irq/allocator > > functions for zero-copy, and means (functions pointers) to set > > allocators and clean_rx functions. > > > > This first version does not support: > > * passing frames to the stack via XDP_PASS (clone/copy to skb). > > * doing XDP redirect to other than AF_XDP sockets > > (convert_to_xdp_frame does not clone the frame yet). > > > > Signed-off-by: Björn Töpel > > --- > > drivers/net/ethernet/intel/i40e/Makefile| 3 +- > > drivers/net/ethernet/intel/i40e/i40e.h | 23 ++ > > drivers/net/ethernet/intel/i40e/i40e_main.c | 35 +- > > drivers/net/ethernet/intel/i40e/i40e_txrx.c | 163 ++--- > > drivers/net/ethernet/intel/i40e/i40e_txrx.h | 128 ++- > > drivers/net/ethernet/intel/i40e/i40e_xsk.c | 537 > > > > drivers/net/ethernet/intel/i40e/i40e_xsk.h | 17 + > > include/net/xdp_sock.h | 19 + > > net/xdp/xdp_umem.h | 10 - > > 9 files changed, 789 insertions(+), 146 deletions(-) > > create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.c > > create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.h > > > > diff --git a/drivers/net/ethernet/intel/i40e/Makefile > > b/drivers/net/ethernet/intel/i40e/Makefile > > index 14397e7e9925..50590e8d1fd1 100644 > > --- a/drivers/net/ethernet/intel/i40e/Makefile > > +++ b/drivers/net/ethernet/intel/i40e/Makefile > > @@ -22,6 +22,7 @@ i40e-objs := i40e_main.o \ > > i40e_txrx.o \ > > i40e_ptp.o \ > > i40e_client.o \ > > - i40e_virtchnl_pf.o > > + i40e_virtchnl_pf.o \ > > + i40e_xsk.o > > > > i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o > > diff --git a/drivers/net/ethernet/intel/i40e/i40e.h > > b/drivers/net/ethernet/intel/i40e/i40e.h > > index 7a80652e2500..20955e5dce02 100644 > > --- a/drivers/net/ethernet/intel/i40e/i40e.h > > +++ b/drivers/net/ethernet/intel/i40e/i40e.h > > @@ -786,6 +786,12 @@ struct i40e_vsi { > > > > /* VSI specific handlers */ > > irqreturn_t (*irq_handler)(int irq, void *data); > > + > > + /* AF_XDP zero-copy */ > > + struct xdp_umem **xsk_umems; > > + u16 num_xsk_umems_used; > > + u16 num_xsk_umems; > > + > > } cacheline_internodealigned_in_smp; > > > > struct i40e_netdev_priv { > > @@ -1090,6 +1096,20 @@ static inline bool i40e_enabled_xdp_vsi(struct > > i40e_vsi *vsi) > > return !!vsi->xdp_prog; > > } > > > > +static inline struct xdp_umem *i40e_xsk_umem(struct i40e_ring *ring) > > +{ > > + bool xdp_on = i40e_enabled_xdp_vsi(ring->vsi); > > + int qid = ring->queue_index; > > + > > + if (ring_is_xdp(ring)) > > + qid -= ring->vsi->alloc_queue_pairs; > > + > > + if (!ring->vsi->xsk_umems || !ring->vsi->xsk_umems[qid] || !xdp_on) > > + return NULL; > > + > > + return ring->vsi->xsk_umems[qid]; > > +} > > + > > int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel > > *ch); > > int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate); > > int i40e_add_del_cloud_filter(struct i40e_vsi *vsi, > > @@ -1098,4 +1118,7 @@ int i40e_add_del_cloud_filter(struct i40e_vsi *vsi, > > int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi, > > struct i40e_cloud_filter *filter, > > bool add); > > +int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair); > > +int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair); > > + > > #endif /* _I40E_H_ */ > > diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c > > b/drivers/net/ethernet/intel/i40e/i40e_main.c > > index 369a116edaa1..8c602424d339 100644 > > --- a/drivers/net/ethernet/intel/i40e/i40e_main.c > > +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c > > @@ -5,6 +5,7 @@ > > #include > > #include > > #include > > +#include > > > > /* Local includes */ > > #include "i40e.h" > > @@ -16,6 +17,7 @@ > > */ > > #define CREATE_TRACE_POINTS > > #include "i40e_trace.h" > > +#include "i40e_xsk.h" > > > > const char i40e_driver_name[] = "i40e"; > > static const char i40e_driver_string[] = > > @@ -3071,6 +3073,9 @@ static int
[PATCH ipsec] vti6: fix PMTU caching and reporting on xmit
When setting the skb->dst before doing the MTU check, the route PMTU caching and reporting is done on the new dst which is about to be released. Instead, PMTU handling should be done using the original dst. This is aligned with IPv4 VTI. Signed-off-by: Eyal Birger Fixes: ccd740cbc6 ("vti6: Add pmtu handling to vti6_xmit.") --- net/ipv6/ip6_vti.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c index ca957dd..e675ec7 100644 --- a/net/ipv6/ip6_vti.c +++ b/net/ipv6/ip6_vti.c @@ -480,10 +480,6 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, struct flowi *fl) goto tx_err_dst_release; } - skb_scrub_packet(skb, !net_eq(t->net, dev_net(dev))); - skb_dst_set(skb, dst); - skb->dev = skb_dst(skb)->dev; - mtu = dst_mtu(dst); if (!skb->ignore_df && skb->len > mtu) { skb_dst_update_pmtu(skb, mtu); @@ -498,9 +494,14 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, struct flowi *fl) htonl(mtu)); } - return -EMSGSIZE; + err = -EMSGSIZE; + goto tx_err_dst_release; } + skb_scrub_packet(skb, !net_eq(t->net, dev_net(dev))); + skb_dst_set(skb, dst); + skb->dev = skb_dst(skb)->dev; + err = dst_output(t->net, skb->sk, skb); if (net_xmit_eval(err) == 0) { struct pcpu_sw_netstats *tstats = this_cpu_ptr(dev->tstats); -- 2.7.4