Re: [PATCH net-next v4 0/3] net: mpls: fragmentation and gso fixes for locally originated traffic
From: David AhernDate: Wed, 24 Aug 2016 20:10:42 -0700 > This series fixes mtu and fragmentation for tunnels using lwtunnel > output redirect, and fixes GSO for MPLS for locally originated traffic > reported by Lennert Buytenhek. > > A follow on series will address fragmentation and GSO for forwarded > MPLS traffic. Hardware offload of GSO with MPLS also needs to be > addressed. > > Simon: Can you verify this works with OVS for single and multiple >labels? Series applied, thanks.
Re: [PATCH net-next] net: batch calls to flush_all_backlogs()
From: Eric DumazetDate: Fri, 26 Aug 2016 12:50:39 -0700 > From: Eric Dumazet > > After commit 145dd5f9c88f ("net: flush the softnet backlog in process > context"), we can easily batch calls to flush_all_backlogs() for all > devices processed in rollback_registered_many() > > Tested: ... > Signed-off-by: Eric Dumazet Applied, thanks.
Re: [net-next] ixgbe: Eliminate useless message and improve logic
From: Jeff KirsherDate: Tue, 30 Aug 2016 11:33:43 -0700 > From: Mark Rustad > > Remove a useless log message and improve the logic for setting > a PHY address from the contents of the MNG_IF_SEL register. > > Signed-off-by: Mark Rustad > Tested-by: Andrew Bowers > Signed-off-by: Jeff Kirsher Applied.
Re: [PATCH net-next 0/8] rxrpc: Preparation for removal of use of skbs from AFS
From: David Howells <dhowe...@redhat.com> Date: Tue, 30 Aug 2016 16:41:37 +0100 > Here's a set of patches that prepare the way for the removal of the use of > sk_buffs from fs/afs (they'll be entirely retained within net/rxrpc): > > (1) Fix a potential NULL-pointer deref in rxrpc_abort_calls(). > > (2) Condense all the terminal call state machine states to a single one > plus supplementary info. > > (3) Add a trace point for rxrpc call usage debugging. > > (4) Cleanups and missing headers. > > (5) Provide a way for AFS to ask about a call's peer address without > having an sk_buff to query. > > (6) Use call->peer directly rather than going via call->conn (which might > be NULL). > > (7) Pass struct socket * to various rxrpc kernel interface functions so > they can use that directly rather than getting it from the rxrpc_call > struct. ... > Tagged thusly: > > git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git > rxrpc-rewrite-20160830-1 Pulled, thanks David.
Re: [PATCH 0/7] Netfilter fixes for net
From: Pablo Neira AyusoDate: Tue, 30 Aug 2016 13:26:16 +0200 > The following patchset contains Netfilter fixes for your net tree, > they are: ... > You can pull these changes from: > > git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git Pulled, thanks a lot Pablo.
[PATCH net-next] rtnetlink: fdb dump: optimize by saving last interface markers
From: Roopa Prabhufdb dumps spanning multiple skb's currently restart from the first interface again for every skb. This results in unnecessary iterations on the already visited interfaces and their fdb entries. In large scale setups, we have seen this to slow down fdb dumps considerably. On a system with 30k macs we see fdb dumps spanning across more than 300 skbs. To fix the problem, this patch replaces the existing single fdb marker with three markers: netdev hash entries, netdevs and fdb index to continue where we left off instead of restarting from the first netdev. This is consistent with link dumps. In the process of fixing the performance issue, this patch also re-implements fix done by commit 472681d57a5d ("net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump") (with an internal fix from Wilson Kok) in the following ways: - change ndo_fdb_dump handlers to return error code instead of the last fdb index - use cb->args strictly for dump frag markers and not error codes. This is consistent with other dump functions. Below results were taken on a system with 1000 netdevs and 35085 fdb entries: before patch: $time bridge fdb show | wc -l 15065 real1m11.791s user0m0.070s sys 1m8.395s (existing code does not return all macs) after patch: $time bridge fdb show | wc -l 35085 real0m2.017s user0m0.113s sys 0m1.942s Signed-off-by: Roopa Prabhu Signed-off-by: Wilson Kok --- drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c | 7 +- drivers/net/vxlan.c | 14 ++- include/linux/netdevice.h| 4 +- include/linux/rtnetlink.h| 2 +- include/net/switchdev.h | 4 +- net/bridge/br_fdb.c | 23 ++--- net/bridge/br_private.h | 2 +- net/core/rtnetlink.c | 105 ++- net/switchdev/switchdev.c| 10 +-- 9 files changed, 98 insertions(+), 73 deletions(-) diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c index 3ebef27..3ae3968 100644 --- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c +++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c @@ -432,18 +432,19 @@ static int qlcnic_fdb_add(struct ndmsg *ndm, struct nlattr *tb[], static int qlcnic_fdb_dump(struct sk_buff *skb, struct netlink_callback *ncb, struct net_device *netdev, - struct net_device *filter_dev, int idx) + struct net_device *filter_dev, int *idx) { struct qlcnic_adapter *adapter = netdev_priv(netdev); + int err = 0; if (!adapter->fdb_mac_learn) return ndo_dflt_fdb_dump(skb, ncb, netdev, filter_dev, idx); if ((adapter->flags & QLCNIC_ESWITCH_ENABLED) || qlcnic_sriov_check(adapter)) - idx = ndo_dflt_fdb_dump(skb, ncb, netdev, filter_dev, idx); + err = ndo_dflt_fdb_dump(skb, ncb, netdev, filter_dev, idx); - return idx; + return err; } static void qlcnic_82xx_cancel_idc_work(struct qlcnic_adapter *adapter) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index c0dda6f..f5b381d 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -861,20 +861,20 @@ out: /* Dump forwarding table */ static int vxlan_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb, struct net_device *dev, - struct net_device *filter_dev, int idx) + struct net_device *filter_dev, int *idx) { struct vxlan_dev *vxlan = netdev_priv(dev); unsigned int h; + int err = 0; for (h = 0; h < FDB_HASH_SIZE; ++h) { struct vxlan_fdb *f; - int err; hlist_for_each_entry_rcu(f, >fdb_head[h], hlist) { struct vxlan_rdst *rd; list_for_each_entry_rcu(rd, >remotes, list) { - if (idx < cb->args[0]) + if (*idx < cb->args[2]) goto skip; err = vxlan_fdb_info(skb, vxlan, f, @@ -882,17 +882,15 @@ static int vxlan_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb, cb->nlh->nlmsg_seq, RTM_NEWNEIGH, NLM_F_MULTI, rd); - if (err < 0) { - cb->args[1] = err; + if (err < 0) goto out; - } skip: - ++idx; +
Re: [PATCH v3 0/5] meson: Meson8b and GXBB DWMAC glue driver
From: Martin BlumenstinglDate: Tue, 30 Aug 2016 20:49:28 +0200 > On Mon, Aug 29, 2016 at 5:40 AM, David Miller wrote: >> From: Martin Blumenstingl >> Date: Sun, 28 Aug 2016 18:16:32 +0200 >> >>> This adds a DWMAC glue driver for the PRG_ETHERNET registers found in >>> Meson8b and GXBB SoCs. Based on the "old" meson6b-dwmac glue driver >>> the register layout is completely different. >>> Thus I introduced a separate driver. >>> >>> Changes since v2: >>> - fixed unloading the glue driver when built as module. This pulls in a >>> patch from Joachim Eastwood (thanks) to get our private data structure >>> (bsp_priv). >> >> This doesn't apply cleanly at all to the net-next tree, so I have >> no idea where you expect these changes to be applied. > OK, maybe Kevin can me help out here as I think the patches should go > to various trees. > > I think patches 1, 3 and 4 should go through the net-next tree (as > these touch drivers/net/ethernet/stmicro/stmmac/ and the corresponding > documentation). > Patch 2 should probably go through clk-meson-gxbb / clk-next (just > like the other clk changes we had). > The last patch (patch 5) should probably go through the ARM SoC tree > (just like the other dts changes we had). > > @David, Kevin: would this be fine for you? I would prefer if all of the patches went through one tree, that way all the dependencies are satisfied in one place.
Re: pull-request: mac80211 2016-08-30
From: Johannes BergDate: Tue, 30 Aug 2016 08:19:18 +0200 > Nothing much, but we have three little fixes, see below. I've included the > static inline so that BATMAN_ADV_BATMAN_V can be changed to be allowed w/o > cfg80211 sooner, and it's a trivial change. > > Let me know if there's any problem. Pulled, thanks Johannes.
Re: [PATCH net] rps: flow_dissector: Fix uninitialized flow_keys used in __skb_get_hash possibly
On Wed, Aug 31, 2016 at 12:14 PM, Eric Dumazetwrote: > On Wed, 2016-08-31 at 10:56 +0800, f...@ikuai8.com wrote: >> From: Gao Feng >> >> The original codes depend on that the function parameters are evaluated from >> left to right. But the parameter's evaluation order is not defined in C >> standard actually. >> >> When flow_keys_have_l4() is invoked before ___skb_get_hash(skb, , >> hashrnd) with some compilers or environment, the keys passed to >> flow_keys_have_l4 is not initialized. >> >> Signed-off-by: Gao Feng >> --- > > Good catch, please add > > Fixes: 6db61d79c1e1 ("flow_dissector: Ignore flow dissector return value from > ___skb_get_hash") > Acked-by: Eric Dumazet > > Add it into the description and resend the patch again? Best Regards Feng
Re: [PATCH net] rps: flow_dissector: Fix uninitialized flow_keys used in __skb_get_hash possibly
On Wed, 2016-08-31 at 10:56 +0800, f...@ikuai8.com wrote: > From: Gao Feng> > The original codes depend on that the function parameters are evaluated from > left to right. But the parameter's evaluation order is not defined in C > standard actually. > > When flow_keys_have_l4() is invoked before ___skb_get_hash(skb, , > hashrnd) with some compilers or environment, the keys passed to > flow_keys_have_l4 is not initialized. > > Signed-off-by: Gao Feng > --- Good catch, please add Fixes: 6db61d79c1e1 ("flow_dissector: Ignore flow dissector return value from ___skb_get_hash") Acked-by: Eric Dumazet
Re: [PATCH net-next 4/6] perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs
On Mon, Aug 29, 2016 at 02:17:18PM +0200, Peter Zijlstra wrote: > On Fri, Aug 26, 2016 at 07:31:22PM -0700, Alexei Starovoitov wrote: > > +static int perf_event_set_bpf_handler(struct perf_event *event, u32 > > prog_fd) > > +{ > > + struct bpf_prog *prog; > > + > > + if (event->overflow_handler_context) > > + /* hw breakpoint or kernel counter */ > > + return -EINVAL; > > + > > + if (event->prog) > > + return -EEXIST; > > + > > + prog = bpf_prog_get_type(prog_fd, BPF_PROG_TYPE_PERF_EVENT); > > + if (IS_ERR(prog)) > > + return PTR_ERR(prog); > > + > > + event->prog = prog; > > + event->orig_overflow_handler = READ_ONCE(event->overflow_handler); > > + WRITE_ONCE(event->overflow_handler, bpf_overflow_handler); > > + return 0; > > +} > > + > > +static void perf_event_free_bpf_handler(struct perf_event *event) > > +{ > > + struct bpf_prog *prog = event->prog; > > + > > + if (!prog) > > + return; > > Does it make sense to do something like: > > WARN_ON_ONCE(event->overflow_handler != bpf_overflow_handler); Yes that's an implicit assumption here, but checking for that would be overkill. event->overflow_handler and event->prog are set back to back in two places and reset here once together. Such warn_on will only make people reading this code in the future think that this bit is too complex to analyze by hand. > > + > > + WRITE_ONCE(event->overflow_handler, event->orig_overflow_handler); > > + event->prog = NULL; > > + bpf_prog_put(prog); > > +} > > > > static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd) > > { > > bool is_kprobe, is_tracepoint; > > struct bpf_prog *prog; > > > > + if (event->attr.type == PERF_TYPE_HARDWARE || > > + event->attr.type == PERF_TYPE_SOFTWARE) > > + return perf_event_set_bpf_handler(event, prog_fd); > > + > > if (event->attr.type != PERF_TYPE_TRACEPOINT) > > return -EINVAL; > > > > @@ -7647,6 +7711,8 @@ static void perf_event_free_bpf_prog(struct > > perf_event *event) > > { > > struct bpf_prog *prog; > > > > + perf_event_free_bpf_handler(event); > > + > > if (!event->tp_event) > > return; > > > > Does it at all make sense to merge the tp_event->prog thing into this > new event->prog? 'struct trace_event_call *tp_event' is global while tp_event->perf_events are per cpu, so I don't see how we can do that without breaking user space logic. Right now users do single perf_event_open of kprobe and attach prog that is executed on all cpus where kprobe is firing. Additional per-cpu filtering is done from within bpf prog. > > #ifdef CONFIG_HAVE_HW_BREAKPOINT > > @@ -8957,6 +9029,14 @@ perf_event_alloc(struct perf_event_attr *attr, int > > cpu, > > if (!overflow_handler && parent_event) { > > overflow_handler = parent_event->overflow_handler; > > context = parent_event->overflow_handler_context; > > + if (overflow_handler == bpf_overflow_handler) { > > + event->prog = bpf_prog_inc(parent_event->prog); > > + event->orig_overflow_handler = > > parent_event->orig_overflow_handler; > > + if (IS_ERR(event->prog)) { > > + event->prog = NULL; > > + overflow_handler = NULL; > > + } > > + } > > } > > Should we not fail the entire perf_event_alloc() call in that IS_ERR() > case? Yes. Good point. Will do.
Re: [RFC v2 09/10] landlock: Handle cgroups (performance)
On Tue, Aug 30, 2016 at 6:36 PM, Alexei Starovoitovwrote: > On Tue, Aug 30, 2016 at 02:45:14PM -0700, Andy Lutomirski wrote: >> >> One might argue that landlock shouldn't be tied to seccomp (in theory, >> attached progs could be given access to syscall_get_xyz()), but I > > proposed lsm is way more powerful than syscall_get_xyz. > no need to dumb it down. I think you're misunderstanding me. Mickaël's code allows one to make the LSM hook filters depend on the syscall using SECCOMP_RET_LANDLOCK. I'm suggesting that a similar effect could be achieved by allowing the eBPF LSM hook to call syscall_get_xyz() if it wants to. > >> think that the seccomp attachment mechanism is the right way to >> install unprivileged filters. It handles the no_new_privs stuff, it >> allows TSYNC, it's totally independent of systemwide policy, etc. >> >> Trying to use cgroups or similar for this is going to be much nastier. >> Some tighter sandboxes (Sandstorm, etc) aren't even going to dream of >> putting cgroupfs in their containers, so requiring cgroups or similar >> would be a mess for that type of application. > > I don't see why it is a 'mess'. cgroups are already used by majority > of the systems, so I don't see why requiring a cgroup is such a big deal. Requiring cgroup to be configured in isn't a big deal. Requiring > But let's say we don't do them. How implementation is going to look like > for task based hierarchy? Note that we need an array of bpf_prog pointers. > One for each lsm hook. Where this array is going to be stored? > We cannot put in task_struct, since it's too large. Cannot put it > into 'struct seccomp' directly either, unless it will become a pointer. > Is that the proposal? It would go in struct seccomp_filter or in something pointed to from there. > So now we will be wasting extra 1kbyte of memory per task. Not great. > We'd want to optimize it by sharing this such struct seccomp with prog array > across threads of the same task? Or dynimically allocating it when > landlock is in use? May sound nice, but how to account for that kernel > memory? I guess also solvable by charging memlock. > With cgroup based approach we don't need to worry about all that. > The considerations are essentially identical either way. With cgroups, if you want to share the memory between multiple separate sandboxes (Firejail instances, Sandstorm grains, Chromium instances, xdg-apps, etc), you'd need to get them to all coordinate to share a cgroup. With a seccomp-like interface, you'd need to get them to coordinate to share an installed layer (using my FD idea or similar). There would *not* be any duplication of this memory just because a sandboxed process called fork(). --Andy -- Andy Lutomirski AMA Capital Management, LLC
[PATCH net-next] rps: flow_dissector: Add the const for the parameter of flow_keys_have_l4
From: Gao FengAdd the const for the parameter of flow_keys_have_l4 for the readability. Signed-off-by: Gao Feng --- include/net/flow_dissector.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h index f266b51..d953492 100644 --- a/include/net/flow_dissector.h +++ b/include/net/flow_dissector.h @@ -183,7 +183,7 @@ struct flow_keys_digest { void make_flow_keys_digest(struct flow_keys_digest *digest, const struct flow_keys *flow); -static inline bool flow_keys_have_l4(struct flow_keys *keys) +static inline bool flow_keys_have_l4(const struct flow_keys *keys) { return (keys->ports.ports || keys->tags.flow_label); } -- 1.9.1
Re: [PATCH RFC 4/4] xfs: Transmit flow steering
On Tue, Aug 30, 2016 at 5:00 PM, Tom Herbertwrote: > XFS maintains a per device flow table that is indexed by the skbuff > hash. The XFS table is only consulted when there is no queue saved in > a transmit socket for an skbuff. > > Each entry in the flow table contains a queue index and a queue > pointer. The queue pointer is set when a queue is chosen using a > flow table entry. This pointer is set to the head pointer in the > transmit queue (which is maintained by BQL). > > The new function get_xfs_index that looks up flows in the XPS table. > The entry returned gives the last queue a matching flow used. The > returned queue is compared against the normal XPS queue. If they > are different, then we only switch if the tail pointer in the TX > queue has advanced past the pointer saved in the entry. In this > way OOO should be avoided when XPS wants to use a different queue. > > Signed-off-by: Tom Herbert This looks pretty good. I haven't had a chance to test it though as it will probably take me a few days. A few minor items called out below. Thanks. - Alex > --- > net/Kconfig| 6 > net/core/dev.c | 93 > -- > 2 files changed, 84 insertions(+), 15 deletions(-) > > diff --git a/net/Kconfig b/net/Kconfig > index 7b6cd34..5e3eddf 100644 > --- a/net/Kconfig > +++ b/net/Kconfig > @@ -255,6 +255,12 @@ config XPS > depends on SMP > default y > > +config XFS > + bool > + depends on XPS > + depends on BQL > + default y > + > config HWBM > bool > > diff --git a/net/core/dev.c b/net/core/dev.c > index 1d5c6dd..722e487 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -3210,6 +3210,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct > net_device *dev) > } > #endif /* CONFIG_NET_EGRESS */ > > +/* Must be called with RCU read_lock */ > static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb) > { > #ifdef CONFIG_XPS > @@ -3217,7 +3218,6 @@ static inline int get_xps_queue(struct net_device *dev, > struct sk_buff *skb) > struct xps_map *map; > int queue_index = -1; > > - rcu_read_lock(); > dev_maps = rcu_dereference(dev->xps_maps); > if (dev_maps) { > map = rcu_dereference( > @@ -3232,7 +3232,6 @@ static inline int get_xps_queue(struct net_device *dev, > struct sk_buff *skb) > queue_index = -1; > } > } > - rcu_read_unlock(); > > return queue_index; > #else > @@ -3240,26 +3239,90 @@ static inline int get_xps_queue(struct net_device > *dev, struct sk_buff *skb) > #endif > } > > -static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb) > +/* Must be called with RCU read_lock */ > +static int get_xfs_index(struct net_device *dev, struct sk_buff *skb) > { > - struct sock *sk = skb->sk; > - int queue_index = sk_tx_queue_get(sk); > +#ifdef CONFIG_XFS > + struct xps_dev_flow_table *flow_table; > + struct xps_dev_flow ent; > + int queue_index; > + struct netdev_queue *txq; > + u32 hash; > > - if (queue_index < 0 || skb->ooo_okay || > - queue_index >= dev->real_num_tx_queues) { > - int new_index = get_xps_queue(dev, skb); > - if (new_index < 0) > - new_index = skb_tx_hash(dev, skb); > + flow_table = rcu_dereference(dev->xps_flow_table); > + if (!flow_table) > + return -1; > > - if (queue_index != new_index && sk && > - sk_fullsock(sk) && > - rcu_access_pointer(sk->sk_dst_cache)) > - sk_tx_queue_set(sk, new_index); > + queue_index = get_xps_queue(dev, skb); > + if (queue_index < 0) > + return -1; Actually I think this bit here probably needs to fall back to using skb_tx_hash if you don't get a usable result. The problem is you could have a system that is running with a mix of XFS assigned for some CPUs and just using skb_tx_hash for others. We shouldn't steal flows from the ones selected using skb_tx_hash until they have met the flow transition criteria. > - queue_index = new_index; > + hash = skb_get_hash(skb); > + if (!hash) > + return -1; I'm not sure the !hash test makes any sense. Isn't 0 a valid hash value? > + ent.v64 = flow_table->flows[hash & flow_table->mask].v64; > + if (ent.queue_index >= 0 && > + ent.queue_index < dev->real_num_tx_queues) { > + txq = netdev_get_tx_queue(dev, ent.queue_index); > + if (queue_index != ent.queue_index) { > + if ((int)(txq->tail_cnt - ent.queue_ptr) >= 0) { > + /* The current queue's tail has advanced > +* beyone the last packet
[PATCH net] rps: flow_dissector: Fix uninitialized flow_keys used in __skb_get_hash possibly
From: Gao FengThe original codes depend on that the function parameters are evaluated from left to right. But the parameter's evaluation order is not defined in C standard actually. When flow_keys_have_l4() is invoked before ___skb_get_hash(skb, , hashrnd) with some compilers or environment, the keys passed to flow_keys_have_l4 is not initialized. Signed-off-by: Gao Feng --- net/core/flow_dissector.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c index 61ad43f..52742a0 100644 --- a/net/core/flow_dissector.c +++ b/net/core/flow_dissector.c @@ -680,11 +680,13 @@ EXPORT_SYMBOL_GPL(__skb_get_hash_symmetric); void __skb_get_hash(struct sk_buff *skb) { struct flow_keys keys; + u32 hash; __flow_hash_secret_init(); - __skb_set_sw_hash(skb, ___skb_get_hash(skb, , hashrnd), - flow_keys_have_l4()); + hash = ___skb_get_hash(skb, , hashrnd); + + __skb_set_sw_hash(skb, hash, flow_keys_have_l4()); } EXPORT_SYMBOL(__skb_get_hash); -- 1.9.1
Re: [PATCH] net/mlx4_en: protect ring->xdp_prog with rcu_read_lock
On Tue, Aug 30, 2016 at 12:35:58PM +0300, Saeed Mahameed wrote: > On Mon, Aug 29, 2016 at 8:46 PM, Tom Herbertwrote: > > On Mon, Aug 29, 2016 at 8:55 AM, Brenden Blanco > > wrote: > >> On Mon, Aug 29, 2016 at 05:59:26PM +0300, Tariq Toukan wrote: > >>> Hi Brenden, > >>> > >>> The solution direction should be XDP specific that does not hurt the > >>> regular flow. > >> An rcu_read_lock is _already_ taken for _every_ packet. This is 1/64th of > > In other words "let's add new small speed bump, we already have > plenty ahead, so why not slow down now anyway". > > Every single new instruction hurts performance, in this case maybe you > are right, maybe we won't feel any performance > impact, but that doesn't mean it is ok to do this. Actually, I will make a stronger assertion. Unless your .config contains CONFIG_PREEMPT=y (not most distros) or something like DEBUG_ATOMIC_SLEEP (to trigger PREEMPT_COUNT), the code in this patch will be a nop. Therefore, adding the protections that you mention below will be _slower_ than the code already proposed. > > > >> that. > >>> > >>> On 26/08/2016 11:38 PM, Brenden Blanco wrote: > >>> >Depending on the preempt mode, the bpf_prog stored in xdp_prog may be > >>> >freed despite the use of call_rcu inside bpf_prog_put. The situation is > >>> >possible when running in PREEMPT_RCU=y mode, for instance, since the rcu > >>> >callback for destroying the bpf prog can run even during the bh handling > >>> >in the mlx4 rx path. > >>> > > >>> >Several options were considered before this patch was settled on: > >>> > > >>> >Add a napi_synchronize loop in mlx4_xdp_set, which would occur after all > >>> >of the rings are updated with the new program. > >>> >This approach has the disadvantage that as the number of rings > >>> >increases, the speed of udpate will slow down significantly due to > >>> >napi_synchronize's msleep(1). > >>> I prefer this option as it doesn't hurt the data path. A delay in a > >>> control command can be tolerated. > >>> >Add a new rcu_head in bpf_prog_aux, to be used by a new bpf_prog_put_bh. > >>> >The action of the bpf_prog_put_bh would be to then call bpf_prog_put > >>> >later. Those drivers that consume a bpf prog in a bh context (like mlx4) > >>> >would then use the bpf_prog_put_bh instead when the ring is up. This has > >>> >the problem of complexity, in maintaining proper refcnts and rcu lists, > >>> >and would likely be harder to review. In addition, this approach to > >>> >freeing must be exclusive with other frees of the bpf prog, for instance > >>> >a _bh prog must not be referenced from a prog array that is consumed by > >>> >a non-_bh prog. > >>> > > >>> >The placement of rcu_read_lock in this patch is functionally the same as > >>> >putting an rcu_read_lock in napi_poll. Actually doing so could be a > >>> >potentially controversial change, but would bring the implementation in > >>> >line with sk_busy_loop (though of course the nature of those two paths > >>> >is substantially different), and would also avoid future copy/paste > >>> >problems with future supporters of XDP. Still, this patch does not take > >>> >that opinionated option. > >>> So you decided to add a lock for all non-XDP flows, which are 99% of > >>> the cases. > >>> We should avoid this. > >> The whole point of rcu_read_lock architecture is to be taken in the fast > >> path. There won't be a performance impact from this patch. > > > > +1, this is nothing at all like a spinlock and really this should be > > just like any other rcu like access. > > > > Brenden, tracking down how the structure is freed needed a few steps, > > please make sure the RCU requirements are well documented. Also, I'm > > still not a fan of using xchg to set the program, seems that a lock > > could be used in that path. > > > > Thanks, > > Tom > > Sorry folks I am with Tariq on this, you can't just add a single > instruction which is only valid/needed for 1% of the use cases > to the driver's general data path, even if it was as cheap as one cpu cycle! How about 0? $ diff mlx4_en.ko.norcu.s mlx4_en.ko.rcu.s | wc -l 0 > > Let me try to suggest something: > instead of taking the rcu_read_lock for the whole > mlx4_en_process_rx_cq, we can minimize to XDP code path only > by double checking xdp_prog (non-protected check followed by a > protected check inside mlx4 XDP critical path). > > i.e instead of: > > rcu_read_lock(); > > xdp_prog = ring->xdp_prog; > > //__Do lots of non-XDP related stuff__ > > if (xdp_prog) { > //Do XDP magic .. > } > //__Do more of non-XDP related stuff__ > > rcu_read_unlock(); > > > We can minimize it to XDP critical path only: > > //Non protected xdp_prog dereference. > if (xdp_prog) { >
Re: [PATCH V2] rtl_bt: Add firmware and config file for RTL8822BE
On Tue, 2016-08-30 at 20:11 -0500, Larry Finger wrote: > This device is a new model from Realtek. Updates to driver btrtl will > soon be submitted to the kernel. > > These files were provided by the Realtek developer. > > Signed-off-by: 陆朱伟> Signed-off-by: Larry Finger > Cc: linux-blueto...@vger.kernel.org > --- > > V2 - fix error in file names in WHENCE > --- [...] > Found in vendor driver, linux_bt_usb_2.11.20140423_8723be.rar > From https://github.com/troy-tan/driver_store > +Files rtl_bt/rtl8822e_* came directly from Realtek. [...] You missed this wildcard, but I fixed it up. Applied and pushed, thanks. Ben. -- Ben Hutchings Anthony's Law of Force: Don't force it, get a larger hammer. signature.asc Description: This is a digitally signed message part
Re: [Bridge] [PATCH net-next v2 2/2] net: bridge: add per-port multicast flood flag
On Tue, Aug 30, 2016 at 05:23:08PM +0200, Nikolay Aleksandrov via Bridge wrote: > diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c > index 1da3221845f1..ed0dd3340084 100644 > --- a/net/bridge/br_if.c > +++ b/net/bridge/br_if.c > @@ -362,7 +362,7 @@ static struct net_bridge_port *new_nbp(struct net_bridge > *br, > p->path_cost = port_cost(dev); > p->priority = 0x8000 >> BR_PORT_BITS; > p->port_no = index; > - p->flags = BR_LEARNING | BR_FLOOD; > + p->flags = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD; I'm discontent with this new flag becoming the default. Could you elaborate a little more on your use-case, when/why do you want/need this flag?
Re: [RFC v2 09/10] landlock: Handle cgroups (performance)
On Tue, Aug 30, 2016 at 02:45:14PM -0700, Andy Lutomirski wrote: > > One might argue that landlock shouldn't be tied to seccomp (in theory, > attached progs could be given access to syscall_get_xyz()), but I proposed lsm is way more powerful than syscall_get_xyz. no need to dumb it down. > think that the seccomp attachment mechanism is the right way to > install unprivileged filters. It handles the no_new_privs stuff, it > allows TSYNC, it's totally independent of systemwide policy, etc. > > Trying to use cgroups or similar for this is going to be much nastier. > Some tighter sandboxes (Sandstorm, etc) aren't even going to dream of > putting cgroupfs in their containers, so requiring cgroups or similar > would be a mess for that type of application. I don't see why it is a 'mess'. cgroups are already used by majority of the systems, so I don't see why requiring a cgroup is such a big deal. But let's say we don't do them. How implementation is going to look like for task based hierarchy? Note that we need an array of bpf_prog pointers. One for each lsm hook. Where this array is going to be stored? We cannot put in task_struct, since it's too large. Cannot put it into 'struct seccomp' directly either, unless it will become a pointer. Is that the proposal? So now we will be wasting extra 1kbyte of memory per task. Not great. We'd want to optimize it by sharing this such struct seccomp with prog array across threads of the same task? Or dynimically allocating it when landlock is in use? May sound nice, but how to account for that kernel memory? I guess also solvable by charging memlock. With cgroup based approach we don't need to worry about all that.
Re: [RFCv2 07/16] bpf: enable non-core use of the verfier
On Tue, Aug 30, 2016 at 11:00:38PM +0200, Daniel Borkmann wrote: > On 08/30/2016 10:48 PM, Alexei Starovoitov wrote: > >On Tue, Aug 30, 2016 at 10:22:46PM +0200, Jakub Kicinski wrote: > >>On Tue, 30 Aug 2016 21:07:50 +0200, Daniel Borkmann wrote: > Having two modes seems more straight forward and I think we would only > need to pay attention in the LD_IMM64 case, I don't think I've seen > LLVM generating XORs, it's just the cBPF -> eBPF conversion. > >>> > >>>Okay, though, I think that the cBPF to eBPF migration wouldn't even > >>>pass through the bpf_parse() handling, since verifier is not aware on > >>>some of their aspects such as emitting calls directly (w/o *proto) or > >>>arg mappings. Probably make sense to reject these (bpf_prog_was_classic()) > >>>if they cannot be handled anyway? > >> > >>TBH again I only use cBPF for testing. It's a convenient way of > >>generating certain instruction sequences. I can probably just drop > >>it completely but the XOR patch is just 3 lines of code so not a huge > >>cost either... I'll keep patch 6 in my tree for now. > > > >if xor matching is only need for classic, I would drop that patch > >just to avoid unnecessary state collection. The number of lines > >is not a concern, but extra state for state prunning is. > > > >>Alternatively - is there any eBPF assembler out there? Something > >>converting verifier output back into ELF would be quite cool. > > > >would certainly be nice. I don't think there is anything standalone. > >btw llvm can be made to work as assembler only, but simple flex/bison > >is probably better. > > Never tried it out, but seems llvm backend doesn't have asm parser > implemented? > > $ clang -target bpf -O2 -c foo.c -S -o foo.S > $ llvm-mc -arch bpf foo.S -filetype=obj -o foo.o > llvm-mc: error: this target does not support assembly parsing. > > LLVM IR might work, but maybe too high level(?); alternatively, we could > make bpf_asm from tools/net/ eBPF aware for debugging purposes. If you > have a toolchain supporting libbfd et al, you could probably make use > of bpf_jit_dump() (like JITs do) and then bpf_jit_disasm tool (from > same dir as bpf_asm). yes. llvm-based bpf asm is not complete. It's straightforward to add though. It won't be going through IR. Only 'mc' (machine instruciton) layer.
[PATCH V2] rtl_bt: Add firmware and config file for RTL8822BE
This device is a new model from Realtek. Updates to driver btrtl will soon be submitted to the kernel. These files were provided by the Realtek developer. Signed-off-by: 陆朱伟Signed-off-by: Larry Finger Cc: linux-blueto...@vger.kernel.org --- V2 - fix error in file names in WHENCE --- WHENCE | 3 +++ rtl_bt/rtl8822b_config.bin | Bin 0 -> 32 bytes rtl_bt/rtl8822b_fw.bin | Bin 0 -> 51756 bytes 3 files changed, 3 insertions(+) create mode 100644 rtl_bt/rtl8822b_config.bin create mode 100644 rtl_bt/rtl8822b_fw.bin diff --git a/WHENCE b/WHENCE index d0bef0d..a9d7c97 100644 --- a/WHENCE +++ b/WHENCE @@ -2755,11 +2755,14 @@ File: rtl_bt/rtl8723b_fw.bin File: rtl_bt/rtl8761a_fw.bin File: rtl_bt/rtl8812ae_fw.bin File: rtl_bt/rtl8821a_fw.bin +File: rtl_bt/rtl8822b_fw.bin +File: rtl_bt/rtl8822b_config.bin Licence: Redistributable. See LICENCE.rtlwifi_firmware.txt for details. Found in vendor driver, linux_bt_usb_2.11.20140423_8723be.rar From https://github.com/troy-tan/driver_store +Files rtl_bt/rtl8822e_* came directly from Realtek. -- diff --git a/rtl_bt/rtl8822b_config.bin b/rtl_bt/rtl8822b_config.bin new file mode 100644 index ..a691e7ca258b0e7dc4ff2bdbdc1d13f2a613526b GIT binary patch literal 32 ncmWGtt=ulfaGQbAX&)n#faGpQMq3vKHbEt07gmOw42=2!f)WPF literal 0 HcmV?d1 diff --git a/rtl_bt/rtl8822b_fw.bin b/rtl_bt/rtl8822b_fw.bin new file mode 100644 index ..b7d6d1229491314875b3d4a7266462c47998c0fb GIT binary patch literal 51756 zcmeEv31CxYy7oEANt=k+uP>MoZa2eFJYz0z8R73<#DD{*qEmm+`20bhe zI2H#en}~Bm+M+^1NLt2w?|5%4-~#0iq#%lVr?Si~Nm^2 PJ!UA(t2e|f%m =O?|<>r;4s%1hVQ@I+T@c!5h*Zpw8ATEmaHEP0l^;$Uld zvWvMwewAKlpfImBJi*r*F7+{23UlY+Y?h9{mQErNTzVHJsQ|_98FTB+y=8O5o zAz%1GDQ9Z})`m5K!hBzNHbT85ncpey68* zU+=>6UEb(*2A*elqu0Cftk}%0WiHA~n9^Z)^5QH;CtBd{h1)ld7a5gKWGuWmAe|R8 z5_RH|AMs*6{EFqgcrH#SjzO3%+z;t_aWC!({dln%zu%|f#e?xW@q%hWot^9Ek5 z)#$|c6$pp>+e3NLf^a$|!r?v=_iebB#_?h+?jSUAO)4**K%CqA^WrD?X3bb$T#es0 zBJ5t=pUcKO;ufFKiR+O534|Gfy8-Ds5ypowzwgV79NzD{4c};`c=aO6c85$SuEuv& za0#~~Y=Tbw2;a0O>BN(2UL1yR6u39zejfg7@$KihZ@^uFJl4ZM754_*wVs3qORZ}w zXLAI!M$=XIm`Czh$DUwSFyg2RGBz(8k+TV9#*{LpGhEDGN_RWTMm`#kA-vd~ zhOd(=IjgCboikf79aMTO?l|~yGiM8|-OyGZpmw1=M@raZj4i_}zHGi(bhe)>soB*Y zDX$MU2X7dW?zOz!ezv5B-V7sB@%H7}v#*BQo7?-2WT!R<)N{^SdEPuK_1xJy(YdR= z{$ixO`C|M{?Afy}3Z;}n>E?7;SsTkUHg+xY4%(QY(}@ds&<5@io4LarGPj2U_n&93 zxW+J7YR0_T-g1t)BIYw!R^&=ZwV!bo +mUPdl$lsl%h zsD{}au~Dw;gH?g}#`p&9epBx5P<*2)w>eZCst-9s>PC%M{In{lZm2>{7Y7Y(mQO*m zESxM7VIm9@iAeE8B40)#kw5;SByPU)xZ+s zM{^}q#y{M_iy>AghHz)lzaw}~!JUKC!=W~8j7^S_TkbWqnU3UuzJ|duwxT>$!=vX= z*Pp~SPRL((-qOrYm0B{OnbT>dS{#RP3*d;liCS4utHNm7ro%Yu2$(*hxTXFD%2yc^ z5JJS*33{+Q0twj~yz}@%WwdOQg&91_jjUaU5sbdufWG^yY@K*ZicVB{T{W!Tw5>iU z%jE6OfDG*-%j7W5_oTrvJTt%6axMO3W4y<*GjeqRdZ6X-MYOFW5a*?~b_P^o2C*o- zrjMo4m%d2l#dz~8^h<8NX{YIE)StD>yK}>oeu*%$ue{5@^3HwbU4 0&0GkZbC~$lg{38l=3?9BbOf%rsb*)w6b@
Re: [PATCH] rtl_bt: Add firmware and config file for RTL8822BE
On 08/30/2016 09:51 AM, Ben Hutchings wrote: On Tue, 2016-08-30 at 09:08 -0500, Larry Finger wrote: This device is a new model from Realtek. Updates to driver btrtl will soon be submitted to the kernel. These files were provided by the Realtek developer. Signed-off-by: 陆朱伟Signed-off-by: Larry Finger Cc: linux-blueto...@vger.kernel.org --- WHENCE | 3 +++ rtl_bt/rtl8822b_config.bin | Bin 0 -> 32 bytes rtl_bt/rtl8822b_fw.bin | Bin 0 -> 51756 bytes 3 files changed, 3 insertions(+) create mode 100644 rtl_bt/rtl8822b_config.bin create mode 100644 rtl_bt/rtl8822b_fw.bin diff --git a/WHENCE b/WHENCE index d0bef0d..a9d7c97 100644 --- a/WHENCE +++ b/WHENCE @@ -2755,11 +2755,14 @@ File: rtl_bt/rtl8723b_fw.bin File: rtl_bt/rtl8761a_fw.bin File: rtl_bt/rtl8812ae_fw.bin File: rtl_bt/rtl8821a_fw.bin +File: rtl_bt/rtl8822e_fw.bin +File: rtl_bt/rtl8822e_config.bin [...] Should the filenames begin with "rtl822b" or "rtl822e"? They should start with rtl8822b. V2 will be sent shortly. Thanks, Larry
[PATCH RFC 0/4] xfs: Transmit flow steering
This patch set introduces transmit flow steering. The idea is that we record the transmit queues in a flow table that is indexed by skbuff. The flow table entries have two values: the queue_index and the head cnt of packets from the TX queue. We only allow a queue to change for a flow if the tail cnt in the TX queue advances beyond the recorded head cnt. That is the condition that should indicate that all outstanding packets for the flow have completed transmission so the queue can change. Tracking the inflight queue is performed as part of BQL. Two fields are added to netdevice structure: head_cnt and tail_cnt. head_cnt is incremented in netdev_tx_sent_queue and tail_cnt is incremented in netdev_tx_completed_queue by the number of packets completed. This patch set creates /sys/class/net/eth*/xps_dev_flow_table_cnt which number of entries in the XPS flow table. Tom Herbert (4): net: Set SW hash in skb_set_hash_from_sk bql: Add tracking of inflight packets net: Add xps_dev_flow_table_cnt xfs: Transmit flow steering include/linux/netdevice.h | 26 + include/net/sock.h| 6 +-- net/Kconfig | 6 +++ net/core/dev.c| 93 +++ net/core/net-sysfs.c | 87 5 files changed, 199 insertions(+), 19 deletions(-) -- 2.8.0.rc2
[PATCH bug-fix] iproute: fix documentation for ip rule scan order
>From 416f45b62f33017d19a9b14e7b0179807c993cbe Mon Sep 17 00:00:00 2001 From: Iskren ChernevDate: Tue, 30 Aug 2016 17:08:54 -0700 Subject: [PATCH bug-fix] iproute: fix documentation for ip rule scan order --- man/man8/ip-rule.8 | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/man/man8/ip-rule.8 b/man/man8/ip-rule.8 index 1774ae3..3508d80 100644 --- a/man/man8/ip-rule.8 +++ b/man/man8/ip-rule.8 @@ -93,7 +93,7 @@ Each policy routing rule consists of a .B selector and an .B action predicate. -The RPDB is scanned in order of decreasing priority. The selector +The RPDB is scanned in order of increasing priority. The selector of each rule is applied to {source address, destination address, incoming interface, tos, fwmark} and, if the selector matches the packet, the action is performed. The action predicate may return with success. -- 2.4.5
[PATCH RFC 2/4] bql: Add tracking of inflight packets
Add two fields to netdev_queue as head_cnt and tail_cnt. head_cnt is incremented for every sent packet in netdev_tx_sent_queue and tail_cnt is incremented by the number of packets in netdev_tx_completed_queue. So then the number of inflight packets for a queue is simply queue->head_cnt - queue->tail_cnt. Add inflight_pkts to be reported in sys-fs. Signed-off-by: Tom Herbert--- include/linux/netdevice.h | 4 net/core/net-sysfs.c | 11 +++ 2 files changed, 15 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index d122be9..487d1df 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -592,6 +592,8 @@ struct netdev_queue { #ifdef CONFIG_BQL struct dql dql; + unsigned inthead_cnt; + unsigned inttail_cnt; #endif } cacheline_aligned_in_smp; @@ -2958,6 +2960,7 @@ static inline void netdev_tx_sent_queue(struct netdev_queue *dev_queue, unsigned int bytes) { #ifdef CONFIG_BQL + dev_queue->head_cnt++; dql_queued(_queue->dql, bytes); if (likely(dql_avail(_queue->dql) >= 0)) @@ -2999,6 +3002,7 @@ static inline void netdev_tx_completed_queue(struct netdev_queue *dev_queue, if (unlikely(!bytes)) return; + dev_queue->tail_cnt += pkts; dql_completed(_queue->dql, bytes); /* diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 6e4f347..5a33f6a 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -1147,6 +1147,16 @@ static ssize_t bql_show_inflight(struct netdev_queue *queue, static struct netdev_queue_attribute bql_inflight_attribute = __ATTR(inflight, S_IRUGO, bql_show_inflight, NULL); +static ssize_t bql_show_inflight_pkts(struct netdev_queue *queue, + struct netdev_queue_attribute *attr, + char *buf) +{ + return sprintf(buf, "%u\n", queue->head_cnt - queue->tail_cnt); +} + +static struct netdev_queue_attribute bql_inflight_pkts_attribute = + __ATTR(inflight_pkts, S_IRUGO, bql_show_inflight_pkts, NULL); + #define BQL_ATTR(NAME, FIELD) \ static ssize_t bql_show_ ## NAME(struct netdev_queue *queue, \ struct netdev_queue_attribute *attr, \ @@ -1176,6 +1186,7 @@ static struct attribute *dql_attrs[] = { _limit_min_attribute.attr, _hold_time_attribute.attr, _inflight_attribute.attr, + _inflight_pkts_attribute.attr, NULL }; -- 2.8.0.rc2
[PATCH RFC 4/4] xfs: Transmit flow steering
XFS maintains a per device flow table that is indexed by the skbuff hash. The XFS table is only consulted when there is no queue saved in a transmit socket for an skbuff. Each entry in the flow table contains a queue index and a queue pointer. The queue pointer is set when a queue is chosen using a flow table entry. This pointer is set to the head pointer in the transmit queue (which is maintained by BQL). The new function get_xfs_index that looks up flows in the XPS table. The entry returned gives the last queue a matching flow used. The returned queue is compared against the normal XPS queue. If they are different, then we only switch if the tail pointer in the TX queue has advanced past the pointer saved in the entry. In this way OOO should be avoided when XPS wants to use a different queue. Signed-off-by: Tom Herbert--- net/Kconfig| 6 net/core/dev.c | 93 -- 2 files changed, 84 insertions(+), 15 deletions(-) diff --git a/net/Kconfig b/net/Kconfig index 7b6cd34..5e3eddf 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -255,6 +255,12 @@ config XPS depends on SMP default y +config XFS + bool + depends on XPS + depends on BQL + default y + config HWBM bool diff --git a/net/core/dev.c b/net/core/dev.c index 1d5c6dd..722e487 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3210,6 +3210,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) } #endif /* CONFIG_NET_EGRESS */ +/* Must be called with RCU read_lock */ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb) { #ifdef CONFIG_XPS @@ -3217,7 +3218,6 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb) struct xps_map *map; int queue_index = -1; - rcu_read_lock(); dev_maps = rcu_dereference(dev->xps_maps); if (dev_maps) { map = rcu_dereference( @@ -3232,7 +3232,6 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb) queue_index = -1; } } - rcu_read_unlock(); return queue_index; #else @@ -3240,26 +3239,90 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb) #endif } -static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb) +/* Must be called with RCU read_lock */ +static int get_xfs_index(struct net_device *dev, struct sk_buff *skb) { - struct sock *sk = skb->sk; - int queue_index = sk_tx_queue_get(sk); +#ifdef CONFIG_XFS + struct xps_dev_flow_table *flow_table; + struct xps_dev_flow ent; + int queue_index; + struct netdev_queue *txq; + u32 hash; - if (queue_index < 0 || skb->ooo_okay || - queue_index >= dev->real_num_tx_queues) { - int new_index = get_xps_queue(dev, skb); - if (new_index < 0) - new_index = skb_tx_hash(dev, skb); + flow_table = rcu_dereference(dev->xps_flow_table); + if (!flow_table) + return -1; - if (queue_index != new_index && sk && - sk_fullsock(sk) && - rcu_access_pointer(sk->sk_dst_cache)) - sk_tx_queue_set(sk, new_index); + queue_index = get_xps_queue(dev, skb); + if (queue_index < 0) + return -1; - queue_index = new_index; + hash = skb_get_hash(skb); + if (!hash) + return -1; + + ent.v64 = flow_table->flows[hash & flow_table->mask].v64; + if (ent.queue_index >= 0 && + ent.queue_index < dev->real_num_tx_queues) { + txq = netdev_get_tx_queue(dev, ent.queue_index); + if (queue_index != ent.queue_index) { + if ((int)(txq->tail_cnt - ent.queue_ptr) >= 0) { + /* The current queue's tail has advanced +* beyone the last packet that was +* enqueued using the table entry. All +* previous packets sent for this flow +* should have been completed so the +* queue for the flow cna be changed. +*/ + ent.queue_index = queue_index; + txq = netdev_get_tx_queue(dev, queue_index); + } else { + queue_index = ent.queue_index; + } + } + } else { + /* Queue from the table was bad, use the new one. */ + ent.queue_index = queue_index; + txq = netdev_get_tx_queue(dev, queue_index); } + /* Save the updated entry */ + ent.queue_ptr = txq->head_cnt; +
[PATCH RFC 1/4] net: Set SW hash in skb_set_hash_from_sk
Use the __skb_set_sw_hash to set the hash in an skbuff from the socket txhash. Signed-off-by: Tom Herbert--- include/net/sock.h | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index c797c57..12e585c 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1910,10 +1910,8 @@ static inline void sock_poll_wait(struct file *filp, static inline void skb_set_hash_from_sk(struct sk_buff *skb, struct sock *sk) { - if (sk->sk_txhash) { - skb->l4_hash = 1; - skb->hash = sk->sk_txhash; - } + if (sk->sk_txhash) + __skb_set_sw_hash(skb, sk->sk_txhash, true); } void skb_set_owner_w(struct sk_buff *skb, struct sock *sk); -- 2.8.0.rc2
[PATCH RFC 3/4] net: Add xps_dev_flow_table_cnt
Add infrastructure and definitions to create XFS flow tables. This creates the new sys entry /sys/class/net/eth*/xps_dev_flow_table_cnt Signed-off-by: Tom Herbert--- include/linux/netdevice.h | 22 ++ net/core/net-sysfs.c | 76 +++ 2 files changed, 98 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 487d1df..d30e1bb 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -736,8 +736,28 @@ struct xps_dev_maps { }; #define XPS_DEV_MAPS_SIZE (sizeof(struct xps_dev_maps) + \ (nr_cpu_ids * sizeof(struct xps_map *))) + +struct xps_dev_flow { + union { + u64 v64; + struct { + int queue_index; + unsigned intqueue_ptr; + }; + }; +}; + +struct xps_dev_flow_table { + unsigned int mask; + struct rcu_head rcu; + struct xps_dev_flow flows[0]; +}; +#define XPS_DEV_FLOW_TABLE_SIZE(_num) (sizeof(struct xps_dev_flow_table) + \ + ((_num) * sizeof(struct xps_dev_flow))) + #endif /* CONFIG_XPS */ + #define TC_MAX_QUEUE 16 #define TC_BITMASK 15 /* HW offloaded queuing disciplines txq count and offset maps */ @@ -1810,6 +1830,8 @@ struct net_device { #ifdef CONFIG_XPS struct xps_dev_maps __rcu *xps_maps; + struct xps_dev_flow_table __rcu *xps_flow_table; + unsigned int xps_dev_flow_table_cnt; #endif #ifdef CONFIG_NET_CLS_ACT struct tcf_proto __rcu *egress_cl_list; diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 5a33f6a..41d0bc9 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -503,6 +503,79 @@ static ssize_t phys_switch_id_show(struct device *dev, } static DEVICE_ATTR_RO(phys_switch_id); +#ifdef CONFIG_XPS +static void xps_dev_flow_table_release(struct rcu_head *rcu) +{ + struct xps_dev_flow_table *table = container_of(rcu, + struct xps_dev_flow_table, rcu); + vfree(table); +} + +static int change_xps_dev_flow_table_cnt(struct net_device *dev, +unsigned long count) +{ + unsigned long mask; + struct xps_dev_flow_table *table, *old_table; + static DEFINE_SPINLOCK(xps_dev_flow_lock); + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + if (count) { + mask = count - 1; + /* mask = roundup_pow_of_two(count) - 1; +* without overflows... +*/ + while ((mask | (mask >> 1)) != mask) + mask |= (mask >> 1); + /* On 64 bit arches, must check mask fits in table->mask (u32), +* and on 32bit arches, must check +* XPS_DEV_FLOW_TABLE_SIZE(mask + 1) doesn't overflow. +*/ +#if BITS_PER_LONG > 32 + if (mask > (unsigned long)(u32)mask) + return -EINVAL; +#else + if (mask > (ULONG_MAX - XPS_DEV_FLOW_TABLE_SIZE(1)) + / sizeof(struct xps_dev_flow)) { + /* Enforce a limit to prevent overflow */ + return -EINVAL; + } +#endif + table = vmalloc(XPS_DEV_FLOW_TABLE_SIZE(mask + 1)); + if (!table) + return -ENOMEM; + + table->mask = mask; + for (count = 0; count <= mask; count++) + table->flows[count].queue_index = -1; + } else + table = NULL; + + spin_lock(_dev_flow_lock); + old_table = rcu_dereference_protected(dev->xps_flow_table, + lockdep_is_held(_dev_flow_lock)); + rcu_assign_pointer(dev->xps_flow_table, table); + dev->xps_dev_flow_table_cnt = count; + spin_unlock(_dev_flow_lock); + + if (old_table) + call_rcu(_table->rcu, xps_dev_flow_table_release); + + return 0; +} + +static ssize_t xps_dev_flow_table_cnt_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t len) +{ + return netdev_store(dev, attr, buf, len, change_xps_dev_flow_table_cnt); +} + +NETDEVICE_SHOW_RW(xps_dev_flow_table_cnt, fmt_dec); + +#endif + static struct attribute *net_class_attrs[] = { _attr_netdev_group.attr, _attr_type.attr, @@ -531,6 +604,9 @@ static struct attribute *net_class_attrs[] = { _attr_phys_port_name.attr, _attr_phys_switch_id.attr, _attr_proto_down.attr, +#ifdef CONFIG_XPS + _attr_xps_dev_flow_table_cnt.attr, +#endif NULL, }; ATTRIBUTE_GROUPS(net_class); -- 2.8.0.rc2
Re: [ovs-dev] [PATCH net-next v11 5/6] openvswitch: add layer 3 flow/port support
On 26 August 2016 at 02:13, Simon Hormanwrote: > On Thu, Aug 25, 2016 at 05:33:57PM -0700, Joe Stringer wrote: >> On 25 August 2016 at 03:08, Simon Horman wrote: >> > Please find my working patch below. >> > >> > From: Simon Horman >> > Subject: [PATCH] system-traffic: Exercise GSO >> > >> > Exercise GSO for: unencapsulated; MPLS; GRE; and MPLS in GRE. >> > >> > There is scope to extend this testing to other encapsulation formats >> > if desired. >> > >> > This is motivated by a desire to test GRE and MPLS encapsulation in >> > the context of L3/VPN (MPLS over non-TEB GRE work). That is not >> > tested here but tests for those cases would idealy be based on those in >> > this patch. >> > >> > Signed-off-by: Simon Horman >> >> I realised that these tests disable TSO, but they don't actually check >> if GSO is enabled. Maybe it's safe to assume this, but it's more >> explicit to actually look for it in the tests. > > Good point, I'll see about checking that. > >> With particular setups (fedora23 in particular, both kernel and >> userspace testsuites) I see this: >> >> ./system-traffic.at:371: ip netns exec at_ns0 sh << NS_EXEC_HEREDOC >> ip route add 10.1.2.0/24 encap mpls 100 via inet 10.1.1.2 dev ns_gre0 >> NS_EXEC_HEREDOC >> --- /dev/null 2016-08-19 01:28:02.15100 + >> +++ >> /home/gitlab-runner/builds/83c49bff/0/root/gitlab-ovs/ovs/tests/system-kmod-testsuite.dir/at-groups/10/stderr >> 2016-08-25 17:16:27.32400 + >> @@ -0,0 +1 @@ >> +Error: either "to" is duplicate, or "encap" is a garbage. >> >> I'm guessing the ip tool is a little out of date. We could detect and >> skip this with something like: >> >> AT_SKIP_IF([ip route help 2>&1 | grep encap]) >> >> in the CHECK_MPLS. > > Thanks, I'll add something like that. > >> Hmm, I'm still seeing the bad counts of segments retransmited even >> with the diff change on a kernel I have built at bf0f500bd019 ("Merge >> tag 'trace-v4.8-1' of >> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace"). >> If it's passing on latest net-next then maybe I just need to swap out >> that box's kernel for a newer build. I'll try that. > > It is possible that it is detecting a bug. > Which test is failing? FWIW I tried with a newer build, commit 9a0a5c4cb1af ("net: systemport: Fix ordering in intrl2_*_mask_clear macro"). I no longer see the issue. Unfortunately I lost my test output. It was one of these two: 8: datapath - ping over gre tunnel FAILED (system-traffic.at:294) 9: datapath - http over gre tunnel FAILED (system-traffic.at:348) I also realised that I didn't have MPLS router enabled in my kernel config so the MPLS tests were getting skipped. I enabled MPLS_ROUTING, but now I see this failure on the "http over mpls" tests: ./system-traffic.at:111: ip netns exec at_ns0 sh << NS_EXEC_HEREDOC ip route add 10.1.1.0/24 encap mpls 100 via inet 172.31.1.2 dev p0 NS_EXEC_HEREDOC --- /dev/null 2016-08-30 15:22:28.813316948 -0700 +++ /home/gitlab-runner/builds/f1d4a2be/0/root/gitlab-ovs/ovs/tests/system-kmod-testsuite.dir/at-groups/4/stderr 2016-08-30 15:33:45.133306581 -0700 @@ -0,0 +1 @@ +RTNETLINK answers: Operation not supported > At this stage I have mostly added TSO/GSO testing to existing checks. > Perhaps it would be better to break them out into separate checks so > ping/http can be be checked without considering TSO/GSO which may have some > value in situations where TSO/GSO is broken which is actually what I am > interested in testing. Sounds reasonable.
Re: [PATCH 3/4] arm64: dts: rockchip: support gmac for rk3399
Am Mittwoch, 31. August 2016, 04:30:06 schrieb Caesar Wang: > This patch adds needed gamc information for rk3399, > also support the gmac pd. > > Signed-off-by: Roger Chen> Signed-off-by: Caesar Wang > --- > > arch/arm64/boot/dts/rockchip/rk3399.dtsi | 90 > 1 file changed, 90 insertions(+) > > diff --git a/arch/arm64/boot/dts/rockchip/rk3399.dtsi > b/arch/arm64/boot/dts/rockchip/rk3399.dtsi index 32aebc8..53ac651 100644 > --- a/arch/arm64/boot/dts/rockchip/rk3399.dtsi > +++ b/arch/arm64/boot/dts/rockchip/rk3399.dtsi > @@ -200,6 +200,26 @@ > }; > }; > > + gmac: eth@fe30 { > + compatible = "rockchip,rk3399-gmac"; > + reg = <0x0 0xfe30 0x0 0x1>; > + rockchip,grf = <>; should move below the reset-names . > + interrupts = ; > + interrupt-names = "macirq"; > + clocks = < SCLK_MAC>, < SCLK_MAC_RX>, > + < SCLK_MAC_TX>, < SCLK_MACREF>, > + < SCLK_MACREF_OUT>, < ACLK_GMAC>, > + < PCLK_GMAC>; > + clock-names = "stmmaceth", "mac_clk_rx", > + "mac_clk_tx", "clk_mac_ref", > + "clk_mac_refout", "aclk_mac", > + "pclk_mac"; > + resets = < SRST_A_GMAC>; > + reset-names = "stmmaceth"; > + power-domains = < RK3399_PD_GMAC>; The driver core should handle regular power-domain handling on device creation already, right? So I should be able to apply patches 3 and 4 even without the dwmac patches, right? Also if resending please move power-domains above resets Heiko
Re: [RFC v2 09/10] landlock: Handle cgroups (performance)
On Aug 30, 2016 1:56 PM, "Alexei Starovoitov"wrote: > > On Tue, Aug 30, 2016 at 10:33:31PM +0200, Mickaël Salaün wrote: > > > > > > On 30/08/2016 22:23, Andy Lutomirski wrote: > > > On Tue, Aug 30, 2016 at 1:20 PM, Mickaël Salaün wrote: > > >> > > >> On 30/08/2016 20:55, Andy Lutomirski wrote: > > >>> On Sun, Aug 28, 2016 at 2:42 AM, Mickaël Salaün > > >>> wrote: > > > > > > On 28/08/2016 10:13, Andy Lutomirski wrote: > > > On Aug 27, 2016 11:14 PM, "Mickaël Salaün" wrote: > > >> > > >> > > >> On 27/08/2016 22:43, Alexei Starovoitov wrote: > > >>> On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote: > > On 27/08/2016 20:06, Alexei Starovoitov wrote: > > > On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote: > > >> As said above, Landlock will not run an eBPF programs when not > > >> strictly > > >> needed. Attaching to a cgroup will have the same performance > > >> impact as > > >> attaching to a process hierarchy. > > > > > > Having a prog per cgroup per lsm_hook is the only scalable way I > > > could come up with. If you see another way, please propose. > > > current->seccomp.landlock_prog is not the answer. > > > > Hum, I don't see the difference from a performance point of view > > between > > a cgroup-based or a process hierarchy-based system. > > > > Maybe a better option should be to use an array of pointers with N > > entries, one for each supported hook, instead of a unique pointer > > list? > > >>> > > >>> yes, clearly array dereference is faster than link list walk. > > >>> Now the question is where to keep this prog_array[num_lsm_hooks] ? > > >>> Since we cannot keep it inside task_struct, we have to allocate it. > > >>> Every time the task is creted then. What to do on the fork? That > > >>> will require changes all over. Then the obvious optimization would > > >>> be > > >>> to share this allocated array of prog pointers across multiple > > >>> tasks... > > >>> and little by little this new facility will look like cgroup. > > >>> Hence the suggestion to put this array into cgroup from the start. > > >> > > >> I see your point :) > > >> > > >>> > > Anyway, being able to attach an LSM hook program to a cgroup > > thanks to > > the new BPF_PROG_ATTACH seems a good idea (while keeping the > > possibility > > to use a process hierarchy). The downside will be to handle an LSM > > hook > > program which is not triggered by a seccomp-filter, but this > > should be > > needed anyway to handle interruptions. > > >>> > > >>> what do you mean 'not triggered by seccomp' ? > > >>> You're not suggesting that this lsm has to enable seccomp to be > > >>> functional? > > >>> imo that's non starter due to overhead. > > >> > > >> Yes, for now, it is triggered by a new seccomp filter return value > > >> RET_LANDLOCK, which can take a 16-bit value called cookie. This must > > >> not > > >> be needed but could be useful to bind a seccomp filter security > > >> policy > > >> with a Landlock one. Waiting for Kees's point of view… > > >> > > > > > > I'm not Kees, but I'd be okay with that. I still think that doing > > > this by process hierarchy a la seccomp will be easier to use and to > > > understand (which is quite important for this kind of work) than doing > > > it by cgroup. > > > > > > A feature I've wanted to add for a while is to have an fd that > > > represents a seccomp layer, the idea being that you would set up your > > > seccomp layer (with syscall filter, landlock hooks, etc) and then you > > > would have a syscall to install that layer. Then an unprivileged > > > sandbox manager could set up its layer and still be able to inject new > > > processes into it later on, no cgroups needed. > > > > A nice thing I didn't highlight about Landlock is that a process can > > prepare a layer of rules (arraymap of handles + Landlock programs) and > > pass the file descriptors of the Landlock programs to another process. > > This process could then apply this programs to get sandboxed. However, > > for now, because a Landlock program is only triggered by a seccomp > > filter (which do not follow the Landlock programs as a FD), they will > > be > > useless. > > > > The FD referring to an arraymap of handles can also be used to update a > > map and change the behavior of a Landlock program. A master process can > > then add or remove restrictions to another process hierarchy on the > > fly. >
Re: [PATCH 2/4] net: stmmac: dwmac-rk: add pd_gmac support for rk3399
Hi David, [auto build test ERROR on rockchip/for-next] [also build test ERROR on v4.8-rc4 next-20160825] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] [Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for convenience) to record what (public, well-known) commit your patch series was built on] [Check https://git-scm.com/docs/git-format-patch for more information] url: https://github.com/0day-ci/linux/commits/Caesar-Wang/Support-the-rk3399-gmac-pd-function/20160831-043741 base: https://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip.git for-next config: xtensa-allmodconfig (attached as .config) compiler: xtensa-linux-gcc (GCC) 4.9.0 reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=xtensa All errors (new ones prefixed by >>): drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c: In function 'rk_gmac_powerdown': >> drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c:672:23: error: 'pdev' >> undeclared (first use in this function) pm_runtime_put_sync(>dev); ^ drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c:672:23: note: each undeclared identifier is reported only once for each function it appears in drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c: At top level: drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c:679:12: error: redefinition of 'rk_gmac_init' static int rk_gmac_init(struct platform_device *pdev, void *priv) ^ drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c:638:12: note: previous definition of 'rk_gmac_init' was here static int rk_gmac_init(struct platform_device *pdev, void *priv) ^ drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c: In function 'rk_gmac_init': drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c:683:2: error: implicit declaration of function 'rk_gmac_powerup' [-Werror=implicit-function-declaration] return rk_gmac_powerup(bsp_priv); ^ drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c: At top level: drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c:638:12: warning: 'rk_gmac_init' defined but not used [-Wunused-function] static int rk_gmac_init(struct platform_device *pdev, void *priv) ^ cc1: some warnings being treated as errors vim +/pdev +672 drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c 666 667 return 0; 668 } 669 670 static void rk_gmac_powerdown(struct rk_priv_data *gmac) 671 { > 672 pm_runtime_put_sync(>dev); 673 pm_runtime_disable(>dev); 674 675 phy_power_on(gmac, false); --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [RFCv2 07/16] bpf: enable non-core use of the verfier
On 08/30/2016 10:48 PM, Alexei Starovoitov wrote: On Tue, Aug 30, 2016 at 10:22:46PM +0200, Jakub Kicinski wrote: On Tue, 30 Aug 2016 21:07:50 +0200, Daniel Borkmann wrote: Having two modes seems more straight forward and I think we would only need to pay attention in the LD_IMM64 case, I don't think I've seen LLVM generating XORs, it's just the cBPF -> eBPF conversion. Okay, though, I think that the cBPF to eBPF migration wouldn't even pass through the bpf_parse() handling, since verifier is not aware on some of their aspects such as emitting calls directly (w/o *proto) or arg mappings. Probably make sense to reject these (bpf_prog_was_classic()) if they cannot be handled anyway? TBH again I only use cBPF for testing. It's a convenient way of generating certain instruction sequences. I can probably just drop it completely but the XOR patch is just 3 lines of code so not a huge cost either... I'll keep patch 6 in my tree for now. if xor matching is only need for classic, I would drop that patch just to avoid unnecessary state collection. The number of lines is not a concern, but extra state for state prunning is. Alternatively - is there any eBPF assembler out there? Something converting verifier output back into ELF would be quite cool. would certainly be nice. I don't think there is anything standalone. btw llvm can be made to work as assembler only, but simple flex/bison is probably better. Never tried it out, but seems llvm backend doesn't have asm parser implemented? $ clang -target bpf -O2 -c foo.c -S -o foo.S $ llvm-mc -arch bpf foo.S -filetype=obj -o foo.o llvm-mc: error: this target does not support assembly parsing. LLVM IR might work, but maybe too high level(?); alternatively, we could make bpf_asm from tools/net/ eBPF aware for debugging purposes. If you have a toolchain supporting libbfd et al, you could probably make use of bpf_jit_dump() (like JITs do) and then bpf_jit_disasm tool (from same dir as bpf_asm).
Re: [RFC v2 09/10] landlock: Handle cgroups (performance)
On Tue, Aug 30, 2016 at 10:33:31PM +0200, Mickaël Salaün wrote: > > > On 30/08/2016 22:23, Andy Lutomirski wrote: > > On Tue, Aug 30, 2016 at 1:20 PM, Mickaël Salaünwrote: > >> > >> On 30/08/2016 20:55, Andy Lutomirski wrote: > >>> On Sun, Aug 28, 2016 at 2:42 AM, Mickaël Salaün wrote: > > > On 28/08/2016 10:13, Andy Lutomirski wrote: > > On Aug 27, 2016 11:14 PM, "Mickaël Salaün" wrote: > >> > >> > >> On 27/08/2016 22:43, Alexei Starovoitov wrote: > >>> On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote: > On 27/08/2016 20:06, Alexei Starovoitov wrote: > > On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote: > >> As said above, Landlock will not run an eBPF programs when not > >> strictly > >> needed. Attaching to a cgroup will have the same performance > >> impact as > >> attaching to a process hierarchy. > > > > Having a prog per cgroup per lsm_hook is the only scalable way I > > could come up with. If you see another way, please propose. > > current->seccomp.landlock_prog is not the answer. > > Hum, I don't see the difference from a performance point of view > between > a cgroup-based or a process hierarchy-based system. > > Maybe a better option should be to use an array of pointers with N > entries, one for each supported hook, instead of a unique pointer > list? > >>> > >>> yes, clearly array dereference is faster than link list walk. > >>> Now the question is where to keep this prog_array[num_lsm_hooks] ? > >>> Since we cannot keep it inside task_struct, we have to allocate it. > >>> Every time the task is creted then. What to do on the fork? That > >>> will require changes all over. Then the obvious optimization would be > >>> to share this allocated array of prog pointers across multiple > >>> tasks... > >>> and little by little this new facility will look like cgroup. > >>> Hence the suggestion to put this array into cgroup from the start. > >> > >> I see your point :) > >> > >>> > Anyway, being able to attach an LSM hook program to a cgroup thanks > to > the new BPF_PROG_ATTACH seems a good idea (while keeping the > possibility > to use a process hierarchy). The downside will be to handle an LSM > hook > program which is not triggered by a seccomp-filter, but this should > be > needed anyway to handle interruptions. > >>> > >>> what do you mean 'not triggered by seccomp' ? > >>> You're not suggesting that this lsm has to enable seccomp to be > >>> functional? > >>> imo that's non starter due to overhead. > >> > >> Yes, for now, it is triggered by a new seccomp filter return value > >> RET_LANDLOCK, which can take a 16-bit value called cookie. This must > >> not > >> be needed but could be useful to bind a seccomp filter security policy > >> with a Landlock one. Waiting for Kees's point of view… > >> > > > > I'm not Kees, but I'd be okay with that. I still think that doing > > this by process hierarchy a la seccomp will be easier to use and to > > understand (which is quite important for this kind of work) than doing > > it by cgroup. > > > > A feature I've wanted to add for a while is to have an fd that > > represents a seccomp layer, the idea being that you would set up your > > seccomp layer (with syscall filter, landlock hooks, etc) and then you > > would have a syscall to install that layer. Then an unprivileged > > sandbox manager could set up its layer and still be able to inject new > > processes into it later on, no cgroups needed. > > A nice thing I didn't highlight about Landlock is that a process can > prepare a layer of rules (arraymap of handles + Landlock programs) and > pass the file descriptors of the Landlock programs to another process. > This process could then apply this programs to get sandboxed. However, > for now, because a Landlock program is only triggered by a seccomp > filter (which do not follow the Landlock programs as a FD), they will be > useless. > > The FD referring to an arraymap of handles can also be used to update a > map and change the behavior of a Landlock program. A master process can > then add or remove restrictions to another process hierarchy on the fly. > >>> > >>> Maybe this could be extended a little bit. The fd could hold the > >>> seccomp filter *and* the LSM hook filters. FMODE_EXECUTE could give > >>> the ability to install it and FMODE_WRITE could give the ability to > >>> modify it. > >>> > >> > >> This is interesting! It should be possible to append the seccomp
Re: [PATCH V2] dt: net: enhance DWC EQoS binding to support Tegra186
On 08/30/2016 01:01 PM, Rob Herring wrote: On Wed, Aug 24, 2016 at 03:20:46PM -0600, Stephen Warren wrote: From: Stephen WarrenThe Synopsys DWC EQoS is a configurable IP block which supports multiple options for bus type, clocking and reset structure, and feature list. Extend the DT binding to define a "compatible value" for the configuration contained in NVIDIA's Tegra186 SoC, and define some new properties and list property entries required by that configuration. diff --git a/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt b/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt Required properties: -- clocks: Phandles to the reference clock and the bus clock -- clock-names: Should be "phy_ref_clk" for the reference clock and "apb_pclk" - for the bus clock. +- clocks: Phandle and clock specifiers for each entry in clock-names, in the + same order. See ../clock/clock-bindings.txt. +- clock-names: May contain any/all of the following depending on the IP + configuration, in any order: No, they should be in a defined order. If the binding only defines "clocks", then yes the order must be specified. If the binding defines clock-names too, then the order is arbitrary since all clocks must be looked up via clock-names. That's the entire point of having a clock-names property. ... +The EQOS transmit path clock. The HW signal name is clk_tx_i. +In some configurations (e.g. GMII/RGMII), this clock also drives the PHY TX +path. In other configurations, other clocks (such as tx_125, rmii) may +drive the PHY TX path. + - "rx" +The EQOS receive path clock. The HW signal name is clk_rx_i. +In some configurations (e.g. GMII/RGMII), this clock also drives the PHY RX +path. In other configurations, other clocks (such as rx_125, pmarx_0, +pmarx_1, rmii) may drive the PHY RX path. + - "slave_bus" +(Alternate name "apb_pclk"; only one alternate must appear.) +The CPU/slave-bus (CSR) interface clock. Despite the name, this applies to +any bus type; APB, AHB, AXI, etc. The HW signal name is hclk_i (AHB) or +clk_csr_i (other buses). Sounds like 2 clocks here. + - "master_bus" +The master bus interface clock. Only required in configurations that use a +separate clock for the master and slave bus interfaces. The HW signal name +is hclk_i (AHB) or aclk_i (AXI). Sounds like 2 clocks. I'm guessing these are mutually exclusive based on whether you configure the IP for AHB or AXI? Yes, my understanding is that the two clocks are mutually exclusive in both cases. It seems simpler to have an entry in clocks/clock-names for each logical purpose, so that the driver can always retrieve a "slave bus clock" and a "master bus clock". That way, there's never any conditional code in the driver; it just gets two fixed clock names and enables them, no matter what the HW configuration. If the binding specifies 3 clocks, hclk_i, clk_csr_i, and aclk_i, then the driver needs to know which subset of clocks to retrieve based on compatible value or HW configuration. That seems like unnecessary complexity. I suppose the driver could just attempt to retrieve all 3 clocks, and ignore any missing clocks, but that would allow malformed DTs not to be noticed since the driver wouldn't validate the set of clocks present, and could lead to the driver touching the HW without all required clocks active, which at least in Tegra can lead to a HW hang. + Note: Support for additional IP configurations may require adding the + following clocks to this list in the future: clk_rx_125_i, clk_tx_125_i, + clk_pmarx_0_i, clk_pmarx1_i, clk_rmii_i, clk_revmii_rx_i, clk_revmii_tx_i. + + The following compatible values require the following set of clocks: + - "nvidia,tegra186-eqos", "snps,dwc-qos-ethernet-4.10": +- "slave_bus" +- "master_bus" +- "rx" +- "tx" +- "ptp_ref" + - "axis,artpec6-eqos", "snps,dwc-qos-ethernet-4.10": +- "phy_ref_clk" +- "apb_clk" It would be good if this was marked deprecated and the full set of clocks could be described and supported. Not sure if you can figure that out. Is it really only 2 clocks, or these have multiple connections to the same source. Lars, can you answer here? I deliberately didn't attempt to change the binding definition for the existing use-case, since I'm not familiar with that SoC, and don't relish changing DTs for a platform I can't test.
Re: [RFCv2 16/16] nfp: bpf: add offload of TC direct action mode
On Tue, 30 Aug 2016 22:02:10 +0200, Daniel Borkmann wrote: > On 08/30/2016 12:52 PM, Jakub Kicinski wrote: > > On Mon, 29 Aug 2016 23:09:35 +0200, Daniel Borkmann wrote: > [...] > >> > >> In da mode, RECLASSIFY is not supported, so this one could be scratched. > >> For the OK and UNSPEC part, couldn't both be treated the same (as in: OK / > >> pass to stack roughly equivalent as in sch_handle_ingress())? Or is the > >> issue that you cannot populate skb->tc_index when passing to stack (maybe > >> just fine to leave it at 0 for now)? > > > > The comment is a bit confus(ed|ing). The problem is: > > > > tc filter add skip_sw > > tc filter add skip_hw > > > > If packet appears in the stack - was it because of OK or UNSPEC (or > > RECLASSIFY) in filter1? Do we need to run filter2 or not? Passing > > tc_index can be implemented the same way I do mark today. > > Okay, I see, thanks for explaining. So, if passing tc_index (or any other > meta data) can be implemented the same way as we do with mark already, > could we store such verdict, say, in some unused skb->tc_verd bits (the > skb->tc_index could be filled by the program already) and pass that up the > stack to differentiate between them? There should be no prior user before > ingress, so that patch 4 could become something like: > >if (tc_skip_sw(prog->gen_flags)) { > filter_res = tc_map_hw_verd_to_act(skb); >} else if (at_ingress) { > ... >} ... This looks promising! > And I assume it wouldn't make any sense anyway to have a skip_sw filter > being chained /after/ some skip_hw and the like, right? Right. I think it should be enforced by TC core or at least some shared code similar to tc_flags_valid() to reject offload attempts of filters which are not first in line from the wire. Right now AFAICT enabling transparent offload with ethtool may result in things going down to HW completely out of order and user doesn't even have to specify the skip_* flags...
Re: [RFCv2 07/16] bpf: enable non-core use of the verfier
On Tue, Aug 30, 2016 at 10:22:46PM +0200, Jakub Kicinski wrote: > On Tue, 30 Aug 2016 21:07:50 +0200, Daniel Borkmann wrote: > > > Having two modes seems more straight forward and I think we would only > > > need to pay attention in the LD_IMM64 case, I don't think I've seen > > > LLVM generating XORs, it's just the cBPF -> eBPF conversion. > > > > Okay, though, I think that the cBPF to eBPF migration wouldn't even > > pass through the bpf_parse() handling, since verifier is not aware on > > some of their aspects such as emitting calls directly (w/o *proto) or > > arg mappings. Probably make sense to reject these (bpf_prog_was_classic()) > > if they cannot be handled anyway? > > TBH again I only use cBPF for testing. It's a convenient way of > generating certain instruction sequences. I can probably just drop > it completely but the XOR patch is just 3 lines of code so not a huge > cost either... I'll keep patch 6 in my tree for now. if xor matching is only need for classic, I would drop that patch just to avoid unnecessary state collection. The number of lines is not a concern, but extra state for state prunning is. > Alternatively - is there any eBPF assembler out there? Something > converting verifier output back into ELF would be quite cool. would certainly be nice. I don't think there is anything standalone. btw llvm can be made to work as assembler only, but simple flex/bison is probably better.
Re: [PATCH net-next 2/6] net/mlx5e: Read ETS settings directly from firmware
On Tue, Aug 30, 2016 at 2:29 PM, Saeed Mahameedwrote: > From: Huy Nguyen > > Current implementation does not read the setting > directly from FW when ieee_getets is called. what's wrong with that? explain
Re: [PATCH net-next 1/6] net/mlx5e: Support DCBX CEE API
On Tue, Aug 30, 2016 at 2:29 PM, Saeed Mahameedwrote: > From: Huy Nguyen > > Add DCBX CEE API interface for CX4. Configurations are stored in a > temporary structure and are applied to the card's firmware when the > CEE's setall callback function is called. > > Note: > priority group in CEE is equivalent to traffic class in ConnectX-4 > hardware spec. > > bw allocation per priority in CEE is not supported because CX4 > only supports bw allocation per traffic class. > > user priority in CEE does not have an equivalent term in CX4. > Therefore, user priority to priority mapping in CEE is not supported. basically our drivers suits (mlx4/5) are not written to a certain HW, but rather to multiple (past, present and future) brands using dev caps advertized by the firmware towards the driver. I see here lots of CX4 explicit mentioning... so (1) try to avoid it or make the description more general (2) do you base your code on dev caps or hard coded assumptions? > Test: see DCBX_LinuxDriverCX4 document section 6.4 what's the relevancy for the upstream commit change log? > Signed-off-by: Huy Nguyen > Signed-off-by: Saeed Mahameed
Re: [RFC v2 09/10] landlock: Handle cgroups (performance)
On 30/08/2016 22:23, Andy Lutomirski wrote: > On Tue, Aug 30, 2016 at 1:20 PM, Mickaël Salaünwrote: >> >> On 30/08/2016 20:55, Andy Lutomirski wrote: >>> On Sun, Aug 28, 2016 at 2:42 AM, Mickaël Salaün wrote: On 28/08/2016 10:13, Andy Lutomirski wrote: > On Aug 27, 2016 11:14 PM, "Mickaël Salaün" wrote: >> >> >> On 27/08/2016 22:43, Alexei Starovoitov wrote: >>> On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote: On 27/08/2016 20:06, Alexei Starovoitov wrote: > On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote: >> As said above, Landlock will not run an eBPF programs when not >> strictly >> needed. Attaching to a cgroup will have the same performance impact >> as >> attaching to a process hierarchy. > > Having a prog per cgroup per lsm_hook is the only scalable way I > could come up with. If you see another way, please propose. > current->seccomp.landlock_prog is not the answer. Hum, I don't see the difference from a performance point of view between a cgroup-based or a process hierarchy-based system. Maybe a better option should be to use an array of pointers with N entries, one for each supported hook, instead of a unique pointer list? >>> >>> yes, clearly array dereference is faster than link list walk. >>> Now the question is where to keep this prog_array[num_lsm_hooks] ? >>> Since we cannot keep it inside task_struct, we have to allocate it. >>> Every time the task is creted then. What to do on the fork? That >>> will require changes all over. Then the obvious optimization would be >>> to share this allocated array of prog pointers across multiple tasks... >>> and little by little this new facility will look like cgroup. >>> Hence the suggestion to put this array into cgroup from the start. >> >> I see your point :) >> >>> Anyway, being able to attach an LSM hook program to a cgroup thanks to the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility to use a process hierarchy). The downside will be to handle an LSM hook program which is not triggered by a seccomp-filter, but this should be needed anyway to handle interruptions. >>> >>> what do you mean 'not triggered by seccomp' ? >>> You're not suggesting that this lsm has to enable seccomp to be >>> functional? >>> imo that's non starter due to overhead. >> >> Yes, for now, it is triggered by a new seccomp filter return value >> RET_LANDLOCK, which can take a 16-bit value called cookie. This must not >> be needed but could be useful to bind a seccomp filter security policy >> with a Landlock one. Waiting for Kees's point of view… >> > > I'm not Kees, but I'd be okay with that. I still think that doing > this by process hierarchy a la seccomp will be easier to use and to > understand (which is quite important for this kind of work) than doing > it by cgroup. > > A feature I've wanted to add for a while is to have an fd that > represents a seccomp layer, the idea being that you would set up your > seccomp layer (with syscall filter, landlock hooks, etc) and then you > would have a syscall to install that layer. Then an unprivileged > sandbox manager could set up its layer and still be able to inject new > processes into it later on, no cgroups needed. A nice thing I didn't highlight about Landlock is that a process can prepare a layer of rules (arraymap of handles + Landlock programs) and pass the file descriptors of the Landlock programs to another process. This process could then apply this programs to get sandboxed. However, for now, because a Landlock program is only triggered by a seccomp filter (which do not follow the Landlock programs as a FD), they will be useless. The FD referring to an arraymap of handles can also be used to update a map and change the behavior of a Landlock program. A master process can then add or remove restrictions to another process hierarchy on the fly. >>> >>> Maybe this could be extended a little bit. The fd could hold the >>> seccomp filter *and* the LSM hook filters. FMODE_EXECUTE could give >>> the ability to install it and FMODE_WRITE could give the ability to >>> modify it. >>> >> >> This is interesting! It should be possible to append the seccomp stack >> of a source process to the seccomp stack of the target process when a >> Landlock program is passed and then activated through seccomp(2). >> >> For the FMODE_EXECUTE/FMODE_WRITE, are you suggesting to manage >> permission of the eBPF program FD in a specific way? >> > > This wouldn't be an eBPF program FD -- it
[PATCH 1/4] net: stmmac: dwmac-rk: fixes the gmac resume after PD on/off
From: Roger ChenGMAC Power Domain(PD) will be disabled during suspend. That will causes GRF registers reset. So corresponding GRF registers for GMAC must be setup again. Signed-off-by: Roger Chen Signed-off-by: Caesar Wang --- drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c | 20 +++- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c index 9210591..ea0e493 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c @@ -629,6 +629,17 @@ static struct rk_priv_data *rk_gmac_setup(struct platform_device *pdev, "rockchip,grf"); bsp_priv->pdev = pdev; + gmac_clk_init(bsp_priv); + + return bsp_priv; +} + +static int rk_gmac_init(struct platform_device *pdev, void *priv) +{ + struct rk_priv_data *bsp_priv = priv; + int ret; + struct device *dev = >dev; + /*rmii or rgmii*/ if (bsp_priv->phy_iface == PHY_INTERFACE_MODE_RGMII) { dev_info(dev, "init for RGMII\n"); @@ -641,15 +652,6 @@ static struct rk_priv_data *rk_gmac_setup(struct platform_device *pdev, dev_err(dev, "NO interface defined!\n"); } - gmac_clk_init(bsp_priv); - - return bsp_priv; -} - -static int rk_gmac_powerup(struct rk_priv_data *bsp_priv) -{ - int ret; - ret = phy_power_on(bsp_priv, true); if (ret) return ret; -- 1.9.1
[PATCH 2/4] net: stmmac: dwmac-rk: add pd_gmac support for rk3399
From: David WuAdd the gmac power domain support for rk3399, in order to save more power consumption. Signed-off-by: David Wu Signed-off-by: Caesar Wang --- drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c index ea0e493..71a1ca5 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c @@ -30,6 +30,7 @@ #include #include #include +#include #include "stmmac_platform.h" @@ -660,11 +661,17 @@ static int rk_gmac_init(struct platform_device *pdev, void *priv) if (ret) return ret; + pm_runtime_enable(>dev); + pm_runtime_get_sync(>dev); + return 0; } static void rk_gmac_powerdown(struct rk_priv_data *gmac) { + pm_runtime_put_sync(>dev); + pm_runtime_disable(>dev); + phy_power_on(gmac, false); gmac_clk_enable(gmac, false); } -- 1.9.1
[PATCH 4/4] arm64: dts: rockchip: enable the gmac for rk3399 evb board
We add the required and optional properties for evb board. See the [0] to get the detail information. [0]: Documentation/devicetree/bindings/net/rockchip-dwmac.txt Signed-off-by: Roger ChenSigned-off-by: Caesar Wang --- arch/arm64/boot/dts/rockchip/rk3399-evb.dts | 31 + 1 file changed, 31 insertions(+) diff --git a/arch/arm64/boot/dts/rockchip/rk3399-evb.dts b/arch/arm64/boot/dts/rockchip/rk3399-evb.dts index d47b4e9..ed6f2e8 100644 --- a/arch/arm64/boot/dts/rockchip/rk3399-evb.dts +++ b/arch/arm64/boot/dts/rockchip/rk3399-evb.dts @@ -94,12 +94,43 @@ regulator-always-on; regulator-boot-on; }; + + clkin_gmac: external-gmac-clock { + compatible = "fixed-clock"; + clock-frequency = <12500>; + clock-output-names = "clkin_gmac"; + #clock-cells = <0>; + }; + + vcc_phy: vcc-phy-regulator { + compatible = "regulator-fixed"; + regulator-name = "vcc_phy"; + regulator-always-on; + regulator-boot-on; + }; + }; _phy { status = "okay"; }; + { + phy-supply = <_phy>; + phy-mode = "rgmii"; + clock_in_out = "input"; + snps,reset-gpio = < 15 GPIO_ACTIVE_LOW>; + snps,reset-active-low; + snps,reset-delays-us = <0 1 5>; + assigned-clocks = < SCLK_RMII_SRC>; + assigned-clock-parents = <_gmac>; + pinctrl-names = "default"; + pinctrl-0 = <_pins>; + tx_delay = <0x28>; + rx_delay = <0x11>; + status = "okay"; +}; + { status = "okay"; }; -- 1.9.1
[PATCH 0/4] Support the rk3399 gmac pd function
This patch add to handle the gmac pd issue, and support the rk3399 gmac for devicetree. Caesar Wang (2): arm64: dts: rockchip: support gmac for rk3399 arm64: dts: rockchip: enable the gmac for rk3399 evb board David Wu (1): net: stmmac: dwmac-rk: add pd_gmac support for rk3399 Roger Chen (1): net: stmmac: dwmac-rk: fixes the gmac resume after PD on/off arch/arm64/boot/dts/rockchip/rk3399-evb.dts| 31 + arch/arm64/boot/dts/rockchip/rk3399.dtsi | 90 ++ drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c | 27 +--- 3 files changed, 139 insertions(+), 9 deletions(-) -- 1.9.1
[PATCH 3/4] arm64: dts: rockchip: support gmac for rk3399
This patch adds needed gamc information for rk3399, also support the gmac pd. Signed-off-by: Roger ChenSigned-off-by: Caesar Wang --- arch/arm64/boot/dts/rockchip/rk3399.dtsi | 90 1 file changed, 90 insertions(+) diff --git a/arch/arm64/boot/dts/rockchip/rk3399.dtsi b/arch/arm64/boot/dts/rockchip/rk3399.dtsi index 32aebc8..53ac651 100644 --- a/arch/arm64/boot/dts/rockchip/rk3399.dtsi +++ b/arch/arm64/boot/dts/rockchip/rk3399.dtsi @@ -200,6 +200,26 @@ }; }; + gmac: eth@fe30 { + compatible = "rockchip,rk3399-gmac"; + reg = <0x0 0xfe30 0x0 0x1>; + rockchip,grf = <>; + interrupts = ; + interrupt-names = "macirq"; + clocks = < SCLK_MAC>, < SCLK_MAC_RX>, +< SCLK_MAC_TX>, < SCLK_MACREF>, +< SCLK_MACREF_OUT>, < ACLK_GMAC>, +< PCLK_GMAC>; + clock-names = "stmmaceth", "mac_clk_rx", + "mac_clk_tx", "clk_mac_ref", + "clk_mac_refout", "aclk_mac", + "pclk_mac"; + resets = < SRST_A_GMAC>; + reset-names = "stmmaceth"; + power-domains = < RK3399_PD_GMAC>; + status = "disabled"; + }; + sdio0: dwmmc@fe31 { compatible = "rockchip,rk3399-dw-mshc", "rockchip,rk3288-dw-mshc"; @@ -611,6 +631,11 @@ status = "disabled"; }; + qos_gmac: qos@ffa5c000 { + compatible = "syscon"; + reg = <0x0 0xffa5c000 0x0 0x20>; + }; + qos_hdcp: qos@ffa9 { compatible = "syscon"; reg = <0x0 0xffa9 0x0 0x20>; @@ -704,6 +729,11 @@ #size-cells = <0>; /* These power domains are grouped by VD_CENTER */ + pd_gmac@RK3399_PD_GMAC { + reg = ; + clocks = < ACLK_GMAC>; + pm_qos = <_gmac>; + }; pd_iep@RK3399_PD_IEP { reg = ; clocks = < ACLK_IEP>, @@ -1183,6 +1213,66 @@ drive-strength = <13>; }; + gmac { + rgmii_pins: rgmii-pins { + rockchip,pins = + /* mac_txclk */ + <3 17 RK_FUNC_1 _pull_none_13ma>, + /* mac_rxclk */ + <3 14 RK_FUNC_1 _pull_none>, + /* mac_mdio */ + <3 13 RK_FUNC_1 _pull_none>, + /* mac_txen */ + <3 12 RK_FUNC_1 _pull_none_13ma>, + /* mac_clk */ + <3 11 RK_FUNC_1 _pull_none>, + /* mac_rxdv */ + <3 9 RK_FUNC_1 _pull_none>, + /* mac_mdc */ + <3 8 RK_FUNC_1 _pull_none>, + /* mac_rxd1 */ + <3 7 RK_FUNC_1 _pull_none>, + /* mac_rxd0 */ + <3 6 RK_FUNC_1 _pull_none>, + /* mac_txd1 */ + <3 5 RK_FUNC_1 _pull_none_13ma>, + /* mac_txd0 */ + <3 4 RK_FUNC_1 _pull_none_13ma>, + /* mac_rxd3 */ + <3 3 RK_FUNC_1 _pull_none>, + /* mac_rxd2 */ + <3 2 RK_FUNC_1 _pull_none>, + /* mac_txd3 */ + <3 1 RK_FUNC_1 _pull_none_13ma>, + /* mac_txd2 */ + <3 0 RK_FUNC_1 _pull_none_13ma>; + }; + + rmii_pins: rmii-pins { + rockchip,pins = + /* mac_mdio */ + <3 13 RK_FUNC_1 _pull_none>, + /* mac_txen */ + <3 12 RK_FUNC_1 _pull_none_13ma>, + /* mac_clk */ + <3 11 RK_FUNC_1 _pull_none>, +
Re: [RFC v2 06/10] landlock: Add LSM hooks
On 30/08/2016 22:18, Andy Lutomirski wrote: > On Tue, Aug 30, 2016 at 1:10 PM, Mickaël Salaünwrote: >> >> On 30/08/2016 20:56, Andy Lutomirski wrote: >>> On Aug 25, 2016 12:34 PM, "Mickaël Salaün" wrote: Add LSM hooks which can be used by userland through Landlock (eBPF) programs. This programs are limited to a whitelist of functions (cf. next commit). The eBPF program context is depicted by the struct landlock_data (cf. include/uapi/linux/bpf.h): * hook: LSM hook ID (useful when using the same program for multiple LSM hooks); * cookie: the 16-bit value from the seccomp filter that triggered this Landlock program; * args[6]: array of LSM hook arguments. The LSM hook arguments can contain raw values as integers or (unleakable) pointers. The only way to use the pointers are to pass them to an eBPF function according to their types (e.g. the bpf_landlock_cmp_fs_beneath_with_struct_file function can use a struct file pointer). For now, there is three hooks for file system access control: * file_open; * file_permission; * mmap_file. >>> >>> What's the purpose of exposing struct cred * to userspace? It's >>> primarily just an optimization to save a bit of RAM, and it's a >>> dubious optimization at that. What are you using it for? Would it >>> make more sense to use struct task_struct * or struct pid * instead? >>> >>> Also, exposing struct cred * has a really weird side-effect: it allows >>> (maybe even encourages) checking for pointer equality between two >>> struct cred * objects. Doing so will have erratic results. >>> >> >> The pointers exposed in the ePBF context are not directly readable by an >> unprivileged eBPF program thanks to the strong typing of the Landlock >> context and the static eBPF verification. There is no way to leak a >> kernel pointer to userspace from an unprivileged eBPF program: pointer >> arithmetic and comparison are prohibited. Pointers can only be pass as >> argument to dedicated eBPF functions. > > I'm not talking about leaking the value -- I'm talking about leaking > the predicate (a == b) for two struct cred pointers. That predicate > shouldn't be available because it has very odd effects. I'm pretty sure this case is covered with the impossibility of doing pointers comparison. > >> >> For now, struct cred * is simply not used by any eBPF function and then >> not usable at all. It only exist here because I map the LSM hook >> arguments in a generic/automatic way to the eBPF context. > > Maybe remove it from this patch set then? Well, this is done with the LANDLOCK_HOOK* macros but I will remove it. signature.asc Description: OpenPGP digital signature
Re: [RFCv2 07/16] bpf: enable non-core use of the verfier
On Tue, 30 Aug 2016 21:07:50 +0200, Daniel Borkmann wrote: > > Having two modes seems more straight forward and I think we would only > > need to pay attention in the LD_IMM64 case, I don't think I've seen > > LLVM generating XORs, it's just the cBPF -> eBPF conversion. > > Okay, though, I think that the cBPF to eBPF migration wouldn't even > pass through the bpf_parse() handling, since verifier is not aware on > some of their aspects such as emitting calls directly (w/o *proto) or > arg mappings. Probably make sense to reject these (bpf_prog_was_classic()) > if they cannot be handled anyway? TBH again I only use cBPF for testing. It's a convenient way of generating certain instruction sequences. I can probably just drop it completely but the XOR patch is just 3 lines of code so not a huge cost either... I'll keep patch 6 in my tree for now. Alternatively - is there any eBPF assembler out there? Something converting verifier output back into ELF would be quite cool.
Re: [PATCH net-next 6/6] net/mlx5: Add handling for port module event
On Tue, Aug 30, 2016 at 2:29 PM, Saeed Mahameedwrote: > From: Huy Nguyen > +++ b/include/linux/mlx5/device.h > @@ -543,6 +544,15 @@ struct mlx5_eqe_vport_change { > __be32 rsvd1[6]; > } __packed; > > +struct mlx5_eqe_port_module { > + u8rsvd0[1]; > + u8module; > + u8rsvd1[1]; > + u8module_status; > + u8rsvd2[2]; > + u8error_type; > +}; > + Saeed, any reason for this struct and friends not to be @ the FW IFC file?
Re: [RFC v2 09/10] landlock: Handle cgroups (performance)
On Tue, Aug 30, 2016 at 1:20 PM, Mickaël Salaünwrote: > > On 30/08/2016 20:55, Andy Lutomirski wrote: >> On Sun, Aug 28, 2016 at 2:42 AM, Mickaël Salaün wrote: >>> >>> >>> On 28/08/2016 10:13, Andy Lutomirski wrote: On Aug 27, 2016 11:14 PM, "Mickaël Salaün" wrote: > > > On 27/08/2016 22:43, Alexei Starovoitov wrote: >> On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote: >>> On 27/08/2016 20:06, Alexei Starovoitov wrote: On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote: > As said above, Landlock will not run an eBPF programs when not > strictly > needed. Attaching to a cgroup will have the same performance impact as > attaching to a process hierarchy. Having a prog per cgroup per lsm_hook is the only scalable way I could come up with. If you see another way, please propose. current->seccomp.landlock_prog is not the answer. >>> >>> Hum, I don't see the difference from a performance point of view between >>> a cgroup-based or a process hierarchy-based system. >>> >>> Maybe a better option should be to use an array of pointers with N >>> entries, one for each supported hook, instead of a unique pointer list? >> >> yes, clearly array dereference is faster than link list walk. >> Now the question is where to keep this prog_array[num_lsm_hooks] ? >> Since we cannot keep it inside task_struct, we have to allocate it. >> Every time the task is creted then. What to do on the fork? That >> will require changes all over. Then the obvious optimization would be >> to share this allocated array of prog pointers across multiple tasks... >> and little by little this new facility will look like cgroup. >> Hence the suggestion to put this array into cgroup from the start. > > I see your point :) > >> >>> Anyway, being able to attach an LSM hook program to a cgroup thanks to >>> the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility >>> to use a process hierarchy). The downside will be to handle an LSM hook >>> program which is not triggered by a seccomp-filter, but this should be >>> needed anyway to handle interruptions. >> >> what do you mean 'not triggered by seccomp' ? >> You're not suggesting that this lsm has to enable seccomp to be >> functional? >> imo that's non starter due to overhead. > > Yes, for now, it is triggered by a new seccomp filter return value > RET_LANDLOCK, which can take a 16-bit value called cookie. This must not > be needed but could be useful to bind a seccomp filter security policy > with a Landlock one. Waiting for Kees's point of view… > I'm not Kees, but I'd be okay with that. I still think that doing this by process hierarchy a la seccomp will be easier to use and to understand (which is quite important for this kind of work) than doing it by cgroup. A feature I've wanted to add for a while is to have an fd that represents a seccomp layer, the idea being that you would set up your seccomp layer (with syscall filter, landlock hooks, etc) and then you would have a syscall to install that layer. Then an unprivileged sandbox manager could set up its layer and still be able to inject new processes into it later on, no cgroups needed. >>> >>> A nice thing I didn't highlight about Landlock is that a process can >>> prepare a layer of rules (arraymap of handles + Landlock programs) and >>> pass the file descriptors of the Landlock programs to another process. >>> This process could then apply this programs to get sandboxed. However, >>> for now, because a Landlock program is only triggered by a seccomp >>> filter (which do not follow the Landlock programs as a FD), they will be >>> useless. >>> >>> The FD referring to an arraymap of handles can also be used to update a >>> map and change the behavior of a Landlock program. A master process can >>> then add or remove restrictions to another process hierarchy on the fly. >> >> Maybe this could be extended a little bit. The fd could hold the >> seccomp filter *and* the LSM hook filters. FMODE_EXECUTE could give >> the ability to install it and FMODE_WRITE could give the ability to >> modify it. >> > > This is interesting! It should be possible to append the seccomp stack > of a source process to the seccomp stack of the target process when a > Landlock program is passed and then activated through seccomp(2). > > For the FMODE_EXECUTE/FMODE_WRITE, are you suggesting to manage > permission of the eBPF program FD in a specific way? > This wouldn't be an eBPF program FD -- it would be an FD encapsulating an entire configuration including seccomp BPF program, whatever landlock stuff is associated, and eventual seccomp monitor configuration (once I write
Re: [RFC v2 09/10] landlock: Handle cgroups (performance)
On 30/08/2016 20:55, Andy Lutomirski wrote: > On Sun, Aug 28, 2016 at 2:42 AM, Mickaël Salaünwrote: >> >> >> On 28/08/2016 10:13, Andy Lutomirski wrote: >>> On Aug 27, 2016 11:14 PM, "Mickaël Salaün" wrote: On 27/08/2016 22:43, Alexei Starovoitov wrote: > On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote: >> On 27/08/2016 20:06, Alexei Starovoitov wrote: >>> On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote: As said above, Landlock will not run an eBPF programs when not strictly needed. Attaching to a cgroup will have the same performance impact as attaching to a process hierarchy. >>> >>> Having a prog per cgroup per lsm_hook is the only scalable way I >>> could come up with. If you see another way, please propose. >>> current->seccomp.landlock_prog is not the answer. >> >> Hum, I don't see the difference from a performance point of view between >> a cgroup-based or a process hierarchy-based system. >> >> Maybe a better option should be to use an array of pointers with N >> entries, one for each supported hook, instead of a unique pointer list? > > yes, clearly array dereference is faster than link list walk. > Now the question is where to keep this prog_array[num_lsm_hooks] ? > Since we cannot keep it inside task_struct, we have to allocate it. > Every time the task is creted then. What to do on the fork? That > will require changes all over. Then the obvious optimization would be > to share this allocated array of prog pointers across multiple tasks... > and little by little this new facility will look like cgroup. > Hence the suggestion to put this array into cgroup from the start. I see your point :) > >> Anyway, being able to attach an LSM hook program to a cgroup thanks to >> the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility >> to use a process hierarchy). The downside will be to handle an LSM hook >> program which is not triggered by a seccomp-filter, but this should be >> needed anyway to handle interruptions. > > what do you mean 'not triggered by seccomp' ? > You're not suggesting that this lsm has to enable seccomp to be > functional? > imo that's non starter due to overhead. Yes, for now, it is triggered by a new seccomp filter return value RET_LANDLOCK, which can take a 16-bit value called cookie. This must not be needed but could be useful to bind a seccomp filter security policy with a Landlock one. Waiting for Kees's point of view… >>> >>> I'm not Kees, but I'd be okay with that. I still think that doing >>> this by process hierarchy a la seccomp will be easier to use and to >>> understand (which is quite important for this kind of work) than doing >>> it by cgroup. >>> >>> A feature I've wanted to add for a while is to have an fd that >>> represents a seccomp layer, the idea being that you would set up your >>> seccomp layer (with syscall filter, landlock hooks, etc) and then you >>> would have a syscall to install that layer. Then an unprivileged >>> sandbox manager could set up its layer and still be able to inject new >>> processes into it later on, no cgroups needed. >> >> A nice thing I didn't highlight about Landlock is that a process can >> prepare a layer of rules (arraymap of handles + Landlock programs) and >> pass the file descriptors of the Landlock programs to another process. >> This process could then apply this programs to get sandboxed. However, >> for now, because a Landlock program is only triggered by a seccomp >> filter (which do not follow the Landlock programs as a FD), they will be >> useless. >> >> The FD referring to an arraymap of handles can also be used to update a >> map and change the behavior of a Landlock program. A master process can >> then add or remove restrictions to another process hierarchy on the fly. > > Maybe this could be extended a little bit. The fd could hold the > seccomp filter *and* the LSM hook filters. FMODE_EXECUTE could give > the ability to install it and FMODE_WRITE could give the ability to > modify it. > This is interesting! It should be possible to append the seccomp stack of a source process to the seccomp stack of the target process when a Landlock program is passed and then activated through seccomp(2). For the FMODE_EXECUTE/FMODE_WRITE, are you suggesting to manage permission of the eBPF program FD in a specific way? signature.asc Description: OpenPGP digital signature
Re: [RFC v2 06/10] landlock: Add LSM hooks
On Tue, Aug 30, 2016 at 1:10 PM, Mickaël Salaünwrote: > > On 30/08/2016 20:56, Andy Lutomirski wrote: >> On Aug 25, 2016 12:34 PM, "Mickaël Salaün" wrote: >>> >>> Add LSM hooks which can be used by userland through Landlock (eBPF) >>> programs. This programs are limited to a whitelist of functions (cf. >>> next commit). The eBPF program context is depicted by the struct >>> landlock_data (cf. include/uapi/linux/bpf.h): >>> * hook: LSM hook ID (useful when using the same program for multiple LSM >>> hooks); >>> * cookie: the 16-bit value from the seccomp filter that triggered this >>> Landlock program; >>> * args[6]: array of LSM hook arguments. >>> >>> The LSM hook arguments can contain raw values as integers or >>> (unleakable) pointers. The only way to use the pointers are to pass them >>> to an eBPF function according to their types (e.g. the >>> bpf_landlock_cmp_fs_beneath_with_struct_file function can use a struct >>> file pointer). >>> >>> For now, there is three hooks for file system access control: >>> * file_open; >>> * file_permission; >>> * mmap_file. >>> >> >> What's the purpose of exposing struct cred * to userspace? It's >> primarily just an optimization to save a bit of RAM, and it's a >> dubious optimization at that. What are you using it for? Would it >> make more sense to use struct task_struct * or struct pid * instead? >> >> Also, exposing struct cred * has a really weird side-effect: it allows >> (maybe even encourages) checking for pointer equality between two >> struct cred * objects. Doing so will have erratic results. >> > > The pointers exposed in the ePBF context are not directly readable by an > unprivileged eBPF program thanks to the strong typing of the Landlock > context and the static eBPF verification. There is no way to leak a > kernel pointer to userspace from an unprivileged eBPF program: pointer > arithmetic and comparison are prohibited. Pointers can only be pass as > argument to dedicated eBPF functions. I'm not talking about leaking the value -- I'm talking about leaking the predicate (a == b) for two struct cred pointers. That predicate shouldn't be available because it has very odd effects. > > For now, struct cred * is simply not used by any eBPF function and then > not usable at all. It only exist here because I map the LSM hook > arguments in a generic/automatic way to the eBPF context. Maybe remove it from this patch set then? --Andy
Re: [RFC v2 06/10] landlock: Add LSM hooks
On 30/08/2016 20:56, Andy Lutomirski wrote: > On Aug 25, 2016 12:34 PM, "Mickaël Salaün"wrote: >> >> Add LSM hooks which can be used by userland through Landlock (eBPF) >> programs. This programs are limited to a whitelist of functions (cf. >> next commit). The eBPF program context is depicted by the struct >> landlock_data (cf. include/uapi/linux/bpf.h): >> * hook: LSM hook ID (useful when using the same program for multiple LSM >> hooks); >> * cookie: the 16-bit value from the seccomp filter that triggered this >> Landlock program; >> * args[6]: array of LSM hook arguments. >> >> The LSM hook arguments can contain raw values as integers or >> (unleakable) pointers. The only way to use the pointers are to pass them >> to an eBPF function according to their types (e.g. the >> bpf_landlock_cmp_fs_beneath_with_struct_file function can use a struct >> file pointer). >> >> For now, there is three hooks for file system access control: >> * file_open; >> * file_permission; >> * mmap_file. >> > > What's the purpose of exposing struct cred * to userspace? It's > primarily just an optimization to save a bit of RAM, and it's a > dubious optimization at that. What are you using it for? Would it > make more sense to use struct task_struct * or struct pid * instead? > > Also, exposing struct cred * has a really weird side-effect: it allows > (maybe even encourages) checking for pointer equality between two > struct cred * objects. Doing so will have erratic results. > The pointers exposed in the ePBF context are not directly readable by an unprivileged eBPF program thanks to the strong typing of the Landlock context and the static eBPF verification. There is no way to leak a kernel pointer to userspace from an unprivileged eBPF program: pointer arithmetic and comparison are prohibited. Pointers can only be pass as argument to dedicated eBPF functions. For now, struct cred * is simply not used by any eBPF function and then not usable at all. It only exist here because I map the LSM hook arguments in a generic/automatic way to the eBPF context. I'm planning to extend the Landlock context with extra pointers, whatever the LSM hook. We could then use task_struct, skb or any other kernel objects, in a safe way, with dedicated functions. signature.asc Description: OpenPGP digital signature
Re: [RFCv2 16/16] nfp: bpf: add offload of TC direct action mode
On 08/30/2016 12:52 PM, Jakub Kicinski wrote: On Mon, 29 Aug 2016 23:09:35 +0200, Daniel Borkmann wrote: +* 0,1 okNOT SUPPORTED[1] +* 2 drop 0x22 -> drop, count as stat1 +* 4,5 nuke 0x02 -> drop +* 7 redir 0x44 -> redir, count as stat2 +* * unspec 0x11 -> pass, count as stat0 +* +* [1] We can't support OK and RECLASSIFY because we can't tell TC +* the exact decision made. We are forced to support UNSPEC +* to handle aborts so that's the only one we handle for passing +* packets up the stack. In da mode, RECLASSIFY is not supported, so this one could be scratched. For the OK and UNSPEC part, couldn't both be treated the same (as in: OK / pass to stack roughly equivalent as in sch_handle_ingress())? Or is the issue that you cannot populate skb->tc_index when passing to stack (maybe just fine to leave it at 0 for now)? The comment is a bit confus(ed|ing). The problem is: tc filter add skip_sw tc filter add skip_hw If packet appears in the stack - was it because of OK or UNSPEC (or RECLASSIFY) in filter1? Do we need to run filter2 or not? Passing tc_index can be implemented the same way I do mark today. Okay, I see, thanks for explaining. So, if passing tc_index (or any other meta data) can be implemented the same way as we do with mark already, could we store such verdict, say, in some unused skb->tc_verd bits (the skb->tc_index could be filled by the program already) and pass that up the stack to differentiate between them? There should be no prior user before ingress, so that patch 4 could become something like: if (tc_skip_sw(prog->gen_flags)) { filter_res = tc_map_hw_verd_to_act(skb); } else if (at_ingress) { ... } ... And I assume it wouldn't make any sense anyway to have a skip_sw filter being chained /after/ some skip_hw and the like, right? Just curious, does TC_ACT_REDIRECT work in this scenario? I do the redirects in the card, all the problems stem from the Ok, cool. difficulty of passing full ret code in the skb from the driver to tc_classify()/cls_bpf_classify().
Re: [RFC v2 00/10] Landlock LSM: Unprivileged sandboxing
On Tue, Aug 30, 2016 at 12:51 PM, Mickaël Salaünwrote: > > On 30/08/2016 18:06, Andy Lutomirski wrote: >> On Thu, Aug 25, 2016 at 3:32 AM, Mickaël Salaün wrote: >>> Hi, >>> >>> This series is a proof of concept to fill some missing part of seccomp as >>> the >>> ability to check syscall argument pointers or creating more dynamic security >>> policies. The goal of this new stackable Linux Security Module (LSM) called >>> Landlock is to allow any process, including unprivileged ones, to create >>> powerful security sandboxes comparable to the Seatbelt/XNU Sandbox or the >>> OpenBSD Pledge. This kind of sandbox help to mitigate the security impact of >>> bugs or unexpected/malicious behaviors in userland applications. >> >> Mickaël, will you be at KS and/or LPC? >> > > I won't be at KS/LPC but I will give a talk at Kernel Recipes (Paris) > for which registration will start Thursday (and will not last long). :) There's a teeny tiny chance I'll be there. I've done way too much traveling lately.
Re: [RFC v2 00/10] Landlock LSM: Unprivileged sandboxing
On 30/08/2016 18:06, Andy Lutomirski wrote: > On Thu, Aug 25, 2016 at 3:32 AM, Mickaël Salaünwrote: >> Hi, >> >> This series is a proof of concept to fill some missing part of seccomp as the >> ability to check syscall argument pointers or creating more dynamic security >> policies. The goal of this new stackable Linux Security Module (LSM) called >> Landlock is to allow any process, including unprivileged ones, to create >> powerful security sandboxes comparable to the Seatbelt/XNU Sandbox or the >> OpenBSD Pledge. This kind of sandbox help to mitigate the security impact of >> bugs or unexpected/malicious behaviors in userland applications. > > Mickaël, will you be at KS and/or LPC? > I won't be at KS/LPC but I will give a talk at Kernel Recipes (Paris) for which registration will start Thursday (and will not last long). :) Mickaël signature.asc Description: OpenPGP digital signature
Re: [PATCH v2] ipv6: Use inbound ifaddr as source addresses for ICMPv6 errors
On Mon, Aug 29, 2016 at 02:34:32AM +0800, Eli Cooper wrote: > Hello, > > > On 2016/8/29 1:18, Guillaume Nault wrote: > > On Sun, Aug 28, 2016 at 11:34:06AM +0800, Eli Cooper wrote: > >> According to RFC 1885 2.2(c), the source address of ICMPv6 > >> errors in response to forwarded packets should be set to the > >> unicast address of the forwarding interface in order to be helpful > >> in diagnosis. > >> > > FWIW, this behaviour has been deprecated ten years ago by RFC 4443: > > "The address SHOULD be chosen according to the rules that would be used > > to select the source address for any other packet originated by the > > node, given the destination address of the packet." > > > > The door is left open for other address selection algorithms but, IMHO, > > changing kernel's behaviour is better justified by real use cases > > than by obsolete RFCs. > > I agree, sorry for the obsoleted RFC. This is actually motivated by a > real use case: Say a Linux box is acting as a router that forwards > packets with policy routing from two local networks to two uplinks, > respectively. An outside host from is performing traceroute to a host on > one of the LAN. If the kernel's default route is via the other LAN's > uplink, it will send ICMPv6 packets with the source address that has > nothing to do with the network in question, yet the message probably > will reach the outside host. > > Here using the address of inbound or exiting interface as source address > is evidently "a more informative choice." I surmise this is the reason > why the comment reads "Force OUTPUT device used as source address" when > dealing with hop limit exceeded packets in ip6_forward(), although not > effectively so. The current behaviour not only confuses diagnosis, but > also might be undesirable if the addresses of the networks are best kept > secret from each other. > That makes more sense indeed. Would be nice to have this use case in the commit message rather than the blind reference to the obsolete RFC. Regards, Guillaume
[PATCH net-next 1/1] rxrpc: Don't expose skbs to in-kernel users [ver #2]
Don't expose skbs to in-kernel users, such as the AFS filesystem, but instead provide a notification hook the indicates that a call needs attention and another that indicates that there's a new call to be collected. This makes the following possibilities more achievable: (1) Call refcounting can be made simpler if skbs don't hold refs to calls. (2) skbs referring to non-data events will be able to be freed much sooner rather than being queued for AFS to pick up as rxrpc_kernel_recv_data will be able to consult the call state. (3) We can shortcut the receive phase when a call is remotely aborted because we don't have to go through all the packets to get to the one cancelling the operation. (4) It makes it easier to do encryption/decryption directly between AFS's buffers and sk_buffs. (5) Encryption/decryption can more easily be done in the AFS's thread contexts - usually that of the userspace process that issued a syscall - rather than in one of rxrpc's background threads on a workqueue. (6) AFS will be able to wait synchronously on a call inside AF_RXRPC. To make this work, the following interface function has been added: int rxrpc_kernel_recv_data( struct socket *sock, struct rxrpc_call *call, void *buffer, size_t bufsize, size_t *_offset, bool want_more, u32 *_abort_code); This is the recvmsg equivalent. It allows the caller to find out about the state of a specific call and to transfer received data into a buffer piecemeal. afs_extract_data() and rxrpc_kernel_recv_data() now do all the extraction logic between them. They don't wait synchronously yet because the socket lock needs to be dealt with. Five interface functions have been removed: rxrpc_kernel_is_data_last() rxrpc_kernel_get_abort_code() rxrpc_kernel_get_error_number() rxrpc_kernel_free_skb() rxrpc_kernel_data_consumed() As a temporary hack, sk_buffs going to an in-kernel call are queued on the rxrpc_call struct (->knlrecv_queue) rather than being handed over to the in-kernel user. To process the queue internally, a temporary function, temp_deliver_data() has been added. This will be replaced with common code between the rxrpc_recvmsg() path and the kernel_rxrpc_recv_data() path in a future patch. Signed-off-by: David Howells--- Documentation/networking/rxrpc.txt | 72 +++--- fs/afs/cmservice.c | 142 ++-- fs/afs/fsclient.c | 148 +--- fs/afs/internal.h | 33 +-- fs/afs/rxrpc.c | 439 +--- fs/afs/vlclient.c |7 - include/net/af_rxrpc.h | 35 +-- net/rxrpc/af_rxrpc.c | 29 +- net/rxrpc/ar-internal.h| 23 ++ net/rxrpc/call_accept.c| 13 + net/rxrpc/call_object.c|5 net/rxrpc/conn_event.c |1 net/rxrpc/input.c | 10 + net/rxrpc/output.c |2 net/rxrpc/recvmsg.c| 191 +--- net/rxrpc/skbuff.c |1 16 files changed, 565 insertions(+), 586 deletions(-) diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt index cfc8cb91452f..1b63bbc6b94f 100644 --- a/Documentation/networking/rxrpc.txt +++ b/Documentation/networking/rxrpc.txt @@ -748,6 +748,37 @@ The kernel interface functions are as follows: The msg must not specify a destination address, control data or any flags other than MSG_MORE. len is the total amount of data to transmit. + (*) Receive data from a call. + + int rxrpc_kernel_recv_data(struct socket *sock, + struct rxrpc_call *call, + void *buf, + size_t size, + size_t *_offset, + bool want_more, + u32 *_abort) + + This is used to receive data from either the reply part of a client call + or the request part of a service call. buf and size specify how much + data is desired and where to store it. *_offset is added on to buf and + subtracted from size internally; the amount copied into the buffer is + added to *_offset before returning. + + want_more should be true if further data will be required after this is + satisfied and false if this is the last item of the receive phase. + + There are three normal returns: 0 if the buffer was filled and want_more + was true; 1 if the buffer was filled, the last DATA packet has been + emptied and want_more was false; and -EAGAIN if the function needs to be + called again. + + If the last DATA packet is processed but the buffer contains less than + the amount requested, EBADMSG is returned. If
[PATCH net-next 0/1] rxrpc: Remove use of skbs from AFS [ver #2]
Here's a single patch that removes the use of sk_buffs from fs/afs. From this point on they'll be entirely retained within net/rxrpc and AFS just asks AF_RXRPC for linear buffers of data. This needs to be applied on top of the just-posted preparatory patch set. This makes some future developments easier/possible: (1) Simpler rxrpc_call usage counting. (2) Earlier freeing of metadata sk_buffs. (3) Rx phase shortcutting on abort/error. (4) Encryption/decryption in the AFS fs contexts/threads and directly between sk_buffs and AFS buffers. (5) Synchronous waiting in reception for AFS. Changes: (V2) Fixed afs_transfer_reply() whereby call->offset was incorrectly being added to the buffer pointer (it doesn't matter as long as the reply fits entirely inside a single packet). Removed an unused goto-label and an unused variable. The patch can be found here also: http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite Tagged thusly: git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git rxrpc-rewrite-20160830-2v2 David --- David Howells (1): rxrpc: Don't expose skbs to in-kernel users Documentation/networking/rxrpc.txt | 72 +++--- fs/afs/cmservice.c | 142 ++-- fs/afs/fsclient.c | 148 +--- fs/afs/internal.h | 33 +-- fs/afs/rxrpc.c | 439 +--- fs/afs/vlclient.c |7 - include/net/af_rxrpc.h | 35 +-- net/rxrpc/af_rxrpc.c | 29 +- net/rxrpc/ar-internal.h| 23 ++ net/rxrpc/call_accept.c| 13 + net/rxrpc/call_object.c|5 net/rxrpc/conn_event.c |1 net/rxrpc/input.c | 10 + net/rxrpc/output.c |2 net/rxrpc/recvmsg.c| 191 +--- net/rxrpc/skbuff.c |1 16 files changed, 565 insertions(+), 586 deletions(-)
Re: [PATCH v3 4/5] net: stmmac: add a glue driver for the Amlogic Meson 8b / GXBB DWMAC
On 08/28, Martin Blumenstingl wrote: > +static int meson8b_init_clk(struct meson8b_dwmac *dwmac) > +{ > + struct clk_init_data init; > + int i, ret; > + struct device *dev = >pdev->dev; > + char clk_name[32]; > + const char *clk_div_parents[1]; > + const char *mux_parent_names[MUX_CLK_NUM_PARENTS]; > + static struct clk_div_table clk_25m_div_table[] = { > + { .val = 0, .div = 5 }, > + { .val = 1, .div = 10 }, > + { /* sentinel */ }, > + }; > + > + /* get the mux parents from DT */ > + for (i = 0; i < MUX_CLK_NUM_PARENTS; i++) { > + char name[16]; > + > + snprintf(name, sizeof(name), "clkin%d", i); > + dwmac->m250_mux_parent[i] = devm_clk_get(dev, name); > + if (IS_ERR(dwmac->m250_mux_parent[i])) { > + ret = PTR_ERR(dwmac->m250_mux_parent[i]); > + if (ret != -EPROBE_DEFER) > + dev_err(dev, "Missing clock %s\n", name); > + return ret; > + } > + > + mux_parent_names[i] = > + __clk_get_name(dwmac->m250_mux_parent[i]); > + } > + > + /* create the m250_mux */ > + snprintf(clk_name, sizeof(clk_name), "%s#m250_sel", dev_name(dev)); > + init.name = clk_name; > + init.ops = _mux_ops; > + init.flags = CLK_IS_BASIC; Please don't use this flag unless you need it. > + init.parent_names = mux_parent_names; > + init.num_parents = MUX_CLK_NUM_PARENTS; > + > + dwmac->m250_mux.reg = dwmac->regs + PRG_ETH0; > + dwmac->m250_mux.shift = PRG_ETH0_CLK_M250_SEL_SHIFT; > + dwmac->m250_mux.mask = PRG_ETH0_CLK_M250_SEL_MASK; > + dwmac->m250_mux.flags = 0; > + dwmac->m250_mux.table = NULL; > + dwmac->m250_mux.hw.init = > + > + dwmac->m250_mux_clk = devm_clk_register(dev, >m250_mux.hw); > + if (WARN_ON(PTR_ERR_OR_ZERO(dwmac->m250_mux_clk))) Why not if(WARN_ON(IS_ERR())? The OR_ZERO part seems confusing. > + return PTR_ERR(dwmac->m250_mux_clk); > + > + /* create the m250_div */ > + snprintf(clk_name, sizeof(clk_name), "%s#m250_div", dev_name(dev)); > + init.name = devm_kstrdup(dev, clk_name, GFP_KERNEL); > + init.ops = _divider_ops; > + init.flags = CLK_IS_BASIC | CLK_SET_RATE_PARENT; > + clk_div_parents[0] = __clk_get_name(dwmac->m250_mux_clk); > + init.parent_names = clk_div_parents; > + init.num_parents = ARRAY_SIZE(clk_div_parents); > + > + dwmac->m250_div.reg = dwmac->regs + PRG_ETH0; > + dwmac->m250_div.shift = PRG_ETH0_CLK_M250_DIV_SHIFT; > + dwmac->m250_div.width = PRG_ETH0_CLK_M250_DIV_WIDTH; > + dwmac->m250_div.hw.init = > + dwmac->m250_div.flags = CLK_DIVIDER_ONE_BASED | CLK_DIVIDER_ALLOW_ZERO; > + > + dwmac->m250_div_clk = devm_clk_register(dev, >m250_div.hw); We've been trying to move away from devm_clk_register() to devm_clk_hw_register() so that clk providers aren't also clk consumers. Obviously in this case this driver is a provider and a consumer, so this isn't as important. Kevin did something similar in the mmc driver, so I'll reiterate what I said on that patch. Perhaps we should make __clk_create_clk() into a real clk provider API so that we can use devm_clk_hw_register() here and then generate a clk for this device. That would allow us to have proper consumer tracking without relying on the clk that is returned from clk_register() (the intent is to make that clk instance internal to the framework). > + if (WARN_ON(PTR_ERR_OR_ZERO(dwmac->m250_div_clk))) > + return PTR_ERR(dwmac->m250_div_clk); > + > + /* create the m25_div */ > + snprintf(clk_name, sizeof(clk_name), "%s#m25_div", dev_name(dev)); > + init.name = devm_kstrdup(dev, clk_name, GFP_KERNEL); > + init.ops = _divider_ops; > + init.flags = CLK_IS_BASIC | CLK_SET_RATE_PARENT; > + clk_div_parents[0] = __clk_get_name(dwmac->m250_div_clk); > + init.parent_names = clk_div_parents; > + init.num_parents = ARRAY_SIZE(clk_div_parents); > + > + dwmac->m25_div.reg = dwmac->regs + PRG_ETH0; > + dwmac->m25_div.shift = PRG_ETH0_CLK_M25_DIV_SHIFT; > + dwmac->m25_div.width = PRG_ETH0_CLK_M25_DIV_WIDTH; > + dwmac->m25_div.table = clk_25m_div_table; > + dwmac->m25_div.hw.init = > + dwmac->m25_div.flags = CLK_DIVIDER_ALLOW_ZERO; > + > + dwmac->m25_div_clk = devm_clk_register(dev, >m25_div.hw); > + if (WARN_ON(PTR_ERR_OR_ZERO(dwmac->m25_div_clk))) > + return PTR_ERR(dwmac->m25_div_clk); > + > + return 0; This could be return WARN_ON(PTR_ERR_OR_ZERO(...)) > + > +static int meson8b_dwmac_probe(struct platform_device *pdev) > +{ > + struct plat_stmmacenet_data *plat_dat; > + struct stmmac_resources stmmac_res; > + struct resource *res; > + struct meson8b_dwmac *dwmac; > + int ret; > + > + ret = stmmac_get_platform_resources(pdev, _res); > + if (ret) > +
Re: [RFCv2 07/16] bpf: enable non-core use of the verfier
On 08/30/2016 12:48 PM, Jakub Kicinski wrote: On Mon, 29 Aug 2016 22:17:10 +0200, Daniel Borkmann wrote: On 08/29/2016 10:13 PM, Daniel Borkmann wrote: On 08/27/2016 07:32 PM, Alexei Starovoitov wrote: On Sat, Aug 27, 2016 at 12:40:04PM +0100, Jakub Kicinski wrote: probably array_of_insn_aux_data[num_insns] should do it. Unlike reg_state that is forked on branches, this array is only one. This would be for struct nfp_insn_meta, right? So, struct bpf_ext_parser_ops could become: static const struct bpf_ext_parser_ops nfp_bpf_pops = { .insn_hook = nfp_verify_insn, .insn_size = sizeof(struct nfp_insn_meta), }; ... where bpf_parse() would prealloc that f.e. in env->insn_meta[]. Hm.. this is tempting, I will have to store the pointer type in nfp_insn_meta soon, anyway. (Well, actually everything can live in env->private_data.) We are discussing changing the place verifier keep its pointer type annotation, I don't think we could put that in the private_data. Agree, was also my concern when I read patch 5 and 6. It would not only be related to types, but also different imm values, where the memcmp() could fail on. Potentially the latter can be avoided by only checking types which should be sufficient. Hmm, maybe only bpf_parse() should go through this stricter mode since only relevant for drivers (otoh downside would be that bugs would end up less likely to be found). I don't want only checking types because it would defeat my exit code validation :) I was thinking about doing a lazy evaluation - registering branches to explored_states with UNKNOWN and only upgrading to CONST when someone actually needed the imm value. I'm not sure the complexity would be justified, though. Having two modes seems more straight forward and I think we would only need to pay attention in the LD_IMM64 case, I don't think I've seen LLVM generating XORs, it's just the cBPF -> eBPF conversion. Okay, though, I think that the cBPF to eBPF migration wouldn't even pass through the bpf_parse() handling, since verifier is not aware on some of their aspects such as emitting calls directly (w/o *proto) or arg mappings. Probably make sense to reject these (bpf_prog_was_classic()) if they cannot be handled anyway? I see. Indeed then you'd need the verifier to walk all paths to make sure constant return values. I think this would still not cover the cases where you'd fetch a return value/verdict from a map, but this should be ignored/ rejected for now, also since majority of programs are not written in such a way. If you only need yes/no check then such info can probably be collected unconditionally during initial program load. Like prog->cb_access flag. One other comment wrt the header, when you move these things there, would be good to prefix with bpf_* so that this doesn't clash in future with other header files. Good point!
[PATCH v2] cfg80211: Remove deprecated create_singlethread_workqueue
The workqueue "cfg80211_wq" is involved in cleanup, scan and event related works. It queues multiple work items >event_work, >dfs_update_channels_wk, _to_rdev(request->wiphy)->scan_done_wk, _to_rdev(wiphy)->sched_scan_results_wk, which require strict execution ordering. Hence, an ordered dedicated workqueue has been used. Since it's a wireless driver, WQ_MEM_RECLAIM has been set to ensure forward progress under memory pressure. Signed-off-by: Bhaktipriya Shridhar--- net/wireless/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/wireless/core.c b/net/wireless/core.c index d25c82b..2cd4563 100644 --- a/net/wireless/core.c +++ b/net/wireless/core.c @@ -1218,7 +1218,7 @@ static int __init cfg80211_init(void) if (err) goto out_fail_reg; - cfg80211_wq = create_singlethread_workqueue("cfg80211"); + cfg80211_wq = alloc_ordered_workqueue("cfg80211", WQ_MEM_RECLAIM); if (!cfg80211_wq) { err = -ENOMEM; goto out_fail_wq; -- 2.1.4
Re: [PATCH V2] dt: net: enhance DWC EQoS binding to support Tegra186
On Wed, Aug 24, 2016 at 03:20:46PM -0600, Stephen Warren wrote: > From: Stephen Warren> > The Synopsys DWC EQoS is a configurable IP block which supports multiple > options for bus type, clocking and reset structure, and feature list. > Extend the DT binding to define a "compatible value" for the configuration > contained in NVIDIA's Tegra186 SoC, and define some new properties and > list property entries required by that configuration. > > Signed-off-by: Stephen Warren > --- > v2: > * Add an explicit compatible value for the Axis SoC's version of the EQOS > IP; this allows the driver to handle any SoC-specific integration quirks > that are required, rather than only knowing about the IP block in > isolation. This is good general DT practice. The existing value is still > documented to support existing DTs. > * Reworked the list of clocks the binding requires: > - Combined "tx" and "phy_ref_clk"; for GMII/RGMII configurations, these > are the same thing. > - Added extra description to the "rx" and "tx" clocks, to make it clear > exactly which HW clock they represent. > - Made the new "tx" and "slave_bus" names more prominent than the > original "phy_ref_clk" and "apb_pclk". The new names are more generic > and should work for any enhanced version of the binding (e.g. to > support additional PHY types). New compatible values will hopefully > choose to require the new names. > * Added a couple extra clocks to the list that may need to be supported in > future binding revisions. > * Fixed a typo; "clocks" -> "resets". > --- > .../bindings/net/snps,dwc-qos-ethernet.txt | 75 > -- > 1 file changed, 71 insertions(+), 4 deletions(-) > > diff --git a/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt > b/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt > index 51f8d2eba8d8..1d028259824a 100644 > --- a/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt > +++ b/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt > @@ -1,21 +1,87 @@ > * Synopsys DWC Ethernet QoS IP version 4.10 driver (GMAC) > > +This binding supports the Synopsys Designware Ethernet QoS (Quality Of > Service) > +IP block. The IP supports multiple options for bus type, clocking and reset > +structure, and feature list. Consequently, a number of properties and list > +entries in properties are marked as optional, or only required in specific HW > +configurations. > > Required properties: > -- compatible: Should be "snps,dwc-qos-ethernet-4.10" > +- compatible: One of: > + - "axis,artpec6-eqos", "snps,dwc-qos-ethernet-4.10" > +Represents the IP core when integrated into the Axis ARTPEC-6 SoC. > + - "nvidia,tegra186-eqos", "snps,dwc-qos-ethernet-4.10" > +Represents the IP core when integrated into the NVIDIA Tegra186 SoC. > + - "snps,dwc-qos-ethernet-4.10" > +This combination is deprecated. It should be treated as equivalent to > +"axis,artpec6-eqos", "snps,dwc-qos-ethernet-4.10". It is supported to be > +compatible with earlier revisions of this binding. > - reg: Address and length of the register set for the device > -- clocks: Phandles to the reference clock and the bus clock > -- clock-names: Should be "phy_ref_clk" for the reference clock and "apb_pclk" > - for the bus clock. > +- clocks: Phandle and clock specifiers for each entry in clock-names, in the > + same order. See ../clock/clock-bindings.txt. > +- clock-names: May contain any/all of the following depending on the IP > + configuration, in any order: No, they should be in a defined order. > + - "tx" > +(Alternate name "phy_ref_clk"; only one alternate must appear.) Obviously, the prior clocks were just made up for what someone needed at the time and did not read the spec. I think it would be better to just separate the old names and state they are deprecated and which compatibles they are for. > +The EQOS transmit path clock. The HW signal name is clk_tx_i. > +In some configurations (e.g. GMII/RGMII), this clock also drives the PHY > TX > +path. In other configurations, other clocks (such as tx_125, rmii) may > +drive the PHY TX path. > + - "rx" > +The EQOS receive path clock. The HW signal name is clk_rx_i. > +In some configurations (e.g. GMII/RGMII), this clock also drives the PHY > RX > +path. In other configurations, other clocks (such as rx_125, pmarx_0, > +pmarx_1, rmii) may drive the PHY RX path. > + - "slave_bus" > +(Alternate name "apb_pclk"; only one alternate must appear.) > +The CPU/slave-bus (CSR) interface clock. Despite the name, this applies > to > +any bus type; APB, AHB, AXI, etc. The HW signal name is hclk_i (AHB) or > +clk_csr_i (other buses). Sounds like 2 clocks here. > + - "master_bus" > +The master bus interface clock. Only required in configurations that use > a > +separate clock for the
Re: [RFC v2 06/10] landlock: Add LSM hooks
On Aug 25, 2016 12:34 PM, "Mickaël Salaün"wrote: > > Add LSM hooks which can be used by userland through Landlock (eBPF) > programs. This programs are limited to a whitelist of functions (cf. > next commit). The eBPF program context is depicted by the struct > landlock_data (cf. include/uapi/linux/bpf.h): > * hook: LSM hook ID (useful when using the same program for multiple LSM > hooks); > * cookie: the 16-bit value from the seccomp filter that triggered this > Landlock program; > * args[6]: array of LSM hook arguments. > > The LSM hook arguments can contain raw values as integers or > (unleakable) pointers. The only way to use the pointers are to pass them > to an eBPF function according to their types (e.g. the > bpf_landlock_cmp_fs_beneath_with_struct_file function can use a struct > file pointer). > > For now, there is three hooks for file system access control: > * file_open; > * file_permission; > * mmap_file. > What's the purpose of exposing struct cred * to userspace? It's primarily just an optimization to save a bit of RAM, and it's a dubious optimization at that. What are you using it for? Would it make more sense to use struct task_struct * or struct pid * instead? Also, exposing struct cred * has a really weird side-effect: it allows (maybe even encourages) checking for pointer equality between two struct cred * objects. Doing so will have erratic results.
Re: [RFC v2 09/10] landlock: Handle cgroups (performance)
On Sun, Aug 28, 2016 at 2:42 AM, Mickaël Salaünwrote: > > > On 28/08/2016 10:13, Andy Lutomirski wrote: >> On Aug 27, 2016 11:14 PM, "Mickaël Salaün" wrote: >>> >>> >>> On 27/08/2016 22:43, Alexei Starovoitov wrote: On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote: > On 27/08/2016 20:06, Alexei Starovoitov wrote: >> On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote: >>> As said above, Landlock will not run an eBPF programs when not strictly >>> needed. Attaching to a cgroup will have the same performance impact as >>> attaching to a process hierarchy. >> >> Having a prog per cgroup per lsm_hook is the only scalable way I >> could come up with. If you see another way, please propose. >> current->seccomp.landlock_prog is not the answer. > > Hum, I don't see the difference from a performance point of view between > a cgroup-based or a process hierarchy-based system. > > Maybe a better option should be to use an array of pointers with N > entries, one for each supported hook, instead of a unique pointer list? yes, clearly array dereference is faster than link list walk. Now the question is where to keep this prog_array[num_lsm_hooks] ? Since we cannot keep it inside task_struct, we have to allocate it. Every time the task is creted then. What to do on the fork? That will require changes all over. Then the obvious optimization would be to share this allocated array of prog pointers across multiple tasks... and little by little this new facility will look like cgroup. Hence the suggestion to put this array into cgroup from the start. >>> >>> I see your point :) >>> > Anyway, being able to attach an LSM hook program to a cgroup thanks to > the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility > to use a process hierarchy). The downside will be to handle an LSM hook > program which is not triggered by a seccomp-filter, but this should be > needed anyway to handle interruptions. what do you mean 'not triggered by seccomp' ? You're not suggesting that this lsm has to enable seccomp to be functional? imo that's non starter due to overhead. >>> >>> Yes, for now, it is triggered by a new seccomp filter return value >>> RET_LANDLOCK, which can take a 16-bit value called cookie. This must not >>> be needed but could be useful to bind a seccomp filter security policy >>> with a Landlock one. Waiting for Kees's point of view… >>> >> >> I'm not Kees, but I'd be okay with that. I still think that doing >> this by process hierarchy a la seccomp will be easier to use and to >> understand (which is quite important for this kind of work) than doing >> it by cgroup. >> >> A feature I've wanted to add for a while is to have an fd that >> represents a seccomp layer, the idea being that you would set up your >> seccomp layer (with syscall filter, landlock hooks, etc) and then you >> would have a syscall to install that layer. Then an unprivileged >> sandbox manager could set up its layer and still be able to inject new >> processes into it later on, no cgroups needed. > > A nice thing I didn't highlight about Landlock is that a process can > prepare a layer of rules (arraymap of handles + Landlock programs) and > pass the file descriptors of the Landlock programs to another process. > This process could then apply this programs to get sandboxed. However, > for now, because a Landlock program is only triggered by a seccomp > filter (which do not follow the Landlock programs as a FD), they will be > useless. > > The FD referring to an arraymap of handles can also be used to update a > map and change the behavior of a Landlock program. A master process can > then add or remove restrictions to another process hierarchy on the fly. Maybe this could be extended a little bit. The fd could hold the seccomp filter *and* the LSM hook filters. FMODE_EXECUTE could give the ability to install it and FMODE_WRITE could give the ability to modify it.
Re: [PATCH v3 0/5] meson: Meson8b and GXBB DWMAC glue driver
On Mon, Aug 29, 2016 at 5:40 AM, David Millerwrote: > From: Martin Blumenstingl > Date: Sun, 28 Aug 2016 18:16:32 +0200 > >> This adds a DWMAC glue driver for the PRG_ETHERNET registers found in >> Meson8b and GXBB SoCs. Based on the "old" meson6b-dwmac glue driver >> the register layout is completely different. >> Thus I introduced a separate driver. >> >> Changes since v2: >> - fixed unloading the glue driver when built as module. This pulls in a >> patch from Joachim Eastwood (thanks) to get our private data structure >> (bsp_priv). > > This doesn't apply cleanly at all to the net-next tree, so I have > no idea where you expect these changes to be applied. OK, maybe Kevin can me help out here as I think the patches should go to various trees. I think patches 1, 3 and 4 should go through the net-next tree (as these touch drivers/net/ethernet/stmicro/stmmac/ and the corresponding documentation). Patch 2 should probably go through clk-meson-gxbb / clk-next (just like the other clk changes we had). The last patch (patch 5) should probably go through the ARM SoC tree (just like the other dts changes we had). @David, Kevin: would this be fine for you?
[net-next] ixgbe: Eliminate useless message and improve logic
From: Mark RustadRemove a useless log message and improve the logic for setting a PHY address from the contents of the MNG_IF_SEL register. Signed-off-by: Mark Rustad Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c | 16 +--- 1 file changed, 5 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c index fb1b819..e092a89 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c @@ -2394,18 +2394,12 @@ static void ixgbe_read_mng_if_sel_x550em(struct ixgbe_hw *hw) /* If X552 (X550EM_a) and MDIO is connected to external PHY, then set * PHY address. This register field was has only been used for X552. */ - if (!hw->phy.nw_mng_if_sel) { - if (hw->mac.type == ixgbe_mac_x550em_a) { - struct ixgbe_adapter *adapter = hw->back; - - e_warn(drv, "nw_mng_if_sel not set\n"); - } - return; + if (hw->mac.type == ixgbe_mac_x550em_a && + hw->phy.nw_mng_if_sel & IXGBE_NW_MNG_IF_SEL_MDIO_ACT) { + hw->phy.mdio.prtad = (hw->phy.nw_mng_if_sel & + IXGBE_NW_MNG_IF_SEL_MDIO_PHY_ADD) >> +IXGBE_NW_MNG_IF_SEL_MDIO_PHY_ADD_SHIFT; } - - hw->phy.mdio.prtad = (hw->phy.nw_mng_if_sel & - IXGBE_NW_MNG_IF_SEL_MDIO_PHY_ADD) >> -IXGBE_NW_MNG_IF_SEL_MDIO_PHY_ADD_SHIFT; } /** ixgbe_init_phy_ops_X550em - PHY/SFP specific init -- 2.7.4
Re: [PATCH net] tg3: Fix for disallow tx coalescing time to be 0
Hello. On 08/30/2016 05:38 PM, Ivan Vecera wrote: The recent commit 087d7a8c disallows to set Rx coalescing time to be 0 You should specify both 12-digit SHA1 and the commit summary enclosed in (""). as this stops generating interrupts for the incoming packets. I found the zero Tx coalescing time stops generating interrupts similarly for outgoing packets and fires Tx watchdog later. To avoid this, don't allow to set Tx coalescing time to 0. Cc: satish.baddipad...@broadcom.com Cc: siva.kal...@broadcom.com Cc: michael.c...@broadcom.com Signed-off-by: Ivan Vecera[...] MBR, Sergei
Re: [PATCH net] l2tp: fix use-after-free during module unload
Hello. On 08/30/2016 05:05 PM, Sabrina Dubroca wrote: Tunnel deletion is delayed by both a workqueue (l2tp_tunnel_delete -> wq -> l2tp_tunnel_del_work) and RCU (sk_destruct -> RCU -> l2tp_tunnel_destruct). By the time l2tp_tunnel_destruct() runs to destroy the tunnel and finish destroying the socket, the private data reserved via the net_generic mechanism has already been freed, but l2tp_tunnel_destruct() actually uses this data. Make sure tunnel deletion for the netns has completed before returning from l2tp_net_exit() by first flushing the tunnel removal workqueue, and The patch tells me the function is named l2tp_exit_net(). :-) then waiting for RCU callbacks to complete. Fixes: 167eb17e0b17 ("l2tp: create tunnel sockets in the right namespace") Signed-off-by: Sabrina Dubroca--- net/l2tp/l2tp_core.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c index 1e40dacaa137..a2ed3bda4ddc 100644 --- a/net/l2tp/l2tp_core.c +++ b/net/l2tp/l2tp_core.c @@ -1855,6 +1855,9 @@ static __net_exit void l2tp_exit_net(struct net *net) (void)l2tp_tunnel_delete(tunnel); } rcu_read_unlock_bh(); + + flush_workqueue(l2tp_wq); + rcu_barrier(); } static struct pernet_operations l2tp_net_ops = { MBR, Sergei
[PATCH net-next 03/12] net: l3mdev: Allow the l3mdev to be a loopback
Allow an L3 master device to act as the loopback for that L3 domain. For IPv4 the device can also have the address 127.0.0.1. Signed-off-by: David Ahern--- include/net/l3mdev.h | 6 +++--- net/ipv4/route.c | 8 ++-- net/ipv6/route.c | 12 ++-- 3 files changed, 19 insertions(+), 7 deletions(-) diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h index 74ffe5aef299..5f03a89bb075 100644 --- a/include/net/l3mdev.h +++ b/include/net/l3mdev.h @@ -90,7 +90,7 @@ static inline int l3mdev_master_ifindex_by_index(struct net *net, int ifindex) } static inline -const struct net_device *l3mdev_master_dev_rcu(const struct net_device *_dev) +struct net_device *l3mdev_master_dev_rcu(const struct net_device *_dev) { /* netdev_master_upper_dev_get_rcu calls * list_first_or_null_rcu to walk the upper dev list. @@ -99,7 +99,7 @@ const struct net_device *l3mdev_master_dev_rcu(const struct net_device *_dev) * typecast to remove the const */ struct net_device *dev = (struct net_device *)_dev; - const struct net_device *master; + struct net_device *master; if (!dev) return NULL; @@ -253,7 +253,7 @@ static inline int l3mdev_master_ifindex_by_index(struct net *net, int ifindex) } static inline -const struct net_device *l3mdev_master_dev_rcu(const struct net_device *dev) +struct net_device *l3mdev_master_dev_rcu(const struct net_device *dev) { return NULL; } diff --git a/net/ipv4/route.c b/net/ipv4/route.c index a1f2830d8110..1119f18fb720 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2016,7 +2016,9 @@ static struct rtable *__mkroute_output(const struct fib_result *res, return ERR_PTR(-EINVAL); if (likely(!IN_DEV_ROUTE_LOCALNET(in_dev))) - if (ipv4_is_loopback(fl4->saddr) && !(dev_out->flags & IFF_LOOPBACK)) + if (ipv4_is_loopback(fl4->saddr) && + !(dev_out->flags & IFF_LOOPBACK) && + !netif_is_l3_master(dev_out)) return ERR_PTR(-EINVAL); if (ipv4_is_lbcast(fl4->daddr)) @@ -2300,7 +2302,9 @@ struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, else fl4->saddr = fl4->daddr; } - dev_out = net->loopback_dev; + + /* L3 master device is the loopback for that domain */ + dev_out = l3mdev_master_dev_rcu(dev_out) ? : net->loopback_dev; fl4->flowi4_oif = dev_out->ifindex; flags |= RTCF_LOCAL; goto make_route; diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 49817555449e..4a0f77aa49cf 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2556,8 +2556,16 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev, { u32 tb_id; struct net *net = dev_net(idev->dev); - struct rt6_info *rt = ip6_dst_alloc(net, net->loopback_dev, - DST_NOCOUNT); + struct net_device *dev = net->loopback_dev; + struct rt6_info *rt; + + /* use L3 Master device as loopback for host routes if device +* is enslaved and address is not link local or multicast +*/ + if (!rt6_need_strict(addr)) + dev = l3mdev_master_dev_rcu(idev->dev) ? : dev; + + rt = ip6_dst_alloc(net, dev, DST_NOCOUNT); if (!rt) return ERR_PTR(-ENOMEM); -- 2.1.4
[PATCH net-next 00/12] net: Convert vrf from dst to tx hook
The motivation for this series is that ICMP Unreachable - Fragmentation Needed packets are not handled properly for VRFs. Specifically, the FIB lookup in __ip_rt_update_pmtu fails so no nexthop exception is created with the reduced MTU. As a result connections stall if packets larger than the smallest MTU in the path are generated. While investigating that problem I also noticed that the MSS for all connections in a VRF is based on the VRF device's MTU and not the interface the packets ultimately go through. VRF currently uses a dst to direct packets to the device. The first FIB lookup returns this dst and then the lookup in the VRF driver gets the actual output route. A side effect of this design is that the VRF dst is cached on sockets and then used for calculations like the MSS. This series fixes this problem by removing the output dst that points to the VRF and always doing the actual FIB lookup. This allows the real dst to be cached on sockets and used for MSS. Packets are diverted to the VRF device on Tx using an l3mdev hook in the output path similar to to what is done for Rx. The end result is a much smaller and faster implementation for VRFs with fewer intrusions into the network stack, less code duplication in the VRF driver (output processing and FIB lookups) and symmetrical packet handling for Rx and Tx paths. The l3mdev and vrf hooks are more tightly focused on the primary goal of controlling the table used for lookups and a secondary goal of providing device based features for VRF such as packet socket hooks for tcpdump and netfilter hooks. Comparison of netperf performance for a build without l3mdev (best case performance), the old vrf driver and the VRF driver from this series. Data are collected using VMs with virtio + vhost. The netperf client runs in the VM and netserver runs in the host. 1-byte RR tests are done as these packets exaggerate the performance hit due to the extra lookups done for l3mdev and VRF. Command: netperf -cC -H ${ip} -l 60 -t {TCP,UDP}_RR [-J red] TCP_RR UDP_RR IPv4IPv6 IPv4 IPv6 no l3mdev 30105 31101 32436 26297 vrf old 27223 28476 28912 26122 vrf new 29001 30630 31024 26351 * Transactions per second as reported by netperf * netperf modified to take a bind-to-device argument -- the -J red option About the series - patch 1 adds the flow update (changing oif or iif to L3 master device and setting the flag to skip the oif check) to ipv4 and ipv6 paths just before hitting the rules. This catches all code paths in a single spot. - patch 2 adds the Tx hook to push the packet to the l3mdev if relevant - patch 3 adds some checks so the vrf device can act as a vrf-local loopback. These paths were not hit before since the vrf dst was returned from the lookup. - patches 4 and 5 flip the ipv4 and ipv6 stacks to the tx stack - patches 6-12 remove no longer needed l3mdev code David Ahern (12): net: flow: Add l3mdev flow update net: l3mdev: Add hook to output path net: l3mdev: Allow the l3mdev to be a loopback net: vrf: Flip the IPv4 path from dst to tx out hook net: vrf: Flip the IPv6 path from dst to tx out hook net: remove redundant l3mdev calls net: l3mdev: Remove l3mdev_get_saddr net: ipv6: Remove l3mdev_get_saddr6 net: l3mdev: Remove l3mdev_get_rtable net: l3mdev: Remove l3mdev_get_rt6_dst net: l3mdev: Remove l3mdev_fib_oif net: flow: Remove FLOWI_FLAG_L3MDEV_SRC flag drivers/net/vrf.c | 545 include/net/flow.h | 3 +- include/net/l3mdev.h| 132 +--- include/net/route.h | 10 - net/ipv4/fib_rules.c| 3 + net/ipv4/ip_output.c| 11 +- net/ipv4/raw.c | 6 - net/ipv4/route.c| 24 +-- net/ipv4/udp.c | 6 - net/ipv4/xfrm4_policy.c | 2 +- net/ipv6/fib6_rules.c | 3 + net/ipv6/ip6_output.c | 28 +-- net/ipv6/ndisc.c| 11 +- net/ipv6/output_core.c | 7 + net/ipv6/raw.c | 7 + net/ipv6/route.c| 24 +-- net/ipv6/tcp_ipv6.c | 8 +- net/ipv6/xfrm6_policy.c | 2 +- net/l3mdev/l3mdev.c | 122 --- 19 files changed, 288 insertions(+), 666 deletions(-) -- 2.1.4
[PATCH net-next 06/12] net: remove redundant l3mdev calls
A previous patch added l3mdev flow update making these hooks redundant. Signed-off-by: David Ahern--- net/ipv4/ip_output.c| 3 +-- net/ipv4/route.c| 12 ++-- net/ipv4/xfrm4_policy.c | 2 +- net/ipv6/ip6_output.c | 2 -- net/ipv6/ndisc.c| 11 ++- net/ipv6/route.c| 7 +-- net/ipv6/tcp_ipv6.c | 8 ++-- net/ipv6/xfrm6_policy.c | 2 +- 8 files changed, 10 insertions(+), 37 deletions(-) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 3c727d4eaba9..75f8167615ba 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1574,8 +1574,7 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, } oif = arg->bound_dev_if; - if (!oif && netif_index_is_l3_master(net, skb->skb_iif)) - oif = skb->skb_iif; + oif = oif ? : skb->skb_iif; flowi4_init_output(, oif, IP4_REPLY_MARK(net, skb->mark), diff --git a/net/ipv4/route.c b/net/ipv4/route.c index d9936f90a755..ec994380d354 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1829,7 +1829,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, * Now we are ready to route packet. */ fl4.flowi4_oif = 0; - fl4.flowi4_iif = l3mdev_fib_oif_rcu(dev); + fl4.flowi4_iif = dev->ifindex; fl4.flowi4_mark = skb->mark; fl4.flowi4_tos = tos; fl4.flowi4_scope = RT_SCOPE_UNIVERSE; @@ -2148,7 +2148,6 @@ struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, unsigned int flags = 0; struct fib_result res; struct rtable *rth; - int master_idx; int orig_oif; int err = -ENETUNREACH; @@ -2158,9 +2157,6 @@ struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, orig_oif = fl4->flowi4_oif; - master_idx = l3mdev_master_ifindex_by_index(net, fl4->flowi4_oif); - if (master_idx) - fl4->flowi4_oif = master_idx; fl4->flowi4_iif = LOOPBACK_IFINDEX; fl4->flowi4_tos = tos & IPTOS_RT_MASK; fl4->flowi4_scope = ((tos & RTO_ONLINK) ? @@ -2261,8 +2257,7 @@ struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, if (err) { res.fi = NULL; res.table = NULL; - if (fl4->flowi4_oif && - !netif_index_is_l3_master(net, fl4->flowi4_oif)) { + if (fl4->flowi4_oif) { /* Apparently, routing tables are wrong. Assume, that the destination is on link. @@ -2575,9 +2570,6 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh) fl4.flowi4_oif = tb[RTA_OIF] ? nla_get_u32(tb[RTA_OIF]) : 0; fl4.flowi4_mark = mark; - if (netif_index_is_l3_master(net, fl4.flowi4_oif)) - fl4.flowi4_flags = FLOWI_FLAG_L3MDEV_SRC | FLOWI_FLAG_SKIP_NH_OIF; - if (iif) { struct net_device *dev; diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c index b644a23c3db0..3155ed73d3b3 100644 --- a/net/ipv4/xfrm4_policy.c +++ b/net/ipv4/xfrm4_policy.c @@ -112,7 +112,7 @@ _decode_session4(struct sk_buff *skb, struct flowi *fl, int reverse) int oif = 0; if (skb_dst(skb)) - oif = l3mdev_fib_oif(skb_dst(skb)->dev); + oif = skb_dst(skb)->dev->ifindex; memset(fl4, 0, sizeof(struct flowi4)); fl4->flowi4_mark = skb->mark; diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 9711f32eedd7..84d1b3feaf2e 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1067,8 +1067,6 @@ struct dst_entry *ip6_dst_lookup_flow(const struct sock *sk, struct flowi6 *fl6, return ERR_PTR(err); if (final_dst) fl6->daddr = *final_dst; - if (!fl6->flowi6_oif) - fl6->flowi6_oif = l3mdev_fib_oif(dst->dev); return xfrm_lookup_route(net, dst, flowi6_to_flowi(fl6), sk, 0); } diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c index fe65cdc28a45..d8e671457d10 100644 --- a/net/ipv6/ndisc.c +++ b/net/ipv6/ndisc.c @@ -67,7 +67,6 @@ #include #include #include -#include #include #include @@ -457,11 +456,9 @@ static void ndisc_send_skb(struct sk_buff *skb, if (!dst) { struct flowi6 fl6; - int oif = l3mdev_fib_oif(skb->dev); + int oif = skb->dev->ifindex; icmpv6_flow_init(sk, , type, saddr, daddr, oif); - if (oif != skb->dev->ifindex) - fl6.flowi6_flags |= FLOWI_FLAG_L3MDEV_SRC; dst = icmp6_dst_alloc(skb->dev, ); if (IS_ERR(dst)) { kfree_skb(skb); @@ -1538,7 +1535,6 @@ void ndisc_send_redirect(struct sk_buff *skb, const struct in6_addr *target) int rd_len;
[PATCH net-next 02/12] net: l3mdev: Add hook to output path
This patch adds the infrastructure to the output path to pass an skb to an l3mdev device if it has a hook registered. This is the Tx parallel to l3mdev_ip{6}_rcv in the receive path and is the basis for removing the dst based hook. Signed-off-by: David Ahern--- include/net/l3mdev.h | 47 +++ net/ipv4/ip_output.c | 8 net/ipv6/ip6_output.c | 8 net/ipv6/output_core.c | 7 +++ net/ipv6/raw.c | 7 +++ 5 files changed, 77 insertions(+) diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h index 81e175e80537..74ffe5aef299 100644 --- a/include/net/l3mdev.h +++ b/include/net/l3mdev.h @@ -11,6 +11,7 @@ #ifndef _NET_L3MDEV_H_ #define _NET_L3MDEV_H_ +#include #include /** @@ -18,6 +19,10 @@ * * @l3mdev_fib_table: Get FIB table id to use for lookups * + * @l3mdev_l3_rcv:Hook in L3 receive path + * + * @l3mdev_l3_out:Hook in L3 output path + * * @l3mdev_get_rtable: Get cached IPv4 rtable (dst_entry) for device * * @l3mdev_get_saddr: Get source address for a flow @@ -29,6 +34,9 @@ struct l3mdev_ops { u32 (*l3mdev_fib_table)(const struct net_device *dev); struct sk_buff * (*l3mdev_l3_rcv)(struct net_device *dev, struct sk_buff *skb, u16 proto); + struct sk_buff * (*l3mdev_l3_out)(struct net_device *dev, + struct sock *sk, struct sk_buff *skb, + u16 proto); /* IPv4 ops */ struct rtable * (*l3mdev_get_rtable)(const struct net_device *dev, @@ -201,6 +209,33 @@ struct sk_buff *l3mdev_ip6_rcv(struct sk_buff *skb) return l3mdev_l3_rcv(skb, AF_INET6); } +static inline +struct sk_buff *l3mdev_l3_out(struct sock *sk, struct sk_buff *skb, u16 proto) +{ + struct net_device *dev = skb_dst(skb)->dev; + struct net_device *master = NULL; + + if (netif_is_l3_slave(dev)) { + master = netdev_master_upper_dev_get_rcu(dev); + if (master && master->l3mdev_ops->l3mdev_l3_out) + skb = master->l3mdev_ops->l3mdev_l3_out(master, sk, + skb, proto); + } + + return skb; +} + +static inline +struct sk_buff *l3mdev_ip_out(struct sock *sk, struct sk_buff *skb) +{ + return l3mdev_l3_out(sk, skb, AF_INET); +} + +static inline +struct sk_buff *l3mdev_ip6_out(struct sock *sk, struct sk_buff *skb) +{ + return l3mdev_l3_out(sk, skb, AF_INET6); +} #else static inline int l3mdev_master_ifindex_rcu(const struct net_device *dev) @@ -287,6 +322,18 @@ struct sk_buff *l3mdev_ip6_rcv(struct sk_buff *skb) } static inline +struct sk_buff *l3mdev_ip_out(struct sock *sk, struct sk_buff *skb) +{ + return skb; +} + +static inline +struct sk_buff *l3mdev_ip6_out(struct sock *sk, struct sk_buff *skb) +{ + return skb; +} + +static inline int l3mdev_fib_rule_match(struct net *net, struct flowi *fl, struct fib_lookup_arg *arg) { diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index dde37fb340bf..3c727d4eaba9 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -98,6 +98,14 @@ int __ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb) iph->tot_len = htons(skb->len); ip_send_check(iph); + + /* if egress device is enslaved to an L3 master device pass the +* skb to its handler for processing +*/ + skb = l3mdev_ip_out(sk, skb); + if (unlikely(!skb)) + return 0; + return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT, net, sk, skb, NULL, skb_dst(skb)->dev, dst_output); diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 1dfc402d9ad1..bcec7e73eb0b 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -228,6 +228,14 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6, if ((skb->len <= mtu) || skb->ignore_df || skb_is_gso(skb)) { IP6_UPD_PO_STATS(net, ip6_dst_idev(skb_dst(skb)), IPSTATS_MIB_OUT, skb->len); + + /* if egress device is enslaved to an L3 master device pass the +* skb to its handler for processing +*/ + skb = l3mdev_ip6_out((struct sock *)sk, skb); + if (unlikely(!skb)) + return 0; + /* hooks should never assume socket lock is held. * we promote our socket to non const */ diff --git a/net/ipv6/output_core.c b/net/ipv6/output_core.c index 462f2a76b5c2..7cca8ac66fe9 100644 --- a/net/ipv6/output_core.c +++ b/net/ipv6/output_core.c @@ -148,6 +148,13 @@ int __ip6_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
[PATCH net-next 01/12] net: flow: Add l3mdev flow update
Add l3mdev hook to set FLOWI_FLAG_SKIP_NH_OIF flag and update oif/iif in flow struct if its oif or iif points to a device enslaved to an L3 Master device. Only 1 needs to be converted to match the l3mdev FIB rule. This moves the flow adjustment for l3mdev to a single point catching all lookups. It is redundant for existing hooks (those are removed in later patches) but is needed for missed lookups such as PMTU updates. Signed-off-by: David Ahern--- include/net/l3mdev.h | 6 ++ net/ipv4/fib_rules.c | 3 +++ net/ipv6/fib6_rules.c | 3 +++ net/l3mdev/l3mdev.c | 35 +++ 4 files changed, 47 insertions(+) diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h index e90095091aa0..81e175e80537 100644 --- a/include/net/l3mdev.h +++ b/include/net/l3mdev.h @@ -49,6 +49,8 @@ struct l3mdev_ops { int l3mdev_fib_rule_match(struct net *net, struct flowi *fl, struct fib_lookup_arg *arg); +void l3mdev_update_flow(struct net *net, struct flowi *fl); + int l3mdev_master_ifindex_rcu(const struct net_device *dev); static inline int l3mdev_master_ifindex(struct net_device *dev) { @@ -290,6 +292,10 @@ int l3mdev_fib_rule_match(struct net *net, struct flowi *fl, { return 1; } +static inline +void l3mdev_update_flow(struct net *net, struct flowi *fl) +{ +} #endif #endif /* _NET_L3MDEV_H_ */ diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c index 6e9ea69e5f75..770bebed6b28 100644 --- a/net/ipv4/fib_rules.c +++ b/net/ipv4/fib_rules.c @@ -56,6 +56,9 @@ int __fib_lookup(struct net *net, struct flowi4 *flp, }; int err; + /* update flow if oif or iif point to device enslaved to l3mdev */ + l3mdev_update_flow(net, flowi4_to_flowi(flp)); + err = fib_rules_lookup(net->ipv4.rules_ops, flowi4_to_flowi(flp), 0, ); #ifdef CONFIG_IP_ROUTE_CLASSID if (arg.rule) diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c index 5857c1fc8b67..eea23b57c6a5 100644 --- a/net/ipv6/fib6_rules.c +++ b/net/ipv6/fib6_rules.c @@ -38,6 +38,9 @@ struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6 *fl6, .flags = FIB_LOOKUP_NOREF, }; + /* update flow if oif or iif point to device enslaved to l3mdev */ + l3mdev_update_flow(net, flowi6_to_flowi(fl6)); + fib_rules_lookup(net->ipv6.fib6_rules_ops, flowi6_to_flowi(fl6), flags, ); diff --git a/net/l3mdev/l3mdev.c b/net/l3mdev/l3mdev.c index c4a1c3e84e12..43610e5acc4e 100644 --- a/net/l3mdev/l3mdev.c +++ b/net/l3mdev/l3mdev.c @@ -222,3 +222,38 @@ int l3mdev_fib_rule_match(struct net *net, struct flowi *fl, return rc; } + +void l3mdev_update_flow(struct net *net, struct flowi *fl) +{ + struct net_device *dev; + int ifindex; + + rcu_read_lock(); + + if (fl->flowi_oif) { + dev = dev_get_by_index_rcu(net, fl->flowi_oif); + if (dev) { + ifindex = l3mdev_master_ifindex_rcu(dev); + if (ifindex) { + fl->flowi_oif = ifindex; + fl->flowi_flags |= FLOWI_FLAG_SKIP_NH_OIF; + goto out; + } + } + } + + if (fl->flowi_iif) { + dev = dev_get_by_index_rcu(net, fl->flowi_iif); + if (dev) { + ifindex = l3mdev_master_ifindex_rcu(dev); + if (ifindex) { + fl->flowi_iif = ifindex; + fl->flowi_flags |= FLOWI_FLAG_SKIP_NH_OIF; + } + } + } + +out: + rcu_read_unlock(); +} +EXPORT_SYMBOL_GPL(l3mdev_update_flow); -- 2.1.4
[PATCH net-next 04/12] net: vrf: Flip IPv4 path from dst to out hook
Flip the IPv4 output path from use of the vrf dst to the l3mdev tx out hook. Signed-off-by: David Ahern--- drivers/net/vrf.c | 171 -- net/ipv4/route.c | 4 -- 2 files changed, 64 insertions(+), 111 deletions(-) diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index 1ce7420322ee..7517645347c3 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -230,79 +230,28 @@ static netdev_tx_t vrf_process_v6_outbound(struct sk_buff *skb, static netdev_tx_t vrf_process_v4_outbound(struct sk_buff *skb, struct net_device *vrf_dev) { - struct iphdr *ip4h = ip_hdr(skb); - int ret = NET_XMIT_DROP; - struct flowi4 fl4 = { - /* needed to match OIF rule */ - .flowi4_oif = vrf_dev->ifindex, - .flowi4_iif = LOOPBACK_IFINDEX, - .flowi4_tos = RT_TOS(ip4h->tos), - .flowi4_flags = FLOWI_FLAG_ANYSRC | FLOWI_FLAG_L3MDEV_SRC | - FLOWI_FLAG_SKIP_NH_OIF, - .daddr = ip4h->daddr, - }; - struct net *net = dev_net(vrf_dev); - struct rtable *rt; - - rt = ip_route_output_flow(net, , NULL); - if (IS_ERR(rt)) - goto err; - - if (rt->rt_type != RTN_UNICAST && rt->rt_type != RTN_LOCAL) { - ip_rt_put(rt); - goto err; - } + struct net_vrf *vrf = netdev_priv(vrf_dev); + struct dst_entry *dst = NULL; + struct rtable *rth_local; skb_dst_drop(skb); - /* if dst.dev is loopback or the VRF device again this is locally -* originated traffic destined to a local address. Short circuit -* to Rx path using our local dst -*/ - if (rt->dst.dev == net->loopback_dev || rt->dst.dev == vrf_dev) { - struct net_vrf *vrf = netdev_priv(vrf_dev); - struct rtable *rth_local; - struct dst_entry *dst = NULL; - - ip_rt_put(rt); - - rcu_read_lock(); - - rth_local = rcu_dereference(vrf->rth_local); - if (likely(rth_local)) { - dst = _local->dst; - dst_hold(dst); - } - - rcu_read_unlock(); - - if (unlikely(!dst)) - goto err; + rcu_read_lock(); - return vrf_local_xmit(skb, vrf_dev, dst); + rth_local = rcu_dereference(vrf->rth_local); + if (likely(rth_local)) { + dst = _local->dst; + dst_hold(dst); } - skb_dst_set(skb, >dst); - - /* strip the ethernet header added for pass through VRF device */ - __skb_pull(skb, skb_network_offset(skb)); + rcu_read_unlock(); - if (!ip4h->saddr) { - ip4h->saddr = inet_select_addr(skb_dst(skb)->dev, 0, - RT_SCOPE_LINK); + if (unlikely(!dst)) { + vrf_tx_error(vrf_dev, skb); + return NET_XMIT_DROP; } - ret = ip_local_out(dev_net(skb_dst(skb)->dev), skb->sk, skb); - if (unlikely(net_xmit_eval(ret))) - vrf_dev->stats.tx_errors++; - else - ret = NET_XMIT_SUCCESS; - -out: - return ret; -err: - vrf_tx_error(vrf_dev, skb); - goto out; + return vrf_local_xmit(skb, vrf_dev, dst); } static netdev_tx_t is_ip_tx_frame(struct sk_buff *skb, struct net_device *dev) @@ -473,64 +422,71 @@ static int vrf_rt6_create(struct net_device *dev) } #endif -/* modelled after ip_finish_output2 */ +/* run skb through packet sockets for tcpdump with dev set to vrf dev */ static int vrf_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb) { - struct dst_entry *dst = skb_dst(skb); - struct rtable *rt = (struct rtable *)dst; - struct net_device *dev = dst->dev; - unsigned int hh_len = LL_RESERVED_SPACE(dev); - struct neighbour *neigh; - u32 nexthop; - int ret = -EINVAL; - - /* Be paranoid, rather than too clever. */ - if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) { - struct sk_buff *skb2; - - skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev)); - if (!skb2) { - ret = -ENOMEM; - goto err; - } - if (skb->sk) - skb_set_owner_w(skb2, skb->sk); - - consume_skb(skb); - skb = skb2; + if (likely(skb_headroom(skb) >= ETH_HLEN)) { + struct ethhdr *eth = (struct ethhdr *)skb_push(skb, ETH_HLEN); + + ether_addr_copy(eth->h_source, skb->dev->dev_addr); + eth_zero_addr(eth->h_dest); + eth->h_proto = skb->protocol; + dev_queue_xmit_nit(skb, skb->dev); +
[PATCH net-next 05/12] net: vrf: Flip IPv6 path from dst to out hook
Flip the IPv6 output path from use of the vrf dst to the l3mdev tx out hook. Signed-off-by: David Ahern--- drivers/net/vrf.c | 156 -- net/ipv6/ip6_output.c | 9 ++- net/ipv6/route.c | 5 -- 3 files changed, 70 insertions(+), 100 deletions(-) diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index 7517645347c3..df58bc791cfd 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -140,80 +140,42 @@ static int vrf_local_xmit(struct sk_buff *skb, struct net_device *dev, static netdev_tx_t vrf_process_v6_outbound(struct sk_buff *skb, struct net_device *dev) { - const struct ipv6hdr *iph = ipv6_hdr(skb); - struct net *net = dev_net(skb->dev); - struct flowi6 fl6 = { - /* needed to match OIF rule */ - .flowi6_oif = dev->ifindex, - .flowi6_iif = LOOPBACK_IFINDEX, - .daddr = iph->daddr, - .saddr = iph->saddr, - .flowlabel = ip6_flowinfo(iph), - .flowi6_mark = skb->mark, - .flowi6_proto = iph->nexthdr, - .flowi6_flags = FLOWI_FLAG_L3MDEV_SRC | FLOWI_FLAG_SKIP_NH_OIF, - }; - int ret = NET_XMIT_DROP; - struct dst_entry *dst; - struct dst_entry *dst_null = >ipv6.ip6_null_entry->dst; - - dst = ip6_route_output(net, NULL, ); - if (dst == dst_null) - goto err; + struct net_vrf *vrf = netdev_priv(dev); + struct dst_entry *dst = NULL; + struct rt6_info *rt6_local; skb_dst_drop(skb); - /* if dst.dev is loopback or the VRF device again this is locally -* originated traffic destined to a local address. Short circuit -* to Rx path using our local dst -*/ - if (dst->dev == net->loopback_dev || dst->dev == dev) { - struct net_vrf *vrf = netdev_priv(dev); - struct rt6_info *rt6_local; - - /* release looked up dst and use cached local dst */ - dst_release(dst); + rcu_read_lock(); - rcu_read_lock(); + rt6_local = rcu_dereference(vrf->rt6_local); + if (unlikely(!rt6_local)) { + rcu_read_unlock(); + goto err; + } - rt6_local = rcu_dereference(vrf->rt6_local); - if (unlikely(!rt6_local)) { + /* Ordering issue: cached local dst is created on newlink +* before the IPv6 initialization. Using the local dst +* requires rt6i_idev to be set so make sure it is. +*/ + if (unlikely(!rt6_local->rt6i_idev)) { + rt6_local->rt6i_idev = in6_dev_get(dev); + if (!rt6_local->rt6i_idev) { rcu_read_unlock(); goto err; } - - /* Ordering issue: cached local dst is created on newlink -* before the IPv6 initialization. Using the local dst -* requires rt6i_idev to be set so make sure it is. -*/ - if (unlikely(!rt6_local->rt6i_idev)) { - rt6_local->rt6i_idev = in6_dev_get(dev); - if (!rt6_local->rt6i_idev) { - rcu_read_unlock(); - goto err; - } - } - - dst = _local->dst; - dst_hold(dst); - - rcu_read_unlock(); - - return vrf_local_xmit(skb, dev, _local->dst); } - skb_dst_set(skb, dst); + dst = _local->dst; + if (likely(dst)) + dst_hold(dst); - /* strip the ethernet header added for pass through VRF device */ - __skb_pull(skb, skb_network_offset(skb)); + rcu_read_unlock(); - ret = ip6_local_out(net, skb->sk, skb); - if (unlikely(net_xmit_eval(ret))) - dev->stats.tx_errors++; - else - ret = NET_XMIT_SUCCESS; + if (unlikely(!dst)) + goto err; - return ret; + return vrf_local_xmit(skb, dev, dst); err: vrf_tx_error(dev, skb); return NET_XMIT_DROP; @@ -286,44 +248,43 @@ static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct net_device *dev) } #if IS_ENABLED(CONFIG_IPV6) -/* modelled after ip6_finish_output2 */ -static int vrf_finish_output6(struct net *net, struct sock *sk, - struct sk_buff *skb) -{ - struct dst_entry *dst = skb_dst(skb); - struct net_device *dev = dst->dev; - struct neighbour *neigh; - struct in6_addr *nexthop; - int ret; +static int vrf_finish_output(struct net *net, struct sock *sk, +struct sk_buff *skb); +/* modelled after ip6_output */ +static int vrf_output6(struct net *net, struct sock *sk, struct sk_buff *skb) +{ skb->protocol =
[PATCH net-next 10/12] net: l3mdev: Remove l3mdev_get_rt6_dst
No longer used Signed-off-by: David Ahern--- drivers/net/vrf.c| 92 +++- include/net/l3mdev.h | 14 net/l3mdev/l3mdev.c | 32 -- 3 files changed, 4 insertions(+), 134 deletions(-) diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index 08103bc7f1f5..23801647c113 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -48,7 +48,6 @@ static bool add_fib_rules = true; struct net_vrf { struct rtable __rcu *rth_local; - struct rt6_info __rcu *rt6; struct rt6_info __rcu *rt6_local; u32 tb_id; }; @@ -289,25 +288,11 @@ static struct sk_buff *vrf_ip6_out(struct net_device *vrf_dev, /* holding rtnl */ static void vrf_rt6_release(struct net_device *dev, struct net_vrf *vrf) { - struct rt6_info *rt6 = rtnl_dereference(vrf->rt6); struct rt6_info *rt6_local = rtnl_dereference(vrf->rt6_local); struct net *net = dev_net(dev); struct dst_entry *dst; - RCU_INIT_POINTER(vrf->rt6, NULL); - RCU_INIT_POINTER(vrf->rt6_local, NULL); - synchronize_rcu(); - - /* move dev in dst's to loopback so this VRF device can be deleted -* - based on dst_ifdown -*/ - if (rt6) { - dst = >dst; - dev_put(dst->dev); - dst->dev = net->loopback_dev; - dev_hold(dst->dev); - dst_release(dst); - } + rcu_assign_pointer(vrf->rt6_local, NULL); if (rt6_local) { if (rt6_local->rt6i_idev) @@ -327,7 +312,7 @@ static int vrf_rt6_create(struct net_device *dev) struct net_vrf *vrf = netdev_priv(dev); struct net *net = dev_net(dev); struct fib6_table *rt6i_table; - struct rt6_info *rt6, *rt6_local; + struct rt6_info *rt6_local; int rc = -ENOMEM; /* IPv6 can be CONFIG enabled and then disabled runtime */ @@ -338,24 +323,12 @@ static int vrf_rt6_create(struct net_device *dev) if (!rt6i_table) goto out; - /* create a dst for routing packets out a VRF device */ - rt6 = ip6_dst_alloc(net, dev, flags); - if (!rt6) - goto out; - - dst_hold(>dst); - - rt6->rt6i_table = rt6i_table; - rt6->dst.output = vrf_output6; - /* create a dst for local routing - packets sent locally * to local address via the VRF device as a loopback */ rt6_local = ip6_dst_alloc(net, dev, flags); - if (!rt6_local) { - dst_release(>dst); + if (!rt6_local) goto out; - } dst_hold(_local->dst); @@ -364,7 +337,6 @@ static int vrf_rt6_create(struct net_device *dev) rt6_local->rt6i_table = rt6i_table; rt6_local->dst.input = ip6_input; - rcu_assign_pointer(vrf->rt6, rt6); rcu_assign_pointer(vrf->rt6_local, rt6_local); rc = 0; @@ -693,7 +665,7 @@ static struct rt6_info *vrf_ip6_route_lookup(struct net *net, rcu_read_lock(); /* fib6_table does not have a refcnt and can not be freed */ - rt6 = rcu_dereference(vrf->rt6); + rt6 = rcu_dereference(vrf->rt6_local); if (likely(rt6)) table = rt6->rt6i_table; @@ -816,66 +788,10 @@ static struct sk_buff *vrf_l3_rcv(struct net_device *vrf_dev, return skb; } -#if IS_ENABLED(CONFIG_IPV6) -static struct dst_entry *vrf_get_rt6_dst(const struct net_device *dev, -struct flowi6 *fl6) -{ - bool need_strict = rt6_need_strict(>daddr); - struct net_vrf *vrf = netdev_priv(dev); - struct net *net = dev_net(dev); - struct dst_entry *dst = NULL; - struct rt6_info *rt; - - /* send to link-local or multicast address */ - if (need_strict) { - int flags = RT6_LOOKUP_F_IFACE; - - /* VRF device does not have a link-local address and -* sending packets to link-local or mcast addresses over -* a VRF device does not make sense -*/ - if (fl6->flowi6_oif == dev->ifindex) { - struct dst_entry *dst = >ipv6.ip6_null_entry->dst; - - dst_hold(dst); - return dst; - } - - if (!ipv6_addr_any(>saddr)) - flags |= RT6_LOOKUP_F_HAS_SADDR; - - rt = vrf_ip6_route_lookup(net, dev, fl6, fl6->flowi6_oif, flags); - if (rt) - dst = >dst; - - } else if (!(fl6->flowi6_flags & FLOWI_FLAG_L3MDEV_SRC)) { - - rcu_read_lock(); - - rt = rcu_dereference(vrf->rt6); - if (likely(rt)) { - dst = >dst; - dst_hold(dst); - } - - rcu_read_unlock(); - } - - /* make sure oif is set
[PATCH net-next 12/12] net: flow: Remove FLOWI_FLAG_L3MDEV_SRC flag
No longer used Signed-off-by: David Ahern--- include/net/flow.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/include/net/flow.h b/include/net/flow.h index d47ef4bb5423..035aa7716967 100644 --- a/include/net/flow.h +++ b/include/net/flow.h @@ -34,8 +34,7 @@ struct flowi_common { __u8flowic_flags; #define FLOWI_FLAG_ANYSRC 0x01 #define FLOWI_FLAG_KNOWN_NH0x02 -#define FLOWI_FLAG_L3MDEV_SRC 0x04 -#define FLOWI_FLAG_SKIP_NH_OIF 0x08 +#define FLOWI_FLAG_SKIP_NH_OIF 0x04 __u32 flowic_secid; struct flowi_tunnel flowic_tun_key; }; -- 2.1.4
[PATCH net-next 11/12] net: l3mdev: Remove l3mdev_fib_oif
No longer used Signed-off-by: David Ahern--- include/net/l3mdev.h | 29 - 1 file changed, 29 deletions(-) diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h index 3c1d71474f55..6aae664b427a 100644 --- a/include/net/l3mdev.h +++ b/include/net/l3mdev.h @@ -95,26 +95,6 @@ struct net_device *l3mdev_master_dev_rcu(const struct net_device *_dev) return master; } -/* get index of an interface to use for FIB lookups. For devices - * enslaved to an L3 master device FIB lookups are based on the - * master index - */ -static inline int l3mdev_fib_oif_rcu(struct net_device *dev) -{ - return l3mdev_master_ifindex_rcu(dev) ? : dev->ifindex; -} - -static inline int l3mdev_fib_oif(struct net_device *dev) -{ - int oif; - - rcu_read_lock(); - oif = l3mdev_fib_oif_rcu(dev); - rcu_read_unlock(); - - return oif; -} - u32 l3mdev_fib_table_rcu(const struct net_device *dev); u32 l3mdev_fib_table_by_index(struct net *net, int ifindex); static inline u32 l3mdev_fib_table(const struct net_device *dev) @@ -224,15 +204,6 @@ struct net_device *l3mdev_master_dev_rcu(const struct net_device *dev) return NULL; } -static inline int l3mdev_fib_oif_rcu(struct net_device *dev) -{ - return dev ? dev->ifindex : 0; -} -static inline int l3mdev_fib_oif(struct net_device *dev) -{ - return dev ? dev->ifindex : 0; -} - static inline u32 l3mdev_fib_table_rcu(const struct net_device *dev) { return 0; -- 2.1.4
[PATCH net-next 08/12] net: ipv6: Remove l3mdev_get_saddr6
No longer needed Signed-off-by: David Ahern--- drivers/net/vrf.c | 41 - include/net/l3mdev.h | 11 --- net/ipv6/ip6_output.c | 9 + net/l3mdev/l3mdev.c | 24 4 files changed, 1 insertion(+), 84 deletions(-) diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index ec65bf2afcb2..cc18319b4b0d 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -909,46 +909,6 @@ static struct dst_entry *vrf_get_rt6_dst(const struct net_device *dev, return dst; } - -/* called under rcu_read_lock */ -static int vrf_get_saddr6(struct net_device *dev, const struct sock *sk, - struct flowi6 *fl6) -{ - struct net *net = dev_net(dev); - struct dst_entry *dst; - struct rt6_info *rt; - int err; - - if (rt6_need_strict(>daddr)) { - rt = vrf_ip6_route_lookup(net, dev, fl6, fl6->flowi6_oif, - RT6_LOOKUP_F_IFACE); - if (unlikely(!rt)) - return 0; - - dst = >dst; - } else { - __u8 flags = fl6->flowi6_flags; - - fl6->flowi6_flags |= FLOWI_FLAG_L3MDEV_SRC; - fl6->flowi6_flags |= FLOWI_FLAG_SKIP_NH_OIF; - - dst = ip6_route_output(net, sk, fl6); - rt = (struct rt6_info *)dst; - - fl6->flowi6_flags = flags; - } - - err = dst->error; - if (!err) { - err = ip6_route_get_saddr(net, rt, >daddr, - sk ? inet6_sk(sk)->srcprefs : 0, - >saddr); - } - - dst_release(dst); - - return err; -} #endif static const struct l3mdev_ops vrf_l3mdev_ops = { @@ -958,7 +918,6 @@ static const struct l3mdev_ops vrf_l3mdev_ops = { .l3mdev_l3_out = vrf_l3_out, #if IS_ENABLED(CONFIG_IPV6) .l3mdev_get_rt6_dst = vrf_get_rt6_dst, - .l3mdev_get_saddr6 = vrf_get_saddr6, #endif }; diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h index 8085be19a767..391c46130ef6 100644 --- a/include/net/l3mdev.h +++ b/include/net/l3mdev.h @@ -43,9 +43,6 @@ struct l3mdev_ops { /* IPv6 ops */ struct dst_entry * (*l3mdev_get_rt6_dst)(const struct net_device *dev, struct flowi6 *fl6); - int(*l3mdev_get_saddr6)(struct net_device *dev, - const struct sock *sk, - struct flowi6 *fl6); }; #ifdef CONFIG_NET_L3_MASTER_DEV @@ -172,8 +169,6 @@ static inline bool netif_index_is_l3_master(struct net *net, int ifindex) } struct dst_entry *l3mdev_get_rt6_dst(struct net *net, struct flowi6 *fl6); -int l3mdev_get_saddr6(struct net *net, const struct sock *sk, - struct flowi6 *fl6); static inline struct sk_buff *l3mdev_l3_rcv(struct sk_buff *skb, u16 proto) @@ -291,12 +286,6 @@ struct dst_entry *l3mdev_get_rt6_dst(struct net *net, struct flowi6 *fl6) return NULL; } -static inline int l3mdev_get_saddr6(struct net *net, const struct sock *sk, - struct flowi6 *fl6) -{ - return 0; -} - static inline struct sk_buff *l3mdev_ip_rcv(struct sk_buff *skb) { diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 84d1b3feaf2e..2d067b0c2f10 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -918,13 +918,6 @@ static int ip6_dst_lookup_tail(struct net *net, const struct sock *sk, int err; int flags = 0; - if (ipv6_addr_any(>saddr) && fl6->flowi6_oif && - (!*dst || !(*dst)->error)) { - err = l3mdev_get_saddr6(net, sk, fl6); - if (err) - goto out_err; - } - /* The correct way to handle this would be to do * ip6_route_get_saddr, and then ip6_route_output; however, * the route-specific preferred source forces the @@ -1016,7 +1009,7 @@ static int ip6_dst_lookup_tail(struct net *net, const struct sock *sk, out_err_release: dst_release(*dst); *dst = NULL; -out_err: + if (err == -ENETUNREACH) IP6_INC_STATS(net, NULL, IPSTATS_MIB_OUTNOROUTES); return err; diff --git a/net/l3mdev/l3mdev.c b/net/l3mdev/l3mdev.c index b30034efccff..998e4dc2e6f9 100644 --- a/net/l3mdev/l3mdev.c +++ b/net/l3mdev/l3mdev.c @@ -131,30 +131,6 @@ struct dst_entry *l3mdev_get_rt6_dst(struct net *net, } EXPORT_SYMBOL_GPL(l3mdev_get_rt6_dst); -int l3mdev_get_saddr6(struct net *net, const struct sock *sk, - struct flowi6 *fl6) -{ - struct net_device *dev; - int rc = 0; - - if (fl6->flowi6_oif) { - rcu_read_lock(); - - dev = dev_get_by_index_rcu(net, fl6->flowi6_oif); - if
[PATCH net-next 07/12] net: ipv4: Remove l3mdev_get_saddr
No longer needed Signed-off-by: David Ahern--- drivers/net/vrf.c| 38 -- include/net/l3mdev.h | 12 include/net/route.h | 10 -- net/ipv4/raw.c | 6 -- net/ipv4/udp.c | 6 -- net/l3mdev/l3mdev.c | 31 --- 6 files changed, 103 deletions(-) diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index df58bc791cfd..ec65bf2afcb2 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -668,43 +668,6 @@ static struct rtable *vrf_get_rtable(const struct net_device *dev, return rth; } -/* called under rcu_read_lock */ -static int vrf_get_saddr(struct net_device *dev, struct flowi4 *fl4) -{ - struct fib_result res = { .tclassid = 0 }; - struct net *net = dev_net(dev); - u32 orig_tos = fl4->flowi4_tos; - u8 flags = fl4->flowi4_flags; - u8 scope = fl4->flowi4_scope; - u8 tos = RT_FL_TOS(fl4); - int rc; - - if (unlikely(!fl4->daddr)) - return 0; - - fl4->flowi4_flags |= FLOWI_FLAG_SKIP_NH_OIF; - fl4->flowi4_iif = LOOPBACK_IFINDEX; - /* make sure oif is set to VRF device for lookup */ - fl4->flowi4_oif = dev->ifindex; - fl4->flowi4_tos = tos & IPTOS_RT_MASK; - fl4->flowi4_scope = ((tos & RTO_ONLINK) ? -RT_SCOPE_LINK : RT_SCOPE_UNIVERSE); - - rc = fib_lookup(net, fl4, , 0); - if (!rc) { - if (res.type == RTN_LOCAL) - fl4->saddr = res.fi->fib_prefsrc ? : fl4->daddr; - else - fib_select_path(net, , fl4, -1); - } - - fl4->flowi4_flags = flags; - fl4->flowi4_tos = orig_tos; - fl4->flowi4_scope = scope; - - return rc; -} - static int vrf_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb) { return 0; @@ -991,7 +954,6 @@ static int vrf_get_saddr6(struct net_device *dev, const struct sock *sk, static const struct l3mdev_ops vrf_l3mdev_ops = { .l3mdev_fib_table = vrf_fib_table, .l3mdev_get_rtable = vrf_get_rtable, - .l3mdev_get_saddr = vrf_get_saddr, .l3mdev_l3_rcv = vrf_l3_rcv, .l3mdev_l3_out = vrf_l3_out, #if IS_ENABLED(CONFIG_IPV6) diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h index 5f03a89bb075..8085be19a767 100644 --- a/include/net/l3mdev.h +++ b/include/net/l3mdev.h @@ -25,8 +25,6 @@ * * @l3mdev_get_rtable: Get cached IPv4 rtable (dst_entry) for device * - * @l3mdev_get_saddr: Get source address for a flow - * * @l3mdev_get_rt6_dst: Get cached IPv6 rt6_info (dst_entry) for device */ @@ -41,8 +39,6 @@ struct l3mdev_ops { /* IPv4 ops */ struct rtable * (*l3mdev_get_rtable)(const struct net_device *dev, const struct flowi4 *fl4); - int (*l3mdev_get_saddr)(struct net_device *dev, - struct flowi4 *fl4); /* IPv6 ops */ struct dst_entry * (*l3mdev_get_rt6_dst)(const struct net_device *dev, @@ -175,8 +171,6 @@ static inline bool netif_index_is_l3_master(struct net *net, int ifindex) return rc; } -int l3mdev_get_saddr(struct net *net, int ifindex, struct flowi4 *fl4); - struct dst_entry *l3mdev_get_rt6_dst(struct net *net, struct flowi6 *fl6); int l3mdev_get_saddr6(struct net *net, const struct sock *sk, struct flowi6 *fl6); @@ -291,12 +285,6 @@ static inline bool netif_index_is_l3_master(struct net *net, int ifindex) return false; } -static inline int l3mdev_get_saddr(struct net *net, int ifindex, - struct flowi4 *fl4) -{ - return 0; -} - static inline struct dst_entry *l3mdev_get_rt6_dst(struct net *net, struct flowi6 *fl6) { diff --git a/include/net/route.h b/include/net/route.h index ad777d79af94..0429d47cad25 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -29,7 +29,6 @@ #include #include #include -#include #include #include #include @@ -285,15 +284,6 @@ static inline struct rtable *ip_route_connect(struct flowi4 *fl4, ip_route_connect_init(fl4, dst, src, tos, oif, protocol, sport, dport, sk); - if (!src && oif) { - int rc; - - rc = l3mdev_get_saddr(net, oif, fl4); - if (rc < 0) - return ERR_PTR(rc); - - src = fl4->saddr; - } if (!dst || !src) { rt = __ip_route_output_key(net, fl4); if (IS_ERR(rt)) diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 438f50c1a676..90a85c955872 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -606,12 +606,6 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) (inet->hdrincl ? FLOWI_FLAG_KNOWN_NH : 0),
[PATCH net-next 09/12] net: l3mdev: Remove l3mdev_get_rtable
No longer used Signed-off-by: David Ahern--- drivers/net/vrf.c| 47 ++- include/net/l3mdev.h | 21 - 2 files changed, 2 insertions(+), 66 deletions(-) diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index cc18319b4b0d..08103bc7f1f5 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -47,7 +47,6 @@ static bool add_fib_rules = true; struct net_vrf { - struct rtable __rcu *rth; struct rtable __rcu *rth_local; struct rt6_info __rcu *rt6; struct rt6_info __rcu *rt6_local; @@ -460,26 +459,16 @@ static struct sk_buff *vrf_l3_out(struct net_device *vrf_dev, /* holding rtnl */ static void vrf_rtable_release(struct net_device *dev, struct net_vrf *vrf) { - struct rtable *rth = rtnl_dereference(vrf->rth); struct rtable *rth_local = rtnl_dereference(vrf->rth_local); struct net *net = dev_net(dev); struct dst_entry *dst; - RCU_INIT_POINTER(vrf->rth, NULL); RCU_INIT_POINTER(vrf->rth_local, NULL); synchronize_rcu(); /* move dev in dst's to loopback so this VRF device can be deleted * - based on dst_ifdown */ - if (rth) { - dst = >dst; - dev_put(dst->dev); - dst->dev = net->loopback_dev; - dev_hold(dst->dev); - dst_release(dst); - } - if (rth_local) { dst = _local->dst; dev_put(dst->dev); @@ -492,31 +481,20 @@ static void vrf_rtable_release(struct net_device *dev, struct net_vrf *vrf) static int vrf_rtable_create(struct net_device *dev) { struct net_vrf *vrf = netdev_priv(dev); - struct rtable *rth, *rth_local; + struct rtable *rth_local; if (!fib_new_table(dev_net(dev), vrf->tb_id)) return -ENOMEM; - /* create a dst for routing packets out through a VRF device */ - rth = rt_dst_alloc(dev, 0, RTN_UNICAST, 1, 1, 0); - if (!rth) - return -ENOMEM; - /* create a dst for local ingress routing - packets sent locally * to local address via the VRF device as a loopback */ rth_local = rt_dst_alloc(dev, RTCF_LOCAL, RTN_LOCAL, 1, 1, 0); - if (!rth_local) { - dst_release(>dst); + if (!rth_local) return -ENOMEM; - } - - rth->dst.output = vrf_output; - rth->rt_table_id = vrf->tb_id; rth_local->rt_table_id = vrf->tb_id; - rcu_assign_pointer(vrf->rth, rth); rcu_assign_pointer(vrf->rth_local, rth_local); return 0; @@ -648,26 +626,6 @@ static u32 vrf_fib_table(const struct net_device *dev) return vrf->tb_id; } -static struct rtable *vrf_get_rtable(const struct net_device *dev, -const struct flowi4 *fl4) -{ - struct rtable *rth = NULL; - - if (!(fl4->flowi4_flags & FLOWI_FLAG_L3MDEV_SRC)) { - struct net_vrf *vrf = netdev_priv(dev); - - rcu_read_lock(); - - rth = rcu_dereference(vrf->rth); - if (likely(rth)) - dst_hold(>dst); - - rcu_read_unlock(); - } - - return rth; -} - static int vrf_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb) { return 0; @@ -913,7 +871,6 @@ static struct dst_entry *vrf_get_rt6_dst(const struct net_device *dev, static const struct l3mdev_ops vrf_l3mdev_ops = { .l3mdev_fib_table = vrf_fib_table, - .l3mdev_get_rtable = vrf_get_rtable, .l3mdev_l3_rcv = vrf_l3_rcv, .l3mdev_l3_out = vrf_l3_out, #if IS_ENABLED(CONFIG_IPV6) diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h index 391c46130ef6..44ceec61de63 100644 --- a/include/net/l3mdev.h +++ b/include/net/l3mdev.h @@ -23,8 +23,6 @@ * * @l3mdev_l3_out:Hook in L3 output path * - * @l3mdev_get_rtable: Get cached IPv4 rtable (dst_entry) for device - * * @l3mdev_get_rt6_dst: Get cached IPv6 rt6_info (dst_entry) for device */ @@ -36,10 +34,6 @@ struct l3mdev_ops { struct sock *sk, struct sk_buff *skb, u16 proto); - /* IPv4 ops */ - struct rtable * (*l3mdev_get_rtable)(const struct net_device *dev, -const struct flowi4 *fl4); - /* IPv6 ops */ struct dst_entry * (*l3mdev_get_rt6_dst)(const struct net_device *dev, struct flowi6 *fl6); @@ -140,15 +134,6 @@ static inline u32 l3mdev_fib_table(const struct net_device *dev) return tb_id; } -static inline struct rtable *l3mdev_get_rtable(const struct net_device *dev, - const struct flowi4 *fl4) -{ - if (netif_is_l3_master(dev) &&
Re: [PATCH] net: pegasus: Remove deprecated create_singlethread_workqueue
On 16-08-30 22:02:47, Bhaktipriya Shridhar wrote: > The workqueue "pegasus_workqueue" queues a single work item per pegasus > instance and hence it doesn't require execution ordering. Hence, > alloc_workqueue has been used to replace the deprecated > create_singlethread_workqueue instance. > > The WQ_MEM_RECLAIM flag has been set to ensure forward progress under > memory pressure since it's a network driver. > > Since there are fixed number of work items, explicit concurrency > limit is unnecessary here. > > Signed-off-by: Bhaktipriya Shridhar> --- > drivers/net/usb/pegasus.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/usb/pegasus.c b/drivers/net/usb/pegasus.c > index 9bbe0161..1434e5d 100644 > --- a/drivers/net/usb/pegasus.c > +++ b/drivers/net/usb/pegasus.c > @@ -1129,7 +1129,8 @@ static int pegasus_probe(struct usb_interface *intf, > return -ENODEV; > > if (pegasus_count == 0) { > - pegasus_workqueue = create_singlethread_workqueue("pegasus"); > + pegasus_workqueue = alloc_workqueue("pegasus", WQ_MEM_RECLAIM, > + 0); > if (!pegasus_workqueue) > return -ENOMEM; > } > -- > 2.1.4 Nope, there is no need for singlethread-ness here. As long as the flag you used is doing the right thing i am OK with the patch. Petko
[PATCH] net: pegasus: Remove deprecated create_singlethread_workqueue
The workqueue "pegasus_workqueue" queues a single work item per pegasus instance and hence it doesn't require execution ordering. Hence, alloc_workqueue has been used to replace the deprecated create_singlethread_workqueue instance. The WQ_MEM_RECLAIM flag has been set to ensure forward progress under memory pressure since it's a network driver. Since there are fixed number of work items, explicit concurrency limit is unnecessary here. Signed-off-by: Bhaktipriya Shridhar--- drivers/net/usb/pegasus.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/usb/pegasus.c b/drivers/net/usb/pegasus.c index 9bbe0161..1434e5d 100644 --- a/drivers/net/usb/pegasus.c +++ b/drivers/net/usb/pegasus.c @@ -1129,7 +1129,8 @@ static int pegasus_probe(struct usb_interface *intf, return -ENODEV; if (pegasus_count == 0) { - pegasus_workqueue = create_singlethread_workqueue("pegasus"); + pegasus_workqueue = alloc_workqueue("pegasus", WQ_MEM_RECLAIM, + 0); if (!pegasus_workqueue) return -ENOMEM; } -- 2.1.4
[PATCH] bonding: Remove deprecated create_singlethread_workqueue
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces deprecated create_singlethread_workqueue(). This is the identity conversion. The workqueue "wq" queues multiple work items viz >mcast_work, >work, >mii_work, >arp_work, >alb_work, >mii_work, >ad_work, >slave_arr_work which require strict execution ordering. Hence, an ordered dedicated workqueue has been used. Since, it is a network driver, WQ_MEM_RECLAIM has been set to ensure forward progress under memory pressure. Signed-off-by: Bhaktipriya Shridhar--- drivers/net/bonding/bond_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 941ec99..ebaf1a9 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -4635,7 +4635,7 @@ static int bond_init(struct net_device *bond_dev) netdev_dbg(bond_dev, "Begin bond_init\n"); - bond->wq = create_singlethread_workqueue(bond_dev->name); + bond->wq = alloc_ordered_workqueue(bond_dev->name, WQ_MEM_RECLAIM); if (!bond->wq) return -ENOMEM; -- 2.1.4
Re: [PATCH net] sunrpc: fix UDP memory accounting
On 25 Aug 2016, at 12:42, Paolo Abeni wrote: The commit f9b2ee714c5c ("SUNRPC: Move UDP receive data path into a workqueue context"), as a side effect, moved the skb_free_datagram() call outside the scope of the related socket lock, but UDP sockets require such lock to be held for proper memory accounting. Fix it by replacing skb_free_datagram() with skb_free_datagram_locked(). Fixes: f9b2ee714c5c ("SUNRPC: Move UDP receive data path into a workqueue context") Reported-and-tested-by: Jan StancekSigned-off-by: Paolo Abeni Thanks for finding this. A similar fix in 2009 for svcsock.c was done by Eric Dumazet: 9d410c796067 ("net: fix sk_forward_alloc corruption") skb_free_datagram_locked() is used for all xprt types in svcsock.c, should we use it for the xs_local_transport as well in xprtsock.c? Ben --- net/sunrpc/xprtsock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index 8ede3bc..bf16883 100644 --- a/net/sunrpc/xprtsock.c +++ b/net/sunrpc/xprtsock.c @@ -1074,7 +1074,7 @@ static void xs_udp_data_receive(struct sock_xprt *transport) skb = skb_recv_datagram(sk, 0, 1, ); if (skb != NULL) { xs_udp_data_read_skb(>xprt, sk, skb); - skb_free_datagram(sk, skb); + skb_free_datagram_locked(sk, skb); continue; } if (!test_and_clear_bit(XPRT_SOCK_DATA_READY, >sock_state)) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] rxrpc: Remove use of skbs from AFS
Sorry about this, stgit mail is playing silly devils and not inserting the patch numbers if there's a cover letter but only one patch:-/ David
[PATCH net-next] rxrpc: Remove use of skbs from AFS
Here's a single patch that removes the use of sk_buffs from fs/afs. From this point on they'll be entirely retained within net/rxrpc and AFS just asks AF_RXRPC for linear buffers of data. This needs to be applied on top of the just-posted preparatory patch set. This makes some future developments easier/possible: (1) Simpler rxrpc_call usage counting. (2) Earlier freeing of metadata sk_buffs. (3) Rx phase shortcutting on abort/error. (4) Encryption/decryption in the AFS fs contexts/threads and directly between sk_buffs and AFS buffers. (5) Synchronous waiting in reception for AFS. Changes: (V2) Fixed afs_transfer_reply() whereby call->offset was incorrectly being added to the buffer pointer (it doesn't matter as long as the reply fits entirely inside a single packet). Removed an unused goto-label and an unused variable. The patch can be found here also: http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite Tagged thusly: git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git rxrpc-rewrite-20160830-2v2 David --- David Howells (1): rxrpc: Don't expose skbs to in-kernel users Documentation/networking/rxrpc.txt | 72 +++--- fs/afs/cmservice.c | 142 ++-- fs/afs/fsclient.c | 148 +--- fs/afs/internal.h | 33 +-- fs/afs/rxrpc.c | 439 +--- fs/afs/vlclient.c |7 - include/net/af_rxrpc.h | 35 +-- net/rxrpc/af_rxrpc.c | 29 +- net/rxrpc/ar-internal.h| 23 ++ net/rxrpc/call_accept.c| 13 + net/rxrpc/call_object.c|5 net/rxrpc/conn_event.c |1 net/rxrpc/input.c | 10 + net/rxrpc/output.c |2 net/rxrpc/recvmsg.c| 191 +--- net/rxrpc/skbuff.c |1 16 files changed, 565 insertions(+), 586 deletions(-)
Re: [PATCH v4] brcmfmac: add missing header dependencies
Baoyou Xiewrites: > On 29 August 2016 at 23:31, Rafał Miłecki wrote: > > On 29 August 2016 at 14:39, Baoyou Xie wrote: > > We get 1 warning when build kernel with W=1: > > drivers/net/wireless/broadcom/brcm80211/brcmfmac/tracepoint.c:23:6: > warning: no previous prototype for '__brcmf_err' [-Wmissing-prototypes] > > building? I'm not native English, but I think so. > > > > In fact, this function is declared in brcmfmac/debug.h, so this patch > > add missing header dependencies. > > adds > > > > Signed-off-by: Baoyou Xie > > Acked-by: Arnd Bergmann > > Please don't resend patches just to add tags like that. This only > increases a noise and patchwork handles this just fine, see: > https://patchwork.kernel.org/patch/9303285/ > https://patchwork.kernel.org/patch/9303285/mbox/ > > > Do I need to resend a patch that fixes two typos(build/add)? or you modify > them > on your way? I can fix those when I commit the patch. -- Kalle Valo
Re: [RFC v2 00/10] Landlock LSM: Unprivileged sandboxing
On Thu, Aug 25, 2016 at 3:32 AM, Mickaël Salaünwrote: > Hi, > > This series is a proof of concept to fill some missing part of seccomp as the > ability to check syscall argument pointers or creating more dynamic security > policies. The goal of this new stackable Linux Security Module (LSM) called > Landlock is to allow any process, including unprivileged ones, to create > powerful security sandboxes comparable to the Seatbelt/XNU Sandbox or the > OpenBSD Pledge. This kind of sandbox help to mitigate the security impact of > bugs or unexpected/malicious behaviors in userland applications. Mickaël, will you be at KS and/or LPC?
Re: [PATCH next] tcp: fastopen: fix rcv_wup initialization for TFO server on SYN/data
Sorry, there's a typo in the subject line: that should be "net" rather than "next" (I'm proposing "net" since it's a bug fix). Looks like "git am" strips this mistake, but I'm happy to resubmit if it helps. thanks, neal
[PATCH next] tcp: fastopen: fix rcv_wup initialization for TFO server on SYN/data
Yuchung noticed that on the first TFO server data packet sent after the (TFO) handshake, the server echoed the TCP timestamp value in the SYN/data instead of the timestamp value in the final ACK of the handshake. This problem did not happen on regular opens. The tcp_replace_ts_recent() logic that decides whether to remember an incoming TS value needs tp->rcv_wup to hold the latest receive sequence number that we have ACKed (latest tp->rcv_nxt we have ACKed). This commit fixes this issue by ensuring that a TFO server properly updates tp->rcv_wup to match tp->rcv_nxt at the time it sends a SYN/ACK for the SYN/data. Reported-by: Yuchung ChengSigned-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path") --- net/ipv4/tcp_fastopen.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c index 54d9f9b..62a5751 100644 --- a/net/ipv4/tcp_fastopen.c +++ b/net/ipv4/tcp_fastopen.c @@ -226,6 +226,7 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk, tcp_fastopen_add_skb(child, skb); tcp_rsk(req)->rcv_nxt = tp->rcv_nxt; + tp->rcv_wup = tp->rcv_nxt; /* tcp_conn_request() is sending the SYNACK, * and queues the child into listener accept queue. */ -- 2.8.0.rc3.226.g39d4020
[PATCH net-next] rxrpc: Don't expose skbs to in-kernel users
Don't expose skbs to in-kernel users, such as the AFS filesystem, but instead provide a notification hook the indicates that a call needs attention and another that indicates that there's a new call to be collected. This makes the following possibilities more achievable: (1) Call refcounting can be made simpler if skbs don't hold refs to calls. (2) skbs referring to non-data events will be able to be freed much sooner rather than being queued for AFS to pick up as rxrpc_kernel_recv_data will be able to consult the call state. (3) We can shortcut the receive phase when a call is remotely aborted because we don't have to go through all the packets to get to the one cancelling the operation. (4) It makes it easier to do encryption/decryption directly between AFS's buffers and sk_buffs. (5) Encryption/decryption can more easily be done in the AFS's thread contexts - usually that of the userspace process that issued a syscall - rather than in one of rxrpc's background threads on a workqueue. (6) AFS will be able to wait synchronously on a call inside AF_RXRPC. To make this work, the following interface function has been added: int rxrpc_kernel_recv_data( struct socket *sock, struct rxrpc_call *call, void *buffer, size_t bufsize, size_t *_offset, bool want_more, u32 *_abort_code); This is the recvmsg equivalent. It allows the caller to find out about the state of a specific call and to transfer received data into a buffer piecemeal. afs_extract_data() and rxrpc_kernel_recv_data() now do all the extraction logic between them. They don't wait synchronously yet because the socket lock needs to be dealt with. Five interface functions have been removed: rxrpc_kernel_is_data_last() rxrpc_kernel_get_abort_code() rxrpc_kernel_get_error_number() rxrpc_kernel_free_skb() rxrpc_kernel_data_consumed() As a temporary hack, sk_buffs going to an in-kernel call are queued on the rxrpc_call struct (->knlrecv_queue) rather than being handed over to the in-kernel user. To process the queue internally, a temporary function, temp_deliver_data() has been added. This will be replaced with common code between the rxrpc_recvmsg() path and the kernel_rxrpc_recv_data() path in a future patch. Signed-off-by: David Howells--- Documentation/networking/rxrpc.txt | 72 +++--- fs/afs/cmservice.c | 142 ++-- fs/afs/fsclient.c | 148 +--- fs/afs/internal.h | 34 +-- fs/afs/rxrpc.c | 439 +--- fs/afs/vlclient.c |7 - include/net/af_rxrpc.h | 35 +-- net/rxrpc/af_rxrpc.c | 29 +- net/rxrpc/ar-internal.h| 23 ++ net/rxrpc/call_accept.c| 13 + net/rxrpc/call_object.c|5 net/rxrpc/conn_event.c |1 net/rxrpc/input.c | 10 + net/rxrpc/output.c |2 net/rxrpc/recvmsg.c| 195 +--- net/rxrpc/skbuff.c |1 16 files changed, 570 insertions(+), 586 deletions(-) diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt index cfc8cb91452f..1b63bbc6b94f 100644 --- a/Documentation/networking/rxrpc.txt +++ b/Documentation/networking/rxrpc.txt @@ -748,6 +748,37 @@ The kernel interface functions are as follows: The msg must not specify a destination address, control data or any flags other than MSG_MORE. len is the total amount of data to transmit. + (*) Receive data from a call. + + int rxrpc_kernel_recv_data(struct socket *sock, + struct rxrpc_call *call, + void *buf, + size_t size, + size_t *_offset, + bool want_more, + u32 *_abort) + + This is used to receive data from either the reply part of a client call + or the request part of a service call. buf and size specify how much + data is desired and where to store it. *_offset is added on to buf and + subtracted from size internally; the amount copied into the buffer is + added to *_offset before returning. + + want_more should be true if further data will be required after this is + satisfied and false if this is the last item of the receive phase. + + There are three normal returns: 0 if the buffer was filled and want_more + was true; 1 if the buffer was filled, the last DATA packet has been + emptied and want_more was false; and -EAGAIN if the function needs to be + called again. + + If the last DATA packet is processed but the buffer contains less than + the amount requested, EBADMSG is returned. If
[PATCH net-next] rxrpc: Remove use of skbs from AFS
Here's a single patch that removes the use of sk_buffs from fs/afs. From this point on they'll be entirely retained within net/rxrpc and AFS just asks AF_RXRPC for linear buffers of data. This needs to be applied on top of the just-posted preparatory patch set. This makes some future developments easier/possible: (1) Simpler rxrpc_call usage counting. (2) Earlier freeing of metadata sk_buffs. (3) Rx phase shortcutting on abort/error. (4) Encryption/decryption in the AFS fs contexts/threads and directly between sk_buffs and AFS buffers. (5) Synchronous waiting in reception for AFS. The patch can be found here also: http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite Tagged thusly: git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git rxrpc-rewrite-20160830-2 David --- David Howells (1): rxrpc: Don't expose skbs to in-kernel users Documentation/networking/rxrpc.txt | 72 +++--- fs/afs/cmservice.c | 142 ++-- fs/afs/fsclient.c | 148 +--- fs/afs/internal.h | 34 +-- fs/afs/rxrpc.c | 439 +--- fs/afs/vlclient.c |7 - include/net/af_rxrpc.h | 35 +-- net/rxrpc/af_rxrpc.c | 29 +- net/rxrpc/ar-internal.h| 23 ++ net/rxrpc/call_accept.c| 13 + net/rxrpc/call_object.c|5 net/rxrpc/conn_event.c |1 net/rxrpc/input.c | 10 + net/rxrpc/output.c |2 net/rxrpc/recvmsg.c| 195 +--- net/rxrpc/skbuff.c |1 16 files changed, 570 insertions(+), 586 deletions(-)
Re: [PATCH 4/8] dmaengine: sa11x0: unexport sa11x0_dma_filter_fn and clean up
On Mon, Aug 29, 2016 at 12:26:20PM +0100, Russell King wrote: > As we now have no users of sa11x0_dma_filter_fn() in the tree, we can > unexport this function, and remove the now unused header file. Acked-by: Vinod Koul-- ~Vinod
Re: [PATCH 1/8] dmaengine: sa11x0: add DMA filters
On Mon, Aug 29, 2016 at 12:26:04PM +0100, Russell King wrote: > Add DMA filters for the sa11x0 DMA channels. This will allow us to > migrate away from directly using the DMA filter function in drivers. Acked-by: Vinod Koul-- ~Vinod
[PATCH net] net: bridge: don't increment tx_dropped in br_do_proxy_arp
pskb_may_pull may fail due to various reasons (e.g. alloc failure), but the skb isn't changed/dropped and processing continues so we shouldn't increment tx_dropped. CC: Kyeyoon ParkCC: Roopa Prabhu CC: Stephen Hemminger CC: bri...@lists.linux-foundation.org Fixes: 958501163ddd ("bridge: Add support for IEEE 802.11 Proxy ARP") Signed-off-by: Nikolay Aleksandrov --- net/bridge/br_input.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c index 8e486203d133..abe11f085479 100644 --- a/net/bridge/br_input.c +++ b/net/bridge/br_input.c @@ -80,13 +80,10 @@ static void br_do_proxy_arp(struct sk_buff *skb, struct net_bridge *br, BR_INPUT_SKB_CB(skb)->proxyarp_replied = false; - if (dev->flags & IFF_NOARP) + if ((dev->flags & IFF_NOARP) || + !pskb_may_pull(skb, arp_hdr_len(dev))) return; - if (!pskb_may_pull(skb, arp_hdr_len(dev))) { - dev->stats.tx_dropped++; - return; - } parp = arp_hdr(skb); if (parp->ar_pro != htons(ETH_P_IP) || -- 2.1.4
Re: [PATCH net] tg3: Fix for disallow tx coalescing time to be 0
On Tue, Aug 30, 2016 at 7:38 AM, Ivan Vecerawrote: > The recent commit 087d7a8c disallows to set Rx coalescing time to be 0 > as this stops generating interrupts for the incoming packets. I found > the zero Tx coalescing time stops generating interrupts similarly for > outgoing packets and fires Tx watchdog later. To avoid this, don't allow > to set Tx coalescing time to 0. > > Cc: satish.baddipad...@broadcom.com > Cc: siva.kal...@broadcom.com > Cc: michael.c...@broadcom.com > Signed-off-by: Ivan Vecera > --- > drivers/net/ethernet/broadcom/tg3.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/net/ethernet/broadcom/tg3.c > b/drivers/net/ethernet/broadcom/tg3.c > index 6592612..07e3beb 100644 > --- a/drivers/net/ethernet/broadcom/tg3.c > +++ b/drivers/net/ethernet/broadcom/tg3.c > @@ -14012,6 +14012,7 @@ static int tg3_set_coalesce(struct net_device *dev, > struct ethtool_coalesce *ec) > if ((ec->rx_coalesce_usecs > MAX_RXCOL_TICKS) || > (!ec->rx_coalesce_usecs) || > (ec->tx_coalesce_usecs > MAX_TXCOL_TICKS) || > + (!ec->tx_coalesce_usecs) || > (ec->rx_max_coalesced_frames > MAX_RXMAX_FRAMES) || > (ec->tx_max_coalesced_frames > MAX_TXMAX_FRAMES) || > (ec->rx_coalesce_usecs_irq > max_rxcoal_tick_int) || As Rick pointed out last time, we can remove this check which follows the block of code above: /* No tx interrupts will be generated if both are zero */ if ((ec->tx_coalesce_usecs == 0) && (ec->tx_max_coalesced_frames == 0)) return -EINVAL;
Re: [PATCH 0/4] SA11x0 Clocks and removal of Neponset SMC91x hack
On Tue, 30 Aug 2016, Russell King - ARM Linux wrote: > This mini-series (which follows several other series on which it > depends) gets rid of the Assabet/Neponset hack in the smc91x driver. > > In order to do that, we need to get several pieces in place first: > * gpiolib support throughout SA11x0/Assabet/Neponset so that we can > represent control signals through gpiolib > * CCF support, so we can re-use the code in drivers/clk to implement > the external crystal oscillator attached to the SMC91x. This > external crystal oscillator is enabled via a control signal. > > This series: > - performs the SA11x0 CCF conversion > - adds an optional clock to SMC91x to cater for an external crystal > oscillator > - switches the Neponset code to provide a 'struct clk' representing > this oscillator > - removes the SMC91x hack to assert the enable signal > > This results in the platform specific includes being removed from the > SMC91x driver. > > Please ack these changes; due to the dependencies, I wish to merge > them through my tree. Thanks. Looks nice to me. Acked-by: Nicolas Pitre> arch/arm/Kconfig | 1 + > arch/arm/mach-sa1100/clock.c | 191 > + > arch/arm/mach-sa1100/neponset.c| 42 > drivers/net/ethernet/smsc/smc91x.c | 47 ++--- > drivers/net/ethernet/smsc/smc91x.h | 1 + > 5 files changed, 166 insertions(+), 116 deletions(-) > > -- > RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ > FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up > according to speedtest.net. > > ___ > linux-arm-kernel mailing list > linux-arm-ker...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel >
[PATCH net-next 6/8] rxrpc: Provide a way for AFS to ask for the peer address of a call
Provide a function so that kernel users, such as AFS, can ask for the peer address of a call: void rxrpc_kernel_get_peer(struct rxrpc_call *call, struct sockaddr_rxrpc *_srx); In the future the kernel service won't get sk_buffs to look inside. Further, this allows us to hide any canonicalisation inside AF_RXRPC for when IPv6 support is added. Also propagate this through to afs_find_server() and issue a warning if we can't handle the address family yet. Signed-off-by: David Howells--- Documentation/networking/rxrpc.txt |7 +++ fs/afs/cmservice.c | 20 +++- fs/afs/internal.h |5 - fs/afs/rxrpc.c |2 +- fs/afs/server.c| 11 --- include/net/af_rxrpc.h |2 ++ net/rxrpc/peer_object.c| 15 +++ 7 files changed, 48 insertions(+), 14 deletions(-) diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt index 70c926ae212d..dfe0b008df74 100644 --- a/Documentation/networking/rxrpc.txt +++ b/Documentation/networking/rxrpc.txt @@ -868,6 +868,13 @@ The kernel interface functions are as follows: This is used to allocate a null RxRPC key that can be used to indicate anonymous security for a particular domain. + (*) Get the peer address of a call. + + void rxrpc_kernel_get_peer(struct socket *sock, struct rxrpc_call *call, + struct sockaddr_rxrpc *_srx); + + This is used to find the remote peer address of a call. + === CONFIGURABLE PARAMETERS diff --git a/fs/afs/cmservice.c b/fs/afs/cmservice.c index ca32d891bbc3..77ee481059ac 100644 --- a/fs/afs/cmservice.c +++ b/fs/afs/cmservice.c @@ -167,9 +167,9 @@ static void SRXAFSCB_CallBack(struct work_struct *work) static int afs_deliver_cb_callback(struct afs_call *call, struct sk_buff *skb, bool last) { + struct sockaddr_rxrpc srx; struct afs_callback *cb; struct afs_server *server; - struct in_addr addr; __be32 *bp; u32 tmp; int ret, loop; @@ -178,6 +178,7 @@ static int afs_deliver_cb_callback(struct afs_call *call, struct sk_buff *skb, switch (call->unmarshall) { case 0: + rxrpc_kernel_get_peer(afs_socket, call->rxcall, ); call->offset = 0; call->unmarshall++; @@ -282,8 +283,7 @@ static int afs_deliver_cb_callback(struct afs_call *call, struct sk_buff *skb, /* we'll need the file server record as that tells us which set of * vnodes to operate upon */ - memcpy(, _hdr(skb)->saddr, 4); - server = afs_find_server(); + server = afs_find_server(); if (!server) return -ENOTCONN; call->server = server; @@ -314,12 +314,14 @@ static int afs_deliver_cb_init_call_back_state(struct afs_call *call, struct sk_buff *skb, bool last) { + struct sockaddr_rxrpc srx; struct afs_server *server; - struct in_addr addr; int ret; _enter(",{%u},%d", skb->len, last); + rxrpc_kernel_get_peer(afs_socket, call->rxcall, ); + ret = afs_data_complete(call, skb, last); if (ret < 0) return ret; @@ -329,8 +331,7 @@ static int afs_deliver_cb_init_call_back_state(struct afs_call *call, /* we'll need the file server record as that tells us which set of * vnodes to operate upon */ - memcpy(, _hdr(skb)->saddr, 4); - server = afs_find_server(); + server = afs_find_server(); if (!server) return -ENOTCONN; call->server = server; @@ -347,11 +348,13 @@ static int afs_deliver_cb_init_call_back_state3(struct afs_call *call, struct sk_buff *skb, bool last) { + struct sockaddr_rxrpc srx; struct afs_server *server; - struct in_addr addr; _enter(",{%u},%d", skb->len, last); + rxrpc_kernel_get_peer(afs_socket, call->rxcall, ); + /* There are some arguments that we ignore */ afs_data_consumed(call, skb); if (!last) @@ -362,8 +365,7 @@ static int afs_deliver_cb_init_call_back_state3(struct afs_call *call, /* we'll need the file server record as that tells us which set of * vnodes to operate upon */ - memcpy(, _hdr(skb)->saddr, 4); - server = afs_find_server(); + server = afs_find_server(); if (!server) return -ENOTCONN; call->server = server; diff --git a/fs/afs/internal.h b/fs/afs/internal.h index df976b2a7f40..d97552de9c59 100644 --- a/fs/afs/internal.h +++ b/fs/afs/internal.h @@ -20,6 +20,7 @@ #include #include
[PATCH net-next 8/8] rxrpc: Pass struct socket * to more rxrpc kernel interface functions
Pass struct socket * to more rxrpc kernel interface functions. They should be starting from this rather than the socket pointer in the rxrpc_call struct if they need to access the socket. I have left: rxrpc_kernel_is_data_last() rxrpc_kernel_get_abort_code() rxrpc_kernel_get_error_number() rxrpc_kernel_free_skb() rxrpc_kernel_data_consumed() unmodified as they're all about to be removed (and, in any case, don't touch the socket). Signed-off-by: David Howells--- Documentation/networking/rxrpc.txt | 11 --- fs/afs/rxrpc.c | 26 +++--- include/net/af_rxrpc.h | 10 +++--- net/rxrpc/af_rxrpc.c |5 +++-- net/rxrpc/output.c | 20 +++- 5 files changed, 44 insertions(+), 28 deletions(-) diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt index dfe0b008df74..cfc8cb91452f 100644 --- a/Documentation/networking/rxrpc.txt +++ b/Documentation/networking/rxrpc.txt @@ -725,7 +725,8 @@ The kernel interface functions are as follows: (*) End a client call. - void rxrpc_kernel_end_call(struct rxrpc_call *call); + void rxrpc_kernel_end_call(struct socket *sock, + struct rxrpc_call *call); This is used to end a previously begun call. The user_call_ID is expunged from AF_RXRPC's knowledge and will not be seen again in association with @@ -733,7 +734,9 @@ The kernel interface functions are as follows: (*) Send data through a call. - int rxrpc_kernel_send_data(struct rxrpc_call *call, struct msghdr *msg, + int rxrpc_kernel_send_data(struct socket *sock, + struct rxrpc_call *call, + struct msghdr *msg, size_t len); This is used to supply either the request part of a client call or the @@ -747,7 +750,9 @@ The kernel interface functions are as follows: (*) Abort a call. - void rxrpc_kernel_abort_call(struct rxrpc_call *call, u32 abort_code); + void rxrpc_kernel_abort_call(struct socket *sock, +struct rxrpc_call *call, +u32 abort_code); This is used to abort a call if it's still in an abortable state. The abort code specified will be placed in the ABORT message sent. diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c index a1916750e2f9..7b0d18900f50 100644 --- a/fs/afs/rxrpc.c +++ b/fs/afs/rxrpc.c @@ -207,7 +207,7 @@ static void afs_free_call(struct afs_call *call) static void afs_end_call_nofree(struct afs_call *call) { if (call->rxcall) { - rxrpc_kernel_end_call(call->rxcall); + rxrpc_kernel_end_call(afs_socket, call->rxcall); call->rxcall = NULL; } if (call->type->destructor) @@ -325,8 +325,8 @@ static int afs_send_pages(struct afs_call *call, struct msghdr *msg, * returns from sending the request */ if (first + loop >= last) call->state = AFS_CALL_AWAIT_REPLY; - ret = rxrpc_kernel_send_data(call->rxcall, msg, -to - offset); + ret = rxrpc_kernel_send_data(afs_socket, call->rxcall, +msg, to - offset); kunmap(pages[loop]); if (ret < 0) break; @@ -406,7 +406,8 @@ int afs_make_call(struct in_addr *addr, struct afs_call *call, gfp_t gfp, * request */ if (!call->send_pages) call->state = AFS_CALL_AWAIT_REPLY; - ret = rxrpc_kernel_send_data(rxcall, , call->request_size); + ret = rxrpc_kernel_send_data(afs_socket, rxcall, +, call->request_size); if (ret < 0) goto error_do_abort; @@ -421,7 +422,7 @@ int afs_make_call(struct in_addr *addr, struct afs_call *call, gfp_t gfp, return wait_mode->wait(call); error_do_abort: - rxrpc_kernel_abort_call(rxcall, RX_USER_ABORT); + rxrpc_kernel_abort_call(afs_socket, rxcall, RX_USER_ABORT); while ((skb = skb_dequeue(>rx_queue))) afs_free_skb(skb); error_kill_call: @@ -509,7 +510,8 @@ static void afs_deliver_to_call(struct afs_call *call) if (call->state != AFS_CALL_AWAIT_REPLY) abort_code = RXGEN_SS_UNMARSHAL; do_abort: - rxrpc_kernel_abort_call(call->rxcall, + rxrpc_kernel_abort_call(afs_socket, + call->rxcall,