Re: [PATCH net-next 0/4] rxrpc: Support IPv6
From: David HowellsDate: Tue, 13 Sep 2016 23:41:31 +0100 > > Here is a set of patches that add IPv6 support. They need to be applied on > top of the just-posted miscellaneous fix patches. They are: > > (1) Make autobinding of an unconnected socket work when sendmsg() is > called to initiate a client call. > > (2) Don't specify the protocol when creating the client socket, but rather > take the default instead. > > (3) Use rxrpc_extract_addr_from_skb() in a couple of places that were > doing the same thing manually. This allows the IPv6 address > extraction to be done in fewer places. > > (4) Add IPv6 support. With this, calls can be made to IPv6 servers from > userspace AF_RXRPC programs; AFS, however, can't use IPv6 yet as the > RPC calls need to be upgradeable. ... > Tagged thusly: > > git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git > rxrpc-rewrite-20160913-2 Looks good, pulled, thanks.
Re: [PATCH net-next 00/10] rxrpc: Miscellaneous fixes
From: David HowellsDate: Tue, 13 Sep 2016 23:20:56 +0100 > > Here's a set of miscellaneous fix patches. There are a couple of points of > note: > > (1) There is one non-fix patch that adjusts the call ref tracking > tracepoint to make kernel API-held refs on calls more obvious. This > is a prerequisite for the patch that fixes prealloc refcounting. > > (2) The final patch alters how jumbo packets that partially exceed the > receive window are handled. Previously, space was being left in the > Rx buffer for them, but this significantly hurts performance as the Rx > window can't be increased to match the OpenAFS Tx window size. > > Instead, the excess subpackets are discarded and an EXCEEDS_WINDOW ACK > is generated for the first. To avoid the problem of someone trying to > run the kernel out of space by feeding the kernel a series of > overlapping maximal jumbo packets, we stop allowing jumbo packets on a > call if we encounter more than three jumbo packets with duplicate or > excessive subpackets. ... > Tagged thusly: > > git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git > rxrpc-rewrite-20160913-1 Pulled, thanks David.
Re: [PATCH] MAINTAINERS: Remove myself from PA Semi entries
From: Michael EllermanDate: Wed, 14 Sep 2016 18:57:55 +1000 > Olof Johansson writes: > >> Jean, Dave, >> >> I was hoping to have Michael merge this since the bulk of the platform is >> under him, >> cc:ing you mostly to be aware that I am orphaning a driver in your >> subsystems. > > I'll merge it unless I hear otherwise from Dave. Feel free to merge this. Thanks.
Re: pull-request: mac80211 2016-09-13
From: Johannes BergDate: Tue, 13 Sep 2016 22:03:23 +0200 > We found a few more issues, I'm sending you small fixes here. The diffstat > would be even shorter, but one of Felix's patches has to move about 30 lines > of code, which makes it seem much bigger than it really is. > > Let me know if there's any problem. Pulled, thanks Johannes.
Re: [PATCH net-next 3/4] samples/bpf: extend test_tunnel_bpf.sh with IPIP test
Hi Alexei, Is there a corresponding patch for iproute2? I tested this patch but fails at: + ip link add dev ipip11 type ipip external because my ip command does not support "external". Thanks William On Thu, Sep 15, 2016 at 1:00 PM, Alexei Starovoitovwrote: > extend existing tests for vxlan, geneve, gre to include IPIP tunnel. > It tests both traditional tunnel configuration and > dynamic via bpf helpers. > > Signed-off-by: Alexei Starovoitov > --- > samples/bpf/tcbpf2_kern.c | 58 > ++ > samples/bpf/test_tunnel_bpf.sh | 56 ++-- > 2 files changed, 106 insertions(+), 8 deletions(-) > > diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c > index 7a15289da6cc..c1917d968fb4 100644 > --- a/samples/bpf/tcbpf2_kern.c > +++ b/samples/bpf/tcbpf2_kern.c > @@ -1,4 +1,5 @@ > /* Copyright (c) 2016 VMware > + * Copyright (c) 2016 Facebook > * > * This program is free software; you can redistribute it and/or > * modify it under the terms of version 2 of the GNU General Public > @@ -188,4 +189,61 @@ int _geneve_get_tunnel(struct __sk_buff *skb) > return TC_ACT_OK; > } > > +SEC("ipip_set_tunnel") > +int _ipip_set_tunnel(struct __sk_buff *skb) > +{ > + struct bpf_tunnel_key key = {}; > + void *data = (void *)(long)skb->data; > + struct iphdr *iph = data; > + struct tcphdr *tcp = data + sizeof(*iph); > + void *data_end = (void *)(long)skb->data_end; > + int ret; > + > + /* single length check */ > + if (data + sizeof(*iph) + sizeof(*tcp) > data_end) { > + ERROR(1); > + return TC_ACT_SHOT; > + } > + > + key.tunnel_ttl = 64; > + if (iph->protocol == IPPROTO_ICMP) { > + key.remote_ipv4 = 0xac100164; /* 172.16.1.100 */ > + } else { > + if (iph->protocol != IPPROTO_TCP || iph->ihl != 5) > + return TC_ACT_SHOT; > + > + if (tcp->dest == htons(5200)) > + key.remote_ipv4 = 0xac100164; /* 172.16.1.100 */ > + else if (tcp->dest == htons(5201)) > + key.remote_ipv4 = 0xac100165; /* 172.16.1.101 */ > + else > + return TC_ACT_SHOT; > + } > + > + ret = bpf_skb_set_tunnel_key(skb, , sizeof(key), 0); > + if (ret < 0) { > + ERROR(ret); > + return TC_ACT_SHOT; > + } > + > + return TC_ACT_OK; > +} > + > +SEC("ipip_get_tunnel") > +int _ipip_get_tunnel(struct __sk_buff *skb) > +{ > + int ret; > + struct bpf_tunnel_key key; > + char fmt[] = "remote ip 0x%x\n"; > + > + ret = bpf_skb_get_tunnel_key(skb, , sizeof(key), 0); > + if (ret < 0) { > + ERROR(ret); > + return TC_ACT_SHOT; > + } > + > + bpf_trace_printk(fmt, sizeof(fmt), key.remote_ipv4); > + return TC_ACT_OK; > +} > + > char _license[] SEC("license") = "GPL"; > diff --git a/samples/bpf/test_tunnel_bpf.sh b/samples/bpf/test_tunnel_bpf.sh > index 4956589a83ae..1ff634f187b7 100755 > --- a/samples/bpf/test_tunnel_bpf.sh > +++ b/samples/bpf/test_tunnel_bpf.sh > @@ -9,15 +9,13 @@ > # local 172.16.1.200 remote 172.16.1.100 > # veth1 IP: 172.16.1.200, tunnel dev 11 > > -set -e > - > function config_device { > ip netns add at_ns0 > ip link add veth0 type veth peer name veth1 > ip link set veth0 netns at_ns0 > ip netns exec at_ns0 ip addr add 172.16.1.100/24 dev veth0 > ip netns exec at_ns0 ip link set dev veth0 up > - ip link set dev veth1 up > + ip link set dev veth1 up mtu 1500 > ip addr add dev veth1 172.16.1.200/24 > } > > @@ -67,6 +65,19 @@ function add_geneve_tunnel { > ip addr add dev $DEV 10.1.1.200/24 > } > > +function add_ipip_tunnel { > + # in namespace > + ip netns exec at_ns0 \ > + ip link add dev $DEV_NS type $TYPE local 172.16.1.100 remote > 172.16.1.200 > + ip netns exec at_ns0 ip link set dev $DEV_NS up > + ip netns exec at_ns0 ip addr add dev $DEV_NS 10.1.1.100/24 > + > + # out of namespace > + ip link add dev $DEV type $TYPE external > + ip link set dev $DEV up > + ip addr add dev $DEV 10.1.1.200/24 > +} > + > function attach_bpf { > DEV=$1 > SET_TUNNEL=$2 > @@ -85,6 +96,7 @@ function test_gre { > attach_bpf $DEV gre_set_tunnel gre_get_tunnel > ping -c 1 10.1.1.100 > ip netns exec at_ns0 ping -c 1 10.1.1.200 > + cleanup > } > > function test_vxlan { > @@ -96,6 +108,7 @@ function test_vxlan { > attach_bpf $DEV vxlan_set_tunnel vxlan_get_tunnel > ping -c 1 10.1.1.100 > ip netns exec at_ns0 ping -c 1 10.1.1.200 > + cleanup > } > > function test_geneve { > @@ -107,21 +120,48 @@ function test_geneve { > attach_bpf $DEV
Re: [PATCH] mwifiex: fix null pointer deference when adapter is null
kbuild test robot <l...@intel.com> writes: > url: > https://github.com/0day-ci/linux/commits/Colin-King/mwifiex-fix-null-pointer-deference-when-adapter-is-null/20160915-231625 > base: > https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next.git > master > config: x86_64-randconfig-x013-201637 (attached as .config) > compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705 > reproduce: > # save the attached .config to linux build tree > make ARCH=x86_64 > > All warnings (new ones prefixed by >>): > >drivers/net/wireless/marvell/mwifiex/main.c: In function > 'mwifiex_shutdown_sw': >>> drivers/net/wireless/marvell/mwifiex/main.c:1433:1: warning: label >>> 'exit_remove' defined but not used [-Wunused-label] > exit_remove: > ^~~ Looks like a valid warning to me, so please resend. -- Kalle Valo
Re: [PATCHv3 next 3/3] ipvlan: Introduce l3s mode
On 9/15/16 6:14 PM, Mahesh Bandewar wrote: > diff --git a/drivers/net/ipvlan/ipvlan.h b/drivers/net/ipvlan/ipvlan.h > index 695a5dc9ace3..371f4548c42d 100644 > --- a/drivers/net/ipvlan/ipvlan.h > +++ b/drivers/net/ipvlan/ipvlan.h > @@ -23,11 +23,13 @@ > #include > #include > #include > +#include > #include > #include > #include > #include > #include > +#include > > #define IPVLAN_DRV "ipvlan" > #define IPV_DRV_VER "0.1" > @@ -96,6 +98,7 @@ struct ipvl_port { > struct work_struct wq; > struct sk_buff_head backlog; > int count; > + boolhooks_attached; With a refcnt on the hook registration you don't need this bool and removing simplifies the set_mode logic. > diff --git a/drivers/net/ipvlan/ipvlan_main.c > b/drivers/net/ipvlan/ipvlan_main.c > index 18b4e8c7f68a..aca690f41559 100644 > --- a/drivers/net/ipvlan/ipvlan_main.c > +++ b/drivers/net/ipvlan/ipvlan_main.c > +static void ipvlan_unregister_nf_hook(void) > +{ > + BUG_ON(!ipvl_nf_hook_refcnt); not a panic() worthy issue. just a pr_warn or WARN_ON_ONCE should be ok.
Re: [PATCH V3 1/3] Documentation: devicetree: add qca8k binding
On 09/15/2016 07:26 AM, John Crispin wrote: > Add device-tree binding for ar8xxx switch families. > > Cc: devicet...@vger.kernel.org > Signed-off-by: John CrispinReviewed-by: Florian Fainelli -- Florian
Re: Modification to skb->queue_mapping affecting performance
2016-09-14 10:46 GMT-07:00 Michael Ma: > 2016-09-13 22:22 GMT-07:00 Eric Dumazet : >> On Tue, 2016-09-13 at 22:13 -0700, Michael Ma wrote: >> >>> I don't intend to install multiple qdisc - the only reason that I'm >>> doing this now is to leverage MQ to workaround the lock contention, >>> and based on the profile this all worked. However to simplify the way >>> to setup HTB I wanted to use TXQ to partition HTB classes so that a >>> HTB class only belongs to one TXQ, which also requires mapping skb to >>> TXQ using some rules (here I'm using priority but I assume it's >>> straightforward to use other information such as classid). And the >>> problem I found here is that when using priority to infer the TXQ so >>> that queue_mapping is changed, bandwidth is affected significantly - >>> the only thing I can guess is that due to queue switch, there are more >>> cache misses assuming processor cores have a static mapping to all the >>> queues. Any suggestion on what to do next for the investigation? >>> >>> I would also guess that this should be a common problem if anyone >>> wants to use MQ+IFB to workaround the qdisc lock contention on the >>> receiver side and classful qdisc is used on IFB, but haven't really >>> found a similar thread here... >> >> But why are you changing the queue ? >> >> NIC already does the proper RSS thing, meaning all packets of one flow >> should land on one RX queue. No need to ' classify yourself and risk >> lock contention' >> >> I use IFB + MQ + netem every day, and it scales to 10 Mpps with no >> problem. >> >> Do you really need to rate limit flows ? Not clear what are your goals, >> why for example you use HTB to begin with. >> > Yes. My goal is to set different min/max bandwidth limits for > different processes, so we started with HTB. However with HTB the > qdisc root lock contention caused some unintended correlation between > flows in different classes. For example if some flows belonging to one > class have large amount of small packets, other flows in a different > class will get their effective bandwidth reduced because they'll wait > longer for the root lock. Using MQ this can be avoided because I'll > just put flows belonging to one class to its dedicated TXQ. Then > classes within one HTB on a TXQ will still have the lock contention > problem but classes in different HTB will use different root locks so > the contention doesn't exist. > > This also means that I'll need to classify packets to different > TXQ/HTB based on some skb metadata (essentially similar to what mqprio > is doing). So TXQ might need to be switched to achieve this. My current theory to this problem is that tasklets in IFB might be scheduled to the same cpu core if the RXQ happens to be the same for two different flows. When queue_mapping is modified and multiple flows are concentrated to the same IFB TXQ because they need to be controlled by the same HTB, they'll have to use the same tasklet because of the way IFB is implemented. So if other flows belonging to a different TXQ/tasklet happens to be scheduled on the same core, that core can be overloaded and becomes the bottleneck. Without modifying the queue_mapping the chance of this contention is much lower. This is a speculation based on the increased si time in softirqd process. I'll try to affinitize each tasklet with a cpu core to verify whether this is the problem. I also noticed that in the past there was a similar proposal of scheduling the tasklet to a dedicated core which was not committed(https://patchwork.ozlabs.org/patch/38486/). I'll try something similar to verify this theory.
Re: [net-next PATCH 00/11] iw_cxgb4,cxgbit: remove duplicate code
From: Varun PrakashDate: Tue, 13 Sep 2016 21:23:55 +0530 > This patch series removes duplicate code from > iw_cxgb4 and cxgbit by adding common function > definitions in libcxgb. > > Please review. Series applied, thanks.
Re: [PATCH net-next] openvswitch: avoid deferred execution of recirc actions
From: Lance RichardsonDate: Tue, 13 Sep 2016 10:08:54 -0400 > The ovs kernel data path currently defers the execution of all > recirc actions until stack utilization is at a minimum. > This is too limiting for some packet forwarding scenarios due to > the small size of the deferred action FIFO (10 entries). For > example, broadcast traffic sent out more than 10 ports with > recirculation results in packet drops when the deferred action > FIFO becomes full, as reported here: > > http://openvswitch.org/pipermail/dev/2016-March/067672.html > > Since the current recursion depth is available (it is already tracked > by the exec_actions_level pcpu variable), we can use it to determine > whether to execute recirculation actions immediately (safe when > recursion depth is low) or defer execution until more stack space is > available. > > With this change, the deferred action fifo size becomes a non-issue > for currently failing scenarios because it is no longer used when > there are three or fewer recursions through ovs_execute_actions(). > > Suggested-by: Pravin Shelar > Signed-off-by: Lance Richardson Applied.
Re: [PATCH net-next V2 0/3] net/sched: cls_flower: Add ports masks
From: Or GerlitzDate: Thu, 15 Sep 2016 15:28:21 +0300 > This series adds the ability to specify tcp/udp ports masks > for TC/flower filter matches. > > I also removed an unused fields from the flower keys struct > and clarified the format of the recently added vlan attibutes. Series applied.
Re: [PATCH net-next v2 2/5] cxgb4: add common api support for configuring filters
From: Rahul LakkireddyDate: Tue, 13 Sep 2016 17:12:26 +0530 > +/* Fill up default masks for set match fields. */ > +static void fill_default_mask(struct ch_filter_specification *fs) > +{ > + unsigned int i; > + unsigned int lip = 0, lip_mask = 0; > + unsigned int fip = 0, fip_mask = 0; Always order local variable declarations from longest to shortest line. Please audit your entire submission for this issue. Thanks.
Re: [PATCH net-next] alx: fix error handling in __alx_open
From: Tobias RegneryDate: Tue, 13 Sep 2016 12:06:57 +0200 > In commit 9ee7b683ea63 we moved the enablement of msi interrupts earlier in > alx_init_intr. If there is an error in alx_alloc_rings, __alx_open returns > with an error but msi (or msi-x) interrupts stays enabled. Add a new error > label to disable msi (or msi-x) interrupts. > > Fixes: 9ee7b683ea63 ("alx: refactor msi enablement and disablement") > Signed-off-by: Tobias Regnery Applied.
[PATCHv3 next 0/3] IPvlan introduce l3s mode
From: Mahesh BandewarSame old problem with new approach especially from suggestions from earlier patch-series. First thing is that this is introduced as a new mode rather than modifying the old (L3) mode. So the behavior of the existing modes is preserved as it is and the new L3s mode obeys iptables so that intended conn-tracking can work. To do this, the code uses newly added l3mdev_rcv() handler and an Iptables hook. l3mdev_rcv() to perform an inbound route lookup with the correct (IPvlan slave) interface and then IPtable-hook at LOCAL_INPUT to change the input device from master to the slave to complete the formality. Supporting stack changes are trivial changes to export symbol to get IPv4 equivalent code exported for IPv6 and to allow netfilter hook registration code to allow caller to hold RTNL. Please look into individual patches for details. Mahesh Bandewar (3): ipv6: Export p6_route_input_lookup symbol net: Add _nf_(un)register_hooks symbols ipvlan: Introduce l3s mode Documentation/networking/ipvlan.txt | 7 ++- drivers/net/Kconfig | 1 + drivers/net/ipvlan/ipvlan.h | 7 +++ drivers/net/ipvlan/ipvlan_core.c| 94 + drivers/net/ipvlan/ipvlan_main.c| 92 +--- include/linux/netfilter.h | 2 + include/net/ip6_route.h | 3 ++ include/uapi/linux/if_link.h| 1 + net/ipv6/route.c| 7 +-- net/netfilter/core.c| 51 ++-- 10 files changed, 249 insertions(+), 16 deletions(-) v1: Initial post v2: Text correction and config changed from "select" to "depends on" v3: separated nf_hook registration logic and made it independent of port as nf_hook registration is independant of how many IPvlan ports are present in the system. -- 2.8.0.rc3.226.g39d4020
[PATCHv3 next 2/3] net: Add _nf_(un)register_hooks symbols
From: Mahesh BandewarAdd _nf_register_hooks() and _nf_unregister_hooks() calls which allow caller to hold RTNL mutex. Signed-off-by: Mahesh Bandewar CC: Pablo Neira Ayuso --- include/linux/netfilter.h | 2 ++ net/netfilter/core.c | 51 ++- 2 files changed, 48 insertions(+), 5 deletions(-) diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index 9230f9aee896..e82b76781bf6 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -133,6 +133,8 @@ int nf_register_hook(struct nf_hook_ops *reg); void nf_unregister_hook(struct nf_hook_ops *reg); int nf_register_hooks(struct nf_hook_ops *reg, unsigned int n); void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n); +int _nf_register_hooks(struct nf_hook_ops *reg, unsigned int n); +void _nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n); /* Functions to register get/setsockopt ranges (non-inclusive). You need to check permissions yourself! */ diff --git a/net/netfilter/core.c b/net/netfilter/core.c index f39276d1c2d7..2c5327e43a88 100644 --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@ -188,19 +188,17 @@ EXPORT_SYMBOL(nf_unregister_net_hooks); static LIST_HEAD(nf_hook_list); -int nf_register_hook(struct nf_hook_ops *reg) +static int _nf_register_hook(struct nf_hook_ops *reg) { struct net *net, *last; int ret; - rtnl_lock(); for_each_net(net) { ret = nf_register_net_hook(net, reg); if (ret && ret != -ENOENT) goto rollback; } list_add_tail(>list, _hook_list); - rtnl_unlock(); return 0; rollback: @@ -210,19 +208,34 @@ rollback: break; nf_unregister_net_hook(net, reg); } + return ret; +} + +int nf_register_hook(struct nf_hook_ops *reg) +{ + int ret; + + rtnl_lock(); + ret = _nf_register_hook(reg); rtnl_unlock(); + return ret; } EXPORT_SYMBOL(nf_register_hook); -void nf_unregister_hook(struct nf_hook_ops *reg) +static void _nf_unregister_hook(struct nf_hook_ops *reg) { struct net *net; - rtnl_lock(); list_del(>list); for_each_net(net) nf_unregister_net_hook(net, reg); +} + +void nf_unregister_hook(struct nf_hook_ops *reg) +{ + rtnl_lock(); + _nf_unregister_hook(reg); rtnl_unlock(); } EXPORT_SYMBOL(nf_unregister_hook); @@ -246,6 +259,26 @@ err: } EXPORT_SYMBOL(nf_register_hooks); +/* Caller MUST take rtnl_lock() */ +int _nf_register_hooks(struct nf_hook_ops *reg, unsigned int n) +{ + unsigned int i; + int err = 0; + + for (i = 0; i < n; i++) { + err = _nf_register_hook([i]); + if (err) + goto err; + } + return err; + +err: + if (i > 0) + _nf_unregister_hooks(reg, i); + return err; +} +EXPORT_SYMBOL(_nf_register_hooks); + void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n) { while (n-- > 0) @@ -253,6 +286,14 @@ void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n) } EXPORT_SYMBOL(nf_unregister_hooks); +/* Caller MUST take rtnl_lock */ +void _nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n) +{ + while (n-- > 0) + _nf_unregister_hook([n]); +} +EXPORT_SYMBOL(_nf_unregister_hooks); + unsigned int nf_iterate(struct list_head *head, struct sk_buff *skb, struct nf_hook_state *state, -- 2.8.0.rc3.226.g39d4020
[PATCHv3 next 1/3] ipv6: Export p6_route_input_lookup symbol
From: Mahesh BandewarMake ip6_route_input_lookup available outside of ipv6 the module similar to ip_route_input_noref in the IPv4 world. Signed-off-by: Mahesh Bandewar --- include/net/ip6_route.h | 3 +++ net/ipv6/route.c| 7 --- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h index d97305d0e71f..e0cd318d5103 100644 --- a/include/net/ip6_route.h +++ b/include/net/ip6_route.h @@ -64,6 +64,9 @@ static inline bool rt6_need_strict(const struct in6_addr *daddr) } void ip6_route_input(struct sk_buff *skb); +struct dst_entry *ip6_route_input_lookup(struct net *net, +struct net_device *dev, +struct flowi6 *fl6, int flags); struct dst_entry *ip6_route_output_flags(struct net *net, const struct sock *sk, struct flowi6 *fl6, int flags); diff --git a/net/ipv6/route.c b/net/ipv6/route.c index ad4a7ff301fc..4dab585f7642 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1147,15 +1147,16 @@ static struct rt6_info *ip6_pol_route_input(struct net *net, struct fib6_table * return ip6_pol_route(net, table, fl6->flowi6_iif, fl6, flags); } -static struct dst_entry *ip6_route_input_lookup(struct net *net, - struct net_device *dev, - struct flowi6 *fl6, int flags) +struct dst_entry *ip6_route_input_lookup(struct net *net, +struct net_device *dev, +struct flowi6 *fl6, int flags) { if (rt6_need_strict(>daddr) && dev->type != ARPHRD_PIMREG) flags |= RT6_LOOKUP_F_IFACE; return fib6_rule_lookup(net, fl6, flags, ip6_pol_route_input); } +EXPORT_SYMBOL_GPL(ip6_route_input_lookup); void ip6_route_input(struct sk_buff *skb) { -- 2.8.0.rc3.226.g39d4020
[PATCHv3 next 3/3] ipvlan: Introduce l3s mode
From: Mahesh BandewarIn a typical IPvlan L3 setup where master is in default-ns and each slave is into different (slave) ns. In this setup egress packet processing for traffic originating from slave-ns will hit all NF_HOOKs in slave-ns as well as default-ns. However same is not true for ingress processing. All these NF_HOOKs are hit only in the slave-ns skipping them in the default-ns. IPvlan in L3 mode is restrictive and if admins want to deploy iptables rules in default-ns, this asymmetric data path makes it impossible to do so. This patch makes use of the l3_rcv() (added as part of l3mdev enhancements) to perform input route lookup on RX packets without changing the skb->dev and then uses nf_hook at NF_INET_LOCAL_IN to change the skb->dev just before handing over skb to L4. Signed-off-by: Mahesh Bandewar CC: David Ahern --- Documentation/networking/ipvlan.txt | 7 ++- drivers/net/Kconfig | 1 + drivers/net/ipvlan/ipvlan.h | 7 +++ drivers/net/ipvlan/ipvlan_core.c| 94 + drivers/net/ipvlan/ipvlan_main.c| 92 +--- include/uapi/linux/if_link.h| 1 + 6 files changed, 194 insertions(+), 8 deletions(-) diff --git a/Documentation/networking/ipvlan.txt b/Documentation/networking/ipvlan.txt index 14422f8fcdc4..24196cef7c91 100644 --- a/Documentation/networking/ipvlan.txt +++ b/Documentation/networking/ipvlan.txt @@ -22,7 +22,7 @@ The driver can be built into the kernel (CONFIG_IPVLAN=y) or as a module There are no module parameters for this driver and it can be configured using IProute2/ip utility. - ip link add link type ipvlan mode { l2 | L3 } + ip link add link type ipvlan mode { l2 | l3 | l3s } e.g. ip link add link ipvl0 eth0 type ipvlan mode l2 @@ -48,6 +48,11 @@ master device for the L2 processing and routing from that instance will be used before packets are queued on the outbound device. In this mode the slaves will not receive nor can send multicast / broadcast traffic. +4.3 L3S mode: + This is very similar to the L3 mode except that iptables (conn-tracking) +works in this mode and hence it is L3-symmetric (L3s). This will have slightly less +performance but that shouldn't matter since you are choosing this mode over plain-L3 +mode to make conn-tracking work. 5. What to choose (macvlan vs. ipvlan)? These two devices are very similar in many regards and the specific use diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 0c5415b05ea9..8768a625350d 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -149,6 +149,7 @@ config IPVLAN tristate "IP-VLAN support" depends on INET depends on IPV6 +depends on NET_L3_MASTER_DEV ---help--- This allows one to create virtual devices off of a main interface and packets will be delivered based on the dest L3 (IPv6/IPv4 addr) diff --git a/drivers/net/ipvlan/ipvlan.h b/drivers/net/ipvlan/ipvlan.h index 695a5dc9ace3..371f4548c42d 100644 --- a/drivers/net/ipvlan/ipvlan.h +++ b/drivers/net/ipvlan/ipvlan.h @@ -23,11 +23,13 @@ #include #include #include +#include #include #include #include #include #include +#include #define IPVLAN_DRV "ipvlan" #define IPV_DRV_VER"0.1" @@ -96,6 +98,7 @@ struct ipvl_port { struct work_struct wq; struct sk_buff_head backlog; int count; + boolhooks_attached; struct rcu_head rcu; }; @@ -124,4 +127,8 @@ struct ipvl_addr *ipvlan_find_addr(const struct ipvl_dev *ipvlan, const void *iaddr, bool is_v6); bool ipvlan_addr_busy(struct ipvl_port *port, void *iaddr, bool is_v6); void ipvlan_ht_addr_del(struct ipvl_addr *addr); +struct sk_buff *ipvlan_l3_rcv(struct net_device *dev, struct sk_buff *skb, + u16 proto); +unsigned int ipvlan_nf_input(void *priv, struct sk_buff *skb, +const struct nf_hook_state *state); #endif /* __IPVLAN_H */ diff --git a/drivers/net/ipvlan/ipvlan_core.c b/drivers/net/ipvlan/ipvlan_core.c index b5f9511d819e..b4e990743e1d 100644 --- a/drivers/net/ipvlan/ipvlan_core.c +++ b/drivers/net/ipvlan/ipvlan_core.c @@ -560,6 +560,7 @@ int ipvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev) case IPVLAN_MODE_L2: return ipvlan_xmit_mode_l2(skb, dev); case IPVLAN_MODE_L3: + case IPVLAN_MODE_L3S: return ipvlan_xmit_mode_l3(skb, dev); } @@ -664,6 +665,8 @@ rx_handler_result_t ipvlan_handle_frame(struct sk_buff **pskb) return ipvlan_handle_mode_l2(pskb, port); case IPVLAN_MODE_L3: return ipvlan_handle_mode_l3(pskb, port); + case IPVLAN_MODE_L3S: + return RX_HANDLER_PASS; }
cdc_ncm driver padding problem (WAS: Question about CDC_NCM_FLAG_NDP_TO_END)
Hello guys. Some very good people managed to detect there is a problem with some Huawei firmwares and NCM padding. I actually don't think I have the hardware to test btw. On Wed, 14 Sep 2016, Marek Brudka wrote: Sorry Marek - I forwarded this message without asking for your consent. Let me know anyway if this is a problem. thank you all guys for everything, Enrico ==Date: Wed, 14 Sep 2016 19:31:50 ==From: Marek Brudka==To: Enrico Mioso ==Subject: Re: Question about CDC_NCM_FLAG_NDP_TO_END == ==Hello Enrico, == ==As nobody at openwrt forum replied to my request on the way how to get ==the exact ==recompilation of OpenWrt 15.05.1 I decided to switch to the developement ==version ==(12/09/2016), which already contains your patch. == ==The nice thing is that I got my modem (E3372 HiLink reflashed to E398) ==working ==in ncm mode! == ==The bad thing is DHCP. It seems, that cdc_ncm driver somehow consumes DHCP ==replies. I had to manually setup wwan0 interface as well as routing ==using the result ==of Hayes command == ==AT^DHCP? ==^DHCP: ==EC684764,F8FF,E9684764,E9684764,356002D4,366002D4,4320,4320 ==OK == ==Certainly, I will modify connect scripts == ==https://github.com/zabbius/smarthome/tree/master/openwrt/huawei-ncm/files/usr/sbin ==for me to parse this response. However it seems, that the problem is on ==driver level and ==is related with padding. Do you know this issue which is nicely ==described in the thread ==https://forum.openwrt.org/viewtopic.php?pid=273099 ==of OpenWrt forum? == ==Thank you ==Marek Brudka == == ==W dniu 11.09.2016 o 15:19, Enrico Mioso pisze: ==> Hello Marek. ==> ==> First of all, thank you for your interest in this driver, and for writing. ==> ==> Unfortunately, I don't know the exact procedure to do that: you might be confortable putting those patches in generic-patches-kernel_version if I am not wrong, but I may well be wrong or imprecise, and recompile the whole Openwrt thing? ==> don't know. But yes, that message should appear in the dmesg. ==> NDPs need to be at end of NCM frames. Oh, I don't remember well what NDP stands for... ufh. Sorry. ==> ==> Anyway, let me know if I can do something for you. ==> Enrico ==> == == ==
Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets
On 9/15/16 4:48 PM, Eric Dumazet wrote: > On Fri, 2016-09-16 at 00:01 +0300, Cyrill Gorcunov wrote: > >> Here I get kicked off the server. Login back >> >> [cyrill@uranus ~] ssh root@pcs7 >> Last login: Thu Sep 15 23:20:42 2016 from gateway >> [root@pcs7 ~]# cd /home/iproute2/ >> [root@pcs7 iproute2]# misc/ss -A raw >> State Recv-Q Send-QLocal Address:Port >> Peer Address:Port >> >> UNCONN 0 0 >> :::ipv6-icmp :::* >> >> UNCONN 0 0 >> :::ipv6-icmp :::* >> >> >> Maybe I do something wrong for testing? > > If you kill your shell, maybe /root/sock is killer as well, thus its raw > sockets are closed. > > Try to be selective in the -K , do not kill tcp sockets ? > > I am running ss -aKw 'dev == red' to kill raw sockets bound to device named 'red'.
Re: [PATCH 0/5] Make /sys/class/net per net namespace objects belong to container
Dmitry Torokhovwrites: > On Mon, Aug 29, 2016 at 5:38 AM, Eric W. Biederman > wrote: >> David Miller writes: >> >>> From: Dmitry Torokhov >>> Date: Tue, 16 Aug 2016 15:33:10 -0700 >>> There are objects in /sys hierarchy (/sys/class/net/) that logically belong to a namespace/container. Unfortunately all sysfs objects start their life belonging to global root, and while we could change ownership manually, keeping tracks of all objects that come and go is cumbersome. It would be better if kernel created them using correct uid/gid from the beginning. This series changes kernfs to allow creating object's with arbitrary uid/gid, adds get_ownership() callback to ktype structure so subsystems could supply their own logic (likely tied to namespace support) for determining ownership of kobjects, and adjusts sysfs code to make use of this information. Lastly net-sysfs is adjusted to make sure that objects in net namespace are owned by the root user from the owning user namespace. Note that we do not adjust ownership of objects moved into a new namespace (as when moving a network device into a container) as userspace can easily do it. >>> >>> I need some domain experts to review this series please. >> >> I just came back from vacation and I will aim to take a look shortly. >> >> The big picture idea seems sensible. Having a better ownship of sysfs >> files that are part of a network namespace. I will have to look at the >> details to see if the implementation is similarly sensible. > > Eric, > > Did you find anything objectionable in the series or should I fix up > the !CONFIG_SYSFS error in networking patch and resubmit? Thank you for the ping, I put this patchset down and forgot to look back. The notion of a get_ownership call seems sensible. At some level I am not a fan of setting the uids and gids on the sysfs nodes as that requires allocation of an additional data structure and it will increase the code of sysfs nodes. Certainly I don't think we should incur that cost if we are not using user namespaces. sysfs nodes can be expensive data wise because we sometimes have so many of them. So skipping the setattr when uid == GLOBAL_ROOT_UID and gid == GLOBAL_ROOT_GID seems very desirable. Perhaps that is just an optimization in setattr, but it should be somewhere. I would very much prefer it if we can find a way not to touch all of the layers, in the stack. As I recall it is the code in drivers/base/core.c that creates the attributes. So my gut feel says we want to export a sysfs_setattr modeled after sysfs_chmod from sysfs.h and then just have the driver core level perform the setattr calls for non-default uids and gids. Symlinks we don't need to worry about changing their ownership they are globally read, write, execute. As long as the chattr happens before the uevent is triggered the code should be essentially race free in dealing with userspace. I think that will lead to a simpler more comprehensible and more maintainable implementation. Hooking in where or near where the namespace bits hook in seems excessively complicated (although there may be a good reason for it) that I am forgetting. Eric
Re: [PATCHv2 net-next] cxgb4vf: don't offload Rx checksums for IPv6 fragments
From: Hariprasad ShenaiDate: Tue, 13 Sep 2016 13:39:24 +0530 > The checksum provided by the device doesn't include the L3 headers, > as IPv6 expects > > Signed-off-by: Hariprasad Shenai > --- > V2: Fixed compilation issue reported by kbuild bot Applied.
Re: [PATCH v2 2/2] openvswitch: use percpu flow stats
On Thu, Sep 15, 2016 at 04:09:26PM -0700, Eric Dumazet wrote: > On Thu, 2016-09-15 at 19:11 -0300, Thadeu Lima de Souza Cascardo wrote: > > Instead of using flow stats per NUMA node, use it per CPU. When using > > megaflows, the stats lock can be a bottleneck in scalability. > > > > On a E5-2690 12-core system, usual throughput went from ~4Mpps to > > ~15Mpps when forwarding between two 40GbE ports with a single flow > > configured on the datapath. > > > > This has been tested on a system with possible CPUs 0-7,16-23. After > > module removal, there were no corruption on the slab cache. > > > > Signed-off-by: Thadeu Lima de Souza Cascardo> > Cc: pravin shelar > > --- > > > + /* We open code this to make sure cpu 0 is always considered */ > > + for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next(cpu, > > cpu_possible_mask)) > > + if (flow->stats[cpu]) > > kmem_cache_free(flow_stats_cache, > > - (struct flow_stats __force > > *)flow->stats[node]); > > + (struct flow_stats __force > > *)flow->stats[cpu]); > > kmem_cache_free(flow_cache, flow); > > } > > > > @@ -757,7 +749,7 @@ int ovs_flow_init(void) > > BUILD_BUG_ON(sizeof(struct sw_flow_key) % sizeof(long)); > > > > flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow) > > - + (nr_node_ids > > + + (nr_cpu_ids > > * sizeof(struct flow_stats *)), > >0, 0, NULL); > > if (flow_cache == NULL) > > Well, if you switch to percpu stats, better use normal > alloc_percpu(struct flow_stats) > > The code was dealing with per node allocation so could not use existing > helper. > > No need to keep this forever. The problem is that the alloc_percpu uses a global spinlock and that affects some workloads on OVS that creates lots of flows, as described in commit 9ac56358dec1a5aa7f4275a42971f55fad1f7f35 ("datapath: Per NUMA node flow stats."). This problem would not happen on this version as the flow allocation does not suffer from the same scalability problem as when using alloc_percpu. Cascardo.
Re: [PATCH v6 net-next 1/1] net_sched: Introduce skbmod action
From: Jamal Hadi SalimDate: Mon, 12 Sep 2016 20:13:09 -0400 > From: Jamal Hadi Salim > > This action is intended to be an upgrade from a usability perspective > from pedit (as well as operational debugability). > Compare this: > > sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ > u32 match ip protocol 1 0xff flowid 1:2 \ > action pedit munge offset -14 u8 set 0x02 \ > munge offset -13 u8 set 0x15 \ > munge offset -12 u8 set 0x15 \ > munge offset -11 u8 set 0x15 \ > munge offset -10 u16 set 0x1515 \ > pipe > > to: > > sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ > u32 match ip protocol 1 0xff flowid 1:2 \ > action skbmod dmac 02:15:15:15:15:15 > > Also try to do a MAC address swap with pedit or worse > try to debug a policy with destination mac, source mac and > etherype. Then make few rules out of those and you'll get my point. > > In the future common use cases on pedit can be migrated to this action > (as an example different fields in ip v4/6, transports like tcp/udp/sctp > etc). For this first cut, this allows modifying basic ethernet header. > > The most important ethernet use case at the moment is when redirecting or > mirroring packets to a remote machine. The dst mac address needs a re-write > so that it doesnt get dropped or confuse an interconnecting (learning) switch > or dropped by a target machine (which looks at the dst mac). And at times > when flipping back the packet a swap of the MAC addresses is needed. > > Signed-off-by: Jamal Hadi Salim Applied, thanks.
Re: [PATCH v3 net 1/1] net sched actions: fix GETing actions
From: Jamal Hadi SalimDate: Mon, 12 Sep 2016 19:07:38 -0400 > From: Jamal Hadi Salim > > With the batch changes that translated transient actions into > a temporary list lost in the translation was the fact that > tcf_action_destroy() will eventually delete the action from > the permanent location if the refcount is zero. > > Example of what broke: > ...add a gact action to drop > sudo $TC actions add action drop index 10 > ...now retrieve it, looks good > sudo $TC actions get action gact index 10 > ...retrieve it again and find it is gone! > sudo $TC actions get action gact index 10 > > Fixes: > commit 22dc13c837c3 ("net_sched: convert tcf_exts from list to pointer > array"), > commit 824a7e8863b3 ("net_sched: remove an unnecessary list_del()") > commit f07fed82ad79 ("net_sched: remove the leftover cleanup_a()") > > Signed-off-by: Jamal Hadi Salim Please incorporate Sergei's feedback and resubmit, thanks Jamal.
Re: [PATCH net-next 0/2] Misc cls_bpf/act_bpf improvements
From: Daniel BorkmannDate: Mon, 12 Sep 2016 23:38:41 +0200 > Two minor improvements to {cls,act}_bpf. For details please see > individual patches. Series applied.
RE: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
> > [ cut here ] > > WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:316 > dev_watchdog+0x1c2/0x1d0 > > NETDEV WATCHDOG: eth1 (e1000): transmit queue 0 timed out > > Thanks a lot for the tests! Really appreciate it. np, I needed to get my old compatibility systems back in running order anyway.
Re: [PATCH net-next 0/5] mlx4 misc fixes and improvements
From: Tariq ToukanDate: Mon, 12 Sep 2016 16:20:11 +0300 > This patchset contains some bug fixes, a cleanup, and small improvements > from the team to the mlx4 Eth and core drivers. > > Series generated against net-next commit: > 02154927c115 "net: dsa: bcm_sf2: Get VLAN_PORT_MASK from b53_device" > > Please push the following patch to -stable >= 4.6 as well: > "net/mlx4_core: Fix to clean devlink resources" Again, coding style fixes and optimizations like branch prediction hints are not bug fixes and therefore not appropriate for 'net'.
[PATCH net-next] pkt_sched: fq: use proper locking in fq_dump_stats()
From: Eric DumazetWhen fq is used on 32bit kernels, we need to lock the qdisc before copying 64bit fields. Otherwise "tc -s qdisc ..." might report bogus values. Fixes: afe4fd062416 ("pkt_sched: fq: Fair Queue packet scheduler") Signed-off-by: Eric Dumazet --- net/sched/sch_fq.c | 32 ++-- 1 file changed, 18 insertions(+), 14 deletions(-) diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c index e5458b99e09c..dc52cc10d6ed 100644 --- a/net/sched/sch_fq.c +++ b/net/sched/sch_fq.c @@ -823,20 +823,24 @@ nla_put_failure: static int fq_dump_stats(struct Qdisc *sch, struct gnet_dump *d) { struct fq_sched_data *q = qdisc_priv(sch); - u64 now = ktime_get_ns(); - struct tc_fq_qd_stats st = { - .gc_flows = q->stat_gc_flows, - .highprio_packets = q->stat_internal_packets, - .tcp_retrans= q->stat_tcp_retrans, - .throttled = q->stat_throttled, - .flows_plimit = q->stat_flows_plimit, - .pkts_too_long = q->stat_pkts_too_long, - .allocation_errors = q->stat_allocation_errors, - .flows = q->flows, - .inactive_flows = q->inactive_flows, - .throttled_flows= q->throttled_flows, - .time_next_delayed_flow = q->time_next_delayed_flow - now, - }; + struct tc_fq_qd_stats st; + + sch_tree_lock(sch); + + st.gc_flows = q->stat_gc_flows; + st.highprio_packets = q->stat_internal_packets; + st.tcp_retrans= q->stat_tcp_retrans; + st.throttled = q->stat_throttled; + st.flows_plimit = q->stat_flows_plimit; + st.pkts_too_long = q->stat_pkts_too_long; + st.allocation_errors = q->stat_allocation_errors; + st.time_next_delayed_flow = q->time_next_delayed_flow - ktime_get_ns(); + st.flows = q->flows; + st.inactive_flows = q->inactive_flows; + st.throttled_flows= q->throttled_flows; + st.pad= 0; + + sch_tree_unlock(sch); return gnet_stats_copy_app(d, , sizeof(st)); }
Re: [PATCH net-next] net/sched: act_tunnel_key: Remove rcu_read_lock protection
From: Hadar Hen ZionDate: Mon, 12 Sep 2016 15:19:21 +0300 > Remove rcu_read_lock protection from tunnel_key_dump and use > rtnl_dereference, dump operation is protected by rtnl lock. > > Also, remove rcu_read_lock from tunnel_key_release and use > rcu_dereference_protected. > > Both operations are running exclusively and a writer couldn't modify > t->params while those functions are executed. > > Fixes: 54d94fd89d90 ('net/sched: Introduce act_tunnel_key') > Signed-off-by: Hadar Hen Zion Applied.
Re: [PATCH] test_bpf: fix the dummy skb after dissector changes
From: Jakub KicinskiDate: Mon, 12 Sep 2016 13:04:57 +0100 > Commit d5709f7ab776 ("flow_dissector: For stripped vlan, get vlan > info from skb->vlan_tci") made flow dissector look at vlan_proto > when vlan is present. Since test_bpf sets skb->vlan_tci to ~0 > (including VLAN_TAG_PRESENT) we have to populate skb->vlan_proto. > > Fixes false negative on test #24: > test_bpf: #24 LD_PAYLOAD_OFF jited:0 175 ret 0 != 42 FAIL (1 times) > > Signed-off-by: Jakub Kicinski > Reviewed-by: Dinan Gunawardena Applied.
Re: [PATCH][V2] atm: iphase: fix newline escape and minor tweak to source formatting
From: Colin KingDate: Mon, 12 Sep 2016 13:01:50 +0100 > From: Colin Ian King > > The newline escape is incorrect and needs fixing. Also adjust source > formatting / indentation and add { } to trailing else. > > Signed-off-by: Colin Ian King Applied.
Re: [PATCH v2 2/2] openvswitch: use percpu flow stats
On Thu, 2016-09-15 at 19:11 -0300, Thadeu Lima de Souza Cascardo wrote: > Instead of using flow stats per NUMA node, use it per CPU. When using > megaflows, the stats lock can be a bottleneck in scalability. > > On a E5-2690 12-core system, usual throughput went from ~4Mpps to > ~15Mpps when forwarding between two 40GbE ports with a single flow > configured on the datapath. > > This has been tested on a system with possible CPUs 0-7,16-23. After > module removal, there were no corruption on the slab cache. > > Signed-off-by: Thadeu Lima de Souza Cascardo> Cc: pravin shelar > --- > + /* We open code this to make sure cpu 0 is always considered */ > + for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next(cpu, > cpu_possible_mask)) > + if (flow->stats[cpu]) > kmem_cache_free(flow_stats_cache, > - (struct flow_stats __force > *)flow->stats[node]); > + (struct flow_stats __force > *)flow->stats[cpu]); > kmem_cache_free(flow_cache, flow); > } > > @@ -757,7 +749,7 @@ int ovs_flow_init(void) > BUILD_BUG_ON(sizeof(struct sw_flow_key) % sizeof(long)); > > flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow) > -+ (nr_node_ids > ++ (nr_cpu_ids > * sizeof(struct flow_stats *)), > 0, 0, NULL); > if (flow_cache == NULL) Well, if you switch to percpu stats, better use normal alloc_percpu(struct flow_stats) The code was dealing with per node allocation so could not use existing helper. No need to keep this forever.
Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets
On Fri, 2016-09-16 at 00:01 +0300, Cyrill Gorcunov wrote: > Here I get kicked off the server. Login back > > [cyrill@uranus ~] ssh root@pcs7 > Last login: Thu Sep 15 23:20:42 2016 from gateway > [root@pcs7 ~]# cd /home/iproute2/ > [root@pcs7 iproute2]# misc/ss -A raw > State Recv-Q Send-QLocal Address:Port > Peer Address:Port > UNCONN 0 0 > :::ipv6-icmp :::* > > UNCONN 0 0 > :::ipv6-icmp :::* > > > Maybe I do something wrong for testing? If you kill your shell, maybe /root/sock is killer as well, thus its raw sockets are closed. Try to be selective in the -K , do not kill tcp sockets ?
Re: MDB offloading of local ipv4 multicast groups
On Thu, Sep 15, 2016 at 08:58:50PM +0200, John Crispin wrote: > Hi, > > While adding MDB support to the qca8k dsa driver I found that ipv4 mcast > groups don't always get propagated to the dsa driver. In my setup there > are 2 clients connected to the switch, both running a mdns client. The > .port_mdb_add() callback is properly called for 33:33:00:00:00:FB but > 01:00:5E:00:00:FB never got propagated to the dsa driver. > > The reason is that the call to ipv4_is_local_multicast() here [1] will > return true and the notifier is never called. Is this intentional or is > there something missing in the code ? Hi John I've not looked too deeply at this yet, but here is my take on how it should work. By default, the switch needs to flood all multicast traffic from any port in a bridge, to all other ports in a bridge, including the host. Adding an mdb entry allows you to reduce where such flooding should occur, i.e. it allows you to implement IGMP snooping and block traffic going out a port when you know there is nobody interested in the traffic on that port. Andrew
[PATCH v2 2/2] openvswitch: use percpu flow stats
Instead of using flow stats per NUMA node, use it per CPU. When using megaflows, the stats lock can be a bottleneck in scalability. On a E5-2690 12-core system, usual throughput went from ~4Mpps to ~15Mpps when forwarding between two 40GbE ports with a single flow configured on the datapath. This has been tested on a system with possible CPUs 0-7,16-23. After module removal, there were no corruption on the slab cache. Signed-off-by: Thadeu Lima de Souza CascardoCc: pravin shelar --- v2: * use smp_processor_id as ovs_flow_stats_update is always called from BH context * use kmem_cache_zalloc to allocate flow --- net/openvswitch/flow.c | 42 ++ net/openvswitch/flow.h | 4 ++-- net/openvswitch/flow_table.c | 26 +- 3 files changed, 33 insertions(+), 39 deletions(-) diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c index 5b80612..0fa45439 100644 --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -72,32 +73,33 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags, { struct flow_stats *stats; int node = numa_node_id(); + int cpu = smp_processor_id(); int len = skb->len + (skb_vlan_tag_present(skb) ? VLAN_HLEN : 0); - stats = rcu_dereference(flow->stats[node]); + stats = rcu_dereference(flow->stats[cpu]); - /* Check if already have node-specific stats. */ + /* Check if already have CPU-specific stats. */ if (likely(stats)) { spin_lock(>lock); /* Mark if we write on the pre-allocated stats. */ - if (node == 0 && unlikely(flow->stats_last_writer != node)) - flow->stats_last_writer = node; + if (cpu == 0 && unlikely(flow->stats_last_writer != cpu)) + flow->stats_last_writer = cpu; } else { stats = rcu_dereference(flow->stats[0]); /* Pre-allocated. */ spin_lock(>lock); - /* If the current NUMA-node is the only writer on the + /* If the current CPU is the only writer on the * pre-allocated stats keep using them. */ - if (unlikely(flow->stats_last_writer != node)) { + if (unlikely(flow->stats_last_writer != cpu)) { /* A previous locker may have already allocated the -* stats, so we need to check again. If node-specific +* stats, so we need to check again. If CPU-specific * stats were already allocated, we update the pre- * allocated stats as we have already locked them. */ - if (likely(flow->stats_last_writer != NUMA_NO_NODE) - && likely(!rcu_access_pointer(flow->stats[node]))) { - /* Try to allocate node-specific stats. */ + if (likely(flow->stats_last_writer != -1) && + likely(!rcu_access_pointer(flow->stats[cpu]))) { + /* Try to allocate CPU-specific stats. */ struct flow_stats *new_stats; new_stats = @@ -114,12 +116,12 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags, new_stats->tcp_flags = tcp_flags; spin_lock_init(_stats->lock); - rcu_assign_pointer(flow->stats[node], + rcu_assign_pointer(flow->stats[cpu], new_stats); goto unlock; } } - flow->stats_last_writer = node; + flow->stats_last_writer = cpu; } } @@ -136,15 +138,15 @@ void ovs_flow_stats_get(const struct sw_flow *flow, struct ovs_flow_stats *ovs_stats, unsigned long *used, __be16 *tcp_flags) { - int node; + int cpu; *used = 0; *tcp_flags = 0; memset(ovs_stats, 0, sizeof(*ovs_stats)); - /* We open code this to make sure node 0 is always considered */ - for (node = 0; node < MAX_NUMNODES; node = next_node(node, node_possible_map)) { - struct flow_stats *stats = rcu_dereference_ovsl(flow->stats[node]); + /* We open code this to make sure cpu 0 is always considered */ + for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next(cpu, cpu_possible_mask)) { + struct flow_stats *stats =
[PATCH v2 1/2] openvswitch: fix flow stats accounting when node 0 is not possible
On a system with only node 1 as possible, all statistics is going to be accounted on node 0 as it will have a single writer. However, when getting and clearing the statistics, node 0 is not going to be considered, as it's not a possible node. Tested that statistics are not zero on a system with only node 1 possible. Also compile-tested with CONFIG_NUMA off. Signed-off-by: Thadeu Lima de Souza Cascardo--- net/openvswitch/flow.c | 6 -- net/openvswitch/flow_table.c | 5 +++-- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c index 1240ae3..5b80612 100644 --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -142,7 +142,8 @@ void ovs_flow_stats_get(const struct sw_flow *flow, *tcp_flags = 0; memset(ovs_stats, 0, sizeof(*ovs_stats)); - for_each_node(node) { + /* We open code this to make sure node 0 is always considered */ + for (node = 0; node < MAX_NUMNODES; node = next_node(node, node_possible_map)) { struct flow_stats *stats = rcu_dereference_ovsl(flow->stats[node]); if (stats) { @@ -165,7 +166,8 @@ void ovs_flow_stats_clear(struct sw_flow *flow) { int node; - for_each_node(node) { + /* We open code this to make sure node 0 is always considered */ + for (node = 0; node < MAX_NUMNODES; node = next_node(node, node_possible_map)) { struct flow_stats *stats = ovsl_dereference(flow->stats[node]); if (stats) { diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c index d073fff..957a3c3 100644 --- a/net/openvswitch/flow_table.c +++ b/net/openvswitch/flow_table.c @@ -148,8 +148,9 @@ static void flow_free(struct sw_flow *flow) kfree(flow->id.unmasked_key); if (flow->sf_acts) ovs_nla_free_flow_actions((struct sw_flow_actions __force *)flow->sf_acts); - for_each_node(node) - if (flow->stats[node]) + /* We open code this to make sure node 0 is always considered */ + for (node = 0; node < MAX_NUMNODES; node = next_node(node, node_possible_map)) + if (node != 0 && flow->stats[node]) kmem_cache_free(flow_stats_cache, (struct flow_stats __force *)flow->stats[node]); kmem_cache_free(flow_cache, flow); -- 2.7.4
Re: [RFC v3 03/22] bpf,landlock: Add a new arraymap type to deal with (Landlock) handles
On 15/09/2016 01:28, Alexei Starovoitov wrote: > On Thu, Sep 15, 2016 at 01:22:49AM +0200, Mickaël Salaün wrote: >> >> On 14/09/2016 20:51, Alexei Starovoitov wrote: >>> On Wed, Sep 14, 2016 at 09:23:56AM +0200, Mickaël Salaün wrote: This new arraymap looks like a set and brings new properties: * strong typing of entries: the eBPF functions get the array type of elements instead of CONST_PTR_TO_MAP (e.g. CONST_PTR_TO_LANDLOCK_HANDLE_FS); * force sequential filling (i.e. replace or append-only update), which allow quick browsing of all entries. This strong typing is useful to statically check if the content of a map can be passed to an eBPF function. For example, Landlock use it to store and manage kernel objects (e.g. struct file) instead of dealing with userland raw data. This improve efficiency and ensure that an eBPF program can only call functions with the right high-level arguments. The enum bpf_map_handle_type list low-level types (e.g. BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD) which are identified when updating a map entry (handle). This handle types are used to infer a high-level arraymap type which are listed in enum bpf_map_array_type (e.g. BPF_MAP_ARRAY_TYPE_LANDLOCK_FS). For now, this new arraymap is only used by Landlock LSM (cf. next commits) but it could be useful for other needs. Changes since v2: * add a RLIMIT_NOFILE-based limit to the maximum number of arraymap handle entries (suggested by Andy Lutomirski) * remove useless checks Changes since v1: * arraymap of handles replace custom checker groups * simpler userland API Signed-off-by: Mickaël SalaünCc: Alexei Starovoitov Cc: Andy Lutomirski Cc: Daniel Borkmann Cc: David S. Miller Cc: Kees Cook Link: https://lkml.kernel.org/r/calcetrwwtiz3kztkegow24-dvhqq6lftwexh77fd2g5o71y...@mail.gmail.com --- include/linux/bpf.h | 14 include/uapi/linux/bpf.h | 18 + kernel/bpf/arraymap.c| 203 +++ kernel/bpf/verifier.c| 12 ++- 4 files changed, 246 insertions(+), 1 deletion(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index fa9a988400d9..eae4ce4542c1 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -13,6 +13,10 @@ #include #include +#ifdef CONFIG_SECURITY_LANDLOCK +#include /* struct file */ +#endif /* CONFIG_SECURITY_LANDLOCK */ + struct perf_event; struct bpf_map; @@ -38,6 +42,7 @@ struct bpf_map_ops { struct bpf_map { atomic_t refcnt; enum bpf_map_type map_type; + enum bpf_map_array_type map_array_type; u32 key_size; u32 value_size; u32 max_entries; @@ -187,6 +192,9 @@ struct bpf_array { */ enum bpf_prog_type owner_prog_type; bool owner_jited; +#ifdef CONFIG_SECURITY_LANDLOCK + u32 n_entries; /* number of entries in a handle array */ +#endif /* CONFIG_SECURITY_LANDLOCK */ union { char value[0] __aligned(8); void *ptrs[0] __aligned(8); @@ -194,6 +202,12 @@ struct bpf_array { }; }; +#ifdef CONFIG_SECURITY_LANDLOCK +struct map_landlock_handle { + u32 type; /* enum bpf_map_handle_type */ +}; +#endif /* CONFIG_SECURITY_LANDLOCK */ + #define MAX_TAIL_CALL_CNT 32 struct bpf_event_entry { diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 7cd36166f9b7..b68de57f7ab8 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -87,6 +87,15 @@ enum bpf_map_type { BPF_MAP_TYPE_PERCPU_ARRAY, BPF_MAP_TYPE_STACK_TRACE,P_TYPE_CGROUP_ARRAY BPF_MAP_TYPE_CGROUP_ARRAY, + BPF_MAP_TYPE_LANDLOCK_ARRAY, +}; + +enum bpf_map_array_type { + BPF_MAP_ARRAY_TYPE_UNSPEC, +}; + +enum bpf_map_handle_type { + BPF_MAP_HANDLE_TYPE_UNSPEC, }; >>> >>> missing something. why it has to be special to have it's own >>> fd array implementation? >>> Please take a look how BPF_MAP_TYPE_PERF_EVENT_ARRAY, >>> BPF_MAP_TYPE_CGROUP_ARRAY and BPF_MAP_TYPE_PROG_ARRAY are done. >>> The all store objects into array map that user space passes via FD. >>> I think the same model should apply here. >> >> The idea is to have multiple way for userland to describe a resource >> (e.g. an open file descriptor, a path or a glob pattern). The kernel >> representation could then be a "struct path *" or dedicated types (e.g. >> custom glob). > > hmm. I think user space api should only deal with FD. Everything > else is
Re: [RFC v3 07/22] landlock: Handle file comparisons
On 15/09/2016 01:24, Alexei Starovoitov wrote: > On Thu, Sep 15, 2016 at 01:02:22AM +0200, Mickaël Salaün wrote: >>> >>> I would suggest for the next RFC to do minimal 7 patches up to this point >>> with simple example that demonstrates the use case. >>> I would avoid all unpriv stuff and all of seccomp for the next RFC as well, >>> otherwise I don't think we can realistically make forward progress, since >>> there are too many issues raised in the subsequent patches. >> >> I hope we will find a common agreement about seccomp vs cgroup… I think >> both approaches have their advantages, can be complementary and nicely >> combined. > > I don't mind having both task based lsm and cgroup based as long as > infrastracture is not duplicated and scaling issues from earlier version > are resolved. It should be much better with this RFC. > I'm proposing to do cgroup only for the next RFC, since mine and Sargun's > use case for this bpf+lsm+cgroup is _not_ security or sandboxing. Well, LSM purpose is to do security stuff. The main goal of Landlock is to bring security features to userland, including unprivileged processes, at least via the seccomp interface [1]. > No need for unpriv, no_new_priv to cgroups are other things that Andy > is concerned about. I'm concern about security too! :) > >> Unprivileged sandboxing is the main goal of Landlock. This should not be >> a problem, even for privileged features, thanks to the new subtype/access. > > yes. the point that unpriv stuff can come later after agreement is reached. > If we keep arguing about seccomp details this set won't go anywhere. > Even in basic part (which is cgroup+bpf+lsm) are plenty of questions > to be still agreed. Using the seccomp(2) (unpriv) *interface* is OK according to a more recent thread [1]. [1] https://lkml.kernel.org/r/20160915044852.ga66...@ast-mbp.thefacebook.com > >> Agreed. With this RFC, the Checmate features (i.e. network helpers) >> should be able to sit on top of Landlock. > > I think neither of them should be called fancy names for no technical reason. > We will have only one bpf based lsm. That's it and it doesn't > need an obscure name. Directory name can be security/bpf/..stuff.c I disagree on an LSM named "BPF". I first started with the "seccomp LSM" name (first RFC) but I later realized that it is confusing because seccomp is associated to its syscall and the underlying features. Same thing goes for BPF. It is also artificially hard to grep on a name too used in the kernel source tree. Making an association between the generic eBPF mechanism and a security centric approach (i.e. LSM) seems a bit reductive (for BPF). Moreover, the seccomp interface [1] can still be used. Landlock is a nice name to depict a sandbox as an enclave (i.e. a landlocked country/state). I want to keep this name, which is simple, express the goal of Landlock nicely and is comparable to other sandbox mechanisms as Seatbelt or Pledge. Landlock should not be confused with the underlying eBPF implementation. Landlock could use more than only eBPF in the future and eBPF could be used in other LSM as well. Mickaël signature.asc Description: OpenPGP digital signature
[PATCH v2 net-next 4/7] ila: Call library function alloc_bucket_locks
To allocate the array of bucket locks for the hash table we now call library function alloc_bucket_spinlocks. Signed-off-by: Tom Herbert--- net/ipv6/ila/ila_xlat.c | 36 +--- 1 file changed, 5 insertions(+), 31 deletions(-) diff --git a/net/ipv6/ila/ila_xlat.c b/net/ipv6/ila/ila_xlat.c index e604013..7d1c34b 100644 --- a/net/ipv6/ila/ila_xlat.c +++ b/net/ipv6/ila/ila_xlat.c @@ -30,34 +30,6 @@ struct ila_net { bool hooks_registered; }; -#defineLOCKS_PER_CPU 10 - -static int alloc_ila_locks(struct ila_net *ilan) -{ - unsigned int i, size; - unsigned int nr_pcpus = num_possible_cpus(); - - nr_pcpus = min_t(unsigned int, nr_pcpus, 32UL); - size = roundup_pow_of_two(nr_pcpus * LOCKS_PER_CPU); - - if (sizeof(spinlock_t) != 0) { -#ifdef CONFIG_NUMA - if (size * sizeof(spinlock_t) > PAGE_SIZE) - ilan->locks = vmalloc(size * sizeof(spinlock_t)); - else -#endif - ilan->locks = kmalloc_array(size, sizeof(spinlock_t), - GFP_KERNEL); - if (!ilan->locks) - return -ENOMEM; - for (i = 0; i < size; i++) - spin_lock_init(>locks[i]); - } - ilan->locks_mask = size - 1; - - return 0; -} - static u32 hashrnd __read_mostly; static __always_inline void __ila_hash_secret_init(void) { @@ -561,14 +533,16 @@ static const struct genl_ops ila_nl_ops[] = { }, }; -#define ILA_HASH_TABLE_SIZE 1024 +#define LOCKS_PER_CPU 10 +#define MAX_LOCKS 1024 static __net_init int ila_init_net(struct net *net) { int err; struct ila_net *ilan = net_generic(net, ila_net_id); - err = alloc_ila_locks(ilan); + err = alloc_bucket_spinlocks(>locks, >locks_mask, +MAX_LOCKS, LOCKS_PER_CPU, GFP_KERNEL); if (err) return err; @@ -583,7 +557,7 @@ static __net_exit void ila_exit_net(struct net *net) rhashtable_free_and_destroy(>rhash_table, ila_free_cb, NULL); - kvfree(ilan->locks); + free_bucket_spinlocks(ilan->locks); if (ilan->hooks_registered) nf_unregister_net_hooks(net, ila_nf_hook_ops, -- 2.8.0.rc2
[PATCH v2 net-next 2/7] spinlock: Add library function to allocate spinlock buckets array
Add two new library functions alloc_bucket_spinlocks and free_bucket_spinlocks. These are use to allocate and free an array of spinlocks that are useful as locks for hash buckets. The interface specifies the maximum number of spinlocks in the array as well as a CPU multiplier to derive the number of spinlocks to allocate. The number to allocated is rounded up to a power of two to make the array amenable to hash lookup. Reviewed by Greg RoseAcked-by: Thomas Graf Signed-off-by: Tom Herbert --- include/linux/spinlock.h | 6 + lib/Makefile | 2 +- lib/bucket_locks.c | 63 3 files changed, 70 insertions(+), 1 deletion(-) create mode 100644 lib/bucket_locks.c diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h index 47dd0ce..4ebdfbf 100644 --- a/include/linux/spinlock.h +++ b/include/linux/spinlock.h @@ -416,4 +416,10 @@ extern int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock); #define atomic_dec_and_lock(atomic, lock) \ __cond_lock(lock, _atomic_dec_and_lock(atomic, lock)) +int alloc_bucket_spinlocks(spinlock_t **locks, unsigned int *lock_mask, + unsigned int max_size, unsigned int cpu_mult, + gfp_t gfp); + +void free_bucket_spinlocks(spinlock_t *locks); + #endif /* __LINUX_SPINLOCK_H */ diff --git a/lib/Makefile b/lib/Makefile index 5dc77a8..f91185e 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -36,7 +36,7 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \ gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \ bsearch.o find_bit.o llist.o memweight.o kfifo.o \ percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \ -once.o +once.o bucket_locks.o obj-y += string_helpers.o obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o obj-y += hexdump.o diff --git a/lib/bucket_locks.c b/lib/bucket_locks.c new file mode 100644 index 000..bb9bf11 --- /dev/null +++ b/lib/bucket_locks.c @@ -0,0 +1,63 @@ +#include +#include +#include +#include +#include + +/* Allocate an array of spinlocks to be accessed by a hash. Two arguments + * indicate the number of elements to allocate in the array. max_size + * gives the maximum number of elements to allocate. cpu_mult gives + * the number of locks per CPU to allocate. The size is rounded up + * to a power of 2 to be suitable as a hash table. + */ +int alloc_bucket_spinlocks(spinlock_t **locks, unsigned int *locks_mask, + unsigned int max_size, unsigned int cpu_mult, + gfp_t gfp) +{ + unsigned int i, size; +#if defined(CONFIG_PROVE_LOCKING) + unsigned int nr_pcpus = 2; +#else + unsigned int nr_pcpus = num_possible_cpus(); +#endif + spinlock_t *tlocks = NULL; + + if (cpu_mult) { + nr_pcpus = min_t(unsigned int, nr_pcpus, 64UL); + size = min_t(unsigned int, nr_pcpus * cpu_mult, max_size); + } else { + size = max_size; + } + size = roundup_pow_of_two(size); + + if (!size) + return -EINVAL; + + if (sizeof(spinlock_t) != 0) { +#ifdef CONFIG_NUMA + if (size * sizeof(spinlock_t) > PAGE_SIZE && + gfp == GFP_KERNEL) + tlocks = vmalloc(size * sizeof(spinlock_t)); +#endif + if (gfp != GFP_KERNEL) + gfp |= __GFP_NOWARN | __GFP_NORETRY; + + if (!tlocks) + tlocks = kmalloc_array(size, sizeof(spinlock_t), gfp); + if (!tlocks) + return -ENOMEM; + for (i = 0; i < size; i++) + spin_lock_init([i]); + } + *locks = tlocks; + *locks_mask = size - 1; + + return 0; +} +EXPORT_SYMBOL(alloc_bucket_spinlocks); + +void free_bucket_spinlocks(spinlock_t *locks) +{ + kvfree(locks); +} +EXPORT_SYMBOL(free_bucket_spinlocks); -- 2.8.0.rc2
[PATCH v2 net-next 6/7] net: Generic resolver backend
This patch implements the backend of a resolver, specifically it provides a means to track unresolved addresses and to time them out. The resolver is mostly a frontend to an rhashtable where the key of the table is whatever address type or object is tracked. A resolver instance is created by net_rslv_create. A resolver is destroyed by net_rslv_destroy. There are two functions that are used to manipulate entries in the table: net_rslv_lookup_and_create and net_rslv_resolved. net_rslv_lookup_and_create is called with an unresolved address as the argument. It returns a structure of type net_rslv_ent. When called a lookup is performed to see if an entry for the address is already in the table, if it is the entry is return and the false is returned in the new bool pointer argument to indicate that the entry was preexisting. If an entry is not found, one is create and true is returned on the new pointer argument. It is expected that when an entry is new the address resolution protocol is initiated (for instance a RTM_ADDR_RESOLVE message may be sent to a userspace daemon as we will do in ILA). If net_rslv_lookup_and_create returns NULL then presumably the hash table has reached the limit of number of outstanding unresolved addresses, the caller should take appropriate actions to avoid spamming the resolution protocol. net_rslv_resolved is called when resolution is completely (e.g. ILA locator mapping was instantiated for a locator. The entry is removed for the hash table. An argument to net_rslv_create indicates a time for the pending resolution in milliseconds. If the timer fires before resolution then the entry is removed from the table. Subsequently, another attempt to resolve the same address will result in a new entry in the table. net_rslv_lookup_and_create allocates an net_rslv_ent struct and includes allocating related user data. This is the object[] field in the structure. The key (unresolved address) is always the first field in the the object. Following that the caller may add it's own private field data. The key length and size of the user object (including the key) are specific in net_rslv_create. There are three callback functions that can be set as arugments in net_rslv_create: - cmp_fn: Compare function for hash table. Arguments are the key and an object in the table. If this is NULL then the default memcmp of rhashtable is used. - init_fn: Initial a new net_rslv_ent structure. This allows initialization of the user portion of the structure (the object[]). - destroy_fn: Called right before a net_rslv_ent is freed. This allows cleanup of user data associated with the entry. Note that the resolver backend only tracks unresolved addresses, it is up to the caller to perform the mechanism of resolution. This includes the possible of queuing packets awaiting resolution; this can be accomplished for instance by maintaining an skbuff queue in the net_rslv_ent user object[] data. DOS mitigation is done by limiting the number of entries in the resolver table (the max_size which argument of net_rslv_create) and setting a timeout. IF the timeout is set then the maximum rate of new resolution requests is max_table_size / timeout. For instance, with a maximum size of 1000 entries and a timeout of 100 msecs the maximum rate of resolutions requests is 1/s. Signed-off-by: Tom Herbert--- include/net/resolver.h | 58 +++ net/Kconfig| 4 + net/core/Makefile | 1 + net/core/resolver.c| 272 + 4 files changed, 335 insertions(+) create mode 100644 include/net/resolver.h create mode 100644 net/core/resolver.c diff --git a/include/net/resolver.h b/include/net/resolver.h new file mode 100644 index 000..9274237 --- /dev/null +++ b/include/net/resolver.h @@ -0,0 +1,58 @@ +#ifndef __NET_RESOLVER_H +#define __NET_RESOLVER_H + +#include + +struct net_rslv; +struct net_rslv_ent; + +typedef int (*net_rslv_cmpfn)(struct net_rslv *nrslv, const void *key, + const void *object); +typedef void (*net_rslv_initfn)(struct net_rslv *nrslv, void *object); +typedef void (*net_rslv_destroyfn)(struct net_rslv_ent *nrent); + +struct net_rslv { + struct rhashtable rhash_table; + struct rhashtable_params params; + net_rslv_cmpfn rslv_cmp; + net_rslv_initfn rslv_init; + net_rslv_destroyfn rslv_destroy; + size_t obj_size; + spinlock_t *locks; + unsigned int locks_mask; + unsigned int hash_rnd; +}; + +struct net_rslv_ent { + struct rcu_head rcu; + union { + /* Fields set when entry is in hash table */ + struct { + struct rhash_head node; + struct delayed_work timeout_work; + struct net_rslv *nrslv; + }; + + /* Fields set when rcu freeing structure */ +
[PATCH v2 net-next 0/7] net: ILA resolver and generic resolver backend
This patch set implements an ILA host side resolver. This uses LWT to implement the hook to a userspace resolver and tracks pending unresolved address using the backend net resolver. This patch set contains: - An new library function to allocate an array of spinlocks for use with locking hash buckets. - Make hash function in rhashtable directly callable. - A generic resolver backend infrastructure. This primary does two things: track unsesolved addresses and implement a timeout for resolution not happening. These mechanisms provides rate limiting control over resolution requests (for instance in ILA it use used to rate limit requests to userspace to resolve addresses). - The ILA resolver. This is implements to path from the kernel ILA implementation to a userspace daemon that an identifier address needs to be resolved. - Routing messages are used over netlink to indicate resoltion requests. Changes from intial RFC: - Added net argument to LWT build_state - Made resolve timeout an attribute of the LWT encap route - Changed ILA notifications to be regular routing messages of event RTM_ADDR_RESOLVE, family RTNL_FAMILY_ILA, and group RTNLGRP_ILA_NOTIFY Tested: - Ran a UDP flood to random addresses in a resolver prefix. Observed timeout and limits were working (watching "ip monitor"). - Also ran against an ILA client daemon that runs the resolver protocol. Observed that when resolution completes (ILA encap route is installed) routing messages are no longer sent. v2: - Fixed function prototype issue found by kbuild - Fix inccorrect interpretation of return code from net_rslv_lookup_and_create Tom Herbert (7): lwt: Add net to build_state argument spinlock: Add library function to allocate spinlock buckets array rhashtable: Call library function alloc_bucket_locks ila: Call library function alloc_bucket_locks rhashtable: abstract out function to get hash net: Generic resolver backend ila: Resolver mechanism include/linux/rhashtable.h | 28 +++-- include/linux/spinlock.h | 6 + include/net/lwtunnel.h | 14 +-- include/net/resolver.h | 58 + include/uapi/linux/ila.h | 9 ++ include/uapi/linux/lwtunnel.h | 1 + include/uapi/linux/rtnetlink.h | 8 +- lib/Makefile | 2 +- lib/bucket_locks.c | 63 ++ lib/rhashtable.c | 46 +-- net/Kconfig| 4 + net/core/Makefile | 1 + net/core/lwtunnel.c| 11 +- net/core/resolver.c| 272 + net/ipv4/fib_semantics.c | 7 +- net/ipv4/ip_tunnel_core.c | 12 +- net/ipv6/Kconfig | 1 + net/ipv6/ila/Makefile | 2 +- net/ipv6/ila/ila.h | 16 +++ net/ipv6/ila/ila_common.c | 7 ++ net/ipv6/ila/ila_lwt.c | 15 ++- net/ipv6/ila/ila_resolver.c| 249 + net/ipv6/ila/ila_xlat.c| 51 ++-- net/ipv6/route.c | 2 +- net/mpls/mpls_iptunnel.c | 6 +- 25 files changed, 770 insertions(+), 121 deletions(-) create mode 100644 include/net/resolver.h create mode 100644 lib/bucket_locks.c create mode 100644 net/core/resolver.c create mode 100644 net/ipv6/ila/ila_resolver.c -- 2.8.0.rc2
[PATCH v2 net-next 5/7] rhashtable: abstract out function to get hash
Split out most of rht_key_hashfn which is calculating the hash into its own function. This way the hash function can be called separately to get the hash value. Acked-by: Thomas GrafSigned-off-by: Tom Herbert --- include/linux/rhashtable.h | 28 ++-- 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h index fd82584..e398a62 100644 --- a/include/linux/rhashtable.h +++ b/include/linux/rhashtable.h @@ -208,34 +208,42 @@ static inline unsigned int rht_bucket_index(const struct bucket_table *tbl, return (hash >> RHT_HASH_RESERVED_SPACE) & (tbl->size - 1); } -static inline unsigned int rht_key_hashfn( - struct rhashtable *ht, const struct bucket_table *tbl, - const void *key, const struct rhashtable_params params) +static inline unsigned int rht_key_get_hash(struct rhashtable *ht, + const void *key, const struct rhashtable_params params, + unsigned int hash_rnd) { unsigned int hash; /* params must be equal to ht->p if it isn't constant. */ if (!__builtin_constant_p(params.key_len)) - hash = ht->p.hashfn(key, ht->key_len, tbl->hash_rnd); + hash = ht->p.hashfn(key, ht->key_len, hash_rnd); else if (params.key_len) { unsigned int key_len = params.key_len; if (params.hashfn) - hash = params.hashfn(key, key_len, tbl->hash_rnd); + hash = params.hashfn(key, key_len, hash_rnd); else if (key_len & (sizeof(u32) - 1)) - hash = jhash(key, key_len, tbl->hash_rnd); + hash = jhash(key, key_len, hash_rnd); else - hash = jhash2(key, key_len / sizeof(u32), - tbl->hash_rnd); + hash = jhash2(key, key_len / sizeof(u32), hash_rnd); } else { unsigned int key_len = ht->p.key_len; if (params.hashfn) - hash = params.hashfn(key, key_len, tbl->hash_rnd); + hash = params.hashfn(key, key_len, hash_rnd); else - hash = jhash(key, key_len, tbl->hash_rnd); + hash = jhash(key, key_len, hash_rnd); } + return hash; +} + +static inline unsigned int rht_key_hashfn( + struct rhashtable *ht, const struct bucket_table *tbl, + const void *key, const struct rhashtable_params params) +{ + unsigned int hash = rht_key_get_hash(ht, key, params, tbl->hash_rnd); + return rht_bucket_index(tbl, hash); } -- 2.8.0.rc2
[PATCH v2 net-next 1/7] lwt: Add net to build_state argument
Users of LWT need to know net if they want to have per net operations in LWT. Signed-off-by: Tom Herbert--- include/net/lwtunnel.h| 14 +++--- net/core/lwtunnel.c | 11 +++ net/ipv4/fib_semantics.c | 7 --- net/ipv4/ip_tunnel_core.c | 12 ++-- net/ipv6/ila/ila_lwt.c| 6 +++--- net/ipv6/route.c | 2 +- net/mpls/mpls_iptunnel.c | 6 +++--- 7 files changed, 31 insertions(+), 27 deletions(-) diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h index ea3f80f..9d1e172 100644 --- a/include/net/lwtunnel.h +++ b/include/net/lwtunnel.h @@ -33,9 +33,9 @@ struct lwtunnel_state { }; struct lwtunnel_encap_ops { - int (*build_state)(struct net_device *dev, struct nlattr *encap, - unsigned int family, const void *cfg, - struct lwtunnel_state **ts); + int (*build_state)(struct net *net, struct net_device *dev, + struct nlattr *encap, unsigned int family, + const void *cfg, struct lwtunnel_state **ts); int (*output)(struct net *net, struct sock *sk, struct sk_buff *skb); int (*input)(struct sk_buff *skb); int (*fill_encap)(struct sk_buff *skb, @@ -106,8 +106,8 @@ int lwtunnel_encap_add_ops(const struct lwtunnel_encap_ops *op, unsigned int num); int lwtunnel_encap_del_ops(const struct lwtunnel_encap_ops *op, unsigned int num); -int lwtunnel_build_state(struct net_device *dev, u16 encap_type, -struct nlattr *encap, +int lwtunnel_build_state(struct net *net, struct net_device *dev, +u16 encap_type, struct nlattr *encap, unsigned int family, const void *cfg, struct lwtunnel_state **lws); int lwtunnel_fill_encap(struct sk_buff *skb, @@ -169,8 +169,8 @@ static inline int lwtunnel_encap_del_ops(const struct lwtunnel_encap_ops *op, return -EOPNOTSUPP; } -static inline int lwtunnel_build_state(struct net_device *dev, u16 encap_type, - struct nlattr *encap, +static inline int lwtunnel_build_state(struct net *net, struct net_device *dev, + u16 encap_type, struct nlattr *encap, unsigned int family, const void *cfg, struct lwtunnel_state **lws) { diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c index e5f84c2..ba8be0b 100644 --- a/net/core/lwtunnel.c +++ b/net/core/lwtunnel.c @@ -39,6 +39,8 @@ static const char *lwtunnel_encap_str(enum lwtunnel_encap_types encap_type) return "MPLS"; case LWTUNNEL_ENCAP_ILA: return "ILA"; + case LWTUNNEL_ENCAP_ILA_NOTIFY: + return "ILA_NOTIFY"; case LWTUNNEL_ENCAP_IP6: case LWTUNNEL_ENCAP_IP: case LWTUNNEL_ENCAP_NONE: @@ -96,9 +98,10 @@ int lwtunnel_encap_del_ops(const struct lwtunnel_encap_ops *ops, } EXPORT_SYMBOL(lwtunnel_encap_del_ops); -int lwtunnel_build_state(struct net_device *dev, u16 encap_type, -struct nlattr *encap, unsigned int family, -const void *cfg, struct lwtunnel_state **lws) +int lwtunnel_build_state(struct net *net, struct net_device *dev, +u16 encap_type, struct nlattr *encap, +unsigned int family, const void *cfg, +struct lwtunnel_state **lws) { const struct lwtunnel_encap_ops *ops; int ret = -EINVAL; @@ -123,7 +126,7 @@ int lwtunnel_build_state(struct net_device *dev, u16 encap_type, } #endif if (likely(ops && ops->build_state)) - ret = ops->build_state(dev, encap, family, cfg, lws); + ret = ops->build_state(net, dev, encap, family, cfg, lws); rcu_read_unlock(); return ret; diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 388d3e2..aee4e95 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -511,7 +511,8 @@ static int fib_get_nhs(struct fib_info *fi, struct rtnexthop *rtnh, goto err_inval; if (cfg->fc_oif) dev = __dev_get_by_index(net, cfg->fc_oif); - ret = lwtunnel_build_state(dev, nla_get_u16( + ret = lwtunnel_build_state(net, dev, + nla_get_u16( nla_entype), nla, AF_INET, cfg, ); @@ -610,7 +611,7 @@ static int fib_encap_match(struct net *net, u16 encap_type, if (oif) dev =
[PATCH v2 net-next 3/7] rhashtable: Call library function alloc_bucket_locks
To allocate the array of bucket locks for the hash table we now call library function alloc_bucket_spinlocks. This function is based on the old alloc_bucket_locks in rhashtable and should produce the same effect. Acked-by: Thomas GrafSigned-off-by: Tom Herbert --- lib/rhashtable.c | 46 -- 1 file changed, 4 insertions(+), 42 deletions(-) diff --git a/lib/rhashtable.c b/lib/rhashtable.c index 06c2872..5b53304 100644 --- a/lib/rhashtable.c +++ b/lib/rhashtable.c @@ -59,50 +59,10 @@ EXPORT_SYMBOL_GPL(lockdep_rht_bucket_is_held); #define ASSERT_RHT_MUTEX(HT) #endif - -static int alloc_bucket_locks(struct rhashtable *ht, struct bucket_table *tbl, - gfp_t gfp) -{ - unsigned int i, size; -#if defined(CONFIG_PROVE_LOCKING) - unsigned int nr_pcpus = 2; -#else - unsigned int nr_pcpus = num_possible_cpus(); -#endif - - nr_pcpus = min_t(unsigned int, nr_pcpus, 64UL); - size = roundup_pow_of_two(nr_pcpus * ht->p.locks_mul); - - /* Never allocate more than 0.5 locks per bucket */ - size = min_t(unsigned int, size, tbl->size >> 1); - - if (sizeof(spinlock_t) != 0) { - tbl->locks = NULL; -#ifdef CONFIG_NUMA - if (size * sizeof(spinlock_t) > PAGE_SIZE && - gfp == GFP_KERNEL) - tbl->locks = vmalloc(size * sizeof(spinlock_t)); -#endif - if (gfp != GFP_KERNEL) - gfp |= __GFP_NOWARN | __GFP_NORETRY; - - if (!tbl->locks) - tbl->locks = kmalloc_array(size, sizeof(spinlock_t), - gfp); - if (!tbl->locks) - return -ENOMEM; - for (i = 0; i < size; i++) - spin_lock_init(>locks[i]); - } - tbl->locks_mask = size - 1; - - return 0; -} - static void bucket_table_free(const struct bucket_table *tbl) { if (tbl) - kvfree(tbl->locks); + free_bucket_spinlocks(tbl->locks); kvfree(tbl); } @@ -131,7 +91,9 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht, tbl->size = nbuckets; - if (alloc_bucket_locks(ht, tbl, gfp) < 0) { + /* Never allocate more than 0.5 locks per bucket */ + if (alloc_bucket_spinlocks(>locks, >locks_mask, + tbl->size >> 1, ht->p.locks_mul, gfp)) { bucket_table_free(tbl); return NULL; } -- 2.8.0.rc2
[PATCH v2 net-next 7/7] ila: Resolver mechanism
Implement an ILA resolver. This uses LWT to implement the hook to a userspace resolver and tracks pending unresolved address using the backend net resolver. The idea is that the kernel sets an ILA resolver route to the SIR prefix, something like: ip route add ::/64 encap ila-resolve \ via 2401:db00:20:911a::27:0 dev eth0 When a packet hits the route the address is looked up in a resolver table. If the entry is created (no entry with the address already exists) then an rtnl message is generated with group RTNLGRP_ILA_NOTIFY and type RTM_ADDR_RESOLVE. A userspace daemon can listen for such messages and perform an ILA resolution protocol to determine the ILA mapping. If the mapping is resolved then a /128 ila encap router is set so that host can perform ILA translation and send directly to destination. Signed-off-by: Tom Herbert--- include/uapi/linux/ila.h | 9 ++ include/uapi/linux/lwtunnel.h | 1 + include/uapi/linux/rtnetlink.h | 8 +- net/ipv6/Kconfig | 1 + net/ipv6/ila/Makefile | 2 +- net/ipv6/ila/ila.h | 16 +++ net/ipv6/ila/ila_common.c | 7 ++ net/ipv6/ila/ila_lwt.c | 9 ++ net/ipv6/ila/ila_resolver.c| 249 + net/ipv6/ila/ila_xlat.c| 15 ++- 10 files changed, 307 insertions(+), 10 deletions(-) create mode 100644 net/ipv6/ila/ila_resolver.c diff --git a/include/uapi/linux/ila.h b/include/uapi/linux/ila.h index 948c0a9..f186f8b 100644 --- a/include/uapi/linux/ila.h +++ b/include/uapi/linux/ila.h @@ -42,4 +42,13 @@ enum { ILA_CSUM_NO_ACTION, }; +enum { + ILA_NOTIFY_ATTR_UNSPEC, + ILA_NOTIFY_ATTR_TIMEOUT,/* u32 */ + + __ILA_NOTIFY_ATTR_MAX, +}; + +#define ILA_NOTIFY_ATTR_MAX(__ILA_NOTIFY_ATTR_MAX - 1) + #endif /* _UAPI_LINUX_ILA_H */ diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h index a478fe8..d880e49 100644 --- a/include/uapi/linux/lwtunnel.h +++ b/include/uapi/linux/lwtunnel.h @@ -9,6 +9,7 @@ enum lwtunnel_encap_types { LWTUNNEL_ENCAP_IP, LWTUNNEL_ENCAP_ILA, LWTUNNEL_ENCAP_IP6, + LWTUNNEL_ENCAP_ILA_NOTIFY, __LWTUNNEL_ENCAP_MAX, }; diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index 262f037..a775464 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -12,7 +12,8 @@ */ #define RTNL_FAMILY_IPMR 128 #define RTNL_FAMILY_IP6MR 129 -#define RTNL_FAMILY_MAX129 +#define RTNL_FAMILY_ILA130 +#define RTNL_FAMILY_MAX130 / * Routing/neighbour discovery messages. @@ -144,6 +145,9 @@ enum { RTM_GETSTATS = 94, #define RTM_GETSTATS RTM_GETSTATS + RTM_ADDR_RESOLVE = 95, +#define RTM_ADDR_RESOLVE RTM_ADDR_RESOLVE + __RTM_MAX, #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1) }; @@ -656,6 +660,8 @@ enum rtnetlink_groups { #define RTNLGRP_MPLS_ROUTE RTNLGRP_MPLS_ROUTE RTNLGRP_NSID, #define RTNLGRP_NSID RTNLGRP_NSID + RTNLGRP_ILA_NOTIFY, +#define RTNLGRP_ILA_NOTIFY RTNLGRP_ILA_NOTIFY __RTNLGRP_MAX }; #define RTNLGRP_MAX(__RTNLGRP_MAX - 1) diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig index 2343e4f..cf3ea8e 100644 --- a/net/ipv6/Kconfig +++ b/net/ipv6/Kconfig @@ -97,6 +97,7 @@ config IPV6_ILA tristate "IPv6: Identifier Locator Addressing (ILA)" depends on NETFILTER select LWTUNNEL + select NET_EXT_RESOLVER ---help--- Support for IPv6 Identifier Locator Addressing (ILA). diff --git a/net/ipv6/ila/Makefile b/net/ipv6/ila/Makefile index 4b32e59..f2aadc3 100644 --- a/net/ipv6/ila/Makefile +++ b/net/ipv6/ila/Makefile @@ -4,4 +4,4 @@ obj-$(CONFIG_IPV6_ILA) += ila.o -ila-objs := ila_common.o ila_lwt.o ila_xlat.o +ila-objs := ila_common.o ila_lwt.o ila_xlat.o ila_resolver.o diff --git a/net/ipv6/ila/ila.h b/net/ipv6/ila/ila.h index e0170f6..e369611 100644 --- a/net/ipv6/ila/ila.h +++ b/net/ipv6/ila/ila.h @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include @@ -23,6 +24,16 @@ #include #include +extern unsigned int ila_net_id; + +struct ila_net { + struct rhashtable rhash_table; + spinlock_t *locks; /* Bucket locks for entry manipulation */ + unsigned int locks_mask; + bool hooks_registered; + struct net_rslv *nrslv; +}; + struct ila_locator { union { __u8v8[8]; @@ -114,9 +125,14 @@ void ila_update_ipv6_locator(struct sk_buff *skb, struct ila_params *p, void ila_init_saved_csum(struct ila_params *p); +void ila_rslv_resolved(struct ila_net *ilan, struct ila_addr *iaddr); int ila_lwt_init(void); void ila_lwt_fini(void); int ila_xlat_init(void); void ila_xlat_fini(void); +int ila_rslv_init(void); +void
Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets
On Thu, Sep 15, 2016 at 02:54:57PM -0600, David Ahern wrote: > On 9/15/16 2:22 PM, Cyrill Gorcunov wrote: > >> ss -K is not working. Socket lookup fails to find a match due to a > >> protocol mismatch. > >> > >> haven't had time to track down why there is a mismatch since the kill uses > >> the socket returned > >> from the dump. Won't have time to come back to this until early next week. > > > > Have you ran iproute2 patched? I just ran ss -K and all sockets get closed > > (including raw ones), which actually kicked me off the testing machine sshd > > :/ > > > > > This is the patch I applied to iproute2; the change in your goo.gl link plus > a debug to confirm the kill action is initiated by ss: > > diff --git a/misc/ss.c b/misc/ss.c > index 3b268d999426..4d98411738ea 100644 > --- a/misc/ss.c > +++ b/misc/ss.c > @@ -2334,6 +2334,10 @@ static int show_one_inet_sock(const struct sockaddr_nl > *addr, > if (diag_arg->f->f && run_ssfilter(diag_arg->f->f, ) == 0) > return 0; > > + if (diag_arg->f->kill) { > +printf("want to kill:\n"); > + err = inet_show_sock(h, , diag_arg->protocol); > + } > if (diag_arg->f->kill && kill_inet_sock(h, arg) != 0) { > if (errno == EOPNOTSUPP || errno == ENOENT) { > /* Socket can't be closed, or is already closed. */ > @@ -2631,6 +2635,10 @@ static int raw_show(struct filter *f) > > dg_proto = RAW_PROTO; > > +if (!getenv("PROC_NET_RAW") && !getenv("PROC_ROOT") && > +inet_show_netlink(f, NULL, IPPROTO_RAW) == 0) > +return 0; > + > if (f->families&(1<if ((fp = net_raw_open()) == NULL) > goto outerr; > Hmm. Weird. I'm running net-next kernel --- [root@pcs7 ~]# /root/sock & [1] 5108 This is a trivial program which opens raw sockets [root@pcs7 iproute2]# misc/ss -A raw State Recv-Q Send-QLocal Address:Port Peer Address:Port ESTAB 0 0 127.0.0.1:ipproto-255 127.0.0.10:ipproto-9090 UNCONN 0 0 127.0.0.10:ipproto-255 *:* UNCONN 0 0:::ipv6-icmp :::* UNCONN 0 0:::ipv6-icmp :::* ESTAB 0 0 ::1:ipproto-255 ::1:ipproto-9091 UNCONN 0 0 ::1:ipproto-255:::* [root@pcs7 iproute2]# [root@pcs7 iproute2]# misc/ss -K Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port u_str ESTAB 0 0/var/run/dbus/system_bus_socket 18071 * 16297 u_str ESTAB 0 0/run/systemd/journal/stdout 18756 * 16188 u_str ESTAB 0 0/run/systemd/journal/stdout 23014 * 23013 u_str ESTAB 0 0 * 18909 * 16298 u_str ESTAB 0 0/var/run/dbus/system_bus_socket 19154 * 18163 ... ???ESTAB 0 0 127.0.0.1:ipproto-255 127.0.0.10:ipproto-9090 ???UNCONN 0 0 127.0.0.10:ipproto-255 *:* ???ESTAB 0 0 ::1:ipproto-255::1:ipproto-9091 ???UNCONN 0 0 ::1:ipproto-255 :::* --- Here I get kicked off the server. Login back [cyrill@uranus ~] ssh root@pcs7 Last login: Thu Sep 15 23:20:42 2016 from gateway [root@pcs7 ~]# cd /home/iproute2/ [root@pcs7 iproute2]# misc/ss -A raw State Recv-Q Send-QLocal Address:Port
Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets
On 9/15/16 2:22 PM, Cyrill Gorcunov wrote: >> ss -K is not working. Socket lookup fails to find a match due to a protocol >> mismatch. >> >> haven't had time to track down why there is a mismatch since the kill uses >> the socket returned >> from the dump. Won't have time to come back to this until early next week. > > Have you ran iproute2 patched? I just ran ss -K and all sockets get closed > (including raw ones), which actually kicked me off the testing machine sshd :/ > This is the patch I applied to iproute2; the change in your goo.gl link plus a debug to confirm the kill action is initiated by ss: diff --git a/misc/ss.c b/misc/ss.c index 3b268d999426..4d98411738ea 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -2334,6 +2334,10 @@ static int show_one_inet_sock(const struct sockaddr_nl *addr, if (diag_arg->f->f && run_ssfilter(diag_arg->f->f, ) == 0) return 0; + if (diag_arg->f->kill) { +printf("want to kill:\n"); + err = inet_show_sock(h, , diag_arg->protocol); + } if (diag_arg->f->kill && kill_inet_sock(h, arg) != 0) { if (errno == EOPNOTSUPP || errno == ENOENT) { /* Socket can't be closed, or is already closed. */ @@ -2631,6 +2635,10 @@ static int raw_show(struct filter *f) dg_proto = RAW_PROTO; +if (!getenv("PROC_NET_RAW") && !getenv("PROC_ROOT") && +inet_show_netlink(f, NULL, IPPROTO_RAW) == 0) +return 0; + if (f->families&(1<
[PATCH net 2/2] bna: fix crash in bnad_get_strings()
Commit 6e7333d "net: add rx_nohandler stat counter" added the new entry rx_nohandler into struct rtnl_link_stats64. Unfortunately the bna driver foolishly depends on the structure. It uses part of it for ethtool statistics and it's not bad but the driver assumes its size is constant as it defines string for each existing entry. The problem occurs when the structure is extended because you need to modify bna driver as well. If not any attempt to retrieve ethtool statistics results in crash in bnad_get_strings(). The patch changes BNAD_ETHTOOL_STATS_NUM so it counts real number of strings in the array and also removes rtnl_link_stats64 entries that are not used in output and are always zero. Fixes: 6e7333d "net: add rx_nohandler stat counter" Signed-off-by: Ivan Vecera--- drivers/net/ethernet/brocade/bna/bnad_ethtool.c | 50 - 1 file changed, 23 insertions(+), 27 deletions(-) diff --git a/drivers/net/ethernet/brocade/bna/bnad_ethtool.c b/drivers/net/ethernet/brocade/bna/bnad_ethtool.c index 5671353..31f61a7 100644 --- a/drivers/net/ethernet/brocade/bna/bnad_ethtool.c +++ b/drivers/net/ethernet/brocade/bna/bnad_ethtool.c @@ -34,12 +34,7 @@ #define BNAD_NUM_RXQ_COUNTERS 7 #define BNAD_NUM_TXQ_COUNTERS 5 -#define BNAD_ETHTOOL_STATS_NUM \ - (sizeof(struct rtnl_link_stats64) / sizeof(u64) + \ - sizeof(struct bnad_drv_stats) / sizeof(u64) + \ - offsetof(struct bfi_enet_stats, rxf_stats[0]) / sizeof(u64)) - -static const char *bnad_net_stats_strings[BNAD_ETHTOOL_STATS_NUM] = { +static const char *bnad_net_stats_strings[] = { "rx_packets", "tx_packets", "rx_bytes", @@ -50,22 +45,10 @@ static const char *bnad_net_stats_strings[BNAD_ETHTOOL_STATS_NUM] = { "tx_dropped", "multicast", "collisions", - "rx_length_errors", - "rx_over_errors", "rx_crc_errors", "rx_frame_errors", - "rx_fifo_errors", - "rx_missed_errors", - - "tx_aborted_errors", - "tx_carrier_errors", "tx_fifo_errors", - "tx_heartbeat_errors", - "tx_window_errors", - - "rx_compressed", - "tx_compressed", "netif_queue_stop", "netif_queue_wakeup", @@ -254,6 +237,8 @@ static const char *bnad_net_stats_strings[BNAD_ETHTOOL_STATS_NUM] = { "fc_tx_fid_parity_errors", }; +#define BNAD_ETHTOOL_STATS_NUM ARRAY_SIZE(bnad_net_stats_strings) + static int bnad_get_settings(struct net_device *netdev, struct ethtool_cmd *cmd) { @@ -859,9 +844,9 @@ bnad_get_ethtool_stats(struct net_device *netdev, struct ethtool_stats *stats, u64 *buf) { struct bnad *bnad = netdev_priv(netdev); - int i, j, bi; + int i, j, bi = 0; unsigned long flags; - struct rtnl_link_stats64 *net_stats64; + struct rtnl_link_stats64 net_stats64; u64 *stats64; u32 bmap; @@ -876,14 +861,25 @@ bnad_get_ethtool_stats(struct net_device *netdev, struct ethtool_stats *stats, * under the same lock */ spin_lock_irqsave(>bna_lock, flags); - bi = 0; - memset(buf, 0, stats->n_stats * sizeof(u64)); - - net_stats64 = (struct rtnl_link_stats64 *)buf; - bnad_netdev_qstats_fill(bnad, net_stats64); - bnad_netdev_hwstats_fill(bnad, net_stats64); - bi = sizeof(*net_stats64) / sizeof(u64); + memset(_stats64, 0, sizeof(net_stats64)); + bnad_netdev_qstats_fill(bnad, _stats64); + bnad_netdev_hwstats_fill(bnad, _stats64); + + buf[bi++] = net_stats64.rx_packets; + buf[bi++] = net_stats64.tx_packets; + buf[bi++] = net_stats64.rx_bytes; + buf[bi++] = net_stats64.tx_bytes; + buf[bi++] = net_stats64.rx_errors; + buf[bi++] = net_stats64.tx_errors; + buf[bi++] = net_stats64.rx_dropped; + buf[bi++] = net_stats64.tx_dropped; + buf[bi++] = net_stats64.multicast; + buf[bi++] = net_stats64.collisions; + buf[bi++] = net_stats64.rx_length_errors; + buf[bi++] = net_stats64.rx_crc_errors; + buf[bi++] = net_stats64.rx_frame_errors; + buf[bi++] = net_stats64.tx_fifo_errors; /* Get netif_queue_stopped from stack */ bnad->stats.drv_stats.netif_queue_stopped = netif_queue_stopped(netdev); -- 2.7.3
[PATCH net 1/2] bna: add missing per queue ethtool stat
Commit ba5ca784 "bna: check for dma mapping errors" added besides other things a statistic that counts number of DMA buffer mapping failures per each Rx queue. This counter is not included in ethtool stats output. Fixes: ba5ca784 "bna: check for dma mapping errors" Signed-off-by: Ivan Vecera--- drivers/net/ethernet/brocade/bna/bnad_ethtool.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/brocade/bna/bnad_ethtool.c b/drivers/net/ethernet/brocade/bna/bnad_ethtool.c index 0e4fdc3..5671353 100644 --- a/drivers/net/ethernet/brocade/bna/bnad_ethtool.c +++ b/drivers/net/ethernet/brocade/bna/bnad_ethtool.c @@ -31,7 +31,7 @@ #define BNAD_NUM_TXF_COUNTERS 12 #define BNAD_NUM_RXF_COUNTERS 10 #define BNAD_NUM_CQ_COUNTERS (3 + 5) -#define BNAD_NUM_RXQ_COUNTERS 6 +#define BNAD_NUM_RXQ_COUNTERS 7 #define BNAD_NUM_TXQ_COUNTERS 5 #define BNAD_ETHTOOL_STATS_NUM \ @@ -658,6 +658,8 @@ bnad_get_strings(struct net_device *netdev, u32 stringset, u8 *string) string += ETH_GSTRING_LEN; sprintf(string, "rxq%d_allocbuf_failed", q_num); string += ETH_GSTRING_LEN; + sprintf(string, "rxq%d_mapbuf_failed", q_num); + string += ETH_GSTRING_LEN; sprintf(string, "rxq%d_producer_index", q_num); string += ETH_GSTRING_LEN; sprintf(string, "rxq%d_consumer_index", q_num); @@ -678,6 +680,9 @@ bnad_get_strings(struct net_device *netdev, u32 stringset, u8 *string) sprintf(string, "rxq%d_allocbuf_failed", q_num); string += ETH_GSTRING_LEN; + sprintf(string, "rxq%d_mapbuf_failed", + q_num); + string += ETH_GSTRING_LEN; sprintf(string, "rxq%d_producer_index", q_num); string += ETH_GSTRING_LEN; -- 2.7.3
[PATCH 3/3] l2tp: constify net_device_ops structures
Check for net_device_ops structures that are only stored in the netdev_ops field of a net_device structure. This field is declared const, so net_device_ops structures that have this property can be declared as const also. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // @r disable optional_qualifier@ identifier i; position p; @@ static struct net_device_ops i@p = { ... }; @ok@ identifier r.i; struct net_device e; position p; @@ e.netdev_ops = @p; @bad@ position p != {r.p,ok.p}; identifier r.i; struct net_device_ops e; @@ e@i@p @depends on !bad disable optional_qualifier@ identifier r.i; @@ static +const struct net_device_ops i = { ... }; // The result of size on this file before the change is: text data bss dec hexfilename 3401931 4443761118 net/l2tp/l2tp_eth.o and after the change it is: text databssdec hex filename 3993 347 44 43841120 net/l2tp/l2tp_eth.o Signed-off-by: Julia Lawall--- net/l2tp/l2tp_eth.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c index 57fc5a4..ddb744c 100644 --- a/net/l2tp/l2tp_eth.c +++ b/net/l2tp/l2tp_eth.c @@ -121,7 +121,7 @@ static struct rtnl_link_stats64 *l2tp_eth_get_stats64(struct net_device *dev, } -static struct net_device_ops l2tp_eth_netdev_ops = { +static const struct net_device_ops l2tp_eth_netdev_ops = { .ndo_init = l2tp_eth_dev_init, .ndo_uninit = l2tp_eth_dev_uninit, .ndo_start_xmit = l2tp_eth_dev_xmit,
[PATCH 1/3] hisilicon: constify net_device_ops structures
Check for net_device_ops structures that are only stored in the netdev_ops field of a net_device structure. This field is declared const, so net_device_ops structures that have this property can be declared as const also. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // @r disable optional_qualifier@ identifier i; position p; @@ static struct net_device_ops i@p = { ... }; @ok@ identifier r.i; struct net_device e; position p; @@ e.netdev_ops = @p; @bad@ position p != {r.p,ok.p}; identifier r.i; struct net_device_ops e; @@ e@i@p @depends on !bad disable optional_qualifier@ identifier r.i; @@ static +const struct net_device_ops i = { ... }; // The result of size on this file before the change is: text data bss dec hexfilename 7995848 888512293 drivers/net/ethernet/hisilicon/hip04_eth.o and after the change it is: text databssdec hex filename 8571 256 8 88352283 drivers/net/ethernet/hisilicon/hip04_eth.o Signed-off-by: Julia Lawall--- drivers/net/ethernet/hisilicon/hip04_eth.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c b/drivers/net/ethernet/hisilicon/hip04_eth.c index a90ab40..415ffa1 100644 --- a/drivers/net/ethernet/hisilicon/hip04_eth.c +++ b/drivers/net/ethernet/hisilicon/hip04_eth.c @@ -761,7 +761,7 @@ static const struct ethtool_ops hip04_ethtool_ops = { .get_drvinfo= hip04_get_drvinfo, }; -static struct net_device_ops hip04_netdev_ops = { +static const struct net_device_ops hip04_netdev_ops = { .ndo_open = hip04_mac_open, .ndo_stop = hip04_mac_stop, .ndo_get_stats = hip04_get_stats,
Re: MDB offloading of local ipv4 multicast groups
On Thu, Sep 15, 2016 at 08:58:50PM +0200, John Crispin wrote: > Hi, > > While adding MDB support to the qca8k dsa driver I found that ipv4 mcast > groups don't always get propagated to the dsa driver. In my setup there > are 2 clients connected to the switch, both running a mdns client. The > .port_mdb_add() callback is properly called for 33:33:00:00:00:FB but > 01:00:5E:00:00:FB never got propagated to the dsa driver. > > The reason is that the call to ipv4_is_local_multicast() here [1] will > return true and the notifier is never called. Is this intentional or is > there something missing in the code ? I believe this is based on RFC 4541: "Packets with a destination IP (DIP) address in the 224.0.0.X range which are not IGMP must be forwarded on all ports." https://tools.ietf.org/html/rfc4541 But, we are missing the offloading of router ports, which is needed for the device to correctly flood unregistered multicast packets. That's also according to the mentioned RFC: "If a switch receives an unregistered packet, it must forward that packet on all ports to which an IGMP router is attached." Implemented at br_flood_multicast() However, the marking is done per-port and not per-{port, VID}. We need that in case vlan filtering is enabled. I think Nik is working on that, but he can correct me if I'm wrong :). The switchdev bits can be added soon after. > > John > > [1] > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/net/bridge/br_multicast.c?id=refs/tags/v4.8-rc6#n737
[PATCH 2/3] dwc_eth_qos: constify net_device_ops structures
Check for net_device_ops structures that are only stored in the netdev_ops field of a net_device structure. This field is declared const, so net_device_ops structures that have this property can be declared as const also. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // @r disable optional_qualifier@ identifier i; position p; @@ static struct net_device_ops i@p = { ... }; @ok@ identifier r.i; struct net_device e; position p; @@ e.netdev_ops = @p; @bad@ position p != {r.p,ok.p}; identifier r.i; struct net_device_ops e; @@ e@i@p @depends on !bad disable optional_qualifier@ identifier r.i; @@ static +const struct net_device_ops i = { ... }; // The result of size on this file before the change is: text data bss dec hexfilename 21623 1316 40 2297959c3 drivers/net/ethernet/synopsys/dwc_eth_qos.o and after the change it is: text databssdec hex filename 22199 724 40 2296359b3 drivers/net/ethernet/synopsys/dwc_eth_qos.o Signed-off-by: Julia Lawall--- drivers/net/ethernet/synopsys/dwc_eth_qos.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/synopsys/dwc_eth_qos.c b/drivers/net/ethernet/synopsys/dwc_eth_qos.c index c25d971..b5c4554 100644 --- a/drivers/net/ethernet/synopsys/dwc_eth_qos.c +++ b/drivers/net/ethernet/synopsys/dwc_eth_qos.c @@ -2761,7 +2761,7 @@ static const struct ethtool_ops dwceqos_ethtool_ops = { .set_link_ksettings = phy_ethtool_set_link_ksettings, }; -static struct net_device_ops netdev_ops = { +static const struct net_device_ops netdev_ops = { .ndo_open = dwceqos_open, .ndo_stop = dwceqos_stop, .ndo_start_xmit = dwceqos_start_xmit,
[PATCH 0/3] constify net_device_ops structures
Constify net_device_ops structures. --- drivers/net/ethernet/hisilicon/hip04_eth.c |2 +- drivers/net/ethernet/synopsys/dwc_eth_qos.c |2 +- net/l2tp/l2tp_eth.c |2 +- 3 files changed, 3 insertions(+), 3 deletions(-)
Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets
On 9/15/16 2:36 PM, Eric Dumazet wrote: > On Thu, 2016-09-15 at 14:25 -0600, David Ahern wrote: >> On 9/15/16 2:22 PM, Cyrill Gorcunov wrote: ss -K is not working. Socket lookup fails to find a match due to a protocol mismatch. haven't had time to track down why there is a mismatch since the kill uses the socket returned from the dump. Won't have time to come back to this until early next week. >>> >>> Have you ran iproute2 patched? I just ran ss -K and all sockets get closed >>> (including raw ones), which actually kicked me off the testing machine sshd >>> :/ >> >> yes. >> > > And CONFIG_INET_DIAG_DESTROY is also set in your .config ? yes dsa@kenny:~/kernel.git$ grep INET_DIAG_DESTROY kbuild/perf/.config CONFIG_INET_DIAG_DESTROY=y raw_diag_destroy is getting called, but protocol is 255: diff --git a/net/ipv4/raw_diag.c b/net/ipv4/raw_diag.c index c730e14618ab..95542b3dad76 100644 --- a/net/ipv4/raw_diag.c +++ b/net/ipv4/raw_diag.c @@ -192,6 +192,11 @@ static int raw_diag_destroy(struct sk_buff *in_skb, struct sock *sk; sk = raw_sock_get(net, r); + +if (r->sdiag_family == AF_INET) +pr_warn("raw_diag_destroy: family IPv4 protocol %d dst %pI4 src %pI4 dev %d sk %p\n", +r->sdiag_protocol, >id.idiag_dst[0], >id.idiag_src[0], r->id.idiag_if, sk); + if (IS_ERR(sk)) return PTR_ERR(sk); return sock_diag_destroy(sk, ECONNABORTED); so it never finds a match to an actual raw socket: diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 03618ed03532..6d0489629e74 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -124,9 +124,14 @@ EXPORT_SYMBOL_GPL(raw_unhash_sk); struct sock *__raw_v4_lookup(struct net *net, struct sock *sk, unsigned short num, __be32 raddr, __be32 laddr, int dif) { +pr_warn("num %d raddr %pI4 laddr %pI4 dif %d\n", num, , , dif); + sk_for_each_from(sk) { struct inet_sock *inet = inet_sk(sk); +pr_warn("sk: num %d raddr %pI4 laddr %pI4 dif %d\n", + inet->inet_num, >inet_daddr, >inet_rcv_saddr,sk->sk_bound_dev_if); + if (net_eq(sock_net(sk), net) && inet->inet_num == num && !(inet->inet_daddr && inet->inet_daddr != raddr)&& !(inet->inet_rcv_saddr && inet->inet_rcv_saddr != laddr) && so raw_abort is not called.
Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets
On Thu, 2016-09-15 at 14:25 -0600, David Ahern wrote: > On 9/15/16 2:22 PM, Cyrill Gorcunov wrote: > >> ss -K is not working. Socket lookup fails to find a match due to a > >> protocol mismatch. > >> > >> haven't had time to track down why there is a mismatch since the kill uses > >> the socket returned > >> from the dump. Won't have time to come back to this until early next week. > > > > Have you ran iproute2 patched? I just ran ss -K and all sockets get closed > > (including raw ones), which actually kicked me off the testing machine sshd > > :/ > > yes. > And CONFIG_INET_DIAG_DESTROY is also set in your .config ?
Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets
On 9/15/16 2:22 PM, Cyrill Gorcunov wrote: >> ss -K is not working. Socket lookup fails to find a match due to a protocol >> mismatch. >> >> haven't had time to track down why there is a mismatch since the kill uses >> the socket returned >> from the dump. Won't have time to come back to this until early next week. > > Have you ran iproute2 patched? I just ran ss -K and all sockets get closed > (including raw ones), which actually kicked me off the testing machine sshd :/ yes.
Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets
On Thu, Sep 15, 2016 at 01:53:13PM -0600, David Ahern wrote: > On 9/13/16 11:19 AM, Cyrill Gorcunov wrote: > > In criu we are actively using diag interface to collect sockets > > present in the system when dumping applications. And while for > > unix, tcp, udp[lite], packet, netlink it works as expected, > > the raw sockets do not have. Thus add it. > > > > v2: > > - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@) > > - implement @destroy for diag requests (by dsa@) > > > > v3: > > - add export of raw_abort for IPv6 (by dsa@) > > - pass net-admin flag into inet_sk_diag_fill due to > >changes in net-next branch (by dsa@) > > > > CC: David S. Miller> > CC: Eric Dumazet > > CC: David Ahern > > CC: Alexey Kuznetsov > > CC: James Morris > > CC: Hideaki YOSHIFUJI > > CC: Patrick McHardy > > CC: Andrey Vagin > > CC: Stephen Hemminger > > Signed-off-by: Cyrill Gorcunov > > --- > > ss -K is not working. Socket lookup fails to find a match due to a protocol > mismatch. > > haven't had time to track down why there is a mismatch since the kill uses > the socket returned > from the dump. Won't have time to come back to this until early next week. Have you ran iproute2 patched? I just ran ss -K and all sockets get closed (including raw ones), which actually kicked me off the testing machine sshd :/ Cyrill
Re: [PATCHv4 net-next 00/15] BPF hardware offload (cls_bpf for now)
On Thu, Sep 15, 2016 at 08:12:20PM +0100, Jakub Kicinski wrote: > In the last year a lot of progress have been made on offloading > simpler TC classifiers. There is also growing interest in using > BPF for generic high-speed packet processing in the kernel. > It seems beneficial to tie those two trends together and think > about hardware offloads of BPF programs. This patch set presents > such offload to Netronome smart NICs. cls_bpf is extended with > hardware offload capabilities and NFP driver gets a JIT translator > which in presence of capable firmware can be used to offload > the BPF program onto the card. Looks great! Thanks for all the hard work.
Re: [PATCHv4 net-next 07/15] bpf: recognize 64bit immediate loads as consts
On Thu, Sep 15, 2016 at 08:12:27PM +0100, Jakub Kicinski wrote: > When running as parser interpret BPF_LD | BPF_IMM | BPF_DW > instructions as loading CONST_IMM with the value stored > in imm. The verifier will continue not recognizing those > due to concerns about search space/program complexity > increase. > > Signed-off-by: Jakub Kicinski> --- > v3: > - limit to parsers. > --- > kernel/bpf/verifier.c | 14 -- > 1 file changed, 12 insertions(+), 2 deletions(-) > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index d93e78331b90..f5bed7cce08d 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -1766,9 +1766,19 @@ static int check_ld_imm(struct bpf_verifier_env *env, > struct bpf_insn *insn) > if (err) > return err; > > - if (insn->src_reg == 0) > - /* generic move 64-bit immediate into a register */ > + if (insn->src_reg == 0) { > + /* generic move 64-bit immediate into a register, > + * only analyzer needs to collect the ld_imm value. > + */ > + u64 imm = ((u64)(insn + 1)->imm << 32) | (u32)insn->imm; > + > + if (!env->analyzer_ops) > + return 0; the check makes sense. thanks. Acked-by: Alexei Starovoitov
Re: [PATCHv4 net-next 06/15] bpf: enable non-core use of the verfier
On Thu, Sep 15, 2016 at 08:12:26PM +0100, Jakub Kicinski wrote: > Advanced JIT compilers and translators may want to use > eBPF verifier as a base for parsers or to perform custom > checks and validations. > > Add ability for external users to invoke the verifier > and provide callbacks to be invoked for every intruction > checked. For now only add most basic callback for > per-instruction pre-interpretation checks is added. More > advanced users may also like to have per-instruction post > callback and state comparison callback. > > Signed-off-by: Jakub Kicinski> --- > v4: > - separate from the header split patch. Acked-by: Alexei Starovoitov
Re: [PATCHv4 net-next 05/15] bpf: expose internal verfier structures
On Thu, Sep 15, 2016 at 08:12:25PM +0100, Jakub Kicinski wrote: > Move verifier's internal structures to a header file and > prefix their names with bpf_ to avoid potential namespace > conflicts. Those structures will soon be used by external > analyzers. > > Signed-off-by: Jakub Kicinski> --- > v4: > - separate from adding the analyzer; > - squash with the prefixing patch. > --- > include/linux/bpf_verifier.h | 78 + > kernel/bpf/verifier.c| 263 > +-- > 2 files changed, 180 insertions(+), 161 deletions(-) > create mode 100644 include/linux/bpf_verifier.h > > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h > new file mode 100644 > index ..1c0511ef7eaf > --- /dev/null > +++ b/include/linux/bpf_verifier.h ... > +#ifndef _LINUX_BPF_ANALYZER_H > +#define _LINUX_BPF_ANALYZER_H 1 the macro doesn't match the file name. Other than that Acked-by: Alexei Starovoitov
[PATCH net-next 2/4] ip6_tunnel: add collect_md mode to IPv6 tunnels
Similar to gre, vxlan, geneve tunnels allow IPIP6 and IP6IP6 tunnels to operate in 'collect metadata' mode. Unlike ipv4 code here it's possible to reuse ip6_tnl_xmit() function for both collect_md and traditional tunnels. bpf_skb_[gs]et_tunnel_key() helpers and ovs (in the future) are the users. Signed-off-by: Alexei StarovoitovAcked-by: Thomas Graf Acked-by: Daniel Borkmann --- include/net/ip6_tunnel.h | 1 + net/ipv6/ip6_tunnel.c| 178 +++ 2 files changed, 134 insertions(+), 45 deletions(-) diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h index 43a5a0e4524c..20ed9699fcd4 100644 --- a/include/net/ip6_tunnel.h +++ b/include/net/ip6_tunnel.h @@ -23,6 +23,7 @@ struct __ip6_tnl_parm { __u8 proto; /* tunnel protocol */ __u8 encap_limit; /* encapsulation limit for tunnel */ __u8 hop_limit; /* hop limit for tunnel */ + bool collect_md; __be32 flowinfo;/* traffic class and flowlabel for tunnel */ __u32 flags;/* tunnel flags */ struct in6_addr laddr; /* local tunnel end-point address */ diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index 5c5779720ef1..6a66adba0c22 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -57,6 +57,7 @@ #include #include #include +#include MODULE_AUTHOR("Ville Nuorvala"); MODULE_DESCRIPTION("IPv6 tunneling device"); @@ -90,6 +91,7 @@ struct ip6_tnl_net { struct ip6_tnl __rcu *tnls_r_l[IP6_TUNNEL_HASH_SIZE]; struct ip6_tnl __rcu *tnls_wc[1]; struct ip6_tnl __rcu **tnls[2]; + struct ip6_tnl __rcu *collect_md_tun; }; static struct net_device_stats *ip6_get_stats(struct net_device *dev) @@ -166,6 +168,10 @@ ip6_tnl_lookup(struct net *net, const struct in6_addr *remote, const struct in6_ return t; } + t = rcu_dereference(ip6n->collect_md_tun); + if (t) + return t; + t = rcu_dereference(ip6n->tnls_wc[0]); if (t && (t->dev->flags & IFF_UP)) return t; @@ -209,6 +215,8 @@ ip6_tnl_link(struct ip6_tnl_net *ip6n, struct ip6_tnl *t) { struct ip6_tnl __rcu **tp = ip6_tnl_bucket(ip6n, >parms); + if (t->parms.collect_md) + rcu_assign_pointer(ip6n->collect_md_tun, t); rcu_assign_pointer(t->next , rtnl_dereference(*tp)); rcu_assign_pointer(*tp, t); } @@ -224,6 +232,9 @@ ip6_tnl_unlink(struct ip6_tnl_net *ip6n, struct ip6_tnl *t) struct ip6_tnl __rcu **tp; struct ip6_tnl *iter; + if (t->parms.collect_md) + rcu_assign_pointer(ip6n->collect_md_tun, NULL); + for (tp = ip6_tnl_bucket(ip6n, >parms); (iter = rtnl_dereference(*tp)) != NULL; tp = >next) { @@ -829,6 +840,9 @@ static int __ip6_tnl_rcv(struct ip6_tnl *tunnel, struct sk_buff *skb, skb_scrub_packet(skb, !net_eq(tunnel->net, dev_net(tunnel->dev))); + if (tun_dst) + skb_dst_set(skb, (struct dst_entry *)tun_dst); + gro_cells_receive(>gro_cells, skb); return 0; @@ -865,6 +879,7 @@ static int ipxip6_rcv(struct sk_buff *skb, u8 ipproto, { struct ip6_tnl *t; const struct ipv6hdr *ipv6h = ipv6_hdr(skb); + struct metadata_dst *tun_dst = NULL; int ret = -1; rcu_read_lock(); @@ -881,7 +896,12 @@ static int ipxip6_rcv(struct sk_buff *skb, u8 ipproto, goto drop; if (iptunnel_pull_header(skb, 0, tpi->proto, false)) goto drop; - ret = __ip6_tnl_rcv(t, skb, tpi, NULL, dscp_ecn_decapsulate, + if (t->parms.collect_md) { + tun_dst = ipv6_tun_rx_dst(skb, 0, 0, 0); + if (!tun_dst) + return 0; + } + ret = __ip6_tnl_rcv(t, skb, tpi, tun_dst, dscp_ecn_decapsulate, log_ecn_error); } @@ -1012,8 +1032,16 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device *dev, __u8 dsfield, int mtu; unsigned int psh_hlen = sizeof(struct ipv6hdr) + t->encap_hlen; unsigned int max_headroom = psh_hlen; + u8 hop_limit; int err = -1; + if (t->parms.collect_md) { + hop_limit = skb_tunnel_info(skb)->key.ttl; + goto route_lookup; + } else { + hop_limit = t->parms.hop_limit; + } + /* NBMA tunnel */ if (ipv6_addr_any(>parms.raddr)) { struct in6_addr *addr6; @@ -1043,6 +1071,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device *dev, __u8 dsfield, goto tx_err_link_failure; if (!dst) { +route_lookup: dst = ip6_route_output(net, NULL, fl6); if (dst->error) @@ -1053,6
[PATCH net-next 3/4] samples/bpf: extend test_tunnel_bpf.sh with IPIP test
extend existing tests for vxlan, geneve, gre to include IPIP tunnel. It tests both traditional tunnel configuration and dynamic via bpf helpers. Signed-off-by: Alexei Starovoitov--- samples/bpf/tcbpf2_kern.c | 58 ++ samples/bpf/test_tunnel_bpf.sh | 56 ++-- 2 files changed, 106 insertions(+), 8 deletions(-) diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c index 7a15289da6cc..c1917d968fb4 100644 --- a/samples/bpf/tcbpf2_kern.c +++ b/samples/bpf/tcbpf2_kern.c @@ -1,4 +1,5 @@ /* Copyright (c) 2016 VMware + * Copyright (c) 2016 Facebook * * This program is free software; you can redistribute it and/or * modify it under the terms of version 2 of the GNU General Public @@ -188,4 +189,61 @@ int _geneve_get_tunnel(struct __sk_buff *skb) return TC_ACT_OK; } +SEC("ipip_set_tunnel") +int _ipip_set_tunnel(struct __sk_buff *skb) +{ + struct bpf_tunnel_key key = {}; + void *data = (void *)(long)skb->data; + struct iphdr *iph = data; + struct tcphdr *tcp = data + sizeof(*iph); + void *data_end = (void *)(long)skb->data_end; + int ret; + + /* single length check */ + if (data + sizeof(*iph) + sizeof(*tcp) > data_end) { + ERROR(1); + return TC_ACT_SHOT; + } + + key.tunnel_ttl = 64; + if (iph->protocol == IPPROTO_ICMP) { + key.remote_ipv4 = 0xac100164; /* 172.16.1.100 */ + } else { + if (iph->protocol != IPPROTO_TCP || iph->ihl != 5) + return TC_ACT_SHOT; + + if (tcp->dest == htons(5200)) + key.remote_ipv4 = 0xac100164; /* 172.16.1.100 */ + else if (tcp->dest == htons(5201)) + key.remote_ipv4 = 0xac100165; /* 172.16.1.101 */ + else + return TC_ACT_SHOT; + } + + ret = bpf_skb_set_tunnel_key(skb, , sizeof(key), 0); + if (ret < 0) { + ERROR(ret); + return TC_ACT_SHOT; + } + + return TC_ACT_OK; +} + +SEC("ipip_get_tunnel") +int _ipip_get_tunnel(struct __sk_buff *skb) +{ + int ret; + struct bpf_tunnel_key key; + char fmt[] = "remote ip 0x%x\n"; + + ret = bpf_skb_get_tunnel_key(skb, , sizeof(key), 0); + if (ret < 0) { + ERROR(ret); + return TC_ACT_SHOT; + } + + bpf_trace_printk(fmt, sizeof(fmt), key.remote_ipv4); + return TC_ACT_OK; +} + char _license[] SEC("license") = "GPL"; diff --git a/samples/bpf/test_tunnel_bpf.sh b/samples/bpf/test_tunnel_bpf.sh index 4956589a83ae..1ff634f187b7 100755 --- a/samples/bpf/test_tunnel_bpf.sh +++ b/samples/bpf/test_tunnel_bpf.sh @@ -9,15 +9,13 @@ # local 172.16.1.200 remote 172.16.1.100 # veth1 IP: 172.16.1.200, tunnel dev 11 -set -e - function config_device { ip netns add at_ns0 ip link add veth0 type veth peer name veth1 ip link set veth0 netns at_ns0 ip netns exec at_ns0 ip addr add 172.16.1.100/24 dev veth0 ip netns exec at_ns0 ip link set dev veth0 up - ip link set dev veth1 up + ip link set dev veth1 up mtu 1500 ip addr add dev veth1 172.16.1.200/24 } @@ -67,6 +65,19 @@ function add_geneve_tunnel { ip addr add dev $DEV 10.1.1.200/24 } +function add_ipip_tunnel { + # in namespace + ip netns exec at_ns0 \ + ip link add dev $DEV_NS type $TYPE local 172.16.1.100 remote 172.16.1.200 + ip netns exec at_ns0 ip link set dev $DEV_NS up + ip netns exec at_ns0 ip addr add dev $DEV_NS 10.1.1.100/24 + + # out of namespace + ip link add dev $DEV type $TYPE external + ip link set dev $DEV up + ip addr add dev $DEV 10.1.1.200/24 +} + function attach_bpf { DEV=$1 SET_TUNNEL=$2 @@ -85,6 +96,7 @@ function test_gre { attach_bpf $DEV gre_set_tunnel gre_get_tunnel ping -c 1 10.1.1.100 ip netns exec at_ns0 ping -c 1 10.1.1.200 + cleanup } function test_vxlan { @@ -96,6 +108,7 @@ function test_vxlan { attach_bpf $DEV vxlan_set_tunnel vxlan_get_tunnel ping -c 1 10.1.1.100 ip netns exec at_ns0 ping -c 1 10.1.1.200 + cleanup } function test_geneve { @@ -107,21 +120,48 @@ function test_geneve { attach_bpf $DEV geneve_set_tunnel geneve_get_tunnel ping -c 1 10.1.1.100 ip netns exec at_ns0 ping -c 1 10.1.1.200 + cleanup +} + +function test_ipip { + TYPE=ipip + DEV_NS=ipip00 + DEV=ipip11 + config_device + tcpdump -nei veth1 & + cat /sys/kernel/debug/tracing/trace_pipe & + add_ipip_tunnel + ethtool -K veth1 gso off gro off rx off tx off + ip link set dev veth1 mtu 1500 + attach_bpf $DEV ipip_set_tunnel ipip_get_tunnel + ping -c 1 10.1.1.100 + ip netns exec at_ns0 ping -c 1
[PATCH net-next 0/4] ip_tunnel: add collect_md mode to IPv4/IPv6 tunnels
Similar to geneve, vxlan, gre tunnels implement 'collect metadata' mode in ipip, ipip6, ip6ip6 tunnels. Alexei Starovoitov (4): ip_tunnel: add collect_md mode to IPIP tunnel ip6_tunnel: add collect_md mode to IPv6 tunnels samples/bpf: extend test_tunnel_bpf.sh with IPIP test samples/bpf: add comprehensive ipip, ipip6, ip6ip6 test include/net/ip6_tunnel.h | 1 + include/net/ip_tunnels.h | 2 + include/uapi/linux/if_tunnel.h | 1 + net/ipv4/ip_tunnel.c | 76 + net/ipv4/ipip.c| 35 ++-- net/ipv6/ip6_tunnel.c | 178 -- samples/bpf/tcbpf2_kern.c | 190 + samples/bpf/test_ipip.sh | 178 ++ samples/bpf/test_tunnel_bpf.sh | 56 ++-- 9 files changed, 658 insertions(+), 59 deletions(-) create mode 100755 samples/bpf/test_ipip.sh -- 2.8.0
[PATCH net-next 4/4] samples/bpf: add comprehensive ipip, ipip6, ip6ip6 test
the test creates 3 namespaces with veth connected via bridge. First two namespaces simulate two different hosts with the same IPv4 and IPv6 addresses configured on the tunnel interface and they communicate with outside world via standard tunnels. Third namespace creates collect_md tunnel that is driven by BPF program which selects different remote host (either first or second namespace) based on tcp dest port number while tcp dst ip is the same. This scenario is rough approximation of load balancer use case. The tests check both traditional tunnel configuration and collect_md mode. Signed-off-by: Alexei Starovoitov--- samples/bpf/tcbpf2_kern.c | 132 ++ samples/bpf/test_ipip.sh | 178 ++ 2 files changed, 310 insertions(+) create mode 100755 samples/bpf/test_ipip.sh diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c index c1917d968fb4..3303bb85593b 100644 --- a/samples/bpf/tcbpf2_kern.c +++ b/samples/bpf/tcbpf2_kern.c @@ -9,12 +9,15 @@ #include #include #include +#include #include #include #include #include +#include #include "bpf_helpers.h" +#define _htonl __builtin_bswap32 #define ERROR(ret) do {\ char fmt[] = "ERROR line:%d ret:%d\n";\ bpf_trace_printk(fmt, sizeof(fmt), __LINE__, ret); \ @@ -246,4 +249,133 @@ int _ipip_get_tunnel(struct __sk_buff *skb) return TC_ACT_OK; } +SEC("ipip6_set_tunnel") +int _ipip6_set_tunnel(struct __sk_buff *skb) +{ + struct bpf_tunnel_key key = {}; + void *data = (void *)(long)skb->data; + struct iphdr *iph = data; + struct tcphdr *tcp = data + sizeof(*iph); + void *data_end = (void *)(long)skb->data_end; + int ret; + + /* single length check */ + if (data + sizeof(*iph) + sizeof(*tcp) > data_end) { + ERROR(1); + return TC_ACT_SHOT; + } + + key.remote_ipv6[0] = _htonl(0x2401db00); + key.tunnel_ttl = 64; + + if (iph->protocol == IPPROTO_ICMP) { + key.remote_ipv6[3] = _htonl(1); + } else { + if (iph->protocol != IPPROTO_TCP || iph->ihl != 5) { + ERROR(iph->protocol); + return TC_ACT_SHOT; + } + + if (tcp->dest == htons(5200)) { + key.remote_ipv6[3] = _htonl(1); + } else if (tcp->dest == htons(5201)) { + key.remote_ipv6[3] = _htonl(2); + } else { + ERROR(tcp->dest); + return TC_ACT_SHOT; + } + } + + ret = bpf_skb_set_tunnel_key(skb, , sizeof(key), BPF_F_TUNINFO_IPV6); + if (ret < 0) { + ERROR(ret); + return TC_ACT_SHOT; + } + + return TC_ACT_OK; +} + +SEC("ipip6_get_tunnel") +int _ipip6_get_tunnel(struct __sk_buff *skb) +{ + int ret; + struct bpf_tunnel_key key; + char fmt[] = "remote ip6 %x::%x\n"; + + ret = bpf_skb_get_tunnel_key(skb, , sizeof(key), BPF_F_TUNINFO_IPV6); + if (ret < 0) { + ERROR(ret); + return TC_ACT_SHOT; + } + + bpf_trace_printk(fmt, sizeof(fmt), _htonl(key.remote_ipv6[0]), +_htonl(key.remote_ipv6[3])); + return TC_ACT_OK; +} + +SEC("ip6ip6_set_tunnel") +int _ip6ip6_set_tunnel(struct __sk_buff *skb) +{ + struct bpf_tunnel_key key = {}; + void *data = (void *)(long)skb->data; + struct ipv6hdr *iph = data; + struct tcphdr *tcp = data + sizeof(*iph); + void *data_end = (void *)(long)skb->data_end; + int ret; + + /* single length check */ + if (data + sizeof(*iph) + sizeof(*tcp) > data_end) { + ERROR(1); + return TC_ACT_SHOT; + } + + key.remote_ipv6[0] = _htonl(0x2401db00); + key.tunnel_ttl = 64; + + if (iph->nexthdr == NEXTHDR_ICMP) { + key.remote_ipv6[3] = _htonl(1); + } else { + if (iph->nexthdr != NEXTHDR_TCP) { + ERROR(iph->nexthdr); + return TC_ACT_SHOT; + } + + if (tcp->dest == htons(5200)) { + key.remote_ipv6[3] = _htonl(1); + } else if (tcp->dest == htons(5201)) { + key.remote_ipv6[3] = _htonl(2); + } else { + ERROR(tcp->dest); + return TC_ACT_SHOT; + } + } + + ret = bpf_skb_set_tunnel_key(skb, , sizeof(key), BPF_F_TUNINFO_IPV6); + if (ret < 0) { + ERROR(ret); + return TC_ACT_SHOT; + } + + return TC_ACT_OK; +} + +SEC("ip6ip6_get_tunnel") +int _ip6ip6_get_tunnel(struct __sk_buff *skb) +{ + int ret; + struct bpf_tunnel_key key; + char fmt[] = "remote ip6 %x::%x\n"; + + ret =
[PATCH net-next 1/4] ip_tunnel: add collect_md mode to IPIP tunnel
Similar to gre, vxlan, geneve tunnels allow IPIP tunnels to operate in 'collect metadata' mode. bpf_skb_[gs]et_tunnel_key() helpers can make use of it right away. ovs can use it as well in the future (once appropriate ovs-vport abstractions and user apis are added). Note that just like in other tunnels we cannot cache the dst, since tunnel_info metadata can be different for every packet. Signed-off-by: Alexei StarovoitovAcked-by: Thomas Graf Acked-by: Daniel Borkmann --- include/net/ip_tunnels.h | 2 ++ include/uapi/linux/if_tunnel.h | 1 + net/ipv4/ip_tunnel.c | 76 ++ net/ipv4/ipip.c| 35 +++ 4 files changed, 108 insertions(+), 6 deletions(-) diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h index e598c639aa6f..59557c07904b 100644 --- a/include/net/ip_tunnels.h +++ b/include/net/ip_tunnels.h @@ -255,6 +255,8 @@ void ip_tunnel_delete_net(struct ip_tunnel_net *itn, struct rtnl_link_ops *ops); void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, const struct iphdr *tnl_params, const u8 protocol); +void ip_md_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, + const u8 proto); int ip_tunnel_ioctl(struct net_device *dev, struct ip_tunnel_parm *p, int cmd); int __ip_tunnel_change_mtu(struct net_device *dev, int new_mtu, bool strict); int ip_tunnel_change_mtu(struct net_device *dev, int new_mtu); diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h index 9865c8caedde..18d5dc13985d 100644 --- a/include/uapi/linux/if_tunnel.h +++ b/include/uapi/linux/if_tunnel.h @@ -73,6 +73,7 @@ enum { IFLA_IPTUN_ENCAP_FLAGS, IFLA_IPTUN_ENCAP_SPORT, IFLA_IPTUN_ENCAP_DPORT, + IFLA_IPTUN_COLLECT_METADATA, __IFLA_IPTUN_MAX, }; #define IFLA_IPTUN_MAX (__IFLA_IPTUN_MAX - 1) diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index 95649ebd2874..5719d6ba0824 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -55,6 +55,7 @@ #include #include #include +#include #if IS_ENABLED(CONFIG_IPV6) #include @@ -546,6 +547,81 @@ static int tnl_update_pmtu(struct net_device *dev, struct sk_buff *skb, return 0; } +void ip_md_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, u8 proto) +{ + struct ip_tunnel *tunnel = netdev_priv(dev); + u32 headroom = sizeof(struct iphdr); + struct ip_tunnel_info *tun_info; + const struct ip_tunnel_key *key; + const struct iphdr *inner_iph; + struct rtable *rt; + struct flowi4 fl4; + __be16 df = 0; + u8 tos, ttl; + + tun_info = skb_tunnel_info(skb); + if (unlikely(!tun_info || !(tun_info->mode & IP_TUNNEL_INFO_TX) || +ip_tunnel_info_af(tun_info) != AF_INET)) + goto tx_error; + key = _info->key; + memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt)); + inner_iph = (const struct iphdr *)skb_inner_network_header(skb); + tos = key->tos; + if (tos == 1) { + if (skb->protocol == htons(ETH_P_IP)) + tos = inner_iph->tos; + else if (skb->protocol == htons(ETH_P_IPV6)) + tos = ipv6_get_dsfield((const struct ipv6hdr *)inner_iph); + } + init_tunnel_flow(, proto, key->u.ipv4.dst, key->u.ipv4.src, 0, +RT_TOS(tos), tunnel->parms.link); + if (tunnel->encap.type != TUNNEL_ENCAP_NONE) + goto tx_error; + rt = ip_route_output_key(tunnel->net, ); + if (IS_ERR(rt)) { + dev->stats.tx_carrier_errors++; + goto tx_error; + } + if (rt->dst.dev == dev) { + ip_rt_put(rt); + dev->stats.collisions++; + goto tx_error; + } + tos = ip_tunnel_ecn_encap(tos, inner_iph, skb); + ttl = key->ttl; + if (ttl == 0) { + if (skb->protocol == htons(ETH_P_IP)) + ttl = inner_iph->ttl; + else if (skb->protocol == htons(ETH_P_IPV6)) + ttl = ((const struct ipv6hdr *)inner_iph)->hop_limit; + else + ttl = ip4_dst_hoplimit(>dst); + } + if (key->tun_flags & TUNNEL_DONT_FRAGMENT) + df = htons(IP_DF); + else if (skb->protocol == htons(ETH_P_IP)) + df = inner_iph->frag_off & htons(IP_DF); + headroom += LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len; + if (headroom > dev->needed_headroom) + dev->needed_headroom = headroom; + + if (skb_cow_head(skb, dev->needed_headroom)) { + ip_rt_put(rt); + goto tx_dropped; + } + iptunnel_xmit(NULL, rt, skb, fl4.saddr, fl4.daddr, proto, key->tos, + key->ttl, df, !net_eq(tunnel->net,
Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets
On 9/13/16 11:19 AM, Cyrill Gorcunov wrote: > In criu we are actively using diag interface to collect sockets > present in the system when dumping applications. And while for > unix, tcp, udp[lite], packet, netlink it works as expected, > the raw sockets do not have. Thus add it. > > v2: > - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@) > - implement @destroy for diag requests (by dsa@) > > v3: > - add export of raw_abort for IPv6 (by dsa@) > - pass net-admin flag into inet_sk_diag_fill due to >changes in net-next branch (by dsa@) > > CC: David S. Miller> CC: Eric Dumazet > CC: David Ahern > CC: Alexey Kuznetsov > CC: James Morris > CC: Hideaki YOSHIFUJI > CC: Patrick McHardy > CC: Andrey Vagin > CC: Stephen Hemminger > Signed-off-by: Cyrill Gorcunov > --- ss -K is not working. Socket lookup fails to find a match due to a protocol mismatch. haven't had time to track down why there is a mismatch since the kill uses the socket returned from the dump. Won't have time to come back to this until early next week.
Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks
On 15/09/2016 06:48, Alexei Starovoitov wrote: > On Wed, Sep 14, 2016 at 09:38:16PM -0700, Andy Lutomirski wrote: >> On Wed, Sep 14, 2016 at 9:31 PM, Alexei Starovoitov >>wrote: >>> On Wed, Sep 14, 2016 at 09:08:57PM -0700, Andy Lutomirski wrote: On Wed, Sep 14, 2016 at 9:00 PM, Alexei Starovoitov wrote: > On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote: > > This RFC handle both cgroup and seccomp approaches in a similar way. I > don't see why building on top of cgroup v2 is a problem. Is there > security issues with delegation? What I mean is: cgroup v2 delegation has a functionality problem. Tejun says [1]: We haven't had to face this decision because cgroup has never properly supported delegating to applications and the in-use setups where this happens are custom configurations where there is no boundary between system and applications and adhoc trial-and-error is good enough a way to find a working solution. That wiggle room goes away once we officially open this up to individual applications. Unless and until that changes, I think that landlock should stay away from cgroups. Others could reasonably disagree with me. >>> >>> Ours and Sargun's use cases for cgroup+lsm+bpf is not for security >>> and not for sandboxing. So the above doesn't matter in such contexts. >>> lsm hooks + cgroups provide convenient scope and existing entry points. >>> Please see checmate examples how it's used. >>> >> >> To be clear: I'm not arguing at all that there shouldn't be >> bpf+lsm+cgroup integration. I'm arguing that the unprivileged >> landlock interface shouldn't expose any cgroup integration, at least >> until the cgroup situation settles down a lot. > > ahh. yes. we're perfectly in agreement here. > I'm suggesting that the next RFC shouldn't include unpriv > and seccomp at all. Once bpf+lsm+cgroup is merged, we can > argue about unpriv with cgroups and even unpriv as a whole, > since it's not a given. Seccomp integration is also questionable. > I'd rather not have seccomp as a gate keeper for this lsm. > lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks > don't have one to one relationship, so mixing them up is only > asking for trouble further down the road. > If we really need to carry some information from seccomp to lsm+bpf, > it's easier to add eBPF support to seccomp and let bpf side deal > with passing whatever information. > As an argument for keeping seccomp (or an extended seccomp) as the interface for an unprivileged bpf+lsm: seccomp already checks off most of the boxes for safely letting unprivileged programs sandbox themselves. >>> >>> you mean the attach part of seccomp syscall that deals with no_new_priv? >>> sure, that's reusable. >>> Furthermore, to the extent that there are use cases for unprivileged bpf+lsm that *aren't* expressible within the seccomp hierarchy, I suspect that syscall filters have exactly the same problem and that we should fix seccomp to cover it. >>> >>> not sure what you mean by 'seccomp hierarchy'. The normal process >>> hierarchy ? >> >> Kind of. I mean the filter layers that are inherited across fork(), >> the TSYNC mechanism, etc. >> >>> imo the main deficiency of secccomp is inability to look into arguments. >>> One can argue that it's a blessing, since composite args >>> are not yet copied into the kernel memory. >>> But in a lot of cases the seccomp arguments are FDs pointing >>> to kernel objects and if programs could examine those objects >>> the sandboxing scope would be more precise. >>> lsm+bpf solves that part and I'd still argue that it's >>> orthogonal to seccomp's pass/reject flow. >>> I mean if seccomp says 'ok' the syscall should continue executing >>> as normal and whatever LSM hooks were triggered by it may have >>> their own lsm+bpf verdicts. >> >> I agree with all of this... >> >>> Furthermore in the process hierarchy different children >>> should be able to set their own lsm+bpf filters that are not >>> related to parallel seccomp+bpf hierarchy of programs. >>> seccomp syscall can be an interface to attach programs >>> to lsm hooks, but nothing more than that. >> >> I'm not sure what you mean. I mean that, logically, I think we should >> be able to do: >> >> seccomp(attach a syscall filter); >> fork(); >> child does seccomp(attach some lsm filters); >> >> I think that they *should* be related to the seccomp+bpf hierarchy of >> programs in that they are entries in the same logical list of filter >> layers installed. Some of those layers can be syscall filters and >> some of the layers can be lsm filters. If we subsequently add a way >> to attach a
Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks
On 15/09/2016 03:25, Andy Lutomirski wrote: > On Wed, Sep 14, 2016 at 3:11 PM, Mickaël Salaünwrote: >> >> On 14/09/2016 20:27, Andy Lutomirski wrote: >>> On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün wrote: Add a new flag CGRP_NO_NEW_PRIVS for each cgroup. This flag is initially set for all cgroup except the root. The flag is clear when a new process without the no_new_privs flags is attached to the cgroup. If a cgroup is landlocked, then any new attempt, from an unprivileged process, to attach a process without no_new_privs to this cgroup will be denied. >>> >>> Until and unless everyone can agree on a way to properly namespace, >>> delegate, etc cgroups, I think that trying to add unprivileged >>> semantics to cgroups is nuts. Given the big thread about cgroup v2, >>> no-internal-tasks, etc, I just don't see how this approach can be >>> viable. >> >> As far as I can tell, the no_new_privs flag of at task is not related to >> namespaces. The CGRP_NO_NEW_PRIVS flag is only a cache to quickly access >> the no_new_privs property of *tasks* in a cgroup. The semantic is unchanged. >> >> Using cgroup is optional, any task could use the seccomp-based >> landlocking instead. However, for those that want/need to manage a >> security policy in a more dynamic way, using cgroups may make sense. >> >> I though cgroup delegation was OK in the v2, isn't it the case? Do you >> have some links? >> >>> >>> Can we try to make landlock work completely independently of cgroups >>> so that it doesn't get stuck and so that programs can use it without >>> worrying about cgroup v1 vs v2, interactions with cgroup managers, >>> cgroup managers that (supposedly?) will start migrating processes >>> around piecemeal and almost certainly blowing up landlock in the >>> process, etc? >> >> This RFC handle both cgroup and seccomp approaches in a similar way. I >> don't see why building on top of cgroup v2 is a problem. Is there >> security issues with delegation? > > What I mean is: cgroup v2 delegation has a functionality problem. > Tejun says [1]: > > We haven't had to face this decision because cgroup has never properly > supported delegating to applications and the in-use setups where this > happens are custom configurations where there is no boundary between > system and applications and adhoc trial-and-error is good enough a way > to find a working solution. That wiggle room goes away once we > officially open this up to individual applications. > > Unless and until that changes, I think that landlock should stay away > from cgroups. Others could reasonably disagree with me. > > [1] https://lkml.kernel.org/r/20160909225747.ga30...@mtj.duckdns.org > I don't get the same echo here: https://lkml.kernel.org/r/20160826155026.gd16...@mtj.duckdns.org On 26/08/2016 17:50, Tejun Heo wrote: > Please refer to "2-5. Delegation" of Documentation/cgroup-v2.txt. > Delegation on v1 is broken on both core and specific controller > behaviors and thus discouraged. On v2, delegation should work just > fine. Tejun, could you please clarify if there is still a problem with cgroup v2 delegation? This patch only implement a cache mechanism with the CGRP_NO_NEW_PRIVS flag. If cgroups can group processes correctly, I don't see any (security) issue here. It's the administrator choice to delegate a part of the cgroup management. It's then the delegatee responsibility to correctly put processes in cgroups. This is comparable to a process which is responsible to correctly call seccomp(2). Mickaël signature.asc Description: OpenPGP digital signature
[PATCHv4 net-next 15/15] nfp: bpf: add offload of TC direct action mode
Add offload of TC in direct action mode. We just need to provide appropriate checks in the verifier and a new outro block to translate the exit codes to what data path expects Signed-off-by: Jakub Kicinski--- drivers/net/ethernet/netronome/nfp/nfp_bpf.h | 1 + drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c | 66 ++ .../net/ethernet/netronome/nfp/nfp_bpf_verifier.c | 11 +++- .../net/ethernet/netronome/nfp/nfp_net_offload.c | 6 +- 4 files changed, 82 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_bpf.h b/drivers/net/ethernet/netronome/nfp/nfp_bpf.h index 378d3c35cad5..25e9bd885f6f 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_bpf.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_bpf.h @@ -61,6 +61,7 @@ enum static_regs { enum nfp_bpf_action_type { NN_ACT_TC_DROP, NN_ACT_TC_REDIR, + NN_ACT_DIRECT, }; /* Software register representation, hardware encoding in asm.h */ diff --git a/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c b/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c index 60a99e0bf459..ca741763b7e9 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c @@ -321,6 +321,16 @@ __emit_br(struct nfp_prog *nfp_prog, enum br_mask mask, enum br_ev_pip ev_pip, nfp_prog_push(nfp_prog, insn); } +static void emit_br_def(struct nfp_prog *nfp_prog, u16 addr, u8 defer) +{ + if (defer > 2) { + pr_err("BUG: branch defer out of bounds %d\n", defer); + nfp_prog->error = -EFAULT; + return; + } + __emit_br(nfp_prog, BR_UNC, BR_EV_PIP_UNCOND, BR_CSS_NONE, addr, defer); +} + static void emit_br(struct nfp_prog *nfp_prog, enum br_mask mask, u16 addr, u8 defer) { @@ -1465,9 +1475,65 @@ static void nfp_outro_tc_legacy(struct nfp_prog *nfp_prog) SHF_SC_L_SHF, 16); } +static void nfp_outro_tc_da(struct nfp_prog *nfp_prog) +{ + /* TC direct-action mode: +* 0,1 okNOT SUPPORTED[1] +* 2 drop 0x22 -> drop, count as stat1 +* 4,5 nuke 0x02 -> drop +* 7 redir 0x44 -> redir, count as stat2 +* * unspec 0x11 -> pass, count as stat0 +* +* [1] We can't support OK and RECLASSIFY because we can't tell TC +* the exact decision made. We are forced to support UNSPEC +* to handle aborts so that's the only one we handle for passing +* packets up the stack. +*/ + /* Target for aborts */ + nfp_prog->tgt_abort = nfp_prog_current_offset(nfp_prog); + + emit_br_def(nfp_prog, nfp_prog->tgt_done, 2); + + emit_alu(nfp_prog, reg_a(0), +reg_none(), ALU_OP_NONE, NFP_BPF_ABI_FLAGS); + emit_ld_field(nfp_prog, reg_a(0), 0xc, reg_imm(0x11), SHF_SC_L_SHF, 16); + + /* Target for normal exits */ + nfp_prog->tgt_out = nfp_prog_current_offset(nfp_prog); + + /* if R0 > 7 jump to abort */ + emit_alu(nfp_prog, reg_none(), reg_imm(7), ALU_OP_SUB, reg_b(0)); + emit_br(nfp_prog, BR_BLO, nfp_prog->tgt_abort, 0); + emit_alu(nfp_prog, reg_a(0), +reg_none(), ALU_OP_NONE, NFP_BPF_ABI_FLAGS); + + wrp_immed(nfp_prog, reg_b(2), 0x41221211); + wrp_immed(nfp_prog, reg_b(3), 0x41001211); + + emit_shf(nfp_prog, reg_a(1), +reg_none(), SHF_OP_NONE, reg_b(0), SHF_SC_L_SHF, 2); + + emit_alu(nfp_prog, reg_none(), reg_a(1), ALU_OP_OR, reg_imm(0)); + emit_shf(nfp_prog, reg_a(2), +reg_imm(0xf), SHF_OP_AND, reg_b(2), SHF_SC_R_SHF, 0); + + emit_alu(nfp_prog, reg_none(), reg_a(1), ALU_OP_OR, reg_imm(0)); + emit_shf(nfp_prog, reg_b(2), +reg_imm(0xf), SHF_OP_AND, reg_b(3), SHF_SC_R_SHF, 0); + + emit_br_def(nfp_prog, nfp_prog->tgt_done, 2); + + emit_shf(nfp_prog, reg_b(2), +reg_a(2), SHF_OP_OR, reg_b(2), SHF_SC_L_SHF, 4); + emit_ld_field(nfp_prog, reg_a(0), 0xc, reg_b(2), SHF_SC_L_SHF, 16); +} + static void nfp_outro(struct nfp_prog *nfp_prog) { switch (nfp_prog->act) { + case NN_ACT_DIRECT: + nfp_outro_tc_da(nfp_prog); + break; case NN_ACT_TC_DROP: case NN_ACT_TC_REDIR: nfp_outro_tc_legacy(nfp_prog); diff --git a/drivers/net/ethernet/netronome/nfp/nfp_bpf_verifier.c b/drivers/net/ethernet/netronome/nfp/nfp_bpf_verifier.c index 15c460964810..af2d7c8bd8bf 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_bpf_verifier.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_bpf_verifier.c @@ -86,7 +86,16 @@ nfp_bpf_check_exit(struct nfp_prog *nfp_prog, return -EINVAL; } - if (reg0->imm != 0 && (reg0->imm & ~0U) != ~0U) { + if (nfp_prog->act != NN_ACT_DIRECT && + reg0->imm != 0 && (reg0->imm & ~0U) != ~0U) { +
[PATCHv4 net-next 03/15] net: cls_bpf: add support for marking filters as hardware-only
Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_SW flag. Signed-off-by: Jakub KicinskiAcked-by: Daniel Borkmann --- net/sched/cls_bpf.c | 34 +- 1 file changed, 25 insertions(+), 9 deletions(-) diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c index 1ae5b6798363..1aad314089e9 100644 --- a/net/sched/cls_bpf.c +++ b/net/sched/cls_bpf.c @@ -28,7 +28,7 @@ MODULE_DESCRIPTION("TC BPF based classifier"); #define CLS_BPF_NAME_LEN 256 #define CLS_BPF_SUPPORTED_GEN_FLAGS\ - TCA_CLS_FLAGS_SKIP_HW + (TCA_CLS_FLAGS_SKIP_HW | TCA_CLS_FLAGS_SKIP_SW) struct cls_bpf_head { struct list_head plist; @@ -98,7 +98,9 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp, qdisc_skb_cb(skb)->tc_classid = prog->res.classid; - if (at_ingress) { + if (tc_skip_sw(prog->gen_flags)) { + filter_res = prog->exts_integrated ? TC_ACT_UNSPEC : 0; + } else if (at_ingress) { /* It is safe to push/pull even if skb_shared() */ __skb_push(skb, skb->mac_len); bpf_compute_data_end(skb); @@ -166,32 +168,42 @@ static int cls_bpf_offload_cmd(struct tcf_proto *tp, struct cls_bpf_prog *prog, tp->protocol, ); } -static void cls_bpf_offload(struct tcf_proto *tp, struct cls_bpf_prog *prog, - struct cls_bpf_prog *oldprog) +static int cls_bpf_offload(struct tcf_proto *tp, struct cls_bpf_prog *prog, + struct cls_bpf_prog *oldprog) { struct net_device *dev = tp->q->dev_queue->dev; struct cls_bpf_prog *obj = prog; enum tc_clsbpf_command cmd; + bool skip_sw; + int ret; + + skip_sw = tc_skip_sw(prog->gen_flags) || + (oldprog && tc_skip_sw(oldprog->gen_flags)); if (oldprog && oldprog->offloaded) { if (tc_should_offload(dev, tp, prog->gen_flags)) { cmd = TC_CLSBPF_REPLACE; - } else { + } else if (!tc_skip_sw(prog->gen_flags)) { obj = oldprog; cmd = TC_CLSBPF_DESTROY; + } else { + return -EINVAL; } } else { if (!tc_should_offload(dev, tp, prog->gen_flags)) - return; + return skip_sw ? -EINVAL : 0; cmd = TC_CLSBPF_ADD; } - if (cls_bpf_offload_cmd(tp, obj, cmd)) - return; + ret = cls_bpf_offload_cmd(tp, obj, cmd); + if (ret) + return skip_sw ? ret : 0; obj->offloaded = true; if (oldprog) oldprog->offloaded = false; + + return 0; } static void cls_bpf_stop_offload(struct tcf_proto *tp, @@ -499,7 +511,11 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb, if (ret < 0) goto errout; - cls_bpf_offload(tp, prog, oldprog); + ret = cls_bpf_offload(tp, prog, oldprog); + if (ret) { + cls_bpf_delete_prog(tp, prog); + return ret; + } if (oldprog) { list_replace_rcu(>link, >link); -- 1.9.1
[PATCHv4 net-next 14/15] nfp: bpf: add support for legacy redirect action
Data path has redirect support so expressing redirect to the port frame came from is a trivial matter of setting the right result code. Signed-off-by: Jakub Kicinski--- drivers/net/ethernet/netronome/nfp/nfp_bpf.h | 1 + drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c | 2 ++ drivers/net/ethernet/netronome/nfp/nfp_net_offload.c | 4 3 files changed, 7 insertions(+) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_bpf.h b/drivers/net/ethernet/netronome/nfp/nfp_bpf.h index d550fbc4768a..378d3c35cad5 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_bpf.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_bpf.h @@ -60,6 +60,7 @@ enum static_regs { enum nfp_bpf_action_type { NN_ACT_TC_DROP, + NN_ACT_TC_REDIR, }; /* Software register representation, hardware encoding in asm.h */ diff --git a/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c b/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c index 42a8afb67fc8..60a99e0bf459 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c @@ -1440,6 +1440,7 @@ static void nfp_outro_tc_legacy(struct nfp_prog *nfp_prog) { const u8 act2code[] = { [NN_ACT_TC_DROP] = 0x22, + [NN_ACT_TC_REDIR] = 0x24 }; /* Target for aborts */ nfp_prog->tgt_abort = nfp_prog_current_offset(nfp_prog); @@ -1468,6 +1469,7 @@ static void nfp_outro(struct nfp_prog *nfp_prog) { switch (nfp_prog->act) { case NN_ACT_TC_DROP: + case NN_ACT_TC_REDIR: nfp_outro_tc_legacy(nfp_prog); break; } diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_offload.c b/drivers/net/ethernet/netronome/nfp/nfp_net_offload.c index 0537a53e2174..1ec8e5b74651 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_offload.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_offload.c @@ -123,6 +123,10 @@ nfp_net_bpf_get_act(struct nfp_net *nn, struct tc_cls_bpf_offload *cls_bpf) list_for_each_entry(a, , list) { if (is_tcf_gact_shot(a)) return NN_ACT_TC_DROP; + + if (is_tcf_mirred_redirect(a) && + tcf_mirred_ifindex(a) == nn->netdev->ifindex) + return NN_ACT_TC_REDIR; } return -ENOTSUPP; -- 1.9.1
[PATCHv4 net-next 01/15] net: cls_bpf: add hardware offload
This patch adds hardware offload capability to cls_bpf classifier, similar to what have been done with U32 and flower. Signed-off-by: Jakub KicinskiAcked-by: Daniel Borkmann --- v3: - s/filter/prog/ in struct tc_cls_bpf_offload. v2: - drop unnecessary WARN_ON; - reformat error handling a bit. --- include/linux/netdevice.h | 2 ++ include/net/pkt_cls.h | 14 ++ net/sched/cls_bpf.c | 70 +++ 3 files changed, 86 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 2095b6ab3661..3c50db29a114 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -789,6 +789,7 @@ enum { TC_SETUP_CLSU32, TC_SETUP_CLSFLOWER, TC_SETUP_MATCHALL, + TC_SETUP_CLSBPF, }; struct tc_cls_u32_offload; @@ -800,6 +801,7 @@ struct tc_to_netdev { struct tc_cls_u32_offload *cls_u32; struct tc_cls_flower_offload *cls_flower; struct tc_cls_matchall_offload *cls_mall; + struct tc_cls_bpf_offload *cls_bpf; }; }; diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h index a459be5fe1c2..41e8071dff87 100644 --- a/include/net/pkt_cls.h +++ b/include/net/pkt_cls.h @@ -486,4 +486,18 @@ struct tc_cls_matchall_offload { unsigned long cookie; }; +enum tc_clsbpf_command { + TC_CLSBPF_ADD, + TC_CLSBPF_REPLACE, + TC_CLSBPF_DESTROY, +}; + +struct tc_cls_bpf_offload { + enum tc_clsbpf_command command; + struct tcf_exts *exts; + struct bpf_prog *prog; + const char *name; + bool exts_integrated; +}; + #endif diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c index 4742f415ee5b..c3983493aeab 100644 --- a/net/sched/cls_bpf.c +++ b/net/sched/cls_bpf.c @@ -39,6 +39,7 @@ struct cls_bpf_prog { struct list_head link; struct tcf_result res; bool exts_integrated; + bool offloaded; struct tcf_exts exts; u32 handle; union { @@ -140,6 +141,71 @@ static bool cls_bpf_is_ebpf(const struct cls_bpf_prog *prog) return !prog->bpf_ops; } +static int cls_bpf_offload_cmd(struct tcf_proto *tp, struct cls_bpf_prog *prog, + enum tc_clsbpf_command cmd) +{ + struct net_device *dev = tp->q->dev_queue->dev; + struct tc_cls_bpf_offload bpf_offload = {}; + struct tc_to_netdev offload; + + offload.type = TC_SETUP_CLSBPF; + offload.cls_bpf = _offload; + + bpf_offload.command = cmd; + bpf_offload.exts = >exts; + bpf_offload.prog = prog->filter; + bpf_offload.name = prog->bpf_name; + bpf_offload.exts_integrated = prog->exts_integrated; + + return dev->netdev_ops->ndo_setup_tc(dev, tp->q->handle, +tp->protocol, ); +} + +static void cls_bpf_offload(struct tcf_proto *tp, struct cls_bpf_prog *prog, + struct cls_bpf_prog *oldprog) +{ + struct net_device *dev = tp->q->dev_queue->dev; + struct cls_bpf_prog *obj = prog; + enum tc_clsbpf_command cmd; + + if (oldprog && oldprog->offloaded) { + if (tc_should_offload(dev, tp, 0)) { + cmd = TC_CLSBPF_REPLACE; + } else { + obj = oldprog; + cmd = TC_CLSBPF_DESTROY; + } + } else { + if (!tc_should_offload(dev, tp, 0)) + return; + cmd = TC_CLSBPF_ADD; + } + + if (cls_bpf_offload_cmd(tp, obj, cmd)) + return; + + obj->offloaded = true; + if (oldprog) + oldprog->offloaded = false; +} + +static void cls_bpf_stop_offload(struct tcf_proto *tp, +struct cls_bpf_prog *prog) +{ + int err; + + if (!prog->offloaded) + return; + + err = cls_bpf_offload_cmd(tp, prog, TC_CLSBPF_DESTROY); + if (err) { + pr_err("Stopping hardware offload failed: %d\n", err); + return; + } + + prog->offloaded = false; +} + static int cls_bpf_init(struct tcf_proto *tp) { struct cls_bpf_head *head; @@ -179,6 +245,7 @@ static int cls_bpf_delete(struct tcf_proto *tp, unsigned long arg) { struct cls_bpf_prog *prog = (struct cls_bpf_prog *) arg; + cls_bpf_stop_offload(tp, prog); list_del_rcu(>link); tcf_unbind_filter(tp, >res); call_rcu(>rcu, __cls_bpf_delete_prog); @@ -195,6 +262,7 @@ static bool cls_bpf_destroy(struct tcf_proto *tp, bool force) return false; list_for_each_entry_safe(prog, tmp, >plist, link) { + cls_bpf_stop_offload(tp, prog); list_del_rcu(>link); tcf_unbind_filter(tp, >res); call_rcu(>rcu, __cls_bpf_delete_prog); @@ -416,6 +484,8
[PATCHv4 net-next 00/15] BPF hardware offload (cls_bpf for now)
Hi! Dave, this set depends on bitfield.h which is sitting in the pull request from Kalle. I'm expecting buildbot to complain about patch 8, please pull wireless-drivers-next before applying. v4: - rename parser -> analyzer; - reorganize the analyzer patches a bit; - use bitfield.h directly. --- merge blurb: In the last year a lot of progress have been made on offloading simpler TC classifiers. There is also growing interest in using BPF for generic high-speed packet processing in the kernel. It seems beneficial to tie those two trends together and think about hardware offloads of BPF programs. This patch set presents such offload to Netronome smart NICs. cls_bpf is extended with hardware offload capabilities and NFP driver gets a JIT translator which in presence of capable firmware can be used to offload the BPF program onto the card. BPF JIT implementation is not 100% complete (e.g. missing instructions) but it is functional. Encouragingly it should be possible to offload most (if not all) advanced BPF features onto the NIC - including packet modification, maps, tunnel encap/decap etc. Example of basic tests I used: __section_cls_entry int cls_entry(struct __sk_buff *skb) { if (load_byte(skb, 0) != 0x0) return 0; if (load_byte(skb, 4) != 0x1) return 0; skb->mark = 0xcafe; if (load_byte(skb, 50) != 0xff) return 0; return ~0U; } Above code can be compiled with Clang and loaded like this: # ethtool -K p1p1 hw-tc-offload on # tc qdisc add dev p1p1 ingress # tc filter add dev p1p1 parent : bpf obj prog.o action drop This set implements the basic transparent offload, the skip_{sw,hw} flags and reporting statistics for cls_bpf. Jakub Kicinski (15): net: cls_bpf: add hardware offload net: cls_bpf: limit hardware offload by software-only flag net: cls_bpf: add support for marking filters as hardware-only bpf: don't (ab)use instructions to store state bpf: expose internal verfier structures bpf: enable non-core use of the verfier bpf: recognize 64bit immediate loads as consts nfp: add BPF to NFP code translator nfp: bpf: add hardware bpf offload net: cls_bpf: allow offloaded filters to update stats net: bpf: allow offloaded filters to update stats nfp: bpf: add packet marking support net: act_mirred: allow statistic updates from offloaded actions nfp: bpf: add support for legacy redirect action nfp: bpf: add offload of TC direct action mode drivers/net/ethernet/netronome/nfp/Makefile|7 + drivers/net/ethernet/netronome/nfp/nfp_asm.h | 233 +++ drivers/net/ethernet/netronome/nfp/nfp_bpf.h | 212 +++ drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c | 1816 .../net/ethernet/netronome/nfp/nfp_bpf_verifier.c | 160 ++ drivers/net/ethernet/netronome/nfp/nfp_net.h | 47 +- .../net/ethernet/netronome/nfp/nfp_net_common.c| 134 +- drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h | 51 +- .../net/ethernet/netronome/nfp/nfp_net_ethtool.c | 12 + .../net/ethernet/netronome/nfp/nfp_net_offload.c | 291 .../net/ethernet/netronome/nfp/nfp_netvf_main.c|2 +- include/linux/bpf_verifier.h | 89 + include/linux/netdevice.h |2 + include/net/pkt_cls.h | 16 + include/uapi/linux/pkt_cls.h |1 + kernel/bpf/verifier.c | 384 +++-- net/sched/act_mirred.c |8 + net/sched/cls_bpf.c| 117 +- 18 files changed, 3376 insertions(+), 206 deletions(-) create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_asm.h create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_bpf.h create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_bpf_verifier.c create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_net_offload.c create mode 100644 include/linux/bpf_verifier.h -- 1.9.1
[PATCHv4 net-next 13/15] net: act_mirred: allow statistic updates from offloaded actions
Implement .stats_update() callback. The implementation is generic and can be reused by other simple actions if needed. Signed-off-by: Jakub Kicinski--- net/sched/act_mirred.c | 8 1 file changed, 8 insertions(+) diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c index 6038c85d92f5..f9862d89cb93 100644 --- a/net/sched/act_mirred.c +++ b/net/sched/act_mirred.c @@ -204,6 +204,13 @@ static int tcf_mirred(struct sk_buff *skb, const struct tc_action *a, return retval; } +static void tcf_stats_update(struct tc_action *a, u64 bytes, u32 packets, +u64 lastuse) +{ + tcf_lastuse_update(>tcfa_tm); + _bstats_cpu_update(this_cpu_ptr(a->cpu_bstats), bytes, packets); +} + static int tcf_mirred_dump(struct sk_buff *skb, struct tc_action *a, int bind, int ref) { unsigned char *b = skb_tail_pointer(skb); @@ -280,6 +287,7 @@ static struct tc_action_ops act_mirred_ops = { .type = TCA_ACT_MIRRED, .owner = THIS_MODULE, .act= tcf_mirred, + .stats_update = tcf_stats_update, .dump = tcf_mirred_dump, .cleanup= tcf_mirred_release, .init = tcf_mirred_init, -- 1.9.1
[PATCHv4 net-next 08/15] nfp: add BPF to NFP code translator
Add translator for JITing eBPF to operations which can be executed on NFP's programmable engines. Signed-off-by: Jakub Kicinski--- v4: - use bitfield.h directly. v3: - don't clone the program for the verifier (no longer needed); - temporarily add a local copy of macros from bitfield.h. NOTE: this one will probably trigger buildbot failures because it depends on pull request from wireless-drivers-next. --- drivers/net/ethernet/netronome/nfp/Makefile|6 + drivers/net/ethernet/netronome/nfp/nfp_asm.h | 233 +++ drivers/net/ethernet/netronome/nfp/nfp_bpf.h | 208 +++ drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c | 1729 .../net/ethernet/netronome/nfp/nfp_bpf_verifier.c | 151 ++ 5 files changed, 2327 insertions(+) create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_asm.h create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_bpf.h create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_bpf_verifier.c diff --git a/drivers/net/ethernet/netronome/nfp/Makefile b/drivers/net/ethernet/netronome/nfp/Makefile index 68178819ff12..5f12689bf523 100644 --- a/drivers/net/ethernet/netronome/nfp/Makefile +++ b/drivers/net/ethernet/netronome/nfp/Makefile @@ -5,4 +5,10 @@ nfp_netvf-objs := \ nfp_net_ethtool.o \ nfp_netvf_main.o +ifeq ($(CONFIG_BPF_SYSCALL),y) +nfp_netvf-objs += \ + nfp_bpf_verifier.o \ + nfp_bpf_jit.o +endif + nfp_netvf-$(CONFIG_NFP_NET_DEBUG) += nfp_net_debugfs.o diff --git a/drivers/net/ethernet/netronome/nfp/nfp_asm.h b/drivers/net/ethernet/netronome/nfp/nfp_asm.h new file mode 100644 index ..22484b6fd3e8 --- /dev/null +++ b/drivers/net/ethernet/netronome/nfp/nfp_asm.h @@ -0,0 +1,233 @@ +/* + * Copyright (C) 2016 Netronome Systems, Inc. + * + * This software is dual licensed under the GNU General License Version 2, + * June 1991 as shown in the file COPYING in the top-level directory of this + * source tree or the BSD 2-Clause License provided below. You have the + * option to license this software under the complete terms of either license. + * + * The BSD 2-Clause License: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * 1. Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * 2. Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef __NFP_ASM_H__ +#define __NFP_ASM_H__ 1 + +#include "nfp_bpf.h" + +#define REG_NONE 0 + +#define RE_REG_NO_DST 0x020 +#define RE_REG_IMM 0x020 +#define RE_REG_IMM_encode(x) \ + (RE_REG_IMM | ((x) & 0x1f) | (((x) & 0x60) << 1)) +#define RE_REG_IMM_MAX 0x07fULL +#define RE_REG_XFR 0x080 + +#define UR_REG_XFR 0x180 +#define UR_REG_NN 0x280 +#define UR_REG_NO_DST 0x300 +#define UR_REG_IMM UR_REG_NO_DST +#define UR_REG_IMM_encode(x) (UR_REG_IMM | (x)) +#define UR_REG_IMM_MAX 0x0ffULL + +#define OP_BR_BASE 0x0d80020ULL +#define OP_BR_BASE_MASK0x0f8000c3ce0ULL +#define OP_BR_MASK 0x01fULL +#define OP_BR_EV_PIP 0x300ULL +#define OP_BR_CSS 0x003c000ULL +#define OP_BR_DEFBR0x030ULL +#define OP_BR_ADDR_LO 0x007ffc0ULL +#define OP_BR_ADDR_HI 0x100ULL + +#define nfp_is_br(_insn) \ + (((_insn) & OP_BR_BASE_MASK) == OP_BR_BASE) + +enum br_mask { + BR_BEQ = 0x00, + BR_BNE = 0x01, + BR_BHS = 0x04, + BR_BLO = 0x05, + BR_BGE = 0x08, + BR_UNC = 0x18, +}; + +enum br_ev_pip { + BR_EV_PIP_UNCOND = 0, + BR_EV_PIP_COND = 1, +}; + +enum br_ctx_signal_state { + BR_CSS_NONE = 2, +}; + +#define OP_BBYTE_BASE 0x0c8ULL +#define OP_BB_A_SRC0x0ffULL +#define OP_BB_BYTE 0x300ULL +#define OP_BB_B_SRC0x003fc00ULL +#define OP_BB_I8 0x004ULL +#define OP_BB_EQ 0x008ULL +#define OP_BB_DEFBR0x030ULL +#define OP_BB_ADDR_LO 0x007ffc0ULL
[PATCHv4 net-next 06/15] bpf: enable non-core use of the verfier
Advanced JIT compilers and translators may want to use eBPF verifier as a base for parsers or to perform custom checks and validations. Add ability for external users to invoke the verifier and provide callbacks to be invoked for every intruction checked. For now only add most basic callback for per-instruction pre-interpretation checks is added. More advanced users may also like to have per-instruction post callback and state comparison callback. Signed-off-by: Jakub Kicinski--- v4: - separate from the header split patch. --- include/linux/bpf_verifier.h | 11 +++ kernel/bpf/verifier.c| 68 2 files changed, 79 insertions(+) diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 1c0511ef7eaf..e3de907d5bf6 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -59,6 +59,12 @@ struct bpf_insn_aux_data { #define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */ +struct bpf_verifier_env; +struct bpf_ext_analyzer_ops { + int (*insn_hook)(struct bpf_verifier_env *env, +int insn_idx, int prev_insn_idx); +}; + /* single container for all structs * one verifier_env per bpf_check() call */ @@ -68,6 +74,8 @@ struct bpf_verifier_env { int stack_size; /* number of states to be processed */ struct bpf_verifier_state cur_state; /* current verifier state */ struct bpf_verifier_state_list **explored_states; /* search pruning optimization */ + const struct bpf_ext_analyzer_ops *analyzer_ops; /* external analyzer ops */ + void *analyzer_priv; /* pointer to external analyzer's private data */ struct bpf_map *used_maps[MAX_USED_MAPS]; /* array of map's used by eBPF program */ u32 used_map_cnt; /* number of used maps */ u32 id_gen; /* used to generate unique reg IDs */ @@ -75,4 +83,7 @@ struct bpf_verifier_env { struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */ }; +int bpf_analyzer(struct bpf_prog *prog, const struct bpf_ext_analyzer_ops *ops, +void *priv); + #endif /* _LINUX_BPF_ANALYZER_H */ diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 6e126a417290..d93e78331b90 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -624,6 +624,10 @@ static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off, static int check_ctx_access(struct bpf_verifier_env *env, int off, int size, enum bpf_access_type t, enum bpf_reg_type *reg_type) { + /* for analyzer ctx accesses are already validated and converted */ + if (env->analyzer_ops) + return 0; + if (env->prog->aux->ops->is_valid_access && env->prog->aux->ops->is_valid_access(off, size, t, reg_type)) { /* remember the offset of last byte accessed in ctx */ @@ -,6 +2226,15 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx) return 0; } +static int ext_analyzer_insn_hook(struct bpf_verifier_env *env, + int insn_idx, int prev_insn_idx) +{ + if (!env->analyzer_ops || !env->analyzer_ops->insn_hook) + return 0; + + return env->analyzer_ops->insn_hook(env, insn_idx, prev_insn_idx); +} + static int do_check(struct bpf_verifier_env *env) { struct bpf_verifier_state *state = >cur_state; @@ -2280,6 +2293,10 @@ static int do_check(struct bpf_verifier_env *env) print_bpf_insn(insn); } + err = ext_analyzer_insn_hook(env, insn_idx, prev_insn_idx); + if (err) + return err; + if (class == BPF_ALU || class == BPF_ALU64) { err = check_alu_op(env, insn); if (err) @@ -2829,3 +2846,54 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr) kfree(env); return ret; } + +int bpf_analyzer(struct bpf_prog *prog, const struct bpf_ext_analyzer_ops *ops, +void *priv) +{ + struct bpf_verifier_env *env; + int ret; + + env = kzalloc(sizeof(struct bpf_verifier_env), GFP_KERNEL); + if (!env) + return -ENOMEM; + + env->insn_aux_data = vzalloc(sizeof(struct bpf_insn_aux_data) * +prog->len); + ret = -ENOMEM; + if (!env->insn_aux_data) + goto err_free_env; + env->prog = prog; + env->analyzer_ops = ops; + env->analyzer_priv = priv; + + /* grab the mutex to protect few globals used by verifier */ + mutex_lock(_verifier_lock); + + log_level = 0; + + env->explored_states = kcalloc(env->prog->len, + sizeof(struct bpf_verifier_state_list *), +
[PATCHv4 net-next 11/15] nfp: bpf: allow offloaded filters to update stats
Periodically poll stats and call into offloaded actions to update them. Signed-off-by: Jakub Kicinski--- v3: - add missing hunk with ethtool stats. --- drivers/net/ethernet/netronome/nfp/nfp_net.h | 19 +++ .../net/ethernet/netronome/nfp/nfp_net_common.c| 3 ++ .../net/ethernet/netronome/nfp/nfp_net_ethtool.c | 12 + .../net/ethernet/netronome/nfp/nfp_net_offload.c | 63 ++ 4 files changed, 97 insertions(+) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h index ea6f5e667f27..13c6a9001b4d 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h @@ -62,6 +62,9 @@ /* Max time to wait for NFP to respond on updates (in seconds) */ #define NFP_NET_POLL_TIMEOUT 5 +/* Interval for reading offloaded filter stats */ +#define NFP_NET_STAT_POLL_IVL msecs_to_jiffies(100) + /* Bar allocation */ #define NFP_NET_CTRL_BAR 0 #define NFP_NET_Q0_BAR 2 @@ -405,6 +408,11 @@ static inline bool nfp_net_fw_ver_eq(struct nfp_net_fw_version *fw_ver, fw_ver->minor == minor; } +struct nfp_stat_pair { + u64 pkts; + u64 bytes; +}; + /** * struct nfp_net - NFP network device structure * @pdev: Backpointer to PCI device @@ -428,6 +436,11 @@ static inline bool nfp_net_fw_ver_eq(struct nfp_net_fw_version *fw_ver, * @rss_cfg:RSS configuration * @rss_key:RSS secret key * @rss_itbl: RSS indirection table + * @rx_filter: Filter offload statistics - dropped packets/bytes + * @rx_filter_prev:Filter offload statistics - values from previous update + * @rx_filter_change: Jiffies when statistics last changed + * @rx_filter_stats_timer: Timer for polling filter offload statistics + * @rx_filter_lock:Lock protecting timer state changes (teardown) * @max_tx_rings: Maximum number of TX rings supported by the Firmware * @max_rx_rings: Maximum number of RX rings supported by the Firmware * @num_tx_rings: Currently configured number of TX rings @@ -504,6 +517,11 @@ struct nfp_net { u8 rss_key[NFP_NET_CFG_RSS_KEY_SZ]; u8 rss_itbl[NFP_NET_CFG_RSS_ITBL_SZ]; + struct nfp_stat_pair rx_filter, rx_filter_prev; + unsigned long rx_filter_change; + struct timer_list rx_filter_stats_timer; + spinlock_t rx_filter_lock; + int max_tx_rings; int max_rx_rings; @@ -775,6 +793,7 @@ static inline void nfp_net_debugfs_adapter_del(struct nfp_net *nn) } #endif /* CONFIG_NFP_NET_DEBUG */ +void nfp_net_filter_stats_timer(unsigned long data); int nfp_net_bpf_offload(struct nfp_net *nn, u32 handle, __be16 proto, struct tc_cls_bpf_offload *cls_bpf); diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 51978dfe883b..f091eb758ca2 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -2703,10 +2703,13 @@ struct nfp_net *nfp_net_netdev_alloc(struct pci_dev *pdev, nn->rxd_cnt = NFP_NET_RX_DESCS_DEFAULT; spin_lock_init(>reconfig_lock); + spin_lock_init(>rx_filter_lock); spin_lock_init(>link_status_lock); setup_timer(>reconfig_timer, nfp_net_reconfig_timer, (unsigned long)nn); + setup_timer(>rx_filter_stats_timer, + nfp_net_filter_stats_timer, (unsigned long)nn); return nn; } diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c index 4c9897220969..3418f2277e9d 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c @@ -106,6 +106,18 @@ static const struct _nfp_net_et_stats nfp_net_et_stats[] = { {"dev_tx_pkts", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_TX_FRAMES)}, {"dev_tx_mc_pkts", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_TX_MC_FRAMES)}, {"dev_tx_bc_pkts", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_TX_BC_FRAMES)}, + + {"bpf_pass_pkts", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_APP0_FRAMES)}, + {"bpf_pass_bytes", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_APP0_BYTES)}, + /* see comments in outro functions in nfp_bpf_jit.c to find out +* how different BPF modes use app-specific counters +*/ + {"bpf_app1_pkts", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_APP1_FRAMES)}, + {"bpf_app1_bytes", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_APP1_BYTES)}, + {"bpf_app2_pkts", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_APP2_FRAMES)}, + {"bpf_app2_bytes", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_APP2_BYTES)}, + {"bpf_app3_pkts", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_APP3_FRAMES)}, + {"bpf_app3_bytes", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_APP3_BYTES)}, }; #define NN_ET_GLOBAL_STATS_LEN
[PATCHv4 net-next 09/15] nfp: bpf: add hardware bpf offload
Add hardware bpf offload on our smart NICs. Detect if capable firmware is loaded and use it to load the code JITed with just added translator onto programmable engines. This commit only supports offloading cls_bpf in legacy mode (non-direct action). Signed-off-by: Jakub Kicinski--- drivers/net/ethernet/netronome/nfp/Makefile| 1 + drivers/net/ethernet/netronome/nfp/nfp_net.h | 26 ++- .../net/ethernet/netronome/nfp/nfp_net_common.c| 40 +++- drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h | 44 - .../net/ethernet/netronome/nfp/nfp_net_offload.c | 220 + 5 files changed, 324 insertions(+), 7 deletions(-) create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_net_offload.c diff --git a/drivers/net/ethernet/netronome/nfp/Makefile b/drivers/net/ethernet/netronome/nfp/Makefile index 5f12689bf523..0efb2ba9a558 100644 --- a/drivers/net/ethernet/netronome/nfp/Makefile +++ b/drivers/net/ethernet/netronome/nfp/Makefile @@ -3,6 +3,7 @@ obj-$(CONFIG_NFP_NETVF) += nfp_netvf.o nfp_netvf-objs := \ nfp_net_common.o \ nfp_net_ethtool.o \ + nfp_net_offload.o \ nfp_netvf_main.o ifeq ($(CONFIG_BPF_SYSCALL),y) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h index 690635660195..ea6f5e667f27 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h @@ -220,7 +220,7 @@ struct nfp_net_tx_ring { #define PCIE_DESC_RX_I_TCP_CSUM_OK cpu_to_le16(BIT(11)) #define PCIE_DESC_RX_I_UDP_CSUMcpu_to_le16(BIT(10)) #define PCIE_DESC_RX_I_UDP_CSUM_OK cpu_to_le16(BIT(9)) -#define PCIE_DESC_RX_SPARE cpu_to_le16(BIT(8)) +#define PCIE_DESC_RX_BPF cpu_to_le16(BIT(8)) #define PCIE_DESC_RX_EOP cpu_to_le16(BIT(7)) #define PCIE_DESC_RX_IP4_CSUM cpu_to_le16(BIT(6)) #define PCIE_DESC_RX_IP4_CSUM_OK cpu_to_le16(BIT(5)) @@ -413,6 +413,7 @@ static inline bool nfp_net_fw_ver_eq(struct nfp_net_fw_version *fw_ver, * @is_vf: Is the driver attached to a VF? * @is_nfp3200: Is the driver for a NFP-3200 card? * @fw_loaded: Is the firmware loaded? + * @bpf_offload_skip_sw: Offloaded BPF program will not be rerun by cls_bpf * @ctrl: Local copy of the control register/word. * @fl_bufsz: Currently configured size of the freelist buffers * @rx_offset: Offset in the RX buffers where packet data starts @@ -473,6 +474,7 @@ struct nfp_net { unsigned is_vf:1; unsigned is_nfp3200:1; unsigned fw_loaded:1; + unsigned bpf_offload_skip_sw:1; u32 ctrl; u32 fl_bufsz; @@ -561,12 +563,28 @@ struct nfp_net { /* Functions to read/write from/to a BAR * Performs any endian conversion necessary. */ +static inline u16 nn_readb(struct nfp_net *nn, int off) +{ + return readb(nn->ctrl_bar + off); +} + static inline void nn_writeb(struct nfp_net *nn, int off, u8 val) { writeb(val, nn->ctrl_bar + off); } -/* NFP-3200 can't handle 16-bit accesses too well - hence no readw/writew */ +/* NFP-3200 can't handle 16-bit accesses too well */ +static inline u16 nn_readw(struct nfp_net *nn, int off) +{ + WARN_ON_ONCE(nn->is_nfp3200); + return readw(nn->ctrl_bar + off); +} + +static inline void nn_writew(struct nfp_net *nn, int off, u16 val) +{ + WARN_ON_ONCE(nn->is_nfp3200); + writew(val, nn->ctrl_bar + off); +} static inline u32 nn_readl(struct nfp_net *nn, int off) { @@ -757,4 +775,8 @@ static inline void nfp_net_debugfs_adapter_del(struct nfp_net *nn) } #endif /* CONFIG_NFP_NET_DEBUG */ +int +nfp_net_bpf_offload(struct nfp_net *nn, u32 handle, __be16 proto, + struct tc_cls_bpf_offload *cls_bpf); + #endif /* _NFP_NET_H_ */ diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 252e4924de0f..51978dfe883b 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -60,6 +60,7 @@ #include +#include #include #include "nfp_net_ctrl.h" @@ -2382,6 +2383,31 @@ static struct rtnl_link_stats64 *nfp_net_stat64(struct net_device *netdev, return stats; } +static bool nfp_net_ebpf_capable(struct nfp_net *nn) +{ + if (nn->cap & NFP_NET_CFG_CTRL_BPF && + nn_readb(nn, NFP_NET_CFG_BPF_ABI) == NFP_NET_BPF_ABI) + return true; + return false; +} + +static int +nfp_net_setup_tc(struct net_device *netdev, u32 handle, __be16 proto, +struct tc_to_netdev *tc) +{ + struct nfp_net *nn = netdev_priv(netdev); + + if (TC_H_MAJ(handle) != TC_H_MAJ(TC_H_INGRESS)) + return -ENOTSUPP; + if (proto != htons(ETH_P_ALL)) + return -ENOTSUPP; + + if
[PATCHv4 net-next 12/15] nfp: bpf: add packet marking support
Add missing ABI defines and eBPF instructions to allow mark to be passed on and extend prepend parsing on the RX path to pick it up from packet metadata. Signed-off-by: Jakub Kicinski--- v3: - change metadata format. --- drivers/net/ethernet/netronome/nfp/nfp_bpf.h | 2 + drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c | 19 + drivers/net/ethernet/netronome/nfp/nfp_net.h | 2 + .../net/ethernet/netronome/nfp/nfp_net_common.c| 91 +- drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h | 7 ++ .../net/ethernet/netronome/nfp/nfp_netvf_main.c| 2 +- 6 files changed, 101 insertions(+), 22 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_bpf.h b/drivers/net/ethernet/netronome/nfp/nfp_bpf.h index af43e058be97..d550fbc4768a 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_bpf.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_bpf.h @@ -91,6 +91,8 @@ enum nfp_bpf_reg_type { #define imm_both(np) reg_both((np)->regs_per_thread - STATIC_REG_IMM) #define NFP_BPF_ABI_FLAGS reg_nnr(0) +#define NFP_BPF_ABI_FLAG_MARK1 +#define NFP_BPF_ABI_MARK reg_nnr(1) #define NFP_BPF_ABI_PKTreg_nnr(2) #define NFP_BPF_ABI_LENreg_nnr(3) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c b/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c index dfcf162ccbb8..42a8afb67fc8 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c @@ -674,6 +674,16 @@ static int construct_data_ld(struct nfp_prog *nfp_prog, u16 offset, u8 size) return construct_data_ind_ld(nfp_prog, offset, 0, false, size); } +static int wrp_set_mark(struct nfp_prog *nfp_prog, u8 src) +{ + emit_alu(nfp_prog, NFP_BPF_ABI_MARK, +reg_none(), ALU_OP_NONE, reg_b(src)); + emit_alu(nfp_prog, NFP_BPF_ABI_FLAGS, +NFP_BPF_ABI_FLAGS, ALU_OP_OR, reg_imm(NFP_BPF_ABI_FLAG_MARK)); + + return 0; +} + static void wrp_alu_imm(struct nfp_prog *nfp_prog, u8 dst, enum alu_op alu_op, u32 imm) { @@ -1117,6 +1127,14 @@ static int mem_ldx4(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) return 0; } +static int mem_stx4(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) +{ + if (meta->insn.off == offsetof(struct sk_buff, mark)) + return wrp_set_mark(nfp_prog, meta->insn.src_reg * 2); + + return -ENOTSUPP; +} + static int jump(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) { if (meta->insn.off < 0) /* TODO */ @@ -1306,6 +1324,7 @@ static const instr_cb_t instr_cb[256] = { [BPF_LD | BPF_IND | BPF_H] =data_ind_ld2, [BPF_LD | BPF_IND | BPF_W] =data_ind_ld4, [BPF_LDX | BPF_MEM | BPF_W] = mem_ldx4, + [BPF_STX | BPF_MEM | BPF_W] = mem_stx4, [BPF_JMP | BPF_JA | BPF_K] =jump, [BPF_JMP | BPF_JEQ | BPF_K] = jeq_imm, [BPF_JMP | BPF_JGT | BPF_K] = jgt_imm, diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h index 13c6a9001b4d..ed824e11a1e3 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h @@ -269,6 +269,8 @@ struct nfp_net_rx_desc { }; }; +#define NFP_NET_META_FIELD_MASK GENMASK(NFP_NET_META_FIELD_SIZE - 1, 0) + struct nfp_net_rx_hash { __be32 hash_type; __be32 hash; diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index f091eb758ca2..415691edcaa5 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -1293,38 +1293,72 @@ static void nfp_net_rx_csum(struct nfp_net *nn, struct nfp_net_r_vector *r_vec, } } -/** - * nfp_net_set_hash() - Set SKB hash data - * @netdev: adapter's net_device structure - * @skb: SKB to set the hash data on - * @rxd: RX descriptor - * - * The RSS hash and hash-type are pre-pended to the packet data. - * Extract and decode it and set the skb fields. - */ static void nfp_net_set_hash(struct net_device *netdev, struct sk_buff *skb, -struct nfp_net_rx_desc *rxd) +unsigned int type, __be32 *hash) { - struct nfp_net_rx_hash *rx_hash; - - if (!(rxd->rxd.flags & PCIE_DESC_RX_RSS) || - !(netdev->features & NETIF_F_RXHASH)) + if (!(netdev->features & NETIF_F_RXHASH)) return; - rx_hash = (struct nfp_net_rx_hash *)(skb->data - sizeof(*rx_hash)); - - switch (be32_to_cpu(rx_hash->hash_type)) { + switch (type) { case NFP_NET_RSS_IPV4: case NFP_NET_RSS_IPV6: case NFP_NET_RSS_IPV6_EX: - skb_set_hash(skb, be32_to_cpu(rx_hash->hash), PKT_HASH_TYPE_L3); + skb_set_hash(skb, get_unaligned_be32(hash),
[PATCHv4 net-next 05/15] bpf: expose internal verfier structures
Move verifier's internal structures to a header file and prefix their names with bpf_ to avoid potential namespace conflicts. Those structures will soon be used by external analyzers. Signed-off-by: Jakub Kicinski--- v4: - separate from adding the analyzer; - squash with the prefixing patch. --- include/linux/bpf_verifier.h | 78 + kernel/bpf/verifier.c| 263 +-- 2 files changed, 180 insertions(+), 161 deletions(-) create mode 100644 include/linux/bpf_verifier.h diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h new file mode 100644 index ..1c0511ef7eaf --- /dev/null +++ b/include/linux/bpf_verifier.h @@ -0,0 +1,78 @@ +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + */ +#ifndef _LINUX_BPF_ANALYZER_H +#define _LINUX_BPF_ANALYZER_H 1 + +#include /* for enum bpf_reg_type */ +#include /* for MAX_BPF_STACK */ + +struct bpf_reg_state { + enum bpf_reg_type type; + union { + /* valid when type == CONST_IMM | PTR_TO_STACK | UNKNOWN_VALUE */ + s64 imm; + + /* valid when type == PTR_TO_PACKET* */ + struct { + u32 id; + u16 off; + u16 range; + }; + + /* valid when type == CONST_PTR_TO_MAP | PTR_TO_MAP_VALUE | +* PTR_TO_MAP_VALUE_OR_NULL +*/ + struct bpf_map *map_ptr; + }; +}; + +enum bpf_stack_slot_type { + STACK_INVALID,/* nothing was stored in this stack slot */ + STACK_SPILL, /* register spilled into stack */ + STACK_MISC/* BPF program wrote some data into this slot */ +}; + +#define BPF_REG_SIZE 8 /* size of eBPF register in bytes */ + +/* state of the program: + * type of all registers and stack info + */ +struct bpf_verifier_state { + struct bpf_reg_state regs[MAX_BPF_REG]; + u8 stack_slot_type[MAX_BPF_STACK]; + struct bpf_reg_state spilled_regs[MAX_BPF_STACK / BPF_REG_SIZE]; +}; + +/* linked list of verifier states used to prune search */ +struct bpf_verifier_state_list { + struct bpf_verifier_state state; + struct bpf_verifier_state_list *next; +}; + +struct bpf_insn_aux_data { + enum bpf_reg_type ptr_type; /* pointer type for load/store insns */ +}; + +#define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */ + +/* single container for all structs + * one verifier_env per bpf_check() call + */ +struct bpf_verifier_env { + struct bpf_prog *prog; /* eBPF program being verified */ + struct bpf_verifier_stack_elem *head; /* stack of verifier states to be processed */ + int stack_size; /* number of states to be processed */ + struct bpf_verifier_state cur_state; /* current verifier state */ + struct bpf_verifier_state_list **explored_states; /* search pruning optimization */ + struct bpf_map *used_maps[MAX_USED_MAPS]; /* array of map's used by eBPF program */ + u32 used_map_cnt; /* number of used maps */ + u32 id_gen; /* used to generate unique reg IDs */ + bool allow_ptr_leaks; + struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */ +}; + +#endif /* _LINUX_BPF_ANALYZER_H */ diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index ce9c0d1721c6..6e126a417290 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -126,81 +127,16 @@ * are set to NOT_INIT to indicate that they are no longer readable. */ -struct reg_state { - enum bpf_reg_type type; - union { - /* valid when type == CONST_IMM | PTR_TO_STACK | UNKNOWN_VALUE */ - s64 imm; - - /* valid when type == PTR_TO_PACKET* */ - struct { - u32 id; - u16 off; - u16 range; - }; - - /* valid when type == CONST_PTR_TO_MAP | PTR_TO_MAP_VALUE | -* PTR_TO_MAP_VALUE_OR_NULL -*/ - struct bpf_map *map_ptr; - }; -}; - -enum bpf_stack_slot_type { - STACK_INVALID,/* nothing was stored in this stack slot */ - STACK_SPILL, /* register spilled into stack */ - STACK_MISC/* BPF program wrote some data into this slot */ -}; - -#define BPF_REG_SIZE 8 /* size of eBPF register in bytes */ - -/* state of the program: - * type of all registers and stack info - */ -struct verifier_state { - struct reg_state regs[MAX_BPF_REG]; - u8
[PATCHv4 net-next 02/15] net: cls_bpf: limit hardware offload by software-only flag
Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_HW flag. Unlike U32 and flower cls_bpf already has some netlink flags defined. Create a new attribute to be able to use the same flag values as the above. Unlike U32 and flower reject unknown flags. Signed-off-by: Jakub KicinskiAcked-by: Daniel Borkmann --- v3: - reject (instead of clear) unsupported flags; - fix error handling. v2: - rename TCA_BPF_GEN_TCA_FLAGS -> TCA_BPF_FLAGS_GEN; - add comment about clearing unsupported flags; - validate flags after clearing unsupported. --- include/net/pkt_cls.h| 1 + include/uapi/linux/pkt_cls.h | 1 + net/sched/cls_bpf.c | 22 -- 3 files changed, 22 insertions(+), 2 deletions(-) diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h index 41e8071dff87..57af9f3032ff 100644 --- a/include/net/pkt_cls.h +++ b/include/net/pkt_cls.h @@ -498,6 +498,7 @@ struct tc_cls_bpf_offload { struct bpf_prog *prog; const char *name; bool exts_integrated; + u32 gen_flags; }; #endif diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h index f9c287c67eae..91dd136445f3 100644 --- a/include/uapi/linux/pkt_cls.h +++ b/include/uapi/linux/pkt_cls.h @@ -396,6 +396,7 @@ enum { TCA_BPF_FD, TCA_BPF_NAME, TCA_BPF_FLAGS, + TCA_BPF_FLAGS_GEN, __TCA_BPF_MAX, }; diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c index c3983493aeab..1ae5b6798363 100644 --- a/net/sched/cls_bpf.c +++ b/net/sched/cls_bpf.c @@ -27,6 +27,8 @@ MODULE_AUTHOR("Daniel Borkmann "); MODULE_DESCRIPTION("TC BPF based classifier"); #define CLS_BPF_NAME_LEN 256 +#define CLS_BPF_SUPPORTED_GEN_FLAGS\ + TCA_CLS_FLAGS_SKIP_HW struct cls_bpf_head { struct list_head plist; @@ -40,6 +42,7 @@ struct cls_bpf_prog { struct tcf_result res; bool exts_integrated; bool offloaded; + u32 gen_flags; struct tcf_exts exts; u32 handle; union { @@ -55,6 +58,7 @@ struct cls_bpf_prog { static const struct nla_policy bpf_policy[TCA_BPF_MAX + 1] = { [TCA_BPF_CLASSID] = { .type = NLA_U32 }, [TCA_BPF_FLAGS] = { .type = NLA_U32 }, + [TCA_BPF_FLAGS_GEN] = { .type = NLA_U32 }, [TCA_BPF_FD]= { .type = NLA_U32 }, [TCA_BPF_NAME] = { .type = NLA_NUL_STRING, .len = CLS_BPF_NAME_LEN }, [TCA_BPF_OPS_LEN] = { .type = NLA_U16 }, @@ -156,6 +160,7 @@ static int cls_bpf_offload_cmd(struct tcf_proto *tp, struct cls_bpf_prog *prog, bpf_offload.prog = prog->filter; bpf_offload.name = prog->bpf_name; bpf_offload.exts_integrated = prog->exts_integrated; + bpf_offload.gen_flags = prog->gen_flags; return dev->netdev_ops->ndo_setup_tc(dev, tp->q->handle, tp->protocol, ); @@ -169,14 +174,14 @@ static void cls_bpf_offload(struct tcf_proto *tp, struct cls_bpf_prog *prog, enum tc_clsbpf_command cmd; if (oldprog && oldprog->offloaded) { - if (tc_should_offload(dev, tp, 0)) { + if (tc_should_offload(dev, tp, prog->gen_flags)) { cmd = TC_CLSBPF_REPLACE; } else { obj = oldprog; cmd = TC_CLSBPF_DESTROY; } } else { - if (!tc_should_offload(dev, tp, 0)) + if (!tc_should_offload(dev, tp, prog->gen_flags)) return; cmd = TC_CLSBPF_ADD; } @@ -372,6 +377,7 @@ static int cls_bpf_modify_existing(struct net *net, struct tcf_proto *tp, { bool is_bpf, is_ebpf, have_exts = false; struct tcf_exts exts; + u32 gen_flags = 0; int ret; is_bpf = tb[TCA_BPF_OPS_LEN] && tb[TCA_BPF_OPS]; @@ -396,8 +402,17 @@ static int cls_bpf_modify_existing(struct net *net, struct tcf_proto *tp, have_exts = bpf_flags & TCA_BPF_FLAG_ACT_DIRECT; } + if (tb[TCA_BPF_FLAGS_GEN]) { + gen_flags = nla_get_u32(tb[TCA_BPF_FLAGS_GEN]); + if (gen_flags & ~CLS_BPF_SUPPORTED_GEN_FLAGS || + !tc_flags_valid(gen_flags)) { + ret = -EINVAL; + goto errout; + } + } prog->exts_integrated = have_exts; + prog->gen_flags = gen_flags; ret = is_bpf ? cls_bpf_prog_from_ops(tb, prog) : cls_bpf_prog_from_efd(tb, prog, tp); @@ -569,6 +584,9 @@ static int cls_bpf_dump(struct net *net, struct tcf_proto *tp, unsigned long fh, bpf_flags |= TCA_BPF_FLAG_ACT_DIRECT; if (bpf_flags && nla_put_u32(skb, TCA_BPF_FLAGS, bpf_flags)) goto nla_put_failure; + if (prog->gen_flags && + nla_put_u32(skb, TCA_BPF_FLAGS_GEN,
[PATCHv4 net-next 04/15] bpf: don't (ab)use instructions to store state
Storing state in reserved fields of instructions makes it impossible to run verifier on programs already marked as read-only. Allocate and use an array of per-instruction state instead. While touching the error path rename and move existing jump target. Suggested-by: Alexei StarovoitovSigned-off-by: Jakub Kicinski Acked-by: Alexei Starovoitov --- v3: - new patch. --- kernel/bpf/verifier.c | 51 --- 1 file changed, 32 insertions(+), 19 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 086b3979380c..ce9c0d1721c6 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -181,6 +181,10 @@ struct verifier_stack_elem { struct verifier_stack_elem *next; }; +struct bpf_insn_aux_data { + enum bpf_reg_type ptr_type; /* pointer type for load/store insns */ +}; + #define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */ /* single container for all structs @@ -196,6 +200,7 @@ struct verifier_env { u32 used_map_cnt; /* number of used maps */ u32 id_gen; /* used to generate unique reg IDs */ bool allow_ptr_leaks; + struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */ }; #define BPF_COMPLEXITY_LIMIT_INSNS 65536 @@ -2340,7 +2345,7 @@ static int do_check(struct verifier_env *env) return err; } else if (class == BPF_LDX) { - enum bpf_reg_type src_reg_type; + enum bpf_reg_type *prev_src_type, src_reg_type; /* check for reserved fields is already done */ @@ -2370,16 +2375,18 @@ static int do_check(struct verifier_env *env) continue; } - if (insn->imm == 0) { + prev_src_type = >insn_aux_data[insn_idx].ptr_type; + + if (*prev_src_type == NOT_INIT) { /* saw a valid insn * dst_reg = *(u32 *)(src_reg + off) -* use reserved 'imm' field to mark this insn +* save type to validate intersecting paths */ - insn->imm = src_reg_type; + *prev_src_type = src_reg_type; - } else if (src_reg_type != insn->imm && + } else if (src_reg_type != *prev_src_type && (src_reg_type == PTR_TO_CTX || - insn->imm == PTR_TO_CTX)) { + *prev_src_type == PTR_TO_CTX)) { /* ABuser program is trying to use the same insn * dst_reg = *(u32*) (src_reg + off) * with different pointer types: @@ -2392,7 +2399,7 @@ static int do_check(struct verifier_env *env) } } else if (class == BPF_STX) { - enum bpf_reg_type dst_reg_type; + enum bpf_reg_type *prev_dst_type, dst_reg_type; if (BPF_MODE(insn->code) == BPF_XADD) { err = check_xadd(env, insn); @@ -2420,11 +2427,13 @@ static int do_check(struct verifier_env *env) if (err) return err; - if (insn->imm == 0) { - insn->imm = dst_reg_type; - } else if (dst_reg_type != insn->imm && + prev_dst_type = >insn_aux_data[insn_idx].ptr_type; + + if (*prev_dst_type == NOT_INIT) { + *prev_dst_type = dst_reg_type; + } else if (dst_reg_type != *prev_dst_type && (dst_reg_type == PTR_TO_CTX || - insn->imm == PTR_TO_CTX)) { + *prev_dst_type == PTR_TO_CTX)) { verbose("same insn cannot be used with different pointers\n"); return -EINVAL; } @@ -2703,11 +2712,8 @@ static int convert_ctx_accesses(struct verifier_env *env) else continue; - if (insn->imm != PTR_TO_CTX) { - /* clear internal mark */ - insn->imm = 0; + if (env->insn_aux_data[i].ptr_type != PTR_TO_CTX) continue; - } cnt = env->prog->aux->ops-> convert_ctx_access(type, insn->dst_reg, insn->src_reg, @@ -2772,6 +2778,11 @@ int bpf_check(struct bpf_prog
[PATCHv4 net-next 10/15] net: cls_bpf: allow offloaded filters to update stats
Call into offloaded filters to update stats. Signed-off-by: Jakub KicinskiAcked-by: Daniel Borkmann --- include/net/pkt_cls.h | 1 + net/sched/cls_bpf.c | 11 +++ 2 files changed, 12 insertions(+) diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h index 57af9f3032ff..5ccaa4be7d96 100644 --- a/include/net/pkt_cls.h +++ b/include/net/pkt_cls.h @@ -490,6 +490,7 @@ enum tc_clsbpf_command { TC_CLSBPF_ADD, TC_CLSBPF_REPLACE, TC_CLSBPF_DESTROY, + TC_CLSBPF_STATS, }; struct tc_cls_bpf_offload { diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c index 1aad314089e9..86ef331f78e8 100644 --- a/net/sched/cls_bpf.c +++ b/net/sched/cls_bpf.c @@ -223,6 +223,15 @@ static void cls_bpf_stop_offload(struct tcf_proto *tp, prog->offloaded = false; } +static void cls_bpf_offload_update_stats(struct tcf_proto *tp, +struct cls_bpf_prog *prog) +{ + if (!prog->offloaded) + return; + + cls_bpf_offload_cmd(tp, prog, TC_CLSBPF_STATS); +} + static int cls_bpf_init(struct tcf_proto *tp) { struct cls_bpf_head *head; @@ -578,6 +587,8 @@ static int cls_bpf_dump(struct net *net, struct tcf_proto *tp, unsigned long fh, tm->tcm_handle = prog->handle; + cls_bpf_offload_update_stats(tp, prog); + nest = nla_nest_start(skb, TCA_OPTIONS); if (nest == NULL) goto nla_put_failure; -- 1.9.1
[PATCHv4 net-next 07/15] bpf: recognize 64bit immediate loads as consts
When running as parser interpret BPF_LD | BPF_IMM | BPF_DW instructions as loading CONST_IMM with the value stored in imm. The verifier will continue not recognizing those due to concerns about search space/program complexity increase. Signed-off-by: Jakub Kicinski--- v3: - limit to parsers. --- kernel/bpf/verifier.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index d93e78331b90..f5bed7cce08d 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1766,9 +1766,19 @@ static int check_ld_imm(struct bpf_verifier_env *env, struct bpf_insn *insn) if (err) return err; - if (insn->src_reg == 0) - /* generic move 64-bit immediate into a register */ + if (insn->src_reg == 0) { + /* generic move 64-bit immediate into a register, +* only analyzer needs to collect the ld_imm value. +*/ + u64 imm = ((u64)(insn + 1)->imm << 32) | (u32)insn->imm; + + if (!env->analyzer_ops) + return 0; + + regs[insn->dst_reg].type = CONST_IMM; + regs[insn->dst_reg].imm = imm; return 0; + } /* replace_map_fd_with_map_ptr() should have caught bad ld_imm64 */ BUG_ON(insn->src_reg != BPF_PSEUDO_MAP_FD); -- 1.9.1
MDB offloading of local ipv4 multicast groups
Hi, While adding MDB support to the qca8k dsa driver I found that ipv4 mcast groups don't always get propagated to the dsa driver. In my setup there are 2 clients connected to the switch, both running a mdns client. The .port_mdb_add() callback is properly called for 33:33:00:00:00:FB but 01:00:5E:00:00:FB never got propagated to the dsa driver. The reason is that the call to ipv4_is_local_multicast() here [1] will return true and the notifier is never called. Is this intentional or is there something missing in the code ? John [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/net/bridge/br_multicast.c?id=refs/tags/v4.8-rc6#n737
Re: XDP user interface confusions
On Thu, Sep 15, 2016 at 08:14:02PM +0200, Jesper Dangaard Brouer wrote: > Hi Brenden, > > I don't quite understand the semantics of the XDP userspace interface. > > We allow XDP programs to be (unconditionally) exchanged by another > program, this avoids taking the link down+up and avoids reallocating > RX ring resources (which is great). > > We have two XDP samples programs in samples/bpf/ xdp1 and xdp2. Now I > want to first load xdp1 and then to avoid the linkdown I load xdp2, > and then afterwards remove/stop program xdp1. > > This does NOT work, because (in samples/bpf/xdp1_user.c) when xdp1 > exits it unconditionally removes the running XDP program (loaded by xdp2) > via set_link_xdp_fd(ifindex, -1). The xdp2 user program is still > running, and is unaware of its xdp/bpf program have been unloaded. > > I find this userspace interface confusing. What this your intention? > Perhaps you can explain what the intended semantics or specification is? In practice, we've used a single agent process to manage bpf programs on behalf of the user applications. This agent process uses common linux functionalities to add semantics, while not really relying on the bpf handles themselves to take care of that. For instance, the process may put some lockfiles and what-not in /var/run/$PID, and maybe returns the list of running programs through a http: or unix: interface. So, from a user<->kernel API, the requirements are minimal...the agent process just overwrites the loaded bpf program when the application changes, or a new application comes online. There is nobody to 'notify' when a handle changes. When translating this into the kernel api that you see now, none of this exists, because IMHO the kernel api should be unopinionated and generic. The result is something that appears very "fire-and-forget", which results in something simple yet safe at the same time; the refcounting is done transparently by the kernel. So, in practice, there is no xdp1 or xdp2, just xdp-agent at different points in time. Or, better yet, no agent, just the programs running in the kernel, with the handles of the programs residing solely in the device, which are perhaps pinned to /sys/fs/bpf for semantic management purposes. I didn't feel like it was appropriate to conflate different bpf features in the kernel samples, so we don't see (and probably never will) a sample which combines these features into a whole. That is best left to userspace tools. It so happens that this is one of the projects I am currently active on at $DAYJOB, and we fully intend to share the details of that when it's in a suitable state. > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer
XDP user interface confusions
Hi Brenden, I don't quite understand the semantics of the XDP userspace interface. We allow XDP programs to be (unconditionally) exchanged by another program, this avoids taking the link down+up and avoids reallocating RX ring resources (which is great). We have two XDP samples programs in samples/bpf/ xdp1 and xdp2. Now I want to first load xdp1 and then to avoid the linkdown I load xdp2, and then afterwards remove/stop program xdp1. This does NOT work, because (in samples/bpf/xdp1_user.c) when xdp1 exits it unconditionally removes the running XDP program (loaded by xdp2) via set_link_xdp_fd(ifindex, -1). The xdp2 user program is still running, and is unaware of its xdp/bpf program have been unloaded. I find this userspace interface confusing. What this your intention? Perhaps you can explain what the intended semantics or specification is? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
XDP user interface conclusions
Hi Brenden, I don't quite understand the semantics of the XDP userspace interface. We allow XDP programs to be (unconditionally) exchanged by another program, this avoids taking the link down+up and avoids reallocating RX ring resources (which is great). We have two XDP samples programs in samples/bpf/ xdp1 and xdp2. Now I want to first load xdp1 and then to avoid the linkdown I load xdp2, and then afterwards remove/stop program xdp1. This does NOT work, because (in samples/bpf/xdp1_user.c) when xdp1 exits it unconditionally removes the running XDP program (loaded by xdp2) via set_link_xdp_fd(ifindex, -1). The xdp2 user program is still running, and is unaware of its xdp/bpf program have been unloaded. I find this userspace interface confusing. What this your intention? Perhaps you can explain what the intended semantics or specification is? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
[PATCH next] sctp: make use of WORD_TRUNC macro
No functional change. Just to avoid the usage of '&~3'. Also break the line to make it easier to read. Signed-off-by: Marcelo Ricardo Leitner--- net/sctp/chunk.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c index a55e54738b81ff8cf9cd711cf5fc466ac71374c0..adae4a41ca2078cfee387631f76e5cb768c2269c 100644 --- a/net/sctp/chunk.c +++ b/net/sctp/chunk.c @@ -182,9 +182,10 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc, /* This is the biggest possible DATA chunk that can fit into * the packet */ - max_data = (asoc->pathmtu - - sctp_sk(asoc->base.sk)->pf->af->net_header_len - - sizeof(struct sctphdr) - sizeof(struct sctp_data_chunk)) & ~3; + max_data = asoc->pathmtu - + sctp_sk(asoc->base.sk)->pf->af->net_header_len - + sizeof(struct sctphdr) - sizeof(struct sctp_data_chunk); + max_data = WORD_TRUNC(max_data); max = asoc->frag_point; /* If the the peer requested that we authenticate DATA chunks -- 2.7.4
[PATCH net] sctp: fix SSN comparision
This function actually operates on u32 yet its paramteres were declared as u16, causing integer truncation upon calling. Note in patch context that ADDIP_SERIAL_SIGN_BIT is already 32 bits. Signed-off-by: Marcelo Ricardo Leitner--- This issue exists since before git import, so I can't put a Fixes tag. Also, that said, probably not worth queueing it to stable. Thanks include/net/sctp/sm.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/net/sctp/sm.h b/include/net/sctp/sm.h index efc01743b9d641bf6b16a37780ee0df34b4ec698..bafe2a0ab9085f24e17038516c55c00cfddd02f4 100644 --- a/include/net/sctp/sm.h +++ b/include/net/sctp/sm.h @@ -382,7 +382,7 @@ enum { ADDIP_SERIAL_SIGN_BIT = (1<<31) }; -static inline int ADDIP_SERIAL_gte(__u16 s, __u16 t) +static inline int ADDIP_SERIAL_gte(__u32 s, __u32 t) { return ((s) == (t)) || (((t) - (s)) & ADDIP_SERIAL_SIGN_BIT); } -- 2.7.4
Re: [PATCH net-next] tcp: prepare skbs for better sack shifting
On Thu, Sep 15, 2016 at 9:33 AM, Eric Dumazetwrote: > > From: Eric Dumazet > > With large BDP TCP flows and lossy networks, it is very important > to keep a low number of skbs in the write queue. > > RACK and SACK processing can perform a linear scan of it. > > We should avoid putting any payload in skb->head, so that SACK > shifting can be done if needed. > > With this patch, we allow to pack ~0.5 MB per skb instead of > the 64KB initially cooked at tcp_sendmsg() time. > > This gives a reduction of number of skbs in write queue by eight. > tcp_rack_detect_loss() likes this. > > We still allow payload in skb->head for first skb put in the queue, > to not impact RPC workloads. > > Signed-off-by: Eric Dumazet > Cc: Yuchung Cheng Acked-by: Yuchung Cheng > --- > net/ipv4/tcp.c | 31 --- > 1 file changed, 24 insertions(+), 7 deletions(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index > a13fcb369f52fe85def7c9d856259bc0509f3453..7dae800092e62cec330544851289d20a68642561 > 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -1020,17 +1020,31 @@ int tcp_sendpage(struct sock *sk, struct page *page, > int offset, > } > EXPORT_SYMBOL(tcp_sendpage); > > -static inline int select_size(const struct sock *sk, bool sg) > +/* Do not bother using a page frag for very small frames. > + * But use this heuristic only for the first skb in write queue. > + * > + * Having no payload in skb->head allows better SACK shifting > + * in tcp_shift_skb_data(), reducing sack/rack overhead, because > + * write queue has less skbs. > + * Each skb can hold up to MAX_SKB_FRAGS * 32Kbytes, or ~0.5 MB. > + * This also speeds up tso_fragment(), since it wont fallback > + * to tcp_fragment(). > + */ > +static int linear_payload_sz(bool first_skb) > +{ > + if (first_skb) > + return SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER); > + return 0; > +} > + > +static int select_size(const struct sock *sk, bool sg, bool first_skb) > { > const struct tcp_sock *tp = tcp_sk(sk); > int tmp = tp->mss_cache; > > if (sg) { > if (sk_can_gso(sk)) { > - /* Small frames wont use a full page: > -* Payload will immediately follow tcp header. > -*/ > - tmp = SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER); > + tmp = linear_payload_sz(first_skb); > } else { > int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER); > > @@ -1161,6 +1175,8 @@ restart: > } > > if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) { > + bool first_skb; > + > new_segment: > /* Allocate new segment. If the interface is SG, > * allocate skb fitting to single page. > @@ -1172,10 +1188,11 @@ new_segment: > process_backlog = false; > goto restart; > } > + first_skb = skb_queue_empty(>sk_write_queue); > skb = sk_stream_alloc_skb(sk, > - select_size(sk, sg), > + select_size(sk, sg, > first_skb), > sk->sk_allocation, > - > skb_queue_empty(>sk_write_queue)); > + first_skb); > if (!skb) > goto wait_for_memory; > > >
[PATCH] llc: switch type to bool as the timeout is only tested versus 0
(As asked by Dave in Februrary) Signed-off-by: Alan Cox--- net/llc/af_llc.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c index 8ae3ed9..db916cf 100644 --- a/net/llc/af_llc.c +++ b/net/llc/af_llc.c @@ -38,7 +38,7 @@ static u16 llc_ui_sap_link_no_max[256]; static struct sockaddr_llc llc_ui_addrnull; static const struct proto_ops llc_ui_ops; -static long llc_ui_wait_for_conn(struct sock *sk, long timeout); +static bool llc_ui_wait_for_conn(struct sock *sk, long timeout); static int llc_ui_wait_for_disc(struct sock *sk, long timeout); static int llc_ui_wait_for_busy_core(struct sock *sk, long timeout); @@ -551,7 +551,7 @@ static int llc_ui_wait_for_disc(struct sock *sk, long timeout) return rc; } -static long llc_ui_wait_for_conn(struct sock *sk, long timeout) +static bool llc_ui_wait_for_conn(struct sock *sk, long timeout) { DEFINE_WAIT(wait);
Re: [PATCH] mwifiex: fix memory leak on regd when chan is zero
On 15/09/16 18:10, Kalle Valo wrote: > Colin Kingwrites: > >> From: Colin Ian King >> >> When chan is zero mwifiex_create_custom_regdomain does not kfree >> regd and we have a memory leak. Fix this by freeing regd before >> the return. >> >> Signed-off-by: Colin Ian King >> --- >> drivers/net/wireless/marvell/mwifiex/sta_cmdresp.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/net/wireless/marvell/mwifiex/sta_cmdresp.c >> b/drivers/net/wireless/marvell/mwifiex/sta_cmdresp.c >> index 3344a26..15a91f3 100644 >> --- a/drivers/net/wireless/marvell/mwifiex/sta_cmdresp.c >> +++ b/drivers/net/wireless/marvell/mwifiex/sta_cmdresp.c >> @@ -1049,8 +1049,10 @@ mwifiex_create_custom_regdomain(struct >> mwifiex_private *priv, >> enum nl80211_band band; >> >> chan = *buf++; >> -if (!chan) >> +if (!chan) { >> +kfree(regd); >> return NULL; >> +} > > Bob sent a similar fix and he also did more: > > mwifiex: fix error handling in mwifiex_create_custom_regdomain > > https://patchwork.kernel.org/patch/9331337/ > Ah, sorry for the duplication noise. Colin
[PATCH net-next] net: l3mdev: Remove netif_index_is_l3_master
No longer used after e0d56fdd73422 ("net: l3mdev: remove redundant calls") Signed-off-by: David Ahern--- include/net/l3mdev.h | 24 1 file changed, 24 deletions(-) diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h index 3832099289c5..b220dabeab45 100644 --- a/include/net/l3mdev.h +++ b/include/net/l3mdev.h @@ -114,25 +114,6 @@ static inline u32 l3mdev_fib_table(const struct net_device *dev) return tb_id; } -static inline bool netif_index_is_l3_master(struct net *net, int ifindex) -{ - struct net_device *dev; - bool rc = false; - - if (ifindex == 0) - return false; - - rcu_read_lock(); - - dev = dev_get_by_index_rcu(net, ifindex); - if (dev) - rc = netif_is_l3_master(dev); - - rcu_read_unlock(); - - return rc; -} - struct dst_entry *l3mdev_link_scope_lookup(struct net *net, struct flowi6 *fl6); static inline @@ -226,11 +207,6 @@ static inline u32 l3mdev_fib_table_by_index(struct net *net, int ifindex) return 0; } -static inline bool netif_index_is_l3_master(struct net *net, int ifindex) -{ - return false; -} - static inline struct dst_entry *l3mdev_link_scope_lookup(struct net *net, struct flowi6 *fl6) { -- 2.1.4
Re: [PATCH net-next 1/7] lwt: Add net to build_state argument
On 9/14/16, 4:22 PM, Tom Herbert wrote: > Users of LWT need to know net if they want to have per net operations > in LWT. > > Signed-off-by: Tom Herbert> --- > Acked-by: Roopa Prabhu
[PATCH net-next] net: vrf: Remove RT_FL_TOS
No longer used after d66f6c0a8f3c0 ("net: ipv4: Remove l3mdev_get_saddr") Signed-off-by: David Ahern--- drivers/net/vrf.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index 55674b0e65b7..85c271c70d42 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -37,9 +37,6 @@ #include #include -#define RT_FL_TOS(oldflp4) \ - ((oldflp4)->flowi4_tos & (IPTOS_RT_MASK | RTO_ONLINK)) - #define DRV_NAME "vrf" #define DRV_VERSION"1.0" -- 2.1.4