Re: [PATCH net-next 0/7] tcp: second round for EDT conversion
From: Eric Dumazet Date: Mon, 15 Oct 2018 09:37:51 -0700 > First round of EDT patches left TCP stack in a non optimal state. > > - High speed flows suffered from loss of performance, addressed > by the first patch of this series. > > - Second patch brings pacing to the current state of networking, > since we now reach ~100 Gbit on a single TCP flow. > > - Third patch implements a mitigation for scheduling delays, > like the one we did in sch_fq in the past. > > - Fourth patch removes one special case in sch_fq for ACK packets. > > - Fifth patch removes a serious perfomance cost for TCP internal > pacing. We should setup the high resolution timer only if > really needed. > > - Sixth patch fixes a typo in BBR. > > - Last patch is one minor change in cdg congestion control. > > Neal Cardwell also has a patch series fixing BBR after > EDT adoption. Series applied, thanks Eric.
Re: [PATCH net] sctp: use the pmtu from the icmp packet to update transport pathmtu
From: Xin Long Date: Mon, 15 Oct 2018 19:58:29 +0800 > Other than asoc pmtu sync from all transports, sctp_assoc_sync_pmtu > is also processing transport pmtu_pending by icmp packets. But it's > meaningless to use sctp_dst_mtu(t->dst) as new pmtu for a transport. > > The right pmtu value should come from the icmp packet, and it would > be saved into transport->mtu_info in this patch and used later when > the pmtu sync happens in sctp_sendmsg_to_asoc or sctp_packet_config. > > Besides, without this patch, as pmtu can only be updated correctly > when receiving a icmp packet and no place is holding sock lock, it > will take long time if the sock is busy with sending packets. > > Note that it doesn't process transport->mtu_info in .release_cb(), > as there is no enough information for pmtu update, like for which > asoc or transport. It is not worth traversing all asocs to check > pmtu_pending. So unlike tcp, sctp does this in tx path, for which > mtu_info needs to be atomic_t. > > Signed-off-by: Xin Long Applied.
Re: [PATCH net,stable 1/1] net: fec: don't dump RX FIFO register when not available
From: Andy Duan Date: Mon, 15 Oct 2018 05:19:00 + > From: Fugang Duan > > Commit db65f35f50e0 ("net: fec: add support of ethtool get_regs") introduce > ethool "--register-dump" interface to dump all FEC registers. > > But not all silicon implementations of the Freescale FEC hardware module > have the FRBR (FIFO Receive Bound Register) and FRSR (FIFO Receive Start > Register) register, so we should not be trying to dump them on those that > don't. > > To fix it we create a quirk flag, FEC_QUIRK_HAS_RFREG, and check it before > dump those RX FIFO registers. > > Signed-off-by: Fugang Duan Applied and queued up for -stable.
[PATCH bpf-next] libbpf: Per-symbol visibility for DSO
Make global symbols in libbpf DSO hidden by default with -fvisibility=hidden and export symbols that are part of ABI explicitly with __attribute__((visibility("default"))). This is common practice that should prevent from accidentally exporting a symbol, that is not supposed to be a part of ABI what, in turn, improves both libbpf developer- and user-experiences. See [1] for more details. Export control becomes more important since more and more projects use libbpf. The patch doesn't export a bunch of netlink related functions since as agreed in [2] they'll be reworked. That doesn't break bpftool since bpftool links libbpf statically. [1] https://www.akkadia.org/drepper/dsohowto.pdf (2.2 Export Control) [2] https://www.mail-archive.com/netdev@vger.kernel.org/msg251434.html Signed-off-by: Andrey Ignatov --- tools/lib/bpf/Makefile | 1 + tools/lib/bpf/bpf.h| 118 ++ tools/lib/bpf/btf.h| 22 +++-- tools/lib/bpf/libbpf.h | 186 ++--- 4 files changed, 179 insertions(+), 148 deletions(-) diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile index 79d84413ddf2..425b480bda75 100644 --- a/tools/lib/bpf/Makefile +++ b/tools/lib/bpf/Makefile @@ -125,6 +125,7 @@ override CFLAGS += $(EXTRA_WARNINGS) override CFLAGS += -Werror -Wall override CFLAGS += -fPIC override CFLAGS += $(INCLUDES) +override CFLAGS += -fvisibility=hidden ifeq ($(VERBOSE),1) Q = diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index 69a4d40c4227..258c3c178333 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -27,6 +27,10 @@ #include #include +#ifndef LIBBPF_API +#define LIBBPF_API __attribute__((visibility("default"))) +#endif + struct bpf_create_map_attr { const char *name; enum bpf_map_type map_type; @@ -42,21 +46,24 @@ struct bpf_create_map_attr { __u32 inner_map_fd; }; -int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr); -int bpf_create_map_node(enum bpf_map_type map_type, const char *name, - int key_size, int value_size, int max_entries, - __u32 map_flags, int node); -int bpf_create_map_name(enum bpf_map_type map_type, const char *name, - int key_size, int value_size, int max_entries, - __u32 map_flags); -int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, - int max_entries, __u32 map_flags); -int bpf_create_map_in_map_node(enum bpf_map_type map_type, const char *name, - int key_size, int inner_map_fd, int max_entries, - __u32 map_flags, int node); -int bpf_create_map_in_map(enum bpf_map_type map_type, const char *name, - int key_size, int inner_map_fd, int max_entries, - __u32 map_flags); +LIBBPF_API int +bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr); +LIBBPF_API int bpf_create_map_node(enum bpf_map_type map_type, const char *name, + int key_size, int value_size, + int max_entries, __u32 map_flags, int node); +LIBBPF_API int bpf_create_map_name(enum bpf_map_type map_type, const char *name, + int key_size, int value_size, + int max_entries, __u32 map_flags); +LIBBPF_API int bpf_create_map(enum bpf_map_type map_type, int key_size, + int value_size, int max_entries, __u32 map_flags); +LIBBPF_API int bpf_create_map_in_map_node(enum bpf_map_type map_type, + const char *name, int key_size, + int inner_map_fd, int max_entries, + __u32 map_flags, int node); +LIBBPF_API int bpf_create_map_in_map(enum bpf_map_type map_type, +const char *name, int key_size, +int inner_map_fd, int max_entries, +__u32 map_flags); struct bpf_load_program_attr { enum bpf_prog_type prog_type; @@ -74,44 +81,49 @@ struct bpf_load_program_attr { /* Recommend log buffer size */ #define BPF_LOG_BUF_SIZE (256 * 1024) -int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr, - char *log_buf, size_t log_buf_sz); -int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns, -size_t insns_cnt, const char *license, -__u32 kern_version, char *log_buf, -size_t log_buf_sz); -int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns, - size_t insns_cnt, int strict_alignment, - const char *license, __u32 kern_version, - char *log_buf, size_t log_buf_sz, int log_level); +LIBBPF_API
Re: [PATCH -next] fore200e: fix missing unlock on error in bsq_audit()
From: Wei Yongjun Date: Mon, 15 Oct 2018 03:07:16 + > Add the missing unlock before return from function bsq_audit() > in the error handling case. > > Fixes: 1d9d8be91788 ("fore200e: check for dma mapping failures") > Signed-off-by: Wei Yongjun Applied.
Re: [PATCH net-next 00/23] bnxt_en: Add support for new 57500 chips.
From: Michael Chan Date: Sun, 14 Oct 2018 07:02:36 -0400 > This patch-set is larger than normal because I wanted a complete series > to add basic support for the new 57500 chips. The new chips have the > following main differences compared to legacy chips: > > 1. Requires the PF driver to allocate DMA context memory as a backing > store. > 2. New NQ (notification queue) for interrupt events. > 3. One or more CP rings can be associated with an NQ. > 4. 64-bit doorbells. > > Most other structures and firmware APIs are compatible with legacy > devices with some exceptions. For example, ring groups are no longer > used and RSS table format has changed. > > The patch-set includes the usual firmware spec. update, some refactoring > and restructuring, and adding the new code to add basic support for the > new class of devices. Looks good, series applied, thanks Michael.
Re: [PATCH net] ipv6: mcast: fix a use-after-free in inet6_mc_check
From: Eric Dumazet Date: Fri, 12 Oct 2018 18:58:53 -0700 > syzbot found a use-after-free in inet6_mc_check [1] > > The problem here is that inet6_mc_check() uses rcu > and read_lock(>sflock) > > So the fact that ip6_mc_leave_src() is called under RTNL > and the socket lock does not help us, we need to acquire > iml->sflock in write mode. > > In the future, we should convert all this stuff to RCU. > > [1] > BUG: KASAN: use-after-free in ipv6_addr_equal include/net/ipv6.h:521 [inline] > BUG: KASAN: use-after-free in inet6_mc_check+0xae7/0xb40 net/ipv6/mcast.c:649 > Read of size 8 at addr 8801ce7f2510 by task syz-executor0/22432 ... > Signed-off-by: Eric Dumazet > Reported-by: syzbot Applied and queued up for -stable, thanks.
Re: [PATCH net-next 0/2] selftests: pmtu: Add test choice and captures
From: Stefano Brivio Date: Fri, 12 Oct 2018 23:54:12 +0200 > This series adds a couple of features useful for debugging: 1/2 > allows selecting single tests and 2/2 adds optional traffic > captures. > > Semantics for current invocation of test script are preserved. M0AR SELF TESTS! I love it. Keep them coming. Series applied, thanks.
Re: [PATCH net-next] r8169: simplify rtl8169_set_magic_reg
From: Heiner Kallweit Date: Fri, 12 Oct 2018 23:23:57 +0200 > Simplify this function, no functional change intended. > > Signed-off-by: Heiner Kallweit Applied.
Re: [PATCH net-next] r8169: remove unneeded call to netif_stop_queue in rtl8169_net_suspend
From: Heiner Kallweit Date: Fri, 12 Oct 2018 23:30:52 +0200 > netif_device_detach() stops all tx queues already, so we don't need > this call. > > Signed-off-by: Heiner Kallweit Applied.
Re: [PATCH net-next] nfp: devlink port split support for 1x100G CXP NIC
From: Jakub Kicinski Date: Fri, 12 Oct 2018 11:09:01 -0700 > From: Ryan C Goodfellow > > This commit makes it possible to use devlink to split the 100G CXP > Netronome into two 40G interfaces. Currently when you ask for 2 > interfaces, the math in src/nfp_devlink.c:nfp_devlink_port_split > calculates that you want 5 lanes per port because for some reason > eth_port.port_lanes=10 (shouldn't this be 12 for CXP?). What we really > want when asking for 2 breakout interfaces is 4 lanes per port. This > commit makes that happen by calculating based on 8 lanes if 10 are > present. > > Signed-off-by: Ryan C Goodfellow > Reviewed-by: Jakub Kicinski > Reviewed-by: Greg Weeks Applied.
Re: [PATCH net-next 0/6] dpaa2-eth: code cleanup
From: Ioana Ciornei Date: Fri, 12 Oct 2018 16:27:16 + > There are no functional changes in this patch set, only some cleanup > changes such as: unused parameters, uninitialized variables and > unnecessary Kconfig dependencies. Series applied.
Re: [PATCH net] ipv6: rate-limit probes for neighbourless routes
From: Sabrina Dubroca Date: Fri, 12 Oct 2018 16:22:47 +0200 > When commit 270972554c91 ("[IPV6]: ROUTE: Add Router Reachability > Probing (RFC4191).") introduced router probing, the rt6_probe() function > required that a neighbour entry existed. This neighbour entry is used to > record the timestamp of the last probe via the ->updated field. > > Later, commit 2152caea7196 ("ipv6: Do not depend on rt->n in rt6_probe().") > removed the requirement for a neighbour entry. Neighbourless routes skip > the interval check and are not rate-limited. > > This patch adds rate-limiting for neighbourless routes, by recording the > timestamp of the last probe in the fib6_info itself. > > Fixes: 2152caea7196 ("ipv6: Do not depend on rt->n in rt6_probe().") > Signed-off-by: Sabrina Dubroca > Reviewed-by: Stefano Brivio Applied and queued up for -stable.
Re: [PATCH net-next 0/2] net: phy: improve and simplify state machine
From: Heiner Kallweit Date: Thu, 11 Oct 2018 22:35:35 +0200 > Improve / simplify handling of states PHY_RUNNING and PHY_RESUMING in > phylib state machine. Series applied.
Re: [PATCH net-next v2] vxlan: support NTF_USE refresh of fdb entries
From: Roopa Prabhu Date: Thu, 11 Oct 2018 12:35:13 -0700 > From: Roopa Prabhu > > This makes use of NTF_USE in vxlan driver consistent > with bridge driver. > > Signed-off-by: Roopa Prabhu Applied.
Re: [Patch net] llc: set SOCK_RCU_FREE in llc_sap_add_socket()
From: Cong Wang Date: Thu, 11 Oct 2018 11:15:13 -0700 > WHen an llc sock is added into the sk_laddr_hash of an llc_sap, > it is not marked with SOCK_RCU_FREE. > > This causes that the sock could be freed while it is still being > read by __llc_lookup_established() with RCU read lock. sock is > refcounted, but with RCU read lock, nothing prevents the readers > getting a zero refcnt. > > Fix it by setting SOCK_RCU_FREE in llc_sap_add_socket(). > > Reported-by: syzbot+11e05f04c15e03be5...@syzkaller.appspotmail.com > Signed-off-by: Cong Wang Applied and queued up for -stable.
Re: [PATCH net-next v7] net/ncsi: Extend NC-SI Netlink interface to allow user space to send NC-SI command
From: Date: Thu, 11 Oct 2018 18:07:37 + > The new command (NCSI_CMD_SEND_CMD) is added to allow user space application > to send NC-SI command to the network card. > Also, add a new attribute (NCSI_ATTR_DATA) for transferring request and > response. > > The work flow is as below. > > Request: > User space application > -> Netlink interface (msg) > -> new Netlink handler - ncsi_send_cmd_nl() > -> ncsi_xmit_cmd() > > Response: > Response received - ncsi_rcv_rsp() > -> internal response handler - ncsi_rsp_handler_xxx() > -> ncsi_rsp_handler_netlink() > -> ncsi_send_netlink_rsp () > -> Netlink interface (msg) > -> user space application > > Command timeout - ncsi_request_timeout() > -> ncsi_send_netlink_timeout () > -> Netlink interface (msg with zero data length) > -> user space application > > Error: > Error detected > -> ncsi_send_netlink_err () > -> Netlink interface (err msg) > -> user space application > > > Signed-off-by: Justin Lee Applied.
Re: [PATCH net-next] net: phy: trigger state machine immediately in phy_start_machine
From: Heiner Kallweit Date: Thu, 11 Oct 2018 19:31:47 +0200 > When starting the state machine there may be work to be done > immediately, e.g. if the initial state is PHY_UP then the state > machine may trigger an autonegotiation. Having said that I see no need > to wait a second until the state machine is run first time. > > Signed-off-by: Heiner Kallweit Applied.
Re: [PATCH net-next 0/3] veth: XDP stats improvement
From: Toshiaki Makita Date: Thu, 11 Oct 2018 18:36:47 +0900 > ndo_xdp_xmit in veth did not update packet counters as described in [1]. > Also, current implementation only updates counters on tx side so rx side > events like XDP_DROP were not collected. > This series implements the missing accounting as well as support for > ethtool per-queue stats in veth. > > Patch 1: Update drop counter in ndo_xdp_xmit. > Patch 2: Update packet and byte counters for all XDP path, and drop > counter on XDP_DROP. > Patch 3: Support per-queue ethtool stats for XDP counters. > > Note that counters are maintained on per-queue basis for XDP but not > otherwise (per-cpu and atomic as before). This is because 1) tx path in > veth is essentially lockless so we cannot update per-queue stats on tx, > and 2) rx path is net core routine (process_backlog) which cannot update > per-queue based stats when XDP is disabled. On the other hand there are > real rxqs and napi handlers for veth XDP, so update per-queue stats on > rx for XDP packets, and use them to calculate tx counters as well, > contrary to the existing non-XDP counters. > > [1] https://patchwork.ozlabs.org/cover/953071/#1967449 > > Signed-off-by: Toshiaki Makita Series applied.
Re: [pull request][net 0/3] Mellanox, mlx5 fixes 2018-10-10
From: Saeed Mahameed Date: Wed, 10 Oct 2018 18:32:41 -0700 > This pull request includes some fixes to mlx5 driver, > Please pull and let me know if there's any problem. Pulled. > For -stable v4.11: > ('net/mlx5: Take only bit 24-26 of wqe.pftype_wq for page fault type') > For -stable v4.17: > ('net/mlx5: Fix memory leak when setting fpga ipsec caps') > For -stable v4.18: > ('net/mlx5: WQ, fixes for fragmented WQ buffers API') Queued up.
Re: [pull request][net-next 0/7] Mellanox, mlx5e and IPoIB netlink support fixes
From: Saeed Mahameed Date: Wed, 10 Oct 2018 18:24:37 -0700 > This series was meant to go to -rc but due to this late submission and the > size/complexity of this patchset, I am submitting to net-next. > > This series came to fix a very serious regression in RDMA > IPoIB netlink child creation API, the patcheset contains fixes to two > components and they must come together: > 1) IPoIB netllink implementation to allow allocation of the netdev to be done > by > the rtnl netdev code > 2) mlx5e refactoring and changes to correctly initialize netdevices > created by the rdma stack. > > For more details please see tag log below. > > Please pull and let me know if there's any problem. Pulled, thanks.
Re: [PATCH net v3] net/sched: cls_api: add missing validation of netlink attributes
From: Davide Caratti Date: Wed, 10 Oct 2018 22:00:58 +0200 > Similarly to what has been done in 8b4c3cdd9dd8 ("net: sched: Add policy > validation for tc attributes"), fix classifier code to add validation of > TCA_CHAIN and TCA_KIND netlink attributes. > > tested with: > # ./tdc.py -c filter > > v2: Let sch_api and cls_api share nla_policy they have in common, thanks > to David Ahern. > v3: Avoid EXPORT_SYMBOL(), as validation of those attributes is not done > by TC modules, thanks to Cong Wang. > While at it, restore the 'Delete / get qdisc' comment to its orginal > position, just above tc_get_qdisc() function prototype. > > Fixes: 5bc1701881e39 ("net: sched: introduce multichain support for filters") > Signed-off-by: Davide Caratti Applied and queued up for -stable.
Re: [PATCH net-next v2 0/2] FDDI: DEC FDDIcontroller 700 TURBOchannel adapter support
From: "Maciej W. Rozycki" Date: Tue, 9 Oct 2018 23:57:36 +0100 (BST) > Questions, comments? Otherwise, please apply. Series applied, thank you.
Re: [PATCH net-next] tun: Consistently configure generic netdev params via rtnetlink
From: Serhey Popovych Date: Tue, 9 Oct 2018 21:21:01 +0300 > Configuring generic network device parameters on tun will fail in > presence of IFLA_INFO_KIND attribute in IFLA_LINKINFO nested attribute > since tun_validate() always return failure. > > This can be visualized with following ip-link(8) command sequences: > > # ip link set dev tun0 group 100 > # ip link set dev tun0 group 100 type tun > RTNETLINK answers: Invalid argument > > with contrast to dummy and veth drivers: > > # ip link set dev dummy0 group 100 > # ip link set dev dummy0 type dummy > > # ip link set dev veth0 group 100 > # ip link set dev veth0 group 100 type veth > > Fix by returning zero in tun_validate() when @data is NULL that is > always in case since rtnl_link_ops->maxtype is zero in tun driver. > > Fixes: f019a7a594d9 ("tun: Implement ip link del tunXXX") > Signed-off-by: Serhey Popovych Applied, thank you.
crash in xt_policy due to skb_dst_drop() in nf_ct_frag6_gather()
I believe that: commit ad8b1ffc3efae2f65080bdb11145c87d299b8f9a Author: Florian Westphal netfilter: ipv6: nf_defrag: drop skb dst before queueing +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c @@ -618,6 +618,8 @@ int nf_ct_frag6_gather(struct net *net, struct sk_buff *skb, u32 user) fq->q.meat == fq->q.len && nf_ct_frag6_reasm(fq, skb, dev)) ret = 0; + else + skb_dst_drop(skb); out_unlock: spin_unlock_bh(>q.lock); Is causing a crash on android after upgrading from 4.9.96 to 4.9.119 This is because clatd ipv4 to ipv6 translation user space daemon is functionally equivalent to the syzkaller reproducer. It will convert ipv4 frags it receives via tap into ipv6 frags which it will write out via rawv6 sendmsg. However we are also using xt_policy, after stripping cruft this is basically: ip6tables -A OUTPUT -m policy --dir out --pol ipsec Crash is: match_policy_out() const struct dst_entry *dst = skb_dst(skb); // returns NULL if (dst->xfrm == NULL) <-- dst == NULL -> panic [ 1136.606948] c1 2675 [] policy_mt+0x34/0x18c [ 1136.606954] c1 2675 [] ip6t_do_table+0x280/0x684 [ 1136.606961] c1 2675 [] ip6table_filter_hook+0x20/0x28 [ 1136.606969] c1 2675 [] nf_hook_slow+0x98/0x154 [ 1136.606977] c1 2675 [] rawv6_sendmsg+0xd14/0x1520 [ 1136.606985] c1 2675 [] inet_sendmsg+0x100/0x1b0 [ 1136.606993] c1 2675 [] ___sys_sendmsg+0x2a0/0x414 [ 1136.606999] c1 2675 [] SyS_sendmsg+0x94/0xe4 Just checking for NULL in xt_policy.c:match_policy_out() and returning 0 or 1 unconditionally seems to be the wrong thing to do, since after all prior to skb_dst_drop() the skb->dst->xfrm might not have been NULL. Maciej Żenczykowski, Kernel Networking Developer @ Google
[PATCH v2 net-next 06/11] ipmr: Refactor mr_rtm_dumproute
From: David Ahern Move per-table loops from mr_rtm_dumproute to mr_table_dump and export mr_table_dump for dumps by specific table id. Signed-off-by: David Ahern --- include/linux/mroute_base.h | 6 net/ipv4/ipmr_base.c| 88 - 2 files changed, 61 insertions(+), 33 deletions(-) diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h index 6675b9f81979..db85373c8d15 100644 --- a/include/linux/mroute_base.h +++ b/include/linux/mroute_base.h @@ -283,6 +283,12 @@ void *mr_mfc_find_any(struct mr_table *mrt, int vifi, void *hasharg); int mr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb, struct mr_mfc *c, struct rtmsg *rtm); +int mr_table_dump(struct mr_table *mrt, struct sk_buff *skb, + struct netlink_callback *cb, + int (*fill)(struct mr_table *mrt, struct sk_buff *skb, + u32 portid, u32 seq, struct mr_mfc *c, + int cmd, int flags), + spinlock_t *lock); int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb, struct mr_table *(*iter)(struct net *net, struct mr_table *mrt), diff --git a/net/ipv4/ipmr_base.c b/net/ipv4/ipmr_base.c index 1ad9aa62a97b..132dd2613ca5 100644 --- a/net/ipv4/ipmr_base.c +++ b/net/ipv4/ipmr_base.c @@ -268,6 +268,55 @@ int mr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb, } EXPORT_SYMBOL(mr_fill_mroute); +int mr_table_dump(struct mr_table *mrt, struct sk_buff *skb, + struct netlink_callback *cb, + int (*fill)(struct mr_table *mrt, struct sk_buff *skb, + u32 portid, u32 seq, struct mr_mfc *c, + int cmd, int flags), + spinlock_t *lock) +{ + unsigned int e = 0, s_e = cb->args[1]; + unsigned int flags = NLM_F_MULTI; + struct mr_mfc *mfc; + int err; + + list_for_each_entry_rcu(mfc, >mfc_cache_list, list) { + if (e < s_e) + goto next_entry; + + err = fill(mrt, skb, NETLINK_CB(cb->skb).portid, + cb->nlh->nlmsg_seq, mfc, RTM_NEWROUTE, flags); + if (err < 0) + goto out; +next_entry: + e++; + } + e = 0; + s_e = 0; + + spin_lock_bh(lock); + list_for_each_entry(mfc, >mfc_unres_queue, list) { + if (e < s_e) + goto next_entry2; + + err = fill(mrt, skb, NETLINK_CB(cb->skb).portid, + cb->nlh->nlmsg_seq, mfc, RTM_NEWROUTE, flags); + if (err < 0) { + spin_unlock_bh(lock); + goto out; + } +next_entry2: + e++; + } + spin_unlock_bh(lock); + err = 0; + e = 0; + +out: + cb->args[1] = e; + return err; +} + int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb, struct mr_table *(*iter)(struct net *net, struct mr_table *mrt), @@ -277,51 +326,24 @@ int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb, int cmd, int flags), spinlock_t *lock) { - unsigned int t = 0, e = 0, s_t = cb->args[0], s_e = cb->args[1]; + unsigned int t = 0, s_t = cb->args[0]; struct net *net = sock_net(skb->sk); struct mr_table *mrt; - struct mr_mfc *mfc; + int err; rcu_read_lock(); for (mrt = iter(net, NULL); mrt; mrt = iter(net, mrt)) { if (t < s_t) goto next_table; - list_for_each_entry_rcu(mfc, >mfc_cache_list, list) { - if (e < s_e) - goto next_entry; - if (fill(mrt, skb, NETLINK_CB(cb->skb).portid, -cb->nlh->nlmsg_seq, mfc, -RTM_NEWROUTE, NLM_F_MULTI) < 0) - goto done; -next_entry: - e++; - } - e = 0; - s_e = 0; - - spin_lock_bh(lock); - list_for_each_entry(mfc, >mfc_unres_queue, list) { - if (e < s_e) - goto next_entry2; - if (fill(mrt, skb, NETLINK_CB(cb->skb).portid, -cb->nlh->nlmsg_seq, mfc, -RTM_NEWROUTE, NLM_F_MULTI) < 0) { - spin_unlock_bh(lock); - goto done; - } -next_entry2: - e++; - } - spin_unlock_bh(lock); - e = 0; - s_e = 0; + +
[PATCH v2 net-next 01/11] netlink: Add answer_flags to netlink_callback
From: David Ahern With dump filtering we need a way to ensure the NLM_F_DUMP_FILTERED flag is set on a message back to the user if the data returned is influenced by some input attributes. Normally this can be done as messages are added to the skb, but if the filter results in no data being returned, the user could be confused as to why. This patch adds answer_flags to the netlink_callback allowing dump handlers to set the NLM_F_DUMP_FILTERED at a minimum in the NLMSG_DONE message ensuring the flag gets back to the user. The netlink_callback space is initialized to 0 via a memset in __netlink_dump_start, so init of the new answer_flags is covered. Signed-off-by: David Ahern --- include/linux/netlink.h | 1 + net/netlink/af_netlink.c | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/netlink.h b/include/linux/netlink.h index 72580f1a72a2..4da90a6ab536 100644 --- a/include/linux/netlink.h +++ b/include/linux/netlink.h @@ -180,6 +180,7 @@ struct netlink_callback { u16 family; u16 min_dump_alloc; boolstrict_check; + u16 answer_flags; unsigned intprev_seq, seq; longargs[6]; }; diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index e613a9f89600..6bb9f3cde0b0 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -2257,7 +2257,8 @@ static int netlink_dump(struct sock *sk) } nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, - sizeof(nlk->dump_done_errno), NLM_F_MULTI); + sizeof(nlk->dump_done_errno), + NLM_F_MULTI | cb->answer_flags); if (WARN_ON(!nlh)) goto errout_skb; -- 2.11.0
[PATCH v2 net-next 05/11] net/mpls: Plumb support for filtering route dumps
From: David Ahern Implement kernel side filtering of routes by egress device index and protocol. MPLS uses only a single table and route type. Signed-off-by: David Ahern --- net/mpls/af_mpls.c | 42 +- 1 file changed, 41 insertions(+), 1 deletion(-) diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index bfcb4759c9ee..48f4cbd9fb38 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -2067,12 +2067,35 @@ static int mpls_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh, } #endif +static bool mpls_rt_uses_dev(struct mpls_route *rt, +const struct net_device *dev) +{ + struct net_device *nh_dev; + + if (rt->rt_nhn == 1) { + struct mpls_nh *nh = rt->rt_nh; + + nh_dev = rtnl_dereference(nh->nh_dev); + if (dev == nh_dev) + return true; + } else { + for_nexthops(rt) { + nh_dev = rtnl_dereference(nh->nh_dev); + if (nh_dev == dev) + return true; + } endfor_nexthops(rt); + } + + return false; +} + static int mpls_dump_routes(struct sk_buff *skb, struct netlink_callback *cb) { const struct nlmsghdr *nlh = cb->nlh; struct net *net = sock_net(skb->sk); struct mpls_route __rcu **platform_label; struct fib_dump_filter filter = {}; + unsigned int flags = NLM_F_MULTI; size_t platform_labels; unsigned int index; @@ -2084,6 +2107,14 @@ static int mpls_dump_routes(struct sk_buff *skb, struct netlink_callback *cb) err = mpls_valid_fib_dump_req(net, nlh, , cb->extack); if (err < 0) return err; + + /* for MPLS, there is only 1 table with fixed type and flags. +* If either are set in the filter then return nothing. +*/ + if ((filter.table_id && filter.table_id != RT_TABLE_MAIN) || + (filter.rt_type && filter.rt_type != RTN_UNICAST) || +filter.flags) + return skb->len; } index = cb->args[0]; @@ -2092,15 +2123,24 @@ static int mpls_dump_routes(struct sk_buff *skb, struct netlink_callback *cb) platform_label = rtnl_dereference(net->mpls.platform_label); platform_labels = net->mpls.platform_labels; + + if (filter.filter_set) + flags |= NLM_F_DUMP_FILTERED; + for (; index < platform_labels; index++) { struct mpls_route *rt; + rt = rtnl_dereference(platform_label[index]); if (!rt) continue; + if ((filter.dev && !mpls_rt_uses_dev(rt, filter.dev)) || + (filter.protocol && rt->rt_protocol != filter.protocol)) + continue; + if (mpls_dump_route(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, RTM_NEWROUTE, - index, rt, NLM_F_MULTI) < 0) + index, rt, flags) < 0) break; } cb->args[0] = index; -- 2.11.0
[PATCH v2 net-next 11/11] net/ipv4: Bail early if user only wants prefix entries
From: David Ahern Unlike IPv6, IPv4 does not have routes marked with RTF_PREFIX_RT. If the flag is set in the dump request, just return. In the process of this change, move the CLONE check to use the new filter flags. Signed-off-by: David Ahern --- net/ipv4/fib_frontend.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index e86ca2255181..5bf653f36911 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -886,10 +886,14 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) err = ip_valid_fib_dump_req(net, nlh, , cb); if (err < 0) return err; + } else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) { + struct rtmsg *rtm = nlmsg_data(nlh); + + filter.flags = rtm->rtm_flags & (RTM_F_PREFIX | RTM_F_CLONED); } - if (nlmsg_len(nlh) >= sizeof(struct rtmsg) && - ((struct rtmsg *)nlmsg_data(nlh))->rtm_flags & RTM_F_CLONED) + /* fib entries are never clones and ipv4 does not use prefix flag */ + if (filter.flags & (RTM_F_PREFIX | RTM_F_CLONED)) return skb->len; if (filter.table_id) { -- 2.11.0
[PATCH v2 net-next 07/11] net: Plumb support for filtering ipv4 and ipv6 multicast route dumps
From: David Ahern Implement kernel side filtering of routes by egress device index and table id. If the table id is given in the filter, lookup table and call mr_table_dump directly for it. Signed-off-by: David Ahern --- include/linux/mroute_base.h | 7 --- net/ipv4/ipmr.c | 18 +++--- net/ipv4/ipmr_base.c| 42 +++--- net/ipv6/ip6mr.c| 18 +++--- 4 files changed, 73 insertions(+), 12 deletions(-) diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h index db85373c8d15..34de06b426ef 100644 --- a/include/linux/mroute_base.h +++ b/include/linux/mroute_base.h @@ -7,6 +7,7 @@ #include #include #include +#include /** * struct vif_device - interface representor for multicast routing @@ -288,7 +289,7 @@ int mr_table_dump(struct mr_table *mrt, struct sk_buff *skb, int (*fill)(struct mr_table *mrt, struct sk_buff *skb, u32 portid, u32 seq, struct mr_mfc *c, int cmd, int flags), - spinlock_t *lock); + spinlock_t *lock, struct fib_dump_filter *filter); int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb, struct mr_table *(*iter)(struct net *net, struct mr_table *mrt), @@ -296,7 +297,7 @@ int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb, struct sk_buff *skb, u32 portid, u32 seq, struct mr_mfc *c, int cmd, int flags), -spinlock_t *lock); +spinlock_t *lock, struct fib_dump_filter *filter); int mr_dump(struct net *net, struct notifier_block *nb, unsigned short family, int (*rules_dump)(struct net *net, @@ -346,7 +347,7 @@ mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb, struct sk_buff *skb, u32 portid, u32 seq, struct mr_mfc *c, int cmd, int flags), -spinlock_t *lock) +spinlock_t *lock, struct fib_dump_filter *filter) { return -EINVAL; } diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 44d777058960..3fa988e6a3df 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -2528,18 +2528,30 @@ static int ipmr_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, static int ipmr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb) { struct fib_dump_filter filter = {}; + int err; if (cb->strict_check) { - int err; - err = ip_valid_fib_dump_req(sock_net(skb->sk), cb->nlh, , cb->extack); if (err < 0) return err; } + if (filter.table_id) { + struct mr_table *mrt; + + mrt = ipmr_get_table(sock_net(skb->sk), filter.table_id); + if (!mrt) { + NL_SET_ERR_MSG(cb->extack, "ipv4: MR table does not exist"); + return -ENOENT; + } + err = mr_table_dump(mrt, skb, cb, _ipmr_fill_mroute, + _unres_lock, ); + return skb->len ? : err; + } + return mr_rtm_dumproute(skb, cb, ipmr_mr_table_iter, - _ipmr_fill_mroute, _unres_lock); + _ipmr_fill_mroute, _unres_lock, ); } static const struct nla_policy rtm_ipmr_policy[RTA_MAX + 1] = { diff --git a/net/ipv4/ipmr_base.c b/net/ipv4/ipmr_base.c index 132dd2613ca5..bfe8fd04afa0 100644 --- a/net/ipv4/ipmr_base.c +++ b/net/ipv4/ipmr_base.c @@ -268,21 +268,45 @@ int mr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb, } EXPORT_SYMBOL(mr_fill_mroute); +static bool mr_mfc_uses_dev(const struct mr_table *mrt, + const struct mr_mfc *c, + const struct net_device *dev) +{ + int ct; + + for (ct = c->mfc_un.res.minvif; ct < c->mfc_un.res.maxvif; ct++) { + if (VIF_EXISTS(mrt, ct) && c->mfc_un.res.ttls[ct] < 255) { + const struct vif_device *vif; + + vif = >vif_table[ct]; + if (vif->dev == dev) + return true; + } + } + return false; +} + int mr_table_dump(struct mr_table *mrt, struct sk_buff *skb, struct netlink_callback *cb, int (*fill)(struct mr_table *mrt, struct sk_buff *skb, u32 portid, u32 seq, struct mr_mfc *c, int cmd, int flags), - spinlock_t *lock) + spinlock_t *lock, struct fib_dump_filter *filter) { unsigned int e = 0,
[PATCH v2 net-next 10/11] net/ipv6: Bail early if user only wants cloned entries
From: David Ahern Similar to IPv4, IPv6 fib no longer contains cloned routes. If a user requests a route dump for only cloned entries, no sense walking the FIB and returning everything. Signed-off-by: David Ahern --- net/ipv6/ip6_fib.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 5562c77022c6..2a058b408a6a 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -586,10 +586,13 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) } else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) { struct rtmsg *rtm = nlmsg_data(nlh); - if (rtm->rtm_flags & RTM_F_PREFIX) - arg.filter.flags = RTM_F_PREFIX; + arg.filter.flags = rtm->rtm_flags & (RTM_F_PREFIX|RTM_F_CLONED); } + /* fib entries are never clones */ + if (arg.filter.flags & RTM_F_CLONED) + return skb->len; + w = (void *)cb->args[2]; if (!w) { /* New dump: -- 2.11.0
[PATCH v2 net-next 09/11] net/mpls: Handle kernel side filtering of route dumps
From: David Ahern Update the dump request parsing in MPLS for the non-INET case to enable kernel side filtering. If INET is disabled the only filters that make sense for MPLS are protocol and nexthop device. Signed-off-by: David Ahern --- net/mpls/af_mpls.c | 33 - 1 file changed, 28 insertions(+), 5 deletions(-) diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 24381696932a..7d55d4c04088 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -2044,7 +2044,9 @@ static int mpls_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh, struct netlink_callback *cb) { struct netlink_ext_ack *extack = cb->extack; + struct nlattr *tb[RTA_MAX + 1]; struct rtmsg *rtm; + int err, i; if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*rtm))) { NL_SET_ERR_MSG_MOD(extack, "Invalid header for FIB dump request"); @@ -2053,15 +2055,36 @@ static int mpls_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh, rtm = nlmsg_data(nlh); if (rtm->rtm_dst_len || rtm->rtm_src_len || rtm->rtm_tos || - rtm->rtm_table || rtm->rtm_protocol || rtm->rtm_scope || - rtm->rtm_type|| rtm->rtm_flags) { + rtm->rtm_table || rtm->rtm_scope|| rtm->rtm_type || + rtm->rtm_flags) { NL_SET_ERR_MSG_MOD(extack, "Invalid values in header for FIB dump request"); return -EINVAL; } - if (nlmsg_attrlen(nlh, sizeof(*rtm))) { - NL_SET_ERR_MSG_MOD(extack, "Invalid data after header in FIB dump request"); - return -EINVAL; + if (rtm->rtm_protocol) { + filter->protocol = rtm->rtm_protocol; + filter->filter_set = 1; + cb->answer_flags = NLM_F_DUMP_FILTERED; + } + + err = nlmsg_parse_strict(nlh, sizeof(*rtm), tb, RTA_MAX, +rtm_mpls_policy, extack); + if (err < 0) + return err; + + for (i = 0; i <= RTA_MAX; ++i) { + int ifindex; + + if (i == RTA_OIF) { + ifindex = nla_get_u32(tb[i]); + filter->dev = __dev_get_by_index(net, ifindex); + if (!filter->dev) + return -ENODEV; + filter->filter_set = 1; + } else if (tb[i]) { + NL_SET_ERR_MSG_MOD(extack, "Unsupported attribute in dump request"); + return -EINVAL; + } } return 0; -- 2.11.0
[PATCH v2 net-next 04/11] net/ipv6: Plumb support for filtering route dumps
From: David Ahern Implement kernel side filtering of routes by table id, egress device index, protocol, and route type. If the table id is given in the filter, lookup the table and call fib6_dump_table directly for it. Move the existing route flags check for prefix only routes to the new filter. Signed-off-by: David Ahern --- net/ipv6/ip6_fib.c | 28 ++-- net/ipv6/route.c | 40 2 files changed, 54 insertions(+), 14 deletions(-) diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 94e61fe47ff8..a51fc357a05c 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -583,10 +583,12 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) err = ip_valid_fib_dump_req(net, nlh, , cb->extack); if (err < 0) return err; - } + } else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) { + struct rtmsg *rtm = nlmsg_data(nlh); - s_h = cb->args[0]; - s_e = cb->args[1]; + if (rtm->rtm_flags & RTM_F_PREFIX) + arg.filter.flags = RTM_F_PREFIX; + } w = (void *)cb->args[2]; if (!w) { @@ -612,6 +614,20 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) arg.net = net; w->args = + if (arg.filter.table_id) { + tb = fib6_get_table(net, arg.filter.table_id); + if (!tb) { + NL_SET_ERR_MSG_MOD(cb->extack, "FIB table does not exist"); + return -ENOENT; + } + + res = fib6_dump_table(tb, skb, cb); + goto out; + } + + s_h = cb->args[0]; + s_e = cb->args[1]; + rcu_read_lock(); for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_e = 0) { e = 0; @@ -621,16 +637,16 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) goto next; res = fib6_dump_table(tb, skb, cb); if (res != 0) - goto out; + goto out_unlock; next: e++; } } -out: +out_unlock: rcu_read_unlock(); cb->args[1] = e; cb->args[0] = h; - +out: res = res < 0 ? res : skb->len; if (res <= 0) fib6_dump_end(cb); diff --git a/net/ipv6/route.c b/net/ipv6/route.c index f4e08b0689a8..9fd600e42f9d 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -4767,28 +4767,52 @@ static int rt6_fill_node(struct net *net, struct sk_buff *skb, return -EMSGSIZE; } +static bool fib6_info_uses_dev(const struct fib6_info *f6i, + const struct net_device *dev) +{ + if (f6i->fib6_nh.nh_dev == dev) + return true; + + if (f6i->fib6_nsiblings) { + struct fib6_info *sibling, *next_sibling; + + list_for_each_entry_safe(sibling, next_sibling, +>fib6_siblings, fib6_siblings) { + if (sibling->fib6_nh.nh_dev == dev) + return true; + } + } + + return false; +} + int rt6_dump_route(struct fib6_info *rt, void *p_arg) { struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *) p_arg; + struct fib_dump_filter *filter = >filter; + unsigned int flags = NLM_F_MULTI; struct net *net = arg->net; if (rt == net->ipv6.fib6_null_entry) return 0; - if (nlmsg_len(arg->cb->nlh) >= sizeof(struct rtmsg)) { - struct rtmsg *rtm = nlmsg_data(arg->cb->nlh); - - /* user wants prefix routes only */ - if (rtm->rtm_flags & RTM_F_PREFIX && - !(rt->fib6_flags & RTF_PREFIX_RT)) { - /* success since this is not a prefix route */ + if ((filter->flags & RTM_F_PREFIX) && + !(rt->fib6_flags & RTF_PREFIX_RT)) { + /* success since this is not a prefix route */ + return 1; + } + if (filter->filter_set) { + if ((filter->rt_type && rt->fib6_type != filter->rt_type) || + (filter->dev && !fib6_info_uses_dev(rt, filter->dev)) || + (filter->protocol && rt->fib6_protocol != filter->protocol)) { return 1; } + flags |= NLM_F_DUMP_FILTERED; } return rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0, RTM_NEWROUTE, NETLINK_CB(arg->cb->skb).portid, -arg->cb->nlh->nlmsg_seq, NLM_F_MULTI); +arg->cb->nlh->nlmsg_seq, flags); } static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, -- 2.11.0
[PATCH v2 net-next 02/11] net: Add struct for fib dump filter
From: David Ahern Add struct fib_dump_filter for options on limiting which routes are returned in a dump request. The current list is table id, protocol, route type, rtm_flags and nexthop device index. struct net is needed to lookup the net_device from the index. Declare the filter for each route dump handler and plumb the new arguments from dump handlers to ip_valid_fib_dump_req. Signed-off-by: David Ahern --- include/net/ip6_route.h | 1 + include/net/ip_fib.h| 13 - net/ipv4/fib_frontend.c | 6 -- net/ipv4/ipmr.c | 6 +- net/ipv6/ip6_fib.c | 5 +++-- net/ipv6/ip6mr.c| 5 - net/mpls/af_mpls.c | 12 7 files changed, 37 insertions(+), 11 deletions(-) diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h index cef186dbd2ce..7ab119936e69 100644 --- a/include/net/ip6_route.h +++ b/include/net/ip6_route.h @@ -174,6 +174,7 @@ struct rt6_rtnl_dump_arg { struct sk_buff *skb; struct netlink_callback *cb; struct net *net; + struct fib_dump_filter filter; }; int rt6_dump_route(struct fib6_info *f6i, void *p_arg); diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 852e4ebf2209..667013bf4266 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -222,6 +222,16 @@ struct fib_table { unsigned long __data[0]; }; +struct fib_dump_filter { + u32 table_id; + /* filter_set is an optimization that an entry is set */ + boolfilter_set; + unsigned char protocol; + unsigned char rt_type; + unsigned intflags; + struct net_device *dev; +}; + int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp, struct fib_result *res, int fib_flags); int fib_table_insert(struct net *, struct fib_table *, struct fib_config *, @@ -453,6 +463,7 @@ static inline void fib_proc_exit(struct net *net) u32 ip_mtu_from_fib_result(struct fib_result *res, __be32 daddr); -int ip_valid_fib_dump_req(const struct nlmsghdr *nlh, +int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh, + struct fib_dump_filter *filter, struct netlink_ext_ack *extack); #endif /* _NET_FIB_H */ diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 0f1beceb47d5..850850dd80e1 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -802,7 +802,8 @@ static int inet_rtm_newroute(struct sk_buff *skb, struct nlmsghdr *nlh, return err; } -int ip_valid_fib_dump_req(const struct nlmsghdr *nlh, +int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh, + struct fib_dump_filter *filter, struct netlink_ext_ack *extack) { struct rtmsg *rtm; @@ -837,6 +838,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) { const struct nlmsghdr *nlh = cb->nlh; struct net *net = sock_net(skb->sk); + struct fib_dump_filter filter = {}; unsigned int h, s_h; unsigned int e = 0, s_e; struct fib_table *tb; @@ -844,7 +846,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) int dumped = 0, err; if (cb->strict_check) { - err = ip_valid_fib_dump_req(nlh, cb->extack); + err = ip_valid_fib_dump_req(net, nlh, , cb->extack); if (err < 0) return err; } diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 91b0d5671649..44d777058960 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -2527,9 +2527,13 @@ static int ipmr_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, static int ipmr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb) { + struct fib_dump_filter filter = {}; + if (cb->strict_check) { - int err = ip_valid_fib_dump_req(cb->nlh, cb->extack); + int err; + err = ip_valid_fib_dump_req(sock_net(skb->sk), cb->nlh, + , cb->extack); if (err < 0) return err; } diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 0783af11b0b7..94e61fe47ff8 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -569,17 +569,18 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) { const struct nlmsghdr *nlh = cb->nlh; struct net *net = sock_net(skb->sk); + struct rt6_rtnl_dump_arg arg = {}; unsigned int h, s_h; unsigned int e = 0, s_e; - struct rt6_rtnl_dump_arg arg; struct fib6_walker *w; struct fib6_table *tb; struct hlist_head *head; int res = 0; if (cb->strict_check) { - int err = ip_valid_fib_dump_req(nlh, cb->extack); +
[PATCH v2 net-next 08/11] net: Enable kernel side filtering of route dumps
From: David Ahern Update parsing of route dump request to enable kernel side filtering. Allow filtering results by protocol (e.g., which routing daemon installed the route), route type (e.g., unicast), table id and nexthop device. These amount to the low hanging fruit, yet a huge improvement, for dumping routes. ip_valid_fib_dump_req is called with RTNL held, so __dev_get_by_index can be used to look up the device index without taking a reference. From there filter->dev is only used during dump loops with the lock still held. Set NLM_F_DUMP_FILTERED in the answer_flags so the user knows the results have been filtered should no entries be returned. Signed-off-by: David Ahern --- include/net/ip_fib.h| 2 +- net/ipv4/fib_frontend.c | 51 ++--- net/ipv4/ipmr.c | 2 +- net/ipv6/ip6_fib.c | 2 +- net/ipv6/ip6mr.c| 2 +- net/mpls/af_mpls.c | 9 + 6 files changed, 53 insertions(+), 15 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 1eabc9edd2b9..e8d9456bf36e 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -465,5 +465,5 @@ u32 ip_mtu_from_fib_result(struct fib_result *res, __be32 daddr); int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh, struct fib_dump_filter *filter, - struct netlink_ext_ack *extack); + struct netlink_callback *cb); #endif /* _NET_FIB_H */ diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 37dc8ac366fd..e86ca2255181 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -804,9 +804,14 @@ static int inet_rtm_newroute(struct sk_buff *skb, struct nlmsghdr *nlh, int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh, struct fib_dump_filter *filter, - struct netlink_ext_ack *extack) + struct netlink_callback *cb) { + struct netlink_ext_ack *extack = cb->extack; + struct nlattr *tb[RTA_MAX + 1]; struct rtmsg *rtm; + int err, i; + + ASSERT_RTNL(); if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*rtm))) { NL_SET_ERR_MSG(extack, "Invalid header for FIB dump request"); @@ -815,8 +820,7 @@ int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh, rtm = nlmsg_data(nlh); if (rtm->rtm_dst_len || rtm->rtm_src_len || rtm->rtm_tos || - rtm->rtm_table || rtm->rtm_protocol || rtm->rtm_scope || - rtm->rtm_type) { + rtm->rtm_scope) { NL_SET_ERR_MSG(extack, "Invalid values in header for FIB dump request"); return -EINVAL; } @@ -825,9 +829,42 @@ int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh, return -EINVAL; } - if (nlmsg_attrlen(nlh, sizeof(*rtm))) { - NL_SET_ERR_MSG(extack, "Invalid data after header in FIB dump request"); - return -EINVAL; + filter->flags= rtm->rtm_flags; + filter->protocol = rtm->rtm_protocol; + filter->rt_type = rtm->rtm_type; + filter->table_id = rtm->rtm_table; + + err = nlmsg_parse_strict(nlh, sizeof(*rtm), tb, RTA_MAX, +rtm_ipv4_policy, extack); + if (err < 0) + return err; + + for (i = 0; i <= RTA_MAX; ++i) { + int ifindex; + + if (!tb[i]) + continue; + + switch (i) { + case RTA_TABLE: + filter->table_id = nla_get_u32(tb[i]); + break; + case RTA_OIF: + ifindex = nla_get_u32(tb[i]); + filter->dev = __dev_get_by_index(net, ifindex); + if (!filter->dev) + return -ENODEV; + break; + default: + NL_SET_ERR_MSG(extack, "Unsupported attribute in dump request"); + return -EINVAL; + } + } + + if (filter->flags || filter->protocol || filter->rt_type || + filter->table_id || filter->dev) { + filter->filter_set = 1; + cb->answer_flags = NLM_F_DUMP_FILTERED; } return 0; @@ -846,7 +883,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) int dumped = 0, err; if (cb->strict_check) { - err = ip_valid_fib_dump_req(net, nlh, , cb->extack); + err = ip_valid_fib_dump_req(net, nlh, , cb); if (err < 0) return err; } diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 3fa988e6a3df..7a3e2acda94c 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -2532,7 +2532,7 @@ static int ipmr_rtm_dumproute(struct sk_buff *skb,
[PATCH v2 net-next 00/11] net: Kernel side filtering for route dumps
From: David Ahern Implement kernel side filtering of route dumps by protocol (e.g., which routing daemon installed the route), route type (e.g., unicast), table id and nexthop device. iproute2 has been doing this filtering in userspace for years; pushing the filters to the kernel side reduces the amount of data the kernel sends and reduces wasted cycles on both sides processing unwanted data. These initial options provide a huge improvement for efficiently examining routes on large scale systems. v2 - better handling of requests for a specific table. Rather than walking the hash of all tables, lookup the specific table and dump it - refactor mr_rtm_dumproute moving the loop over the table into a helper that can be invoked directly - add hook to return NLM_F_DUMP_FILTERED in DONE message to ensure it is returned even when the dump returns nothing David Ahern (11): netlink: Add answer_flags to netlink_callback net: Add struct for fib dump filter net/ipv4: Plumb support for filtering route dumps net/ipv6: Plumb support for filtering route dumps net/mpls: Plumb support for filtering route dumps ipmr: Refactor mr_rtm_dumproute net: Plumb support for filtering ipv4 and ipv6 multicast route dumps net: Enable kernel side filtering of route dumps net/mpls: Handle kernel side filtering of route dumps net/ipv6: Bail early if user only wants cloned entries net/ipv4: Bail early if user only wants prefix entries include/linux/mroute_base.h | 11 +++- include/linux/netlink.h | 1 + include/net/ip6_route.h | 1 + include/net/ip_fib.h| 17 -- net/ipv4/fib_frontend.c | 76 ++ net/ipv4/fib_trie.c | 37 + net/ipv4/ipmr.c | 22 ++-- net/ipv4/ipmr_base.c| 126 net/ipv6/ip6_fib.c | 34 +--- net/ipv6/ip6mr.c| 21 ++-- net/ipv6/route.c| 40 +++--- net/mpls/af_mpls.c | 92 +++- net/netlink/af_netlink.c| 3 +- 13 files changed, 386 insertions(+), 95 deletions(-) -- 2.11.0
[PATCH v2 net-next 03/11] net/ipv4: Plumb support for filtering route dumps
From: David Ahern Implement kernel side filtering of routes by table id, egress device index, protocol and route type. If the table id is given in the filter, lookup the table and call fib_table_dump directly for it. Signed-off-by: David Ahern --- include/net/ip_fib.h| 2 +- net/ipv4/fib_frontend.c | 13 - net/ipv4/fib_trie.c | 37 ++--- 3 files changed, 39 insertions(+), 13 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 667013bf4266..1eabc9edd2b9 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -239,7 +239,7 @@ int fib_table_insert(struct net *, struct fib_table *, struct fib_config *, int fib_table_delete(struct net *, struct fib_table *, struct fib_config *, struct netlink_ext_ack *extack); int fib_table_dump(struct fib_table *table, struct sk_buff *skb, - struct netlink_callback *cb); + struct netlink_callback *cb, struct fib_dump_filter *filter); int fib_table_flush(struct net *net, struct fib_table *table); struct fib_table *fib_trie_unmerge(struct fib_table *main_tb); void fib_table_flush_external(struct fib_table *table); diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 850850dd80e1..37dc8ac366fd 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -855,6 +855,17 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) ((struct rtmsg *)nlmsg_data(nlh))->rtm_flags & RTM_F_CLONED) return skb->len; + if (filter.table_id) { + tb = fib_get_table(net, filter.table_id); + if (!tb) { + NL_SET_ERR_MSG(cb->extack, "ipv4: FIB table does not exist"); + return -ENOENT; + } + + err = fib_table_dump(tb, skb, cb, ); + return skb->len ? : err; + } + s_h = cb->args[0]; s_e = cb->args[1]; @@ -869,7 +880,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) if (dumped) memset(>args[2], 0, sizeof(cb->args) - 2 * sizeof(cb->args[0])); - err = fib_table_dump(tb, skb, cb); + err = fib_table_dump(tb, skb, cb, ); if (err < 0) { if (likely(skb->len)) goto out; diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 5bc0c89e81e4..237c9f72b265 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -2003,12 +2003,17 @@ void fib_free_table(struct fib_table *tb) } static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb, -struct sk_buff *skb, struct netlink_callback *cb) +struct sk_buff *skb, struct netlink_callback *cb, +struct fib_dump_filter *filter) { + unsigned int flags = NLM_F_MULTI; __be32 xkey = htonl(l->key); struct fib_alias *fa; int i, s_i; + if (filter->filter_set) + flags |= NLM_F_DUMP_FILTERED; + s_i = cb->args[4]; i = 0; @@ -2016,25 +2021,35 @@ static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb, hlist_for_each_entry_rcu(fa, >leaf, fa_list) { int err; - if (i < s_i) { - i++; - continue; - } + if (i < s_i) + goto next; - if (tb->tb_id != fa->tb_id) { - i++; - continue; + if (tb->tb_id != fa->tb_id) + goto next; + + if (filter->filter_set) { + if (filter->rt_type && fa->fa_type != filter->rt_type) + goto next; + + if ((filter->protocol && +fa->fa_info->fib_protocol != filter->protocol)) + goto next; + + if (filter->dev && + !fib_info_nh_uses_dev(fa->fa_info, filter->dev)) + goto next; } err = fib_dump_info(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, RTM_NEWROUTE, tb->tb_id, fa->fa_type, xkey, KEYLENGTH - fa->fa_slen, - fa->fa_tos, fa->fa_info, NLM_F_MULTI); + fa->fa_tos, fa->fa_info, flags); if (err < 0) { cb->args[4] = i; return err; } +next: i++; } @@ -2044,7 +2059,7 @@ static int
Re: [PATCH bpf-next v2 7/8] bpf: add tls support for testing in test_sockmap
On 10/16/2018 02:42 AM, Andrey Ignatov wrote: > Hi Daniel and John! > > Daniel Borkmann [Fri, 2018-10-12 17:46 -0700]: >> From: John Fastabend >> >> This adds a --ktls option to test_sockmap in order to enable the >> combination of ktls and sockmap to run, which makes for another >> batch of 648 test cases for both in combination. >> >> Signed-off-by: John Fastabend >> Signed-off-by: Daniel Borkmann >> --- >> tools/testing/selftests/bpf/test_sockmap.c | 89 >> ++ >> 1 file changed, 89 insertions(+) >> >> diff --git a/tools/testing/selftests/bpf/test_sockmap.c >> b/tools/testing/selftests/bpf/test_sockmap.c >> index ac7de38..10a5fa8 100644 >> --- a/tools/testing/selftests/bpf/test_sockmap.c >> +++ b/tools/testing/selftests/bpf/test_sockmap.c >> @@ -71,6 +71,7 @@ int txmsg_start; >> int txmsg_end; >> int txmsg_ingress; >> int txmsg_skb; >> +int ktls; >> >> static const struct option long_options[] = { >> {"help",no_argument,NULL, 'h' }, >> @@ -92,6 +93,7 @@ static const struct option long_options[] = { >> {"txmsg_end", required_argument, NULL, 'e'}, >> {"txmsg_ingress", no_argument, _ingress, 1 }, >> {"txmsg_skb", no_argument, _skb, 1 }, >> +{"ktls", no_argument, , 1 }, >> {0, 0, NULL, 0 } >> }; >> >> @@ -112,6 +114,76 @@ static void usage(char *argv[]) >> printf("\n"); >> } >> >> +#define TCP_ULP 31 >> +#define TLS_TX 1 >> +#define TLS_RX 2 >> +#include > > This breaks selftest build for me: > test_sockmap.c:120:23: fatal error: linux/tls.h: No such file or directory >#include > ^ > compilation terminated. > > Should include/uapi/linux/tls.h be copied to tools/ not to depend on > host headers? Good point, yes, that should happen; will send a fix tomorrow morning. Thanks, Daniel
Re: [PATCH bpf-next v2 7/8] bpf: add tls support for testing in test_sockmap
Hi Daniel and John! Daniel Borkmann [Fri, 2018-10-12 17:46 -0700]: > From: John Fastabend > > This adds a --ktls option to test_sockmap in order to enable the > combination of ktls and sockmap to run, which makes for another > batch of 648 test cases for both in combination. > > Signed-off-by: John Fastabend > Signed-off-by: Daniel Borkmann > --- > tools/testing/selftests/bpf/test_sockmap.c | 89 > ++ > 1 file changed, 89 insertions(+) > > diff --git a/tools/testing/selftests/bpf/test_sockmap.c > b/tools/testing/selftests/bpf/test_sockmap.c > index ac7de38..10a5fa8 100644 > --- a/tools/testing/selftests/bpf/test_sockmap.c > +++ b/tools/testing/selftests/bpf/test_sockmap.c > @@ -71,6 +71,7 @@ int txmsg_start; > int txmsg_end; > int txmsg_ingress; > int txmsg_skb; > +int ktls; > > static const struct option long_options[] = { > {"help",no_argument,NULL, 'h' }, > @@ -92,6 +93,7 @@ static const struct option long_options[] = { > {"txmsg_end", required_argument, NULL, 'e'}, > {"txmsg_ingress", no_argument, _ingress, 1 }, > {"txmsg_skb", no_argument, _skb, 1 }, > + {"ktls", no_argument, , 1 }, > {0, 0, NULL, 0 } > }; > > @@ -112,6 +114,76 @@ static void usage(char *argv[]) > printf("\n"); > } > > +#define TCP_ULP 31 > +#define TLS_TX 1 > +#define TLS_RX 2 > +#include This breaks selftest build for me: test_sockmap.c:120:23: fatal error: linux/tls.h: No such file or directory #include ^ compilation terminated. Should include/uapi/linux/tls.h be copied to tools/ not to depend on host headers? > + > +char *sock_to_string(int s) > +{ > + if (s == c1) > + return "client1"; > + else if (s == c2) > + return "client2"; > + else if (s == s1) > + return "server1"; > + else if (s == s2) > + return "server2"; > + else if (s == p1) > + return "peer1"; > + else if (s == p2) > + return "peer2"; > + else > + return "unknown"; > +} > + > +static int sockmap_init_ktls(int verbose, int s) > +{ > + struct tls12_crypto_info_aes_gcm_128 tls_tx = { > + .info = { > + .version = TLS_1_2_VERSION, > + .cipher_type = TLS_CIPHER_AES_GCM_128, > + }, > + }; > + struct tls12_crypto_info_aes_gcm_128 tls_rx = { > + .info = { > + .version = TLS_1_2_VERSION, > + .cipher_type = TLS_CIPHER_AES_GCM_128, > + }, > + }; > + int so_buf = 6553500; > + int err; > + > + err = setsockopt(s, 6, TCP_ULP, "tls", sizeof("tls")); > + if (err) { > + fprintf(stderr, "setsockopt: TCP_ULP(%s) failed with error > %i\n", sock_to_string(s), err); > + return -EINVAL; > + } > + err = setsockopt(s, SOL_TLS, TLS_TX, (void *)_tx, sizeof(tls_tx)); > + if (err) { > + fprintf(stderr, "setsockopt: TLS_TX(%s) failed with error > %i\n", sock_to_string(s), err); > + return -EINVAL; > + } > + err = setsockopt(s, SOL_TLS, TLS_RX, (void *)_rx, sizeof(tls_rx)); > + if (err) { > + fprintf(stderr, "setsockopt: TLS_RX(%s) failed with error > %i\n", sock_to_string(s), err); > + return -EINVAL; > + } > + err = setsockopt(s, SOL_SOCKET, SO_SNDBUF, _buf, sizeof(so_buf)); > + if (err) { > + fprintf(stderr, "setsockopt: (%s) failed sndbuf with error > %i\n", sock_to_string(s), err); > + return -EINVAL; > + } > + err = setsockopt(s, SOL_SOCKET, SO_RCVBUF, _buf, sizeof(so_buf)); > + if (err) { > + fprintf(stderr, "setsockopt: (%s) failed rcvbuf with error > %i\n", sock_to_string(s), err); > + return -EINVAL; > + } > + > + if (verbose) > + fprintf(stdout, "socket(%s) kTLS enabled\n", sock_to_string(s)); > + return 0; > +} > static int sockmap_init_sockets(int verbose) > { > int i, err, one = 1; > @@ -456,6 +528,21 @@ static int sendmsg_test(struct sockmap_options *opt) > else > rx_fd = p2; > > + if (ktls) { > + /* Redirecting into non-TLS socket which sends into a TLS > + * socket is not a valid test. So in this case lets not > + * enable kTLS but still run the test. > + */ > + if (!txmsg_redir || (txmsg_redir && txmsg_ingress)) { > + err = sockmap_init_ktls(opt->verbose, rx_fd); > + if (err) > + return err; > + } > + err = sockmap_init_ktls(opt->verbose, c1); > + if (err) > + return err; > + } > + > rxpid = fork(); > if (rxpid == 0) { > if (opt->drop_expected) > @@ -907,6 +994,8 @@ static
pull-request: bpf-next 2018-10-16
Hi David, The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Convert BPF sockmap and kTLS to both use a new sk_msg API and enable sk_msg BPF integration for the latter, from Daniel and John. 2) Enable BPF syscall side to indicate for maps that they do not support a map lookup operation as opposed to just missing key, from Prashant. 3) Add bpftool map create command which after map creation pins the map into bpf fs for further processing, from Jakub. 4) Add bpftool support for attaching programs to maps allowing sock_map and sock_hash to be used from bpftool, from John. 5) Improve syscall BPF map update/delete path for map-in-map types to wait a RCU grace period for pending references to complete, from Daniel. 6) Couple of follow-up fixes for the BPF socket lookup to get it enabled also when IPv6 is compiled as a module, from Joe. 7) Fix a generic-XDP bug to handle the case when the Ethernet header was mangled and thus update skb's protocol and data, from Jesper. 8) Add a missing BTF header length check between header copies from user space, from Wenwen. 9) Minor fixups in libbpf to use __u32 instead u32 types and include proper perf_event.h uapi header instead of perf internal one, from Yonghong. 10) Allow to pass user-defined flags through EXTRA_CFLAGS and EXTRA_LDFLAGS to bpftool's build, from Jiri. 11) BPF kselftest tweaks to add LWTUNNEL to config fragment and to install with_addr.sh script from flow dissector selftest, from Anders. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git Thanks a lot! The following changes since commit 071a234ad744ab9a1e9c948874d5f646a2964734: Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next (2018-10-08 23:42:44 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git for you to fetch changes up to 0b592b5a01bef5416472ec610d3191e019c144a5: tools: bpftool: add map create command (2018-10-15 16:39:21 -0700) Alexei Starovoitov (5): Merge branch 'unsupported-map-lookup' Merge branch 'xdp-vlan' Merge branch 'sockmap_and_ktls' Merge branch 'ipv6_sk_lookup_fixes' Merge branch 'bpftool_sockmap' Anders Roxell (2): selftests: bpf: add config fragment LWTUNNEL selftests: bpf: install script with_addr.sh Daniel Borkmann (5): tcp, ulp: enforce sock_owned_by_me upon ulp init and cleanup tcp, ulp: remove ulp bits from sockmap bpf, sockmap: convert to generic sk_msg interface tls: convert to generic sk_msg interface bpf, doc: add maintainers entry to related files Daniel Colascione (1): bpf: wait for running BPF programs when updating map-in-map Jakub Kicinski (1): tools: bpftool: add map create command Jesper Dangaard Brouer (3): net: fix generic XDP to handle if eth header was mangled bpf: make TC vlan bpf_helpers avail to selftests selftests/bpf: add XDP selftests for modifying and popping VLAN headers Jiri Olsa (2): bpftool: Allow to add compiler flags via EXTRA_CFLAGS variable bpftool: Allow add linker flags via EXTRA_LDFLAGS variable Joe Stringer (3): bpf: Fix dev pointer dereference from sk_skb bpf: Allow sk_lookup with IPv6 module bpf: Fix IPv6 dport byte-order in bpf_sk_lookup John Fastabend (5): tls: replace poll implementation with read hook tls: add bpf support to sk_msg handling bpf: add tls support for testing in test_sockmap bpf: bpftool, add support for attaching programs to maps bpf: bpftool, add flag to allow non-compat map definitions Prashant Bhole (6): bpf: error handling when map_lookup_elem isn't supported bpf: return EOPNOTSUPP when map lookup isn't supported tools/bpf: bpftool, split the function do_dump() tools/bpf: bpftool, print strerror when map lookup error occurs selftests/bpf: test_verifier, change names of fixup maps selftests/bpf: test_verifier, check bpf_map_lookup_elem access in bpf prog Wenwen Wang (1): bpf: btf: Fix a missing check bug Yonghong Song (1): tools/bpf: use proper type and uapi perf_event.h header for libbpf MAINTAINERS | 10 + include/linux/bpf.h | 33 +- include/linux/bpf_types.h|2 +- include/linux/filter.h | 21 - include/linux/skmsg.h| 410 include/net/addrconf.h |5 + include/net/sock.h |4 - include/net/tcp.h| 28 +- include/net/tls.h| 24 +- kernel/bpf/Makefile
[PATCH net 3/3] nfp: flower: use offsets provided by pedit instead of index for ipv6
From: Pieter Jansen van Vuuren Previously when populating the set ipv6 address action, we incorrectly made use of pedit's key index to determine which 32bit word should be set. We now calculate which word has been selected based on the offset provided by the pedit action. Fixes: 354b82bb320e ("nfp: add set ipv6 source and destination address") Signed-off-by: Pieter Jansen van Vuuren Reviewed-by: Jakub Kicinski --- .../ethernet/netronome/nfp/flower/action.c| 26 +++ 1 file changed, 15 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/flower/action.c b/drivers/net/ethernet/netronome/nfp/flower/action.c index c39d7fdf73e6..7a1e9cd9cc62 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/action.c +++ b/drivers/net/ethernet/netronome/nfp/flower/action.c @@ -450,12 +450,12 @@ nfp_fl_set_ip4(const struct tc_action *action, int idx, u32 off, } static void -nfp_fl_set_ip6_helper(int opcode_tag, int idx, __be32 exact, __be32 mask, +nfp_fl_set_ip6_helper(int opcode_tag, u8 word, __be32 exact, __be32 mask, struct nfp_fl_set_ipv6_addr *ip6) { - ip6->ipv6[idx % 4].mask |= mask; - ip6->ipv6[idx % 4].exact &= ~mask; - ip6->ipv6[idx % 4].exact |= exact & mask; + ip6->ipv6[word].mask |= mask; + ip6->ipv6[word].exact &= ~mask; + ip6->ipv6[word].exact |= exact & mask; ip6->reserved = cpu_to_be16(0); ip6->head.jump_id = opcode_tag; @@ -468,6 +468,7 @@ nfp_fl_set_ip6(const struct tc_action *action, int idx, u32 off, struct nfp_fl_set_ipv6_addr *ip_src) { __be32 exact, mask; + u8 word; /* We are expecting tcf_pedit to return a big endian value */ mask = (__force __be32)~tcf_pedit_mask(action, idx); @@ -476,17 +477,20 @@ nfp_fl_set_ip6(const struct tc_action *action, int idx, u32 off, if (exact & ~mask) return -EOPNOTSUPP; - if (off < offsetof(struct ipv6hdr, saddr)) + if (off < offsetof(struct ipv6hdr, saddr)) { return -EOPNOTSUPP; - else if (off < offsetof(struct ipv6hdr, daddr)) - nfp_fl_set_ip6_helper(NFP_FL_ACTION_OPCODE_SET_IPV6_SRC, idx, + } else if (off < offsetof(struct ipv6hdr, daddr)) { + word = (off - offsetof(struct ipv6hdr, saddr)) / sizeof(exact); + nfp_fl_set_ip6_helper(NFP_FL_ACTION_OPCODE_SET_IPV6_SRC, word, exact, mask, ip_src); - else if (off < offsetof(struct ipv6hdr, daddr) + - sizeof(struct in6_addr)) - nfp_fl_set_ip6_helper(NFP_FL_ACTION_OPCODE_SET_IPV6_DST, idx, + } else if (off < offsetof(struct ipv6hdr, daddr) + + sizeof(struct in6_addr)) { + word = (off - offsetof(struct ipv6hdr, daddr)) / sizeof(exact); + nfp_fl_set_ip6_helper(NFP_FL_ACTION_OPCODE_SET_IPV6_DST, word, exact, mask, ip_dst); - else + } else { return -EOPNOTSUPP; + } return 0; } -- 2.17.1
[PATCH net 0/3] nfp: fix pedit set action offloads
Hi, Pieter says: This set fixes set actions when using multiple pedit actions with partial masks and with multiple keys per pedit action. Additionally it fixes set ipv6 pedit action offloads when using it in combination with other header keys. The problem would only trigger if one combines multiple pedit actions of the same type with partial masks, e.g.: $ tc filter add dev netdev protocol ip parent : \ flower indev netdev \ ip_proto tcp \ action pedit ex munge \ ip src set 11.11.11.11 retain 65535 munge \ ip src set 22.22.22.22 retain 4294901760 pipe \ csum ip and tcp pipe \ mirred egress redirect dev netdev Pieter Jansen van Vuuren (3): nfp: flower: fix pedit set actions for multiple partial masks nfp: flower: fix multiple keys per pedit action nfp: flower: use offsets provided by pedit instead of index for ipv6 .../ethernet/netronome/nfp/flower/action.c| 51 --- 1 file changed, 33 insertions(+), 18 deletions(-) -- 2.17.1
[PATCH net 1/3] nfp: flower: fix pedit set actions for multiple partial masks
From: Pieter Jansen van Vuuren Previously we did not correctly change headers when using multiple pedit actions with partial masks. We now take this into account and no longer just commit the last pedit action. Fixes: c0b1bd9a8b8a ("nfp: add set ipv4 header action flower offload") Signed-off-by: Pieter Jansen van Vuuren Reviewed-by: Jakub Kicinski --- .../net/ethernet/netronome/nfp/flower/action.c| 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/flower/action.c b/drivers/net/ethernet/netronome/nfp/flower/action.c index 46ba0cf257c6..91de7a9b0190 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/action.c +++ b/drivers/net/ethernet/netronome/nfp/flower/action.c @@ -429,12 +429,14 @@ nfp_fl_set_ip4(const struct tc_action *action, int idx, u32 off, switch (off) { case offsetof(struct iphdr, daddr): - set_ip_addr->ipv4_dst_mask = mask; - set_ip_addr->ipv4_dst = exact; + set_ip_addr->ipv4_dst_mask |= mask; + set_ip_addr->ipv4_dst &= ~mask; + set_ip_addr->ipv4_dst |= exact & mask; break; case offsetof(struct iphdr, saddr): - set_ip_addr->ipv4_src_mask = mask; - set_ip_addr->ipv4_src = exact; + set_ip_addr->ipv4_src_mask |= mask; + set_ip_addr->ipv4_src &= ~mask; + set_ip_addr->ipv4_src |= exact & mask; break; default: return -EOPNOTSUPP; @@ -451,8 +453,9 @@ static void nfp_fl_set_ip6_helper(int opcode_tag, int idx, __be32 exact, __be32 mask, struct nfp_fl_set_ipv6_addr *ip6) { - ip6->ipv6[idx % 4].mask = mask; - ip6->ipv6[idx % 4].exact = exact; + ip6->ipv6[idx % 4].mask |= mask; + ip6->ipv6[idx % 4].exact &= ~mask; + ip6->ipv6[idx % 4].exact |= exact & mask; ip6->reserved = cpu_to_be16(0); ip6->head.jump_id = opcode_tag; -- 2.17.1
[PATCH net 2/3] nfp: flower: fix multiple keys per pedit action
From: Pieter Jansen van Vuuren Previously we only allowed a single header key per pedit action to change the header. This used to result in the last header key in the pedit action to overwrite previous headers. We now keep track of them and allow multiple header keys per pedit action. Fixes: c0b1bd9a8b8a ("nfp: add set ipv4 header action flower offload") Fixes: 354b82bb320e ("nfp: add set ipv6 source and destination address") Fixes: f8b7b0a6b113 ("nfp: add set tcp and udp header action flower offload") Signed-off-by: Pieter Jansen van Vuuren Reviewed-by: Jakub Kicinski --- .../net/ethernet/netronome/nfp/flower/action.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/flower/action.c b/drivers/net/ethernet/netronome/nfp/flower/action.c index 91de7a9b0190..c39d7fdf73e6 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/action.c +++ b/drivers/net/ethernet/netronome/nfp/flower/action.c @@ -544,7 +544,7 @@ nfp_fl_pedit(const struct tc_action *action, struct tc_cls_flower_offload *flow, struct nfp_fl_set_eth set_eth; enum pedit_header_type htype; int idx, nkeys, err; - size_t act_size; + size_t act_size = 0; u32 offset, cmd; u8 ip_proto = 0; @@ -602,7 +602,9 @@ nfp_fl_pedit(const struct tc_action *action, struct tc_cls_flower_offload *flow, act_size = sizeof(set_eth); memcpy(nfp_action, _eth, act_size); *a_len += act_size; - } else if (set_ip_addr.head.len_lw) { + } + if (set_ip_addr.head.len_lw) { + nfp_action += act_size; act_size = sizeof(set_ip_addr); memcpy(nfp_action, _ip_addr, act_size); *a_len += act_size; @@ -610,10 +612,12 @@ nfp_fl_pedit(const struct tc_action *action, struct tc_cls_flower_offload *flow, /* Hardware will automatically fix IPv4 and TCP/UDP checksum. */ *csum_updated |= TCA_CSUM_UPDATE_FLAG_IPV4HDR | nfp_fl_csum_l4_to_flag(ip_proto); - } else if (set_ip6_dst.head.len_lw && set_ip6_src.head.len_lw) { + } + if (set_ip6_dst.head.len_lw && set_ip6_src.head.len_lw) { /* TC compiles set src and dst IPv6 address as a single action, * the hardware requires this to be 2 separate actions. */ + nfp_action += act_size; act_size = sizeof(set_ip6_src); memcpy(nfp_action, _ip6_src, act_size); *a_len += act_size; @@ -626,6 +630,7 @@ nfp_fl_pedit(const struct tc_action *action, struct tc_cls_flower_offload *flow, /* Hardware will automatically fix TCP/UDP checksum. */ *csum_updated |= nfp_fl_csum_l4_to_flag(ip_proto); } else if (set_ip6_dst.head.len_lw) { + nfp_action += act_size; act_size = sizeof(set_ip6_dst); memcpy(nfp_action, _ip6_dst, act_size); *a_len += act_size; @@ -633,13 +638,16 @@ nfp_fl_pedit(const struct tc_action *action, struct tc_cls_flower_offload *flow, /* Hardware will automatically fix TCP/UDP checksum. */ *csum_updated |= nfp_fl_csum_l4_to_flag(ip_proto); } else if (set_ip6_src.head.len_lw) { + nfp_action += act_size; act_size = sizeof(set_ip6_src); memcpy(nfp_action, _ip6_src, act_size); *a_len += act_size; /* Hardware will automatically fix TCP/UDP checksum. */ *csum_updated |= nfp_fl_csum_l4_to_flag(ip_proto); - } else if (set_tport.head.len_lw) { + } + if (set_tport.head.len_lw) { + nfp_action += act_size; act_size = sizeof(set_tport); memcpy(nfp_action, _tport, act_size); *a_len += act_size; -- 2.17.1
Re: [PATCH bpf-next v2] tools: bpftool: add map create command
On Mon, Oct 15, 2018 at 04:30:36PM -0700, Jakub Kicinski wrote: > Add a way of creating maps from user space. The command takes > as parameters most of the attributes of the map creation system > call command. After map is created its pinned to bpffs. This makes > it possible to easily and dynamically (without rebuilding programs) > test various corner cases related to map creation. > > Map type names are taken from bpftool's array used for printing. > In general these days we try to make use of libbpf type names, but > there are no map type names in libbpf as of today. > > As with most features I add the motivation is testing (offloads) :) > > Signed-off-by: Jakub Kicinski > Reviewed-by: Quentin Monnet Applied, Thanks
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On Fri, Sep 21, 2018 at 3:15 PM Alexei Starovoitov wrote: > > On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote: > > > > > I have considered adding MUNMAP to match existing MMAP, but went > > > without it because I didn't want to introduce new bit in perf_event_attr > > > and emit these new events in a misbalanced conditional way for prog > > > load/unload. > > > Like old perf is asking kernel for mmap events via mmap bit, so prog load > > > events > > > > By prog load events you mean that old perf, having perf_event_attr.mmap = 1 > > || > > perf_event_attr.mmap2 = 1 will cause the new kernel to emit > > PERF_RECORD_MMAP records for the range of addresses that a BPF program > > is being loaded on, right? > > right. it would be weird when prog load events are there, but not unload. > > > > will be in perf.data, but old perf report won't recognize them anyway. > > > > Why not? It should lookup the symbol and find it in the rb_tree of maps, > > with a DSO name equal to what was in the PERF_RECORD_MMAP emitted by the > > BPF core, no? It'll be an unresolved symbol, but a resolved map. > > > > > Whereas new perf would certainly want to catch bpf events and will set > > > both mmap and mumap bits. > > > > new perf with your code will find a symbol, not a map, because your code > > catches a special case PERF_RECORD_MMAP and instead of creating a > > 'struct map' will create a 'struct symbol' and insert it in the kallsyms > > 'struct map', right? > > right. > bpf progs are more similar to kernel functions than to modules. > For modules it makes sense to create a new map and insert symbols into it. > For bpf JITed images there is no DSO to parse. > Single bpf elf file may contain multiple bpf progs and each prog may contain > multiple bpf functions. They will be loaded at different time and > will have different life time. > > > In theory the old perf should catch the PERF_RECORD_MMAP with a string > > in the filename part and insert a new map into the kernel mmap rb_tree, > > and then samples would be resolved to this map, but since there is no > > backing DSO with a symtab, it would stop at that, just stating that the > > map is called NAME-OF-BPF-PROGRAM. This is all from memory, possibly > > there is something in there that makes it ignore this PERF_RECORD_MMAP > > emitted by the BPF kernel code when loading a new program. > > In /proc/kcore there is already a section for module range. > Hence when perf processes bpf load/unload events the map is already created. > Therefore the patch 3 only searches for it and inserts new symbol into it. > > In that sense the reuse of RECORD_MMAP event for bpf progs is indeed > not exactly clean, since no new map is created. > It's probably better to introduce PERF_RECORD_[INSERT|ERASE]_KSYM events ? > > Such event potentially can be used for offline ksym resolution. > perf could process /proc/kallsyms during perf record and emit all of them > as synthetic PERF_RECORD_INSERT_KSYM into perf.data, so perf report can run > on a different server and still find the right symbols. > > I guess, we can do bpf specific events too and keep RECORD_MMAP as-is. > How about single PERF_RECORD_BPF event with internal flag for load/unload ? > > > Right, that is another unfortunate state of affairs, kernel module > > load/unload should already be supported, reported by the kernel via a > > proper PERF_RECORD_MODULE_LOAD/UNLOAD > > I agree with Peter here. It would nice, but low priority. > modules are mostly static. Loaded once and stay there. > > > There is another longstanding TODO list entry: PERF_RECORD_MMAP records > > should include a build-id, to avoid either userspace getting confused > > when there is an update of some mmap DSO, for long running sessions, for > > instance, or to have to scan the just recorded perf.data file for DSOs > > with samples to then read it from the file system (more races). > > > > Have you ever considered having a build-id for bpf objects that could be > > used here? > > build-id concept is not applicable to bpf. > bpf elf files on the disc don't have good correlation with what is > running in the kernel. bpf bytestream is converted and optimized > by the verifier. Then JITed. > So debug info left in .o file and original bpf bytestream in .o are > mostly useless. > For bpf programs we have 'program tag'. It is computed over original > bpf bytestream, so both kernel and user space can compute it. > In libbcc we use /var/tmp/bcc/bpf_prog_TAG/ directory to store original > source code of the program, so users looking at kernel stack traces > with bpf_prog_TAG can find the source. > It's similar to build-id, but not going to help perf to annotate > actual x86 instructions inside JITed image and show src code. > Since JIT runs in the kernel this problem cannot be solved by user space only. > It's a difficult problem and we have a plan to tackle that, > but it's step 2. A bunch of infra is needed on bpf side to > preserve the
[PATCH bpf-next v2] tools: bpftool: add map create command
Add a way of creating maps from user space. The command takes as parameters most of the attributes of the map creation system call command. After map is created its pinned to bpffs. This makes it possible to easily and dynamically (without rebuilding programs) test various corner cases related to map creation. Map type names are taken from bpftool's array used for printing. In general these days we try to make use of libbpf type names, but there are no map type names in libbpf as of today. As with most features I add the motivation is testing (offloads) :) Signed-off-by: Jakub Kicinski Reviewed-by: Quentin Monnet --- .../bpf/bpftool/Documentation/bpftool-map.rst | 15 ++- tools/bpf/bpftool/Documentation/bpftool.rst | 4 +- tools/bpf/bpftool/bash-completion/bpftool | 38 +- tools/bpf/bpftool/common.c| 21 tools/bpf/bpftool/main.h | 1 + tools/bpf/bpftool/map.c | 110 +- 6 files changed, 183 insertions(+), 6 deletions(-) diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst b/tools/bpf/bpftool/Documentation/bpftool-map.rst index a6258bc8ec4f..3497f2d80328 100644 --- a/tools/bpf/bpftool/Documentation/bpftool-map.rst +++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst @@ -15,13 +15,15 @@ SYNOPSIS *OPTIONS* := { { **-j** | **--json** } [{ **-p** | **--pretty** }] | { **-f** | **--bpffs** } } *COMMANDS* := - { **show** | **list** | **dump** | **update** | **lookup** | **getnext** | **delete** - | **pin** | **help** } + { **show** | **list** | **create** | **dump** | **update** | **lookup** | **getnext** + | **delete** | **pin** | **help** } MAP COMMANDS = | **bpftool** **map { show | list }** [*MAP*] +| **bpftool** **map create** *FILE* **type** *TYPE* **key** *KEY_SIZE* **value** *VALUE_SIZE* \ +| **entries** *MAX_ENTRIES* **name** *NAME* [**flags** *FLAGS*] [**dev** *NAME*] | **bpftool** **map dump** *MAP* | **bpftool** **map update** *MAP* **key** *DATA* **value** *VALUE* [*UPDATE_FLAGS*] | **bpftool** **map lookup** *MAP* **key** *DATA* @@ -36,6 +38,11 @@ MAP COMMANDS | *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* } | *VALUE* := { *DATA* | *MAP* | *PROG* } | *UPDATE_FLAGS* := { **any** | **exist** | **noexist** } +| *TYPE* := { **hash** | **array** | **prog_array** | **perf_event_array** | **percpu_hash** +| | **percpu_array** | **stack_trace** | **cgroup_array** | **lru_hash** +| | **lru_percpu_hash** | **lpm_trie** | **array_of_maps** | **hash_of_maps** +| | **devmap** | **sockmap** | **cpumap** | **xskmap** | **sockhash** +| | **cgroup_storage** | **reuseport_sockarray** | **percpu_cgroup_storage** } DESCRIPTION === @@ -47,6 +54,10 @@ DESCRIPTION Output will start with map ID followed by map type and zero or more named attributes (depending on kernel version). + **bpftool map create** *FILE* **type** *TYPE* **key** *KEY_SIZE* **value** *VALUE_SIZE* **entries** *MAX_ENTRIES* **name** *NAME* [**flags** *FLAGS*] [**dev** *NAME*] + Create a new map with given parameters and pin it to *bpffs* + as *FILE*. + **bpftool map dump***MAP* Dump all entries in a given *MAP*. diff --git a/tools/bpf/bpftool/Documentation/bpftool.rst b/tools/bpf/bpftool/Documentation/bpftool.rst index 65488317fefa..04cd4f92ab89 100644 --- a/tools/bpf/bpftool/Documentation/bpftool.rst +++ b/tools/bpf/bpftool/Documentation/bpftool.rst @@ -22,8 +22,8 @@ SYNOPSIS | { **-j** | **--json** } [{ **-p** | **--pretty** }] } *MAP-COMMANDS* := - { **show** | **list** | **dump** | **update** | **lookup** | **getnext** | **delete** - | **pin** | **event_pipe** | **help** } + { **show** | **list** | **create** | **dump** | **update** | **lookup** | **getnext** + | **delete** | **pin** | **event_pipe** | **help** } *PROG-COMMANDS* := { **show** | **list** | **dump jited** | **dump xlated** | **pin** | **load** | **attach** | **detach** | **help** } diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool index ac85207cba8d..c56545e87b0d 100644 --- a/tools/bpf/bpftool/bash-completion/bpftool +++ b/tools/bpf/bpftool/bash-completion/bpftool @@ -387,6 +387,42 @@ _bpftool() ;; esac ;; +create) +case $prev in +$command) +_filedir +return 0 +;; +type) +COMPREPLY=( $( compgen -W 'hash array prog_array \ +
Re: [bpf-next PATCH v3 0/2] bpftool support for sockmap use cases
On Mon, Oct 15, 2018 at 11:19:44AM -0700, John Fastabend wrote: > The first patch adds support for attaching programs to maps. This is > needed to support sock{map|hash} use from bpftool. Currently, I carry > around custom code to do this so doing it using standard bpftool will > be great. > > The second patch adds a compat mode to ignore non-zero entries in > the map def. This allows using bpftool with maps that have a extra > fields that the user knows can be ignored. This is needed to work > correctly with maps being loaded by other tools or directly via > syscalls. > > v3: add bash completion and doc updates for --mapcompat Applied, Thanks
Re: [PATCH bpf-next 05/13] bpf: get better bpf_prog ksyms based on btf func type_id
On Fri, Oct 12, 2018 at 11:54:42AM -0700, Yonghong Song wrote: > This patch added interface to load a program with the following > additional information: >. prog_btf_fd >. func_info and func_info_len > where func_info will provides function range and type_id > corresponding to each function. > > If verifier agrees with function range provided by the user, > the bpf_prog ksym for each function will use the func name > provided in the type_id, which is supposed to provide better > encoding as it is not limited by 16 bytes program name > limitation and this is better for bpf program which contains > multiple subprograms. > > The bpf_prog_info interface is also extended to > return btf_id and jited_func_types, so user spaces can > print out the function prototype for each jited function. Some nits. > > Signed-off-by: Yonghong Song > --- > include/linux/bpf.h | 2 + > include/linux/bpf_verifier.h | 1 + > include/linux/btf.h | 2 + > include/uapi/linux/bpf.h | 11 + > kernel/bpf/btf.c | 16 +++ > kernel/bpf/core.c| 9 > kernel/bpf/syscall.c | 86 +++- > kernel/bpf/verifier.c| 50 + > 8 files changed, 176 insertions(+), 1 deletion(-) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 9b558713447f..e9c63ffa01af 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -308,6 +308,8 @@ struct bpf_prog_aux { > void *security; > #endif > struct bpf_prog_offload *offload; > + struct btf *btf; > + u32 type_id; /* type id for this prog/func */ > union { > struct work_struct work; > struct rcu_head rcu; > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h > index 9e8056ec20fa..e84782ec50ac 100644 > --- a/include/linux/bpf_verifier.h > +++ b/include/linux/bpf_verifier.h > @@ -201,6 +201,7 @@ static inline bool bpf_verifier_log_needed(const struct > bpf_verifier_log *log) > struct bpf_subprog_info { > u32 start; /* insn idx of function entry point */ > u16 stack_depth; /* max. stack depth used by this function */ > + u32 type_id; /* btf type_id for this subprog */ > }; > > /* single container for all structs > diff --git a/include/linux/btf.h b/include/linux/btf.h > index e076c4697049..90e91b52aa90 100644 > --- a/include/linux/btf.h > +++ b/include/linux/btf.h > @@ -46,5 +46,7 @@ void btf_type_seq_show(const struct btf *btf, u32 type_id, > void *obj, > struct seq_file *m); > int btf_get_fd_by_id(u32 id); > u32 btf_id(const struct btf *btf); > +bool is_btf_func_type(const struct btf *btf, u32 type_id); > +const char *btf_get_name_by_id(const struct btf *btf, u32 type_id); > > #endif > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index f9187b41dff6..7ebbf4f06a65 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -332,6 +332,9 @@ union bpf_attr { >* (context accesses, allowed helpers, etc). >*/ > __u32 expected_attach_type; > + __u32 prog_btf_fd;/* fd pointing to BTF type data > */ > + __u32 func_info_len; /* func_info length */ > + __aligned_u64 func_info; /* func type info */ > }; > > struct { /* anonymous struct used by BPF_OBJ_* commands */ > @@ -2585,6 +2588,9 @@ struct bpf_prog_info { > __u32 nr_jited_func_lens; > __aligned_u64 jited_ksyms; > __aligned_u64 jited_func_lens; > + __u32 btf_id; > + __u32 nr_jited_func_types; > + __aligned_u64 jited_func_types; > } __attribute__((aligned(8))); > > struct bpf_map_info { > @@ -2896,4 +2902,9 @@ struct bpf_flow_keys { > }; > }; > > +struct bpf_func_info { > + __u32 insn_offset; > + __u32 type_id; > +}; > + > #endif /* _UAPI__LINUX_BPF_H__ */ > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c > index 794a185f11bf..85b8eeccddbd 100644 > --- a/kernel/bpf/btf.c > +++ b/kernel/bpf/btf.c > @@ -486,6 +486,15 @@ static const struct btf_type *btf_type_by_id(const > struct btf *btf, u32 type_id) > return btf->types[type_id]; > } > > +bool is_btf_func_type(const struct btf *btf, u32 type_id) > +{ > + const struct btf_type *type = btf_type_by_id(btf, type_id); > + > + if (!type || BTF_INFO_KIND(type->info) != BTF_KIND_FUNC) > + return false; > + return true; > +} Can btf_type_is_func() (from patch 2) be reused? The btf_type_by_id() can be done by the caller. I don't think it worths to add a similar helper for just one user for now. The !type check can be added to btf_type_is_func() if it is needed. > + > /* > * Regular int is not a bit field and it must be either > * u8/u16/u32/u64. > @@ -2579,3 +2588,10 @@ u32 btf_id(const struct btf *btf) > { > return btf->id; > } > + > +const char
Re: [PATCH bpf-next 0/2] IPv6 sk-lookup fixes
On Mon, Oct 15, 2018 at 10:27:44AM -0700, Joe Stringer wrote: > This series includes a couple of fixups for the IPv6 socket lookup > helper, to make the API more consistent (always supply all arguments in > network byte-order) and to allow its use when IPv6 is compiled as a > module. Applied, Thanks
Re: [PATCH bpf-next] tools: bpftool: add map create command
On Mon, 15 Oct 2018 12:58:07 -0700, Alexei Starovoitov wrote: > > > > fprintf(stderr, > > > > "Usage: %s %s { show | list } [MAP]\n" > > > > + " %s %s create FILE type TYPE key KEY_SIZE > > > > value VALUE_SIZE \\\n" > > > > + " entries MAX_ENTRIES > > > > [name NAME] [flags FLAGS] \\\n" > > > > + " [dev NAME]\n" > > > > > > I suspect as soon as bpftool has an ability to create standalone maps > > > some folks will start relying on such interface. > > > > That'd be cool, do you see any real life use cases where its useful > > outside of corner case testing? > > In our XDP use case we have an odd protocol for different apps to share > common prog_array that is pinned in bpffs. > If cmdline creation of it via bpftool was available that would have been > an option to consider. Not saying that it would have been a better option. > Just another option. I see, I didn't think of prog arrays. > > > Therefore I'd like to request to make 'name' argument to be mandatory. > > > > Will do in v2! > > thx! > > > > I think in the future we will require BTF to be mandatory too. > > > We need to move towards more transparent and debuggable infra. > > > Do you think requiring json description of key/value would be managable > > > to implement? > > > Then bpftool could convert it to BTF and the map full be fully defined. > > > I certainly understand that bpf prog can disregard the key/value layout > > > today, > > > but we will make verifier to enforce that in the future too. > > > > I was hoping that we can leave BTF support as a future extension, and > > then once we have the option for the verifier to enforce BTF (a sysctl?) > > the bpftool map create without a BTF will get rejected as one would > > expect. > > right. something like sysctl in the future. > > > IOW it's fine not to make BTF required at bpftool level and > > leave it to system configuration. > > > > I'd love to implement the BTF support right away, but I'm not sure I > > can afford that right now time-wise. The whole map create command is > > pretty trivial, but for BTF we don't even have a way of dumping it > > AFAICT. We can pretty print values, but what is the format in which to > > express the BTF itself? We could do JSON, do we use an external > > library? Should we have a separate BTF command for that? > > I prefer standard C type description for both input and output :) > Anyway that wasn't a request for you to do it now. More of the feature > request for somebody to put on todo list :) Oh, okay :) I will wait for John's patches to get merged and post v2, otherwise we'd conflict on the man page.
Re: [PATCH bpf-next 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
On 10/12/2018 08:54 PM, Yonghong Song wrote: [...] > +static bool btf_name_valid_identifier(const struct btf *btf, u32 offset) > +{ > + /* offset must be valid */ > + const char *src = >strings[offset]; > + > + if (!isalpha(*src) && *src != '_') > + return false; > + > + src++; > + while (*src) { > + if (!isalnum(*src) && *src != '_') > + return false; > + src++; > + } > + > + return true; > +} Should there be an upper name length limit like KSYM_NAME_LEN? (Is it implied by the kvmalloc() limit?) > static const char *btf_name_by_offset(const struct btf *btf, u32 offset) > { > if (!offset) > @@ -747,7 +782,9 @@ static bool env_type_is_resolve_sink(const struct > btf_verifier_env *env, > /* int, enum or void is a sink */ > return !btf_type_needs_resolve(next_type); > case RESOLVE_PTR: > - /* int, enum, void, struct or array is a sink for ptr */ > + /* int, enum, void, struct, array or func_ptoto is a sink > + * for ptr > + */ > return !btf_type_is_modifier(next_type) && > !btf_type_is_ptr(next_type); > case RESOLVE_STRUCT_OR_ARRAY:
Re: Fw: [Bug 201423] New: eth0: hw csum failure
On 15 October 2018 17:41:47 CEST, Eric Dumazet wrote: >On Mon, Oct 15, 2018 at 8:15 AM Stephen Hemminger > wrote: >> >> >> >> Begin forwarded message: >> >> Date: Sun, 14 Oct 2018 10:42:48 + >> From: bugzilla-dae...@bugzilla.kernel.org >> To: step...@networkplumber.org >> Subject: [Bug 201423] New: eth0: hw csum failure >> >> >> https://bugzilla.kernel.org/show_bug.cgi?id=201423 >> >> Bug ID: 201423 >>Summary: eth0: hw csum failure >>Product: Networking >>Version: 2.5 >> Kernel Version: 4.19.0-rc7 >> Hardware: Intel >> OS: Linux >> Tree: Mainline >> Status: NEW >> Severity: normal >> Priority: P1 >> Component: Other >> Assignee: step...@networkplumber.org >> Reporter: ross...@inwind.it >> Regression: No >> >> I have a P6T DELUXE V2 motherboard and using the sky2 driver for the >ethernet >> ports. I get the following error message: >> >> [ 433.727397] eth0: hw csum failure >> [ 433.727406] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.19.0-rc7 >#19 >> [ 433.727406] Hardware name: System manufacturer System Product >Name/P6T >> DELUXE V2, BIOS 120212/22/2010 >> [ 433.727407] Call Trace: >> [ 433.727409] >> [ 433.727415] dump_stack+0x46/0x5b >> [ 433.727419] __skb_checksum_complete+0xb0/0xc0 >> [ 433.727423] tcp_v4_rcv+0x528/0xb60 >> [ 433.727426] ? ipt_do_table+0x2d0/0x400 >> [ 433.727429] ip_local_deliver_finish+0x5a/0x110 >> [ 433.727430] ip_local_deliver+0xe1/0xf0 >> [ 433.727431] ? ip_sublist_rcv_finish+0x60/0x60 >> [ 433.727432] ip_rcv+0xca/0xe0 >> [ 433.727434] ? ip_rcv_finish_core.isra.0+0x300/0x300 >> [ 433.727436] __netif_receive_skb_one_core+0x4b/0x70 >> [ 433.727438] netif_receive_skb_internal+0x4e/0x130 >> [ 433.727439] napi_gro_receive+0x6a/0x80 >> [ 433.727442] sky2_poll+0x707/0xd20 >> [ 433.727446] ? rcu_check_callbacks+0x1b4/0x900 >> [ 433.727447] net_rx_action+0x237/0x380 >> [ 433.727449] __do_softirq+0xdc/0x1e0 >> [ 433.727452] irq_exit+0xa9/0xb0 >> [ 433.727453] do_IRQ+0x45/0xc0 >> [ 433.727455] common_interrupt+0xf/0xf >> [ 433.727456] >> [ 433.727459] RIP: 0010:cpuidle_enter_state+0x124/0x200 >> [ 433.727461] Code: 53 60 89 c3 e8 dd 90 ad ff 65 8b 3d 96 58 a7 7e >e8 d1 8f >> ad ff 31 ff 49 89 c4 e8 27 99 ad ff fb 48 ba cf f7 53 e3 a5 9b c4 20 ><4c> 89 e1 >> 4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48 >> [ 433.727462] RSP: :c90a3e98 EFLAGS: 0282 ORIG_RAX: >> ffde >> [ 433.727463] RAX: 880237b1f280 RBX: 0004 RCX: >> 001f >> [ 433.727464] RDX: 20c49ba5e353f7cf RSI: 2fe419c1 RDI: >> >> [ 433.727465] RBP: 880237b263a0 R08: 0714 R09: >> 00650512105d >> [ 433.727465] R10: R11: 0342 R12: >> 0064fc2a8b1c >> [ 433.727466] R13: 0064fc25b35f R14: 0004 R15: >> 8204af20 >> [ 433.727468] ? cpuidle_enter_state+0x119/0x200 >> [ 433.727471] do_idle+0x1bf/0x200 >> [ 433.727473] cpu_startup_entry+0x6a/0x70 >> [ 433.727475] start_secondary+0x17f/0x1c0 >> [ 433.727476] secondary_startup_64+0xa4/0xb0 >> [ 441.662954] eth0: hw csum failure >> [ 441.662959] CPU: 4 PID: 4347 Comm: radeon_cs:0 Not tainted >4.19.0-rc7 #19 >> [ 441.662960] Hardware name: System manufacturer System Product >Name/P6T >> DELUXE V2, BIOS 120212/22/2010 >> [ 441.662960] Call Trace: >> [ 441.662963] >> [ 441.662968] dump_stack+0x46/0x5b >> [ 441.662972] __skb_checksum_complete+0xb0/0xc0 >> [ 441.662975] tcp_v4_rcv+0x528/0xb60 >> [ 441.662979] ? ipt_do_table+0x2d0/0x400 >> [ 441.662981] ip_local_deliver_finish+0x5a/0x110 >> [ 441.662983] ip_local_deliver+0xe1/0xf0 >> [ 441.662985] ? ip_sublist_rcv_finish+0x60/0x60 >> [ 441.662986] ip_rcv+0xca/0xe0 >> [ 441.662988] ? ip_rcv_finish_core.isra.0+0x300/0x300 >> [ 441.662990] __netif_receive_skb_one_core+0x4b/0x70 >> [ 441.662993] netif_receive_skb_internal+0x4e/0x130 >> [ 441.662994] napi_gro_receive+0x6a/0x80 >> [ 441.662998] sky2_poll+0x707/0xd20 >> [ 441.663000] net_rx_action+0x237/0x380 >> [ 441.663002] __do_softirq+0xdc/0x1e0 >> [ 441.663005] irq_exit+0xa9/0xb0 >> [ 441.663007] do_IRQ+0x45/0xc0 >> [ 441.663009] common_interrupt+0xf/0xf >> [ 441.663010] >> [ 441.663012] RIP: 0010:merge+0x22/0xb0 >> [ 441.663014] Code: c3 31 c0 c3 90 90 90 90 41 56 41 55 41 54 55 48 >89 d5 53 >> 48 89 cb 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0 ><48> 85 c9 >> 74 70 48 85 d2 74 6b 49 89 fd 49 89 f6 49 89 e4 eb 14 48 >> [ 441.663015] RSP: 0018:c990b988 EFLAGS: 0246 ORIG_RAX: >> ffde >> [ 441.663017] RAX: RBX: 88021ab2d408 RCX: >> 88021ab2d408 >> [ 441.663018] RDX: 88021ab2d388 RSI: a021c440 RDI: >> >> [ 441.663019] RBP: 88021ab2d388 R08: 5ecf
Re: [PATCH bpf-next 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
On 10/12/2018 08:54 PM, Yonghong Song wrote: > This patch adds BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO > support to the type section. BTF_KIND_FUNC_PROTO is used > to specify the type of a function pointer. With this, > BTF has a complete set of C types (except float). > > BTF_KIND_FUNC is used to specify the signature of a > defined subprogram. BTF_KIND_FUNC_PROTO can be referenced > by another type, e.g., a pointer type, and BTF_KIND_FUNC > type cannot be referenced by another type. > > For both BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO types, > the func return type is in t->type (where t is a > "struct btf_type" object). The func args are an array of > u32s immediately following object "t". > > As a concrete example, for the C program below, > $ cat test.c > int foo(int (*bar)(int)) { return bar(5); } > with latest llvm trunk built with Debug mode, we have > $ clang -target bpf -g -O2 -mllvm -debug-only=btf -c test.c > Type Table: > [1] FUNC name_off=1 info=0x0c01 size/type=2 > param_type=3 > [2] INT name_off=11 info=0x0100 size/type=4 > desc=0x0120 > [3] PTR name_off=0 info=0x0200 size/type=4 > [4] FUNC_PROTO name_off=0 info=0x0d01 size/type=2 > param_type=2 > > String Table: > 0 : > 1 : foo > 5 : .text > 11 : int > 15 : test.c > 22 : int foo(int (*bar)(int)) { return bar(5); } > > FuncInfo Table: > sec_name_off=5 > insn_offset= type_id=1 > > ... > > (Eventually we shall have bpftool to dump btf information > like the above.) > > Function "foo" has a FUNC type (type_id = 1). > The parameter of "foo" has type_id 3 which is PTR->FUNC_PROTO, > where FUNC_PROTO refers to function pointer "bar". Should also "bar" be part of the string table (at least at some point in future)? Iow, if verifier hints to an issue in the program when it would for example walk pointers and rewrite ctx access, then it could dump the var name along with it. It might be useful as well in combination with 22 from str table, when annotating the source. We might need support for variadic functions, though. How is LLVM handling the latter with the recent BTF support? > In FuncInfo Table, for section .text, the function, > with to-be-determined offset (marked as ), > has type_id=1 which refers to a FUNC type. > This way, the function signature is > available to both kernel and user space. > Here, the insn offset is not available during the dump time > as relocation is resolved pretty late in the compilation process. > > Signed-off-by: Martin KaFai Lau > Signed-off-by: Yonghong Song
Re: [PATCH net] sctp: use the pmtu from the icmp packet to update transport pathmtu
On Mon, Oct 15, 2018 at 07:58:29PM +0800, Xin Long wrote: > Other than asoc pmtu sync from all transports, sctp_assoc_sync_pmtu > is also processing transport pmtu_pending by icmp packets. But it's > meaningless to use sctp_dst_mtu(t->dst) as new pmtu for a transport. > > The right pmtu value should come from the icmp packet, and it would > be saved into transport->mtu_info in this patch and used later when > the pmtu sync happens in sctp_sendmsg_to_asoc or sctp_packet_config. > > Besides, without this patch, as pmtu can only be updated correctly > when receiving a icmp packet and no place is holding sock lock, it > will take long time if the sock is busy with sending packets. > > Note that it doesn't process transport->mtu_info in .release_cb(), > as there is no enough information for pmtu update, like for which > asoc or transport. It is not worth traversing all asocs to check > pmtu_pending. So unlike tcp, sctp does this in tx path, for which > mtu_info needs to be atomic_t. > > Signed-off-by: Xin Long Acked-by: Marcelo Ricardo Leitner > --- > include/net/sctp/structs.h | 2 ++ > net/sctp/associola.c | 3 ++- > net/sctp/input.c | 1 + > net/sctp/output.c | 6 ++ > 4 files changed, 11 insertions(+), 1 deletion(-) > > diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h > index 28a7c8e..a11f937 100644 > --- a/include/net/sctp/structs.h > +++ b/include/net/sctp/structs.h > @@ -876,6 +876,8 @@ struct sctp_transport { > unsigned long sackdelay; > __u32 sackfreq; > > + atomic_t mtu_info; > + > /* When was the last time that we heard from this transport? We use >* this to pick new active and retran paths. >*/ > diff --git a/net/sctp/associola.c b/net/sctp/associola.c > index 297d9cf..a827a1f 100644 > --- a/net/sctp/associola.c > +++ b/net/sctp/associola.c > @@ -1450,7 +1450,8 @@ void sctp_assoc_sync_pmtu(struct sctp_association *asoc) > /* Get the lowest pmtu of all the transports. */ > list_for_each_entry(t, >peer.transport_addr_list, transports) { > if (t->pmtu_pending && t->dst) { > - sctp_transport_update_pmtu(t, sctp_dst_mtu(t->dst)); > + sctp_transport_update_pmtu(t, > +atomic_read(>mtu_info)); > t->pmtu_pending = 0; > } > if (!pmtu || (t->pathmtu < pmtu)) > diff --git a/net/sctp/input.c b/net/sctp/input.c > index 9bbc5f9..5c36a99 100644 > --- a/net/sctp/input.c > +++ b/net/sctp/input.c > @@ -395,6 +395,7 @@ void sctp_icmp_frag_needed(struct sock *sk, struct > sctp_association *asoc, > return; > > if (sock_owned_by_user(sk)) { > + atomic_set(>mtu_info, pmtu); > asoc->pmtu_pending = 1; > t->pmtu_pending = 1; > return; > diff --git a/net/sctp/output.c b/net/sctp/output.c > index 7f849b0..67939ad 100644 > --- a/net/sctp/output.c > +++ b/net/sctp/output.c > @@ -120,6 +120,12 @@ void sctp_packet_config(struct sctp_packet *packet, > __u32 vtag, > sctp_assoc_sync_pmtu(asoc); > } > > + if (asoc->pmtu_pending) { > + if (asoc->param_flags & SPP_PMTUD_ENABLE) > + sctp_assoc_sync_pmtu(asoc); > + asoc->pmtu_pending = 0; > + } > + > /* If there a is a prepend chunk stick it on the list before >* any other chunks get appended. >*/ > -- > 2.1.0 >
Re: [PATCH iproute2] macsec: fix off-by-one when parsing attributes
2018-10-15, 09:36:58 -0700, Stephen Hemminger wrote: > On Fri, 12 Oct 2018 17:34:12 +0200 > Sabrina Dubroca wrote: > > > I seem to have had a massive brainfart with uses of > > parse_rtattr_nested(). The rtattr* array must have MAX+1 elements, and > > the call to parse_rtattr_nested must have MAX as its bound. Let's fix > > those. > > > > Fixes: b26fc590ce62 ("ip: add MACsec support") > > Signed-off-by: Sabrina Dubroca > > Applied, > How did it ever work?? I'm guessing it wrote over some other stack variables before their first use. It worked without issue until the JSON patch. Thanks, -- Sabrina
Re: [PATCH bpf-next 01/13] bpf: btf: Break up btf_type_is_void()
On 10/12/2018 08:54 PM, Yonghong Song wrote: > This patch breaks up btf_type_is_void() into > btf_type_is_void() and btf_type_is_fwd(). > > It also adds btf_type_nosize() to better describe it is > testing a type has nosize info. > > Signed-off-by: Martin KaFai Lau > --- Yonghong, your SoB is missing here. Thanks, Daniel
Re: [PATCH bpf-next 0/2] IPv6 sk-lookup fixes
On 10/15/2018 07:27 PM, Joe Stringer wrote: > This series includes a couple of fixups for the IPv6 socket lookup > helper, to make the API more consistent (always supply all arguments in > network byte-order) and to allow its use when IPv6 is compiled as a > module. > > Joe Stringer (2): > bpf: Allow sk_lookup with IPv6 module > bpf: Fix IPv6 dport byte-order in bpf_sk_lookup > > include/net/addrconf.h | 5 + > net/core/filter.c | 15 +-- > net/ipv6/af_inet6.c| 1 + > 3 files changed, 15 insertions(+), 6 deletions(-) > LGTM, thanks for following up on this. Series: Acked-by: Daniel Borkmann
Re: [PATCH net-next 11/18] vxlan: Add netif_is_vxlan()
On Mon, 15 Oct 2018 13:30:41 -0700 Jakub Kicinski wrote: > On Mon, 15 Oct 2018 23:27:41 +0300, Ido Schimmel wrote: > > On Mon, Oct 15, 2018 at 01:16:42PM -0700, Stephen Hemminger wrote: > > > On Mon, 15 Oct 2018 22:57:48 +0300 > > > Ido Schimmel wrote: > > > > > > > On Mon, Oct 15, 2018 at 11:57:56AM -0700, Jakub Kicinski wrote: > > > > > On Sat, 13 Oct 2018 17:18:38 +, Ido Schimmel wrote: > > > > > > Add the ability to determine whether a netdev is a VxLAN netdev by > > > > > > calling the above mentioned function that checks the netdev's > > > > > > private > > > > > > flags. > > > > > > > > > > > > This will allow modules to identify netdev events involving a VxLAN > > > > > > netdev and act accordingly. For example, drivers capable of VxLAN > > > > > > offload will need to configure the underlying device when a VxLAN > > > > > > netdev > > > > > > is being enslaved to an offloaded bridge. > > > > > > > > > > > > Signed-off-by: Ido Schimmel > > > > > > Reviewed-by: Petr Machata > > > > > > > > > > Is this preferable over > > > > > > > > > > !strcmp(netdev->rtnl_link_ops->kind, "vxlan") > > > > > > > > > > which is what TC offloads do? > > > > > > > > Using a flag seemed like the more standard way. > > > > > > > > That being said, we considered using net_device_ops instead, given we > > > > are about to run out of available private flags, so I don't mind > > > > adopting a technique already employed by another driver. > > > > > > > > P.S. Had to Cc netdev again. I think your client somehow messed the Cc > > > > list? I see Cc list in your reply, but with back slashes at the end of > > > > two email addresses. > > > > > > Agree that using a global resource bit in flags is probably overkill. > > > If you can use kind that would be good example for other drivers as well. > > > > > > > OK, will change. > > > > Jakub, any objections if I implement netif_is_vxlan() using 'kind' and > > convert nfp to use the helper? Having all these helpers in the same > > location will increase the chances of others reusing them. > > Sounds very good :) We could even do this for bridge, and other devices that are using private flags.
Re: [PATCH net-next 11/18] vxlan: Add netif_is_vxlan()
On Mon, 15 Oct 2018 23:27:41 +0300, Ido Schimmel wrote: > On Mon, Oct 15, 2018 at 01:16:42PM -0700, Stephen Hemminger wrote: > > On Mon, 15 Oct 2018 22:57:48 +0300 > > Ido Schimmel wrote: > > > > > On Mon, Oct 15, 2018 at 11:57:56AM -0700, Jakub Kicinski wrote: > > > > On Sat, 13 Oct 2018 17:18:38 +, Ido Schimmel wrote: > > > > > Add the ability to determine whether a netdev is a VxLAN netdev by > > > > > calling the above mentioned function that checks the netdev's private > > > > > flags. > > > > > > > > > > This will allow modules to identify netdev events involving a VxLAN > > > > > netdev and act accordingly. For example, drivers capable of VxLAN > > > > > offload will need to configure the underlying device when a VxLAN > > > > > netdev > > > > > is being enslaved to an offloaded bridge. > > > > > > > > > > Signed-off-by: Ido Schimmel > > > > > Reviewed-by: Petr Machata > > > > > > > > Is this preferable over > > > > > > > > !strcmp(netdev->rtnl_link_ops->kind, "vxlan") > > > > > > > > which is what TC offloads do? > > > > > > Using a flag seemed like the more standard way. > > > > > > That being said, we considered using net_device_ops instead, given we > > > are about to run out of available private flags, so I don't mind > > > adopting a technique already employed by another driver. > > > > > > P.S. Had to Cc netdev again. I think your client somehow messed the Cc > > > list? I see Cc list in your reply, but with back slashes at the end of > > > two email addresses. > > > > Agree that using a global resource bit in flags is probably overkill. > > If you can use kind that would be good example for other drivers as well. > > OK, will change. > > Jakub, any objections if I implement netif_is_vxlan() using 'kind' and > convert nfp to use the helper? Having all these helpers in the same > location will increase the chances of others reusing them. Sounds very good :)
Re: [PATCH net-next 11/18] vxlan: Add netif_is_vxlan()
On Mon, Oct 15, 2018 at 01:16:42PM -0700, Stephen Hemminger wrote: > On Mon, 15 Oct 2018 22:57:48 +0300 > Ido Schimmel wrote: > > > On Mon, Oct 15, 2018 at 11:57:56AM -0700, Jakub Kicinski wrote: > > > On Sat, 13 Oct 2018 17:18:38 +, Ido Schimmel wrote: > > > > Add the ability to determine whether a netdev is a VxLAN netdev by > > > > calling the above mentioned function that checks the netdev's private > > > > flags. > > > > > > > > This will allow modules to identify netdev events involving a VxLAN > > > > netdev and act accordingly. For example, drivers capable of VxLAN > > > > offload will need to configure the underlying device when a VxLAN netdev > > > > is being enslaved to an offloaded bridge. > > > > > > > > Signed-off-by: Ido Schimmel > > > > Reviewed-by: Petr Machata > > > > > > Is this preferable over > > > > > > !strcmp(netdev->rtnl_link_ops->kind, "vxlan") > > > > > > which is what TC offloads do? > > > > Using a flag seemed like the more standard way. > > > > That being said, we considered using net_device_ops instead, given we > > are about to run out of available private flags, so I don't mind > > adopting a technique already employed by another driver. > > > > P.S. Had to Cc netdev again. I think your client somehow messed the Cc > > list? I see Cc list in your reply, but with back slashes at the end of > > two email addresses. > > Agree that using a global resource bit in flags is probably overkill. > If you can use kind that would be good example for other drivers as well. OK, will change. Jakub, any objections if I implement netif_is_vxlan() using 'kind' and convert nfp to use the helper? Having all these helpers in the same location will increase the chances of others reusing them.
[iproute PATCH] ip-addrlabel: Fix printing of label value
Passing the return value of RTA_DATA() to rta_getattr_u32() is wrong since that function will call RTA_DATA() by itself already. Fixes: a7ad1c8a6845d ("ipaddrlabel: add json support") Signed-off-by: Phil Sutter --- ip/ipaddrlabel.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ip/ipaddrlabel.c b/ip/ipaddrlabel.c index 2f79c56dcead2..8abe5722bafd1 100644 --- a/ip/ipaddrlabel.c +++ b/ip/ipaddrlabel.c @@ -95,7 +95,7 @@ int print_addrlabel(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg } if (tb[IFAL_LABEL] && RTA_PAYLOAD(tb[IFAL_LABEL]) == sizeof(uint32_t)) { - uint32_t label = rta_getattr_u32(RTA_DATA(tb[IFAL_LABEL])); + uint32_t label = rta_getattr_u32(tb[IFAL_LABEL]); print_uint(PRINT_ANY, "label", "label %u ", label); -- 2.19.0
Re: [PATCH net-next 11/18] vxlan: Add netif_is_vxlan()
On Mon, 15 Oct 2018 22:57:48 +0300 Ido Schimmel wrote: > On Mon, Oct 15, 2018 at 11:57:56AM -0700, Jakub Kicinski wrote: > > On Sat, 13 Oct 2018 17:18:38 +, Ido Schimmel wrote: > > > Add the ability to determine whether a netdev is a VxLAN netdev by > > > calling the above mentioned function that checks the netdev's private > > > flags. > > > > > > This will allow modules to identify netdev events involving a VxLAN > > > netdev and act accordingly. For example, drivers capable of VxLAN > > > offload will need to configure the underlying device when a VxLAN netdev > > > is being enslaved to an offloaded bridge. > > > > > > Signed-off-by: Ido Schimmel > > > Reviewed-by: Petr Machata > > > > Is this preferable over > > > > !strcmp(netdev->rtnl_link_ops->kind, "vxlan") > > > > which is what TC offloads do? > > Using a flag seemed like the more standard way. > > That being said, we considered using net_device_ops instead, given we > are about to run out of available private flags, so I don't mind > adopting a technique already employed by another driver. > > P.S. Had to Cc netdev again. I think your client somehow messed the Cc > list? I see Cc list in your reply, but with back slashes at the end of > two email addresses. Agree that using a global resource bit in flags is probably overkill. If you can use kind that would be good example for other drivers as well.
Re: [PATCH bpf-next] tools: bpftool: add map create command
On Mon, Oct 15, 2018 at 09:49:08AM -0700, Jakub Kicinski wrote: > On Fri, 12 Oct 2018 23:16:59 -0700, Alexei Starovoitov wrote: > > On Fri, Oct 12, 2018 at 11:06:14AM -0700, Jakub Kicinski wrote: > > > Add a way of creating maps from user space. The command takes > > > as parameters most of the attributes of the map creation system > > > call command. After map is created its pinned to bpffs. This makes > > > it possible to easily and dynamically (without rebuilding programs) > > > test various corner cases related to map creation. > > > > > > Map type names are taken from bpftool's array used for printing. > > > In general these days we try to make use of libbpf type names, but > > > there are no map type names in libbpf as of today. > > > > > > As with most features I add the motivation is testing (offloads) :) > > > > > > Signed-off-by: Jakub Kicinski > > > Reviewed-by: Quentin Monnet > > ... > > > fprintf(stderr, > > > "Usage: %s %s { show | list } [MAP]\n" > > > + " %s %s create FILE type TYPE key KEY_SIZE value > > > VALUE_SIZE \\\n" > > > + " entries MAX_ENTRIES [name NAME] > > > [flags FLAGS] \\\n" > > > + " [dev NAME]\n" > > > > I suspect as soon as bpftool has an ability to create standalone maps > > some folks will start relying on such interface. > > That'd be cool, do you see any real life use cases where its useful > outside of corner case testing? In our XDP use case we have an odd protocol for different apps to share common prog_array that is pinned in bpffs. If cmdline creation of it via bpftool was available that would have been an option to consider. Not saying that it would have been a better option. Just another option. > > > Therefore I'd like to request to make 'name' argument to be mandatory. > > Will do in v2! thx! > > I think in the future we will require BTF to be mandatory too. > > We need to move towards more transparent and debuggable infra. > > Do you think requiring json description of key/value would be managable to > > implement? > > Then bpftool could convert it to BTF and the map full be fully defined. > > I certainly understand that bpf prog can disregard the key/value layout > > today, > > but we will make verifier to enforce that in the future too. > > I was hoping that we can leave BTF support as a future extension, and > then once we have the option for the verifier to enforce BTF (a sysctl?) > the bpftool map create without a BTF will get rejected as one would > expect. right. something like sysctl in the future. > IOW it's fine not to make BTF required at bpftool level and > leave it to system configuration. > > I'd love to implement the BTF support right away, but I'm not sure I > can afford that right now time-wise. The whole map create command is > pretty trivial, but for BTF we don't even have a way of dumping it > AFAICT. We can pretty print values, but what is the format in which to > express the BTF itself? We could do JSON, do we use an external > library? Should we have a separate BTF command for that? I prefer standard C type description for both input and output :) Anyway that wasn't a request for you to do it now. More of the feature request for somebody to put on todo list :)
Re: [PATCH net-next 11/18] vxlan: Add netif_is_vxlan()
On Mon, Oct 15, 2018 at 11:57:56AM -0700, Jakub Kicinski wrote: > On Sat, 13 Oct 2018 17:18:38 +, Ido Schimmel wrote: > > Add the ability to determine whether a netdev is a VxLAN netdev by > > calling the above mentioned function that checks the netdev's private > > flags. > > > > This will allow modules to identify netdev events involving a VxLAN > > netdev and act accordingly. For example, drivers capable of VxLAN > > offload will need to configure the underlying device when a VxLAN netdev > > is being enslaved to an offloaded bridge. > > > > Signed-off-by: Ido Schimmel > > Reviewed-by: Petr Machata > > Is this preferable over > > !strcmp(netdev->rtnl_link_ops->kind, "vxlan") > > which is what TC offloads do? Using a flag seemed like the more standard way. That being said, we considered using net_device_ops instead, given we are about to run out of available private flags, so I don't mind adopting a technique already employed by another driver. P.S. Had to Cc netdev again. I think your client somehow messed the Cc list? I see Cc list in your reply, but with back slashes at the end of two email addresses.
Re: [PATCH bpf-next v2 0/8] sockmap integration for ktls
On Sat, Oct 13, 2018 at 02:45:55AM +0200, Daniel Borkmann wrote: > This work adds a generic sk_msg layer and converts both sockmap > and later ktls over to make use of it as a common data structure > for application data (similarly as sk_buff for network packets). > With that in place the sk_msg framework spans accross ULP layer > in the kernel and allows for introspection or filtering of L7 > data with the help of BPF programs operating on a common input > context. > > In a second step, we enable the latter for ktls which was previously > not possible, meaning, ktls and sk_msg verdict programs were > mutually exclusive in the ULP layer which created challenges for > the orchestrator when trying to apply TCP based policy, for > example. Leveraging the prior consolidation we can finally overcome > this limitation. > > Note, there's no change in behavior when ktls is not used in > combination with BPF, and also no change in behavior for stand > alone sockmap. The kselftest suites for ktls, sockmap and ktls > with sockmap combined also runs through successfully. For further > details please see individual patches. > > Thanks! > > v1 -> v2: > - Removed leftover comment spotted by Alexei > - Improved commit messages, rebase Applied, Thanks
[PATCH net-next] net: phy: merge phy_start_aneg and phy_start_aneg_priv
After commit 9f2959b6b52d ("net: phy: improve handling delayed work") the sync parameter isn't needed any longer in phy_start_aneg_priv(). This allows to merge phy_start_aneg() and phy_start_aneg_priv(). Signed-off-by: Heiner Kallweit --- drivers/net/phy/phy.c | 21 +++-- 1 file changed, 3 insertions(+), 18 deletions(-) diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index d03bdbbd1..1d73ac330 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -482,16 +482,15 @@ static int phy_config_aneg(struct phy_device *phydev) } /** - * phy_start_aneg_priv - start auto-negotiation for this PHY device + * phy_start_aneg - start auto-negotiation for this PHY device * @phydev: the phy_device struct - * @sync: indicate whether we should wait for the workqueue cancelation * * Description: Sanitizes the settings (if we're not autonegotiating * them), and then calls the driver's config_aneg function. * If the PHYCONTROL Layer is operating, we change the state to * reflect the beginning of Auto-negotiation or forcing. */ -static int phy_start_aneg_priv(struct phy_device *phydev, bool sync) +int phy_start_aneg(struct phy_device *phydev) { bool trigger = 0; int err; @@ -541,20 +540,6 @@ static int phy_start_aneg_priv(struct phy_device *phydev, bool sync) return err; } - -/** - * phy_start_aneg - start auto-negotiation for this PHY device - * @phydev: the phy_device struct - * - * Description: Sanitizes the settings (if we're not autonegotiating - * them), and then calls the driver's config_aneg function. - * If the PHYCONTROL Layer is operating, we change the state to - * reflect the beginning of Auto-negotiation or forcing. - */ -int phy_start_aneg(struct phy_device *phydev) -{ - return phy_start_aneg_priv(phydev, true); -} EXPORT_SYMBOL(phy_start_aneg); static int phy_poll_aneg_done(struct phy_device *phydev) @@ -1085,7 +1070,7 @@ void phy_state_machine(struct work_struct *work) mutex_unlock(>lock); if (needs_aneg) - err = phy_start_aneg_priv(phydev, false); + err = phy_start_aneg(phydev); else if (do_suspend) phy_suspend(phydev); -- 2.19.1
Re: [PATCH net] net/sched: properly init chain in case of multiple control actions
On Sat, Oct 13, 2018 at 8:23 AM Davide Caratti wrote: > > On Fri, 2018-10-12 at 13:57 -0700, Cong Wang wrote: > > Why not just validate the fallback action in each action init()? > > For example, checking tcfg_paction in tcf_gact_init(). > > > > I don't see the need of making it generic. > > hello Cong, once again thanks for looking at this. > > what you say is doable, and I evaluated doing it before proposing this > patch. > > But I felt unconfortable, because I needed to pass struct tcf_proto *tp in > tcf_gact_init() to initialize a->goto_chain with the chain_idx encoded in > the fallback action. So, I would have changed all the init() functions in > all TC actions, just to fix two of them. > > A (legal?) trick is to let tcf_action store the fallback action when it > contains a 'goto chain' command, I just posted a proposal for gact. If you > think it's ok, I will test and post the same for act_police. Do we really need to support TC_ACT_GOTO_CHAIN for gact->tcfg_paction etc.? I mean, is it useful in practice or is it just for completeness? IF we don't need to support it, we can just make it invalid without needing to initialize it in ->init() at all. If we do, however, we really need to move it into each ->init(), because we have to lock each action if we are modifying an existing one. With your patch, tcf_action_goto_chain_init() is still called without the per-action lock. What's more, if we support two different actions in gact, that is, tcfg_paction and tcf_action, how could you still only have one a->goto_chain pointer? There should be two pointers for each of them. :) Thanks.
Re: [bpf-next PATCH v3 2/2] bpf: bpftool, add flag to allow non-compat map definitions
On Mon, 15 Oct 2018 11:19:55 -0700, John Fastabend wrote: > Multiple map definition structures exist and user may have non-zero > fields in their definition that are not recognized by bpftool and > libbpf. The normal behavior is to then fail loading the map. Although > this is a good default behavior users may still want to load the map > for debugging or other reasons. This patch adds a --mapcompat flag > that can be used to override the default behavior and allow loading > the map even when it has additional non-zero fields. > > For now the only user is 'bpftool prog' we can switch over other > subcommands as needed. The library exposes an API that consumes > a flags field now but I kept the original API around also in case > users of the API don't want to expose this. The flags field is an > int in case we need more control over how the API call handles > errors/features/etc in the future. > > Signed-off-by: John Fastabend Acked-by: Jakub Kicinski Thank you!
[bpf-next PATCH v3 2/2] bpf: bpftool, add flag to allow non-compat map definitions
Multiple map definition structures exist and user may have non-zero fields in their definition that are not recognized by bpftool and libbpf. The normal behavior is to then fail loading the map. Although this is a good default behavior users may still want to load the map for debugging or other reasons. This patch adds a --mapcompat flag that can be used to override the default behavior and allow loading the map even when it has additional non-zero fields. For now the only user is 'bpftool prog' we can switch over other subcommands as needed. The library exposes an API that consumes a flags field now but I kept the original API around also in case users of the API don't want to expose this. The flags field is an int in case we need more control over how the API call handles errors/features/etc in the future. Signed-off-by: John Fastabend --- tools/bpf/bpftool/Documentation/bpftool.rst |4 tools/bpf/bpftool/bash-completion/bpftool |2 +- tools/bpf/bpftool/main.c|7 ++- tools/bpf/bpftool/main.h|3 ++- tools/bpf/bpftool/prog.c|2 +- 5 files changed, 14 insertions(+), 4 deletions(-) diff --git a/tools/bpf/bpftool/Documentation/bpftool.rst b/tools/bpf/bpftool/Documentation/bpftool.rst index 25c0872..6548831 100644 --- a/tools/bpf/bpftool/Documentation/bpftool.rst +++ b/tools/bpf/bpftool/Documentation/bpftool.rst @@ -57,6 +57,10 @@ OPTIONS -p, --pretty Generate human-readable JSON output. Implies **-j**. + -m, --mapcompat + Allow loading maps with unknown map definitions. + + SEE ALSO **bpftool-map**\ (8), **bpftool-prog**\ (8), **bpftool-cgroup**\ (8) diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool index 0826519..ac85207 100644 --- a/tools/bpf/bpftool/bash-completion/bpftool +++ b/tools/bpf/bpftool/bash-completion/bpftool @@ -184,7 +184,7 @@ _bpftool() # Deal with options if [[ ${words[cword]} == -* ]]; then -local c='--version --json --pretty --bpffs' +local c='--version --json --pretty --bpffs --mapcompat' COMPREPLY=( $( compgen -W "$c" -- "$cur" ) ) return 0 fi diff --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c index 79dc3f1..828dde3 100644 --- a/tools/bpf/bpftool/main.c +++ b/tools/bpf/bpftool/main.c @@ -55,6 +55,7 @@ bool pretty_output; bool json_output; bool show_pinned; +int bpf_flags; struct pinned_obj_table prog_table; struct pinned_obj_table map_table; @@ -341,6 +342,7 @@ int main(int argc, char **argv) { "pretty", no_argument,NULL, 'p' }, { "version",no_argument,NULL, 'V' }, { "bpffs", no_argument,NULL, 'f' }, + { "mapcompat", no_argument,NULL, 'm' }, { 0 } }; int opt, ret; @@ -355,7 +357,7 @@ int main(int argc, char **argv) hash_init(map_table.table); opterr = 0; - while ((opt = getopt_long(argc, argv, "Vhpjf", + while ((opt = getopt_long(argc, argv, "Vhpjfm", options, NULL)) >= 0) { switch (opt) { case 'V': @@ -379,6 +381,9 @@ int main(int argc, char **argv) case 'f': show_pinned = true; break; + case 'm': + bpf_flags = MAPS_RELAX_COMPAT; + break; default: p_err("unrecognized option '%s'", argv[optind - 1]); if (json_output) diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h index 40492cd..91fd697 100644 --- a/tools/bpf/bpftool/main.h +++ b/tools/bpf/bpftool/main.h @@ -74,7 +74,7 @@ #define HELP_SPEC_PROGRAM \ "PROG := { id PROG_ID | pinned FILE | tag PROG_TAG }" #define HELP_SPEC_OPTIONS \ - "OPTIONS := { {-j|--json} [{-p|--pretty}] | {-f|--bpffs} }" + "OPTIONS := { {-j|--json} [{-p|--pretty}] | {-f|--bpffs} | {-m|--mapcompat}" #define HELP_SPEC_MAP \ "MAP := { id MAP_ID | pinned FILE }" @@ -89,6 +89,7 @@ enum bpf_obj_type { extern json_writer_t *json_wtr; extern bool json_output; extern bool show_pinned; +extern int bpf_flags; extern struct pinned_obj_table prog_table; extern struct pinned_obj_table map_table; diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c index 99ab42c..3350289 100644 --- a/tools/bpf/bpftool/prog.c +++ b/tools/bpf/bpftool/prog.c @@ -908,7 +908,7 @@ static int do_load(int argc, char **argv) } } - obj = bpf_object__open_xattr(); + obj = __bpf_object__open_xattr(, bpf_flags); if (IS_ERR_OR_NULL(obj)) { p_err("failed to
[bpf-next PATCH v3 1/2] bpf: bpftool, add support for attaching programs to maps
Sock map/hash introduce support for attaching programs to maps. To date I have been doing this with custom tooling but this is less than ideal as we shift to using bpftool as the single CLI for our BPF uses. This patch adds new sub commands 'attach' and 'detach' to the 'prog' command to attach programs to maps and then detach them. Signed-off-by: John Fastabend Reviewed-by: Jakub Kicinski --- tools/bpf/bpftool/Documentation/bpftool-prog.rst | 11 ++ tools/bpf/bpftool/Documentation/bpftool.rst |2 tools/bpf/bpftool/bash-completion/bpftool| 19 tools/bpf/bpftool/prog.c | 99 ++ 4 files changed, 128 insertions(+), 3 deletions(-) diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.rst b/tools/bpf/bpftool/Documentation/bpftool-prog.rst index 64156a1..12c8030 100644 --- a/tools/bpf/bpftool/Documentation/bpftool-prog.rst +++ b/tools/bpf/bpftool/Documentation/bpftool-prog.rst @@ -25,6 +25,8 @@ MAP COMMANDS | **bpftool** **prog dump jited** *PROG* [{**file** *FILE* | **opcodes**}] | **bpftool** **prog pin** *PROG* *FILE* | **bpftool** **prog load** *OBJ* *FILE* [**type** *TYPE*] [**map** {**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*] +| **bpftool** **prog attach** *PROG* *ATTACH_TYPE* *MAP* +| **bpftool** **prog detach** *PROG* *ATTACH_TYPE* *MAP* | **bpftool** **prog help** | | *MAP* := { **id** *MAP_ID* | **pinned** *FILE* } @@ -37,6 +39,7 @@ MAP COMMANDS | **cgroup/bind4** | **cgroup/bind6** | **cgroup/post_bind4** | **cgroup/post_bind6** | | **cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** | **cgroup/sendmsg6** | } +| *ATTACH_TYPE* := { **msg_verdict** | **skb_verdict** | **skb_parse** } DESCRIPTION @@ -90,6 +93,14 @@ DESCRIPTION Note: *FILE* must be located in *bpffs* mount. +**bpftool prog attach** *PROG* *ATTACH_TYPE* *MAP* + Attach bpf program *PROG* (with type specified by *ATTACH_TYPE*) + to the map *MAP*. + +**bpftool prog detach** *PROG* *ATTACH_TYPE* *MAP* + Detach bpf program *PROG* (with type specified by *ATTACH_TYPE*) + from the map *MAP*. + **bpftool prog help** Print short help message. diff --git a/tools/bpf/bpftool/Documentation/bpftool.rst b/tools/bpf/bpftool/Documentation/bpftool.rst index 8dda77d..25c0872 100644 --- a/tools/bpf/bpftool/Documentation/bpftool.rst +++ b/tools/bpf/bpftool/Documentation/bpftool.rst @@ -26,7 +26,7 @@ SYNOPSIS | **pin** | **event_pipe** | **help** } *PROG-COMMANDS* := { **show** | **list** | **dump jited** | **dump xlated** | **pin** - | **load** | **help** } + | **load** | **attach** | **detach** | **help** } *CGROUP-COMMANDS* := { **show** | **list** | **attach** | **detach** | **help** } diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool index df1060b..0826519 100644 --- a/tools/bpf/bpftool/bash-completion/bpftool +++ b/tools/bpf/bpftool/bash-completion/bpftool @@ -292,6 +292,23 @@ _bpftool() fi return 0 ;; +attach|detach) +if [[ ${#words[@]} == 7 ]]; then +COMPREPLY=( $( compgen -W "id pinned" -- "$cur" ) ) +return 0 +fi + +if [[ ${#words[@]} == 6 ]]; then +COMPREPLY=( $( compgen -W "msg_verdict skb_verdict skb_parse" -- "$cur" ) ) +return 0 +fi + +if [[ $prev == "$command" ]]; then +COMPREPLY=( $( compgen -W "id pinned" -- "$cur" ) ) +return 0 +fi +return 0 +;; load) local obj @@ -347,7 +364,7 @@ _bpftool() ;; *) [[ $prev == $object ]] && \ -COMPREPLY=( $( compgen -W 'dump help pin load \ +COMPREPLY=( $( compgen -W 'dump help pin attach detach load \ show list' -- "$cur" ) ) ;; esac diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c index b1cd3bc..99ab42c 100644 --- a/tools/bpf/bpftool/prog.c +++ b/tools/bpf/bpftool/prog.c @@ -77,6 +77,26 @@ [BPF_PROG_TYPE_FLOW_DISSECTOR] = "flow_dissector", }; +static const char * const attach_type_strings[] = { + [BPF_SK_SKB_STREAM_PARSER] = "stream_parser", + [BPF_SK_SKB_STREAM_VERDICT] = "stream_verdict", + [BPF_SK_MSG_VERDICT] = "msg_verdict", + [__MAX_BPF_ATTACH_TYPE] = NULL, +}; + +enum bpf_attach_type parse_attach_type(const char *str) +{ + enum
[bpf-next PATCH v3 0/2] bpftool support for sockmap use cases
The first patch adds support for attaching programs to maps. This is needed to support sock{map|hash} use from bpftool. Currently, I carry around custom code to do this so doing it using standard bpftool will be great. The second patch adds a compat mode to ignore non-zero entries in the map def. This allows using bpftool with maps that have a extra fields that the user knows can be ignored. This is needed to work correctly with maps being loaded by other tools or directly via syscalls. v3: add bash completion and doc updates for --mapcompat --- John Fastabend (2): bpf: bpftool, add support for attaching programs to maps bpf: bpftool, add flag to allow non-compat map definitions tools/bpf/bpftool/Documentation/bpftool-prog.rst | 11 ++ tools/bpf/bpftool/Documentation/bpftool.rst |6 + tools/bpf/bpftool/bash-completion/bpftool| 21 - tools/bpf/bpftool/main.c |7 +- tools/bpf/bpftool/main.h |3 - tools/bpf/bpftool/prog.c | 101 ++ 6 files changed, 142 insertions(+), 7 deletions(-) -- Signature
Re: [PATCH stable 4.9 v2 00/29] backport of IP fragmentation fixes
On Mon, Oct 15, 2018 at 10:47 AM Florian Fainelli wrote: > > > > On 10/10/2018 12:29 PM, Florian Fainelli wrote: > > This is based on Stephen's v4.14 patches, with the necessary merge > > conflicts, and the lack of timer_setup() on the 4.9 baseline. > > > > Perf results on a gigabit capable system, before and after are below. > > > > Series can also be found here: > > > > https://github.com/ffainelli/linux/commits/fragment-stack-v4.9-v2 > > > > Changes in v2: > > > > - drop "net: sk_buff rbnode reorg" > > - added original "ip: use rb trees for IP frag queue." commit > > Eric, does this look reasonable to you? Yes, thanks a lot Florian. > > > > > Before patches: > > > >PerfTop: 180 irqs/sec kernel:78.9% exact: 0.0% [4000Hz > > cycles:ppp], (all, 4 CPUs) > > --- > > > > 34.81% [kernel] [k] ip_defrag > > 4.57% [kernel] [k] arch_cpu_idle > > 2.09% [kernel] [k] fib_table_lookup > > 1.74% [kernel] [k] finish_task_switch > > 1.57% [kernel] [k] v7_dma_inv_range > > 1.47% [kernel] [k] __netif_receive_skb_core > > 1.06% [kernel] [k] __slab_free > > 1.04% [kernel] [k] __netdev_alloc_skb > > 0.99% [kernel] [k] ip_route_input_noref > > 0.96% [kernel] [k] dev_gro_receive > > 0.96% [kernel] [k] tick_nohz_idle_enter > > 0.93% [kernel] [k] bcm_sysport_poll > > 0.92% [kernel] [k] skb_release_data > > 0.91% [kernel] [k] __memzero > > 0.90% [kernel] [k] __free_page_frag > > 0.87% [kernel] [k] ip_rcv > > 0.77% [kernel] [k] eth_type_trans > > 0.71% [kernel] [k] _raw_spin_unlock_irqrestore > > 0.68% [kernel] [k] tick_nohz_idle_exit > > 0.65% [kernel] [k] bcm_sysport_rx_refill > > > > After patches: > > > >PerfTop: 214 irqs/sec kernel:80.4% exact: 0.0% [4000Hz > > cycles:ppp], (all, 4 CPUs) > > --- > > > > 6.61% [kernel] [k] arch_cpu_idle > > 3.77% [kernel] [k] ip_defrag > > 3.65% [kernel] [k] v7_dma_inv_range > > 3.18% [kernel] [k] fib_table_lookup > > 3.04% [kernel] [k] __netif_receive_skb_core > > 2.31% [kernel] [k] finish_task_switch > > 2.31% [kernel] [k] _raw_spin_unlock_irqrestore > > 1.65% [kernel] [k] bcm_sysport_poll > > 1.63% [kernel] [k] ip_route_input_noref > > 1.63% [kernel] [k] __memzero > > 1.58% [kernel] [k] __netdev_alloc_skb > > 1.47% [kernel] [k] tick_nohz_idle_enter > > 1.40% [kernel] [k] __slab_free > > 1.32% [kernel] [k] ip_rcv > > 1.32% [kernel] [k] __softirqentry_text_start > > 1.30% [kernel] [k] dev_gro_receive > > 1.23% [kernel] [k] bcm_sysport_rx_refill > > 1.11% [kernel] [k] tick_nohz_idle_exit > > 1.06% [kernel] [k] memcmp > > 1.02% [kernel] [k] dma_cache_maint_page > > > > > > Dan Carpenter (1): > > ipv4: frags: precedence bug in ip_expire() > > > > Eric Dumazet (21): > > inet: frags: change inet_frags_init_net() return value > > inet: frags: add a pointer to struct netns_frags > > inet: frags: refactor ipfrag_init() > > inet: frags: refactor ipv6_frag_init() > > inet: frags: refactor lowpan_net_frag_init() > > ipv6: export ip6 fragments sysctl to unprivileged users > > rhashtable: add schedule points > > inet: frags: use rhashtables for reassembly units > > inet: frags: remove some helpers > > inet: frags: get rif of inet_frag_evicting() > > inet: frags: remove inet_frag_maybe_warn_overflow() > > inet: frags: break the 2GB limit for frags storage > > inet: frags: do not clone skb in ip_expire() > > ipv6: frags: rewrite ip6_expire_frag_queue() > > rhashtable: reorganize struct rhashtable layout > > inet: frags: reorganize struct netns_frags > > inet: frags: get rid of ipfrag_skb_cb/FRAG_CB > > inet: frags: fix ip6frag_low_thresh boundary > > net: speed up skb_rbtree_purge() > > net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends > > net: add rb_to_skb() and other rb tree helpers > > > > Florian Westphal (1): > > ipv6: defrag: drop non-last frags smaller than min mtu > > > > Peter Oskolkov (5): > > ip: discard IPv4 datagrams with overlapping segments. > > net: modify skb_rbtree_purge to return the truesize of all purged > > skbs. > > ip: use rb trees for IP frag queue. > > ip: add helpers to process in-order fragments faster. > > ip: process in-order fragments efficiently > > > > Taehee Yoo (1): > > ip: frags: fix crash in ip_do_fragment() > > > > Documentation/networking/ip-sysctl.txt | 13 +- > > include/linux/rhashtable.h | 4 +- > > include/linux/skbuff.h
Re: [PATCH stable 4.9 v2 00/29] backport of IP fragmentation fixes
On 10/10/2018 12:29 PM, Florian Fainelli wrote: > This is based on Stephen's v4.14 patches, with the necessary merge > conflicts, and the lack of timer_setup() on the 4.9 baseline. > > Perf results on a gigabit capable system, before and after are below. > > Series can also be found here: > > https://github.com/ffainelli/linux/commits/fragment-stack-v4.9-v2 > > Changes in v2: > > - drop "net: sk_buff rbnode reorg" > - added original "ip: use rb trees for IP frag queue." commit Eric, does this look reasonable to you? > > Before patches: > >PerfTop: 180 irqs/sec kernel:78.9% exact: 0.0% [4000Hz cycles:ppp], > (all, 4 CPUs) > --- > > 34.81% [kernel] [k] ip_defrag > 4.57% [kernel] [k] arch_cpu_idle > 2.09% [kernel] [k] fib_table_lookup > 1.74% [kernel] [k] finish_task_switch > 1.57% [kernel] [k] v7_dma_inv_range > 1.47% [kernel] [k] __netif_receive_skb_core > 1.06% [kernel] [k] __slab_free > 1.04% [kernel] [k] __netdev_alloc_skb > 0.99% [kernel] [k] ip_route_input_noref > 0.96% [kernel] [k] dev_gro_receive > 0.96% [kernel] [k] tick_nohz_idle_enter > 0.93% [kernel] [k] bcm_sysport_poll > 0.92% [kernel] [k] skb_release_data > 0.91% [kernel] [k] __memzero > 0.90% [kernel] [k] __free_page_frag > 0.87% [kernel] [k] ip_rcv > 0.77% [kernel] [k] eth_type_trans > 0.71% [kernel] [k] _raw_spin_unlock_irqrestore > 0.68% [kernel] [k] tick_nohz_idle_exit > 0.65% [kernel] [k] bcm_sysport_rx_refill > > After patches: > >PerfTop: 214 irqs/sec kernel:80.4% exact: 0.0% [4000Hz cycles:ppp], > (all, 4 CPUs) > --- > > 6.61% [kernel] [k] arch_cpu_idle > 3.77% [kernel] [k] ip_defrag > 3.65% [kernel] [k] v7_dma_inv_range > 3.18% [kernel] [k] fib_table_lookup > 3.04% [kernel] [k] __netif_receive_skb_core > 2.31% [kernel] [k] finish_task_switch > 2.31% [kernel] [k] _raw_spin_unlock_irqrestore > 1.65% [kernel] [k] bcm_sysport_poll > 1.63% [kernel] [k] ip_route_input_noref > 1.63% [kernel] [k] __memzero > 1.58% [kernel] [k] __netdev_alloc_skb > 1.47% [kernel] [k] tick_nohz_idle_enter > 1.40% [kernel] [k] __slab_free > 1.32% [kernel] [k] ip_rcv > 1.32% [kernel] [k] __softirqentry_text_start > 1.30% [kernel] [k] dev_gro_receive > 1.23% [kernel] [k] bcm_sysport_rx_refill > 1.11% [kernel] [k] tick_nohz_idle_exit > 1.06% [kernel] [k] memcmp > 1.02% [kernel] [k] dma_cache_maint_page > > > Dan Carpenter (1): > ipv4: frags: precedence bug in ip_expire() > > Eric Dumazet (21): > inet: frags: change inet_frags_init_net() return value > inet: frags: add a pointer to struct netns_frags > inet: frags: refactor ipfrag_init() > inet: frags: refactor ipv6_frag_init() > inet: frags: refactor lowpan_net_frag_init() > ipv6: export ip6 fragments sysctl to unprivileged users > rhashtable: add schedule points > inet: frags: use rhashtables for reassembly units > inet: frags: remove some helpers > inet: frags: get rif of inet_frag_evicting() > inet: frags: remove inet_frag_maybe_warn_overflow() > inet: frags: break the 2GB limit for frags storage > inet: frags: do not clone skb in ip_expire() > ipv6: frags: rewrite ip6_expire_frag_queue() > rhashtable: reorganize struct rhashtable layout > inet: frags: reorganize struct netns_frags > inet: frags: get rid of ipfrag_skb_cb/FRAG_CB > inet: frags: fix ip6frag_low_thresh boundary > net: speed up skb_rbtree_purge() > net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends > net: add rb_to_skb() and other rb tree helpers > > Florian Westphal (1): > ipv6: defrag: drop non-last frags smaller than min mtu > > Peter Oskolkov (5): > ip: discard IPv4 datagrams with overlapping segments. > net: modify skb_rbtree_purge to return the truesize of all purged > skbs. > ip: use rb trees for IP frag queue. > ip: add helpers to process in-order fragments faster. > ip: process in-order fragments efficiently > > Taehee Yoo (1): > ip: frags: fix crash in ip_do_fragment() > > Documentation/networking/ip-sysctl.txt | 13 +- > include/linux/rhashtable.h | 4 +- > include/linux/skbuff.h | 34 +- > include/net/inet_frag.h | 133 +++--- > include/net/ip.h| 1 - > include/net/ipv6.h | 26 +- > include/uapi/linux/snmp.h | 1 + > lib/rhashtable.c| 5 +- > net/core/skbuff.c
[PATCH bpf-next 1/2] bpf: Allow sk_lookup with IPv6 module
This is a more complete fix than d71019b54bff ("net: core: Fix build with CONFIG_IPV6=m"), so that IPv6 sockets may be looked up if the IPv6 module is loaded (not just if it's compiled in). Signed-off-by: Joe Stringer --- include/net/addrconf.h | 5 + net/core/filter.c | 12 +++- net/ipv6/af_inet6.c| 1 + 3 files changed, 13 insertions(+), 5 deletions(-) diff --git a/include/net/addrconf.h b/include/net/addrconf.h index 6def0351bcc3..14b789a123e7 100644 --- a/include/net/addrconf.h +++ b/include/net/addrconf.h @@ -265,6 +265,11 @@ extern const struct ipv6_stub *ipv6_stub __read_mostly; struct ipv6_bpf_stub { int (*inet6_bind)(struct sock *sk, struct sockaddr *uaddr, int addr_len, bool force_bind_address_no_port, bool with_lock); + struct sock *(*udp6_lib_lookup)(struct net *net, + const struct in6_addr *saddr, __be16 sport, + const struct in6_addr *daddr, __be16 dport, + int dif, int sdif, struct udp_table *tbl, + struct sk_buff *skb); }; extern const struct ipv6_bpf_stub *ipv6_bpf_stub __read_mostly; diff --git a/net/core/filter.c b/net/core/filter.c index b844761b5d4c..21aba2a521c7 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4842,7 +4842,7 @@ static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple, sk = __udp4_lib_lookup(net, src4, tuple->ipv4.sport, dst4, tuple->ipv4.dport, dif, sdif, _table, skb); -#if IS_REACHABLE(CONFIG_IPV6) +#if IS_ENABLED(CONFIG_IPV6) } else { struct in6_addr *src6 = (struct in6_addr *)>ipv6.saddr; struct in6_addr *dst6 = (struct in6_addr *)>ipv6.daddr; @@ -4853,10 +4853,12 @@ static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple, src6, tuple->ipv6.sport, dst6, tuple->ipv6.dport, dif, sdif, ); - else - sk = __udp6_lib_lookup(net, src6, tuple->ipv6.sport, - dst6, tuple->ipv6.dport, - dif, sdif, _table, skb); + else if (likely(ipv6_bpf_stub)) + sk = ipv6_bpf_stub->udp6_lib_lookup(net, + src6, tuple->ipv6.sport, + dst6, tuple->ipv6.dport, + dif, sdif, + _table, skb); #endif } diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index e9c8cfdf4b4c..3f4d61017a69 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -901,6 +901,7 @@ static const struct ipv6_stub ipv6_stub_impl = { static const struct ipv6_bpf_stub ipv6_bpf_stub_impl = { .inet6_bind = __inet6_bind, + .udp6_lib_lookup = __udp6_lib_lookup, }; static int __init inet6_init(void) -- 2.17.1
[PATCH bpf-next 0/2] IPv6 sk-lookup fixes
This series includes a couple of fixups for the IPv6 socket lookup helper, to make the API more consistent (always supply all arguments in network byte-order) and to allow its use when IPv6 is compiled as a module. Joe Stringer (2): bpf: Allow sk_lookup with IPv6 module bpf: Fix IPv6 dport byte-order in bpf_sk_lookup include/net/addrconf.h | 5 + net/core/filter.c | 15 +-- net/ipv6/af_inet6.c| 1 + 3 files changed, 15 insertions(+), 6 deletions(-) -- 2.17.1
[PATCH bpf-next 2/2] bpf: Fix IPv6 dport byte-order in bpf_sk_lookup
Commit 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF") mistakenly passed the destination port in network byte-order to the IPv6 TCP/UDP socket lookup functions, which meant that BPF writers would need to either manually swap the byte-order of this field or otherwise IPv6 sockets could not be located via this helper. Fix the issue by swapping the byte-order appropriately in the helper. This also makes the API more consistent with the IPv4 version. Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF") Signed-off-by: Joe Stringer --- net/core/filter.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 21aba2a521c7..d877c4c599ce 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4846,17 +4846,18 @@ static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple, } else { struct in6_addr *src6 = (struct in6_addr *)>ipv6.saddr; struct in6_addr *dst6 = (struct in6_addr *)>ipv6.daddr; + u16 hnum = ntohs(tuple->ipv6.dport); int sdif = inet6_sdif(skb); if (proto == IPPROTO_TCP) sk = __inet6_lookup(net, _hashinfo, skb, 0, src6, tuple->ipv6.sport, - dst6, tuple->ipv6.dport, + dst6, hnum, dif, sdif, ); else if (likely(ipv6_bpf_stub)) sk = ipv6_bpf_stub->udp6_lib_lookup(net, src6, tuple->ipv6.sport, - dst6, tuple->ipv6.dport, + dst6, hnum, dif, sdif, _table, skb); #endif -- 2.17.1
Read Business Letter
Steven Peter Walker(Esq) Stone Chambers, 4 Field Court, Gray's Inn, London, WC1R 5EF.. Email: stevenwalkerchamb...@workmail.co.za Greetings To You, This is a personal email directed to you and I request that it be treated as such. I am Steven Walker, a personal attorney/sole executor to the late Engineer Robert M, herein after referred to as" my client" I represent the interest of my client killed with his immediate family in a fatal motor accident in East London on November 5, 2002.and I will like to negotiate the terms of investment of resources available to him. My late client worked as consulting engineer & sub-comptroller with Genesis Oil and Gas Consultants Ltd here in the United Kingdom and had left behind a deposit of Six Million Eight Hundred Thousand British Pounds Sterling only (£6.8million) with a finance company. The funds originated from contract transactions he executed in his registered area of business. Just after his death, I was contacted by the finance house to provide his next of kin, reasons been that his deposit agreement contains a residuary clause giving his personal attorney express authority to nominate the beneficiary to his funds. Unknown to the bank, Robert had left no possible trace of any of his close relative with me, making all efforts in my part to locate his family relative to be unfruitful since his death. In addition, from Robert's own story, he was only adopted and his foster parents whom he lost in 1976, according to him had no possible trace of his real family. The funds had remained unclaimed since his death, but I had made effort writing several letters to the embassy with intent to locate any of his extended relatives whom shall be claimants/beneficiaries of his abandoned personal estate, and all such efforts have been to no avail. More so, I have received official letters in the last few weeks suggesting a likely proceeding for confiscation of his abandoned personal assets in line with existing laws by the bank However, it will interest you to know that I discovered that some directors of this finance company are making plans already to have this fund to themselves only to use the excuse that since I am unable to find a next of kin to my late client then the funds should be confiscated, meanwhile their intentions is to have the funds retrieved for themselves. I reasoned very professionally and resolved to use a legal means to retrieve the abandoned funds, and that is to present the next of kin of my deceased client to the bank. This is legally possible and would be done in accordance with the laws. On this note, I decided to search for a credible person and finding that you bear a similar last name, I was urged to contact you, that I may, with your consent, present you to the "trustee" bank as my late client's surviving family member so as to enable you put up a claim to the bank in that capacity as a next of kin of my client. I find this to be possible for the fuller reasons that you are of the same nationality and you bear a similar last name with my late client making it a lot easier for you to put up a claim in that capacity. I have all vital documents that would confer you the legal right to lay claim to the funds, and it would back up your claim. I am willing to make these documents available to you so that the proceeds of this bank account valued at £6.8million can be paid to you before it is confiscated or declared unserviceable to the bank where this huge amount is lodged. I do sincerely sympathize the death of my client but I think that it is unprofitable for his funds to be submitted to the government of this country or some financial institution. I seek your assistance since I have been unable to locate the relatives for the past three years now and since no one would come for the claim. I seek your consent to present you as the next of kin of the deceased since you have the same last name giving you the advantage which also makes the claim most credible . In that stand, the proceeds of this account can be paid to you. Then, we talk about percentage. I know there are others with the same surname as my client, but after a little search, my instinct tells me to contact you. I shall assemble all the necessary documents that would be used to back up your claim. I guarantee that this will be executed under a legitimate arrangement that will protect you from any breach of law. I will not fail to bring to your notice that this proposal is hitch-free and that you should not entertain any fears as the required arrangements have been made for the completion of this transfer. As I said, I require only a solemn confidentiality on this. Please get in touch via my alternative email{stevenwalkerchamb...@workmail.co.za} for better confidentiality and if it's okay to you send me your telephone and fax numbers to enable us discuss further on this transaction, please do not take undue
Re: [PATCH bpf-next] tools: bpftool: add map create command
On Fri, 12 Oct 2018 23:16:59 -0700, Alexei Starovoitov wrote: > On Fri, Oct 12, 2018 at 11:06:14AM -0700, Jakub Kicinski wrote: > > Add a way of creating maps from user space. The command takes > > as parameters most of the attributes of the map creation system > > call command. After map is created its pinned to bpffs. This makes > > it possible to easily and dynamically (without rebuilding programs) > > test various corner cases related to map creation. > > > > Map type names are taken from bpftool's array used for printing. > > In general these days we try to make use of libbpf type names, but > > there are no map type names in libbpf as of today. > > > > As with most features I add the motivation is testing (offloads) :) > > > > Signed-off-by: Jakub Kicinski > > Reviewed-by: Quentin Monnet > ... > > fprintf(stderr, > > "Usage: %s %s { show | list } [MAP]\n" > > + " %s %s create FILE type TYPE key KEY_SIZE value > > VALUE_SIZE \\\n" > > + " entries MAX_ENTRIES [name NAME] > > [flags FLAGS] \\\n" > > + " [dev NAME]\n" > > I suspect as soon as bpftool has an ability to create standalone maps > some folks will start relying on such interface. That'd be cool, do you see any real life use cases where its useful outside of corner case testing? > Therefore I'd like to request to make 'name' argument to be mandatory. Will do in v2! > I think in the future we will require BTF to be mandatory too. > We need to move towards more transparent and debuggable infra. > Do you think requiring json description of key/value would be managable to > implement? > Then bpftool could convert it to BTF and the map full be fully defined. > I certainly understand that bpf prog can disregard the key/value layout today, > but we will make verifier to enforce that in the future too. I was hoping that we can leave BTF support as a future extension, and then once we have the option for the verifier to enforce BTF (a sysctl?) the bpftool map create without a BTF will get rejected as one would expect. IOW it's fine not to make BTF required at bpftool level and leave it to system configuration. I'd love to implement the BTF support right away, but I'm not sure I can afford that right now time-wise. The whole map create command is pretty trivial, but for BTF we don't even have a way of dumping it AFAICT. We can pretty print values, but what is the format in which to express the BTF itself? We could do JSON, do we use an external library? Should we have a separate BTF command for that?
Re: [PATCH iproute 2/2] utils: fix get_rtnl_link_stats_rta stats parsing
On Thu, 11 Oct 2018 14:24:03 +0200 Lorenzo Bianconi wrote: > > > iproute2 walks through the list of available tunnels using netlink > > > protocol in order to get device info instead of reading > > > them from proc filesystem. However the kernel reports device statistics > > > using IFLA_INET6_STATS/IFLA_INET6_ICMP6STATS attributes nested in > > > IFLA_PROTINFO one but iproutes expects these info in > > > IFLA_STATS64/IFLA_STATS attributes. > > > The issue can be triggered with the following reproducer: > > > > > > $ip link add ip6d0 type ip6tnl mode ip6ip6 local ::1 remote ::1 > > > $ip -6 -d -s tunnel show ip6d0 > > > ip6d0: ipv6/ipv6 remote ::1 local ::1 encaplimit 4 hoplimit 64 > > > tclass 0x00 flowlabel 0x0 (flowinfo 0x) > > > Dump terminated > > > > > > Fix the issue introducing IFLA_INET6_STATS attribute parsing > > > > > > Fixes: 3e953938717f ("iptunnel/ip6tunnel: Use netlink to walk through > > > tunnels list") > > > > > > Signed-off-by: Lorenzo Bianconi > > > > Can't we fix the kernel to report statistics properly, rather than > > starting iproute2 doing more /proc interfaces. > > > > Hi Stephen, > > sorry, I did not get what you mean. Current iproute implementation > walks through tunnels list using netlink protocol and parses device > statistics in the kernel netlink message. However it does not take > into account the actual netlink message layout since the statistic > attribute is nested in IFLA_PROTINFO one. > Moreover AFAIU the related kernel code has not changed since iproute > commit 3e953938717f, so I guess we should fix the issue in iproute code > instead in the kernel one. Do you agree? > > Regards, > Lorenzo Applied to current iproute2.
[PATCH net-next 7/7] tcp: cdg: use tcp high resolution clock cache
We store in tcp socket a cache of most recent high resolution clock, there is no need to call local_clock() again, since this cache is good enough. Signed-off-by: Eric Dumazet --- net/ipv4/tcp_cdg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv4/tcp_cdg.c b/net/ipv4/tcp_cdg.c index 06fbe102a425f28b43294925d8d13af4a13ec776..37eebd9103961be4731323cfb4d933b51954e802 100644 --- a/net/ipv4/tcp_cdg.c +++ b/net/ipv4/tcp_cdg.c @@ -146,7 +146,7 @@ static void tcp_cdg_hystart_update(struct sock *sk) return; if (hystart_detect & HYSTART_ACK_TRAIN) { - u32 now_us = div_u64(local_clock(), NSEC_PER_USEC); + u32 now_us = tp->tcp_mstamp; if (ca->last_ack == 0 || !tcp_is_cwnd_limited(sk)) { ca->last_ack = now_us; -- 2.19.0.605.g01d371f741-goog
[PATCH net-next 3/7] tcp: mitigate scheduling jitter in EDT pacing model
In commit fefa569a9d4b ("net_sched: sch_fq: account for schedule/timers drifts") we added a mitigation for scheduling jitter in fq packet scheduler. This patch does the same in TCP stack, now it is using EDT model. Note that this mitigation is valid for both external (fq packet scheduler) or internal TCP pacing. This uses the same strategy than the above commit, allowing a time credit of half the packet currently sent. Consider following case : An skb is sent, after an idle period of 300 usec. The air-time (skb->len/pacing_rate) is 500 usec Instead of setting the pacing timer to now+500 usec, it will use now+min(500/2, 300) -> now+250usec This is like having a token bucket with a depth of half an skb. Tested: tc qdisc replace dev eth0 root pfifo_fast Before netperf -P0 -H remote -- -q 10 # 8000Mbit 54 262144 26214410.007710.43 After : netperf -P0 -H remote -- -q 10 # 8000 Mbit 54 262144 26214410.007999.75 # Much closer to 8000Mbit target Signed-off-by: Eric Dumazet --- net/ipv4/tcp_output.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index f4aa4109334a043d02b17b18bef346d805dab501..5474c9854f252e50cdb1136435417873861d7618 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -985,7 +985,8 @@ static void tcp_internal_pacing(struct sock *sk) sock_hold(sk); } -static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb) +static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb, + u64 prior_wstamp) { struct tcp_sock *tp = tcp_sk(sk); @@ -998,7 +999,12 @@ static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb) * this is a minor annoyance. */ if (rate != ~0UL && rate && tp->data_segs_out >= 10) { - tp->tcp_wstamp_ns += div64_ul((u64)skb->len * NSEC_PER_SEC, rate); + u64 len_ns = div64_ul((u64)skb->len * NSEC_PER_SEC, rate); + u64 credit = tp->tcp_wstamp_ns - prior_wstamp; + + /* take into account OS jitter */ + len_ns -= min_t(u64, len_ns / 2, credit); + tp->tcp_wstamp_ns += len_ns; tcp_internal_pacing(sk); } @@ -1029,6 +1035,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, struct sk_buff *oskb = NULL; struct tcp_md5sig_key *md5; struct tcphdr *th; + u64 prior_wstamp; int err; BUG_ON(!skb || !tcp_skb_pcount(skb)); @@ -1050,7 +1057,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, return -ENOBUFS; } - /* TODO: might take care of jitter here */ + prior_wstamp = tp->tcp_wstamp_ns; tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache); skb->skb_mstamp_ns = tp->tcp_wstamp_ns; @@ -1169,7 +1176,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, err = net_xmit_eval(err); } if (!err && oskb) { - tcp_update_skb_after_send(sk, oskb); + tcp_update_skb_after_send(sk, oskb, prior_wstamp); tcp_rate_skb_sent(sk, oskb); } return err; @@ -2321,7 +2328,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) { /* "skb_mstamp" is used as a start point for the retransmit timer */ - tcp_update_skb_after_send(sk, skb); + tcp_update_skb_after_send(sk, skb, tp->tcp_wstamp_ns); goto repair; /* Skip network transmission */ } @@ -2896,7 +2903,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs) } tcp_skb_tsorted_restore(skb); if (!err) { - tcp_update_skb_after_send(sk, skb); + tcp_update_skb_after_send(sk, skb, tp->tcp_wstamp_ns); tcp_rate_skb_sent(sk, skb); } } else { -- 2.19.0.605.g01d371f741-goog
[PATCH net-next 2/7] net: extend sk_pacing_rate to unsigned long
sk_pacing_rate has beed introduced as a u32 field in 2013, effectively limiting per flow pacing to 34Gbit. We believe it is time to allow TCP to pace high speed flows on 64bit hosts, as we now can reach 100Gbit on one TCP flow. This patch adds no cost for 32bit kernels. The tcpi_pacing_rate and tcpi_max_pacing_rate were already exported as 64bit, so iproute2/ss command require no changes. Unfortunately the SO_MAX_PACING_RATE socket option will stay 32bit and we will need to add a new option to let applications control high pacing rates. State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 1787144 10.246.9.76:49992 10.246.9.77:36741 timer:(on,003ms,0) ino:91863 sk:2 <-> skmem:(r0,rb54,t66440,tb2363904,f605944,w1822984,o0,bl0,d0) ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448 rcvmss:536 advmss:1448 cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177 segs_in:3916318 data_segs_out:177279175 bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2) send 28045.5Mbps lastrcv:7 pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps busy:7ms unacked:135 retrans:0/157 rcv_space:14480 notsent:2085120 minrtt:0.013 Signed-off-by: Eric Dumazet --- include/net/sock.h| 4 ++-- net/core/filter.c | 4 ++-- net/core/sock.c | 9 + net/ipv4/tcp.c| 10 +- net/ipv4/tcp_bbr.c| 6 +++--- net/ipv4/tcp_output.c | 19 +++ net/sched/sch_fq.c| 20 7 files changed, 40 insertions(+), 32 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 751549ac0a849144ab0382203ee5c877374523e2..cfaf261936c8787b3a65ce832fd9c871697d00f4 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -422,8 +422,8 @@ struct sock { struct timer_list sk_timer; __u32 sk_priority; __u32 sk_mark; - u32 sk_pacing_rate; /* bytes per second */ - u32 sk_max_pacing_rate; + unsigned long sk_pacing_rate; /* bytes per second */ + unsigned long sk_max_pacing_rate; struct page_fragsk_frag; netdev_features_t sk_route_caps; netdev_features_t sk_route_nocaps; diff --git a/net/core/filter.c b/net/core/filter.c index 4bbc6567fcb818e91617bfa9a2fd7fbebbd129f8..80da21b097b8d05eb7b9fa92afa86762334ac0ae 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3927,8 +3927,8 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock, sk->sk_userlocks |= SOCK_SNDBUF_LOCK; sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF); break; - case SO_MAX_PACING_RATE: - sk->sk_max_pacing_rate = val; + case SO_MAX_PACING_RATE: /* 32bit version */ + sk->sk_max_pacing_rate = (val == ~0U) ? ~0UL : val; sk->sk_pacing_rate = min(sk->sk_pacing_rate, sk->sk_max_pacing_rate); break; diff --git a/net/core/sock.c b/net/core/sock.c index 7e8796a6a0892efbb7dfce67d12b8062b2d5daa9..fdf9fc7d3f9875f2718575078a0f263674c80b4f 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -998,7 +998,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname, cmpxchg(>sk_pacing_status, SK_PACING_NONE, SK_PACING_NEEDED); - sk->sk_max_pacing_rate = val; + sk->sk_max_pacing_rate = (val == ~0U) ? ~0UL : val; sk->sk_pacing_rate = min(sk->sk_pacing_rate, sk->sk_max_pacing_rate); break; @@ -1336,7 +1336,8 @@ int sock_getsockopt(struct socket *sock, int level, int optname, #endif case SO_MAX_PACING_RATE: - v.val = sk->sk_max_pacing_rate; + /* 32bit version */ + v.val = min_t(unsigned long, sk->sk_max_pacing_rate, ~0U); break; case SO_INCOMING_CPU: @@ -2810,8 +2811,8 @@ void sock_init_data(struct socket *sock, struct sock *sk) sk->sk_ll_usec = sysctl_net_busy_read; #endif - sk->sk_max_pacing_rate = ~0U; - sk->sk_pacing_rate = ~0U; + sk->sk_max_pacing_rate = ~0UL; + sk->sk_pacing_rate = ~0UL; sk->sk_pacing_shift = 10; sk->sk_incoming_cpu = -1; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 43ef83b2330e6238a55c9843580a585d87708e0c..b8ba8fa34effac5138aea76b0d0fc2a9f1c05c4f 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3111,10 +3111,10 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info) { const struct tcp_sock *tp = tcp_sk(sk); /* iff sk_type == SOCK_STREAM */ const struct inet_connection_sock
[PATCH net-next 6/7] tcp_bbr: fix typo in bbr_pacing_margin_percent
From: Neal Cardwell There was a typo in this parameter name. Signed-off-by: Neal Cardwell Signed-off-by: Eric Dumazet --- net/ipv4/tcp_bbr.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c index 33f4358615e6d63b5c98a30484f12ffae66334a2..b88081285fd172444a844b6aec5d038c0f882594 100644 --- a/net/ipv4/tcp_bbr.c +++ b/net/ipv4/tcp_bbr.c @@ -129,7 +129,7 @@ static const u32 bbr_probe_rtt_mode_ms = 200; static const int bbr_min_tso_rate = 120; /* Pace at ~1% below estimated bw, on average, to reduce queue at bottleneck. */ -static const int bbr_pacing_marging_percent = 1; +static const int bbr_pacing_margin_percent = 1; /* We use a high_gain value of 2/ln(2) because it's the smallest pacing gain * that will allow a smoothly increasing pacing rate that will double each RTT @@ -214,7 +214,7 @@ static u64 bbr_rate_bytes_per_sec(struct sock *sk, u64 rate, int gain) rate *= mss; rate *= gain; rate >>= BBR_SCALE; - rate *= USEC_PER_SEC / 100 * (100 - bbr_pacing_marging_percent); + rate *= USEC_PER_SEC / 100 * (100 - bbr_pacing_margin_percent); return rate >> BW_SCALE; } -- 2.19.0.605.g01d371f741-goog
[PATCH net-next 4/7] net_sched: sch_fq: no longer use skb_is_tcp_pure_ack()
With the new EDT model, sch_fq no longer has to special case TCP pure acks, since their skb->tstamp will allow them being sent without pacing delay. Signed-off-by: Eric Dumazet --- net/sched/sch_fq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c index 3923d14095335df61c270f69e50cb7cbfde4c796..4b1af706896c07e5a0fe6d542dfcd530acdcf8f5 100644 --- a/net/sched/sch_fq.c +++ b/net/sched/sch_fq.c @@ -444,7 +444,7 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch) } skb = f->head; - if (skb && !skb_is_tcp_pure_ack(skb)) { + if (skb) { u64 time_next_packet = max_t(u64, ktime_to_ns(skb->tstamp), f->time_next_packet); -- 2.19.0.605.g01d371f741-goog
[PATCH net-next 5/7] tcp: optimize tcp internal pacing
When TCP implements its own pacing (when no fq packet scheduler is used), it is arming high resolution timer after a packet is sent. But in many cases (like TCP_RR kind of workloads), this high resolution timer expires before the application attempts to write the following packet. This overhead also happens when the flow is ACK clocked and cwnd limited instead of being limited by the pacing rate. This leads to extra overhead (high number of IRQ) Now tcp_wstamp_ns is reserved for the pacing timer only (after commit "tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh"), we can setup the timer only when a packet is about to be sent, and if tcp_wstamp_ns is in the future. This leads to a ~10% performance increase in TCP_RR workloads. Signed-off-by: Eric Dumazet --- net/ipv4/tcp_output.c | 31 --- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 5474c9854f252e50cdb1136435417873861d7618..d212e4cbc68902e873afb4a12b43b467ccd6069b 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -975,16 +975,6 @@ enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer) return HRTIMER_NORESTART; } -static void tcp_internal_pacing(struct sock *sk) -{ - if (!tcp_needs_internal_pacing(sk)) - return; - hrtimer_start(_sk(sk)->pacing_timer, - ns_to_ktime(tcp_sk(sk)->tcp_wstamp_ns), - HRTIMER_MODE_ABS_PINNED_SOFT); - sock_hold(sk); -} - static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb, u64 prior_wstamp) { @@ -1005,8 +995,6 @@ static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb, /* take into account OS jitter */ len_ns -= min_t(u64, len_ns / 2, credit); tp->tcp_wstamp_ns += len_ns; - - tcp_internal_pacing(sk); } } list_move_tail(>tcp_tsorted_anchor, >tsorted_sent_queue); @@ -2186,10 +2174,23 @@ static int tcp_mtu_probe(struct sock *sk) return -1; } -static bool tcp_pacing_check(const struct sock *sk) +static bool tcp_pacing_check(struct sock *sk) { - return tcp_needs_internal_pacing(sk) && - hrtimer_is_queued(_sk(sk)->pacing_timer); + struct tcp_sock *tp = tcp_sk(sk); + + if (!tcp_needs_internal_pacing(sk)) + return false; + + if (tp->tcp_wstamp_ns <= tp->tcp_clock_cache) + return false; + + if (!hrtimer_is_queued(>pacing_timer)) { + hrtimer_start(>pacing_timer, + ns_to_ktime(tp->tcp_wstamp_ns), + HRTIMER_MODE_ABS_PINNED_SOFT); + sock_hold(sk); + } + return true; } /* TCP Small Queues : -- 2.19.0.605.g01d371f741-goog
[PATCH net-next 1/7] tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh
In EDT design, I made the mistake of using tcp_wstamp_ns to store the last tcp_clock_ns() sample and to store the pacing virtual timer. This causes major regressions at high speed flows. Introduce tcp_clock_cache to store last tcp_clock_ns(). This is needed because some arches have slow high-resolution kernel time service. tcp_wstamp_ns is only updated when a packet is sent. Note that we can remove tcp_mstamp in the future since tcp_mstamp is essentially tcp_clock_cache/1000, so the apparent socket size increase is temporary. Fixes: 9799ccb0e984 ("tcp: add tcp_wstamp_ns socket field") Signed-off-by: Eric Dumazet Acked-by: Soheil Hassas Yeganeh --- include/linux/tcp.h | 1 + net/ipv4/tcp_output.c | 9 ++--- net/ipv4/tcp_timer.c | 2 +- 3 files changed, 8 insertions(+), 4 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 848f5b25e178288ce870637b68a692ab88dc7d4d..8ed77bb4ed8636e9294389a011529fd9a667dce4 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -249,6 +249,7 @@ struct tcp_sock { u32 tlp_high_seq; /* snd_nxt at the time of TLP retransmit. */ u64 tcp_wstamp_ns; /* departure time for next sent data packet */ + u64 tcp_clock_cache; /* cache last tcp_clock_ns() (see tcp_mstamp_refresh()) */ /* RTT measurement */ u64 tcp_mstamp; /* most recent packet received/sent */ diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 059b67af28b137fb9566eaef370b270fc424bffb..f14df66a0c858dcb22b8924b9691c375eb5fcbc5 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -52,9 +52,8 @@ void tcp_mstamp_refresh(struct tcp_sock *tp) { u64 val = tcp_clock_ns(); - /* departure time for next data packet */ - if (val > tp->tcp_wstamp_ns) - tp->tcp_wstamp_ns = val; + if (val > tp->tcp_clock_cache) + tp->tcp_clock_cache = val; val = div_u64(val, NSEC_PER_USEC); if (val > tp->tcp_mstamp) @@ -1050,6 +1049,10 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, if (unlikely(!skb)) return -ENOBUFS; } + + /* TODO: might take care of jitter here */ + tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache); + skb->skb_mstamp_ns = tp->tcp_wstamp_ns; inet = inet_sk(sk); diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index 61023d50cd604d5e19464a32c33b65d29c75c81e..676020663ce80a79341ad1a05352742cc8dd5850 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -360,7 +360,7 @@ static void tcp_probe_timer(struct sock *sk) */ start_ts = tcp_skb_timestamp(skb); if (!start_ts) - skb->skb_mstamp_ns = tp->tcp_wstamp_ns; + skb->skb_mstamp_ns = tp->tcp_clock_cache; else if (icsk->icsk_user_timeout && (s32)(tcp_time_stamp(tp) - start_ts) > icsk->icsk_user_timeout) goto abort; -- 2.19.0.605.g01d371f741-goog
[PATCH net-next 0/7] tcp: second round for EDT conversion
First round of EDT patches left TCP stack in a non optimal state. - High speed flows suffered from loss of performance, addressed by the first patch of this series. - Second patch brings pacing to the current state of networking, since we now reach ~100 Gbit on a single TCP flow. - Third patch implements a mitigation for scheduling delays, like the one we did in sch_fq in the past. - Fourth patch removes one special case in sch_fq for ACK packets. - Fifth patch removes a serious perfomance cost for TCP internal pacing. We should setup the high resolution timer only if really needed. - Sixth patch fixes a typo in BBR. - Last patch is one minor change in cdg congestion control. Neal Cardwell also has a patch series fixing BBR after EDT adoption. Eric Dumazet (6): tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh net: extend sk_pacing_rate to unsigned long tcp: mitigate scheduling jitter in EDT pacing model net_sched: sch_fq: no longer use skb_is_tcp_pure_ack() tcp: optimize tcp internal pacing tcp: cdg: use tcp high resolution clock cache Neal Cardwell (1): tcp_bbr: fix typo in bbr_pacing_margin_percent include/linux/tcp.h | 1 + include/net/sock.h| 4 +-- net/core/filter.c | 4 +-- net/core/sock.c | 9 +++--- net/ipv4/tcp.c| 10 +++--- net/ipv4/tcp_bbr.c| 10 +++--- net/ipv4/tcp_cdg.c| 2 +- net/ipv4/tcp_output.c | 72 ++- net/ipv4/tcp_timer.c | 2 +- net/sched/sch_fq.c| 22 +++-- 10 files changed, 78 insertions(+), 58 deletions(-) -- 2.19.0.605.g01d371f741-goog
Re: [PATCH iproute2] macsec: fix off-by-one when parsing attributes
On Fri, 12 Oct 2018 17:34:12 +0200 Sabrina Dubroca wrote: > I seem to have had a massive brainfart with uses of > parse_rtattr_nested(). The rtattr* array must have MAX+1 elements, and > the call to parse_rtattr_nested must have MAX as its bound. Let's fix > those. > > Fixes: b26fc590ce62 ("ip: add MACsec support") > Signed-off-by: Sabrina Dubroca Applied, How did it ever work??
Re: [PATCH iproute2] json: make 0xhex handle u64
On Fri, 12 Oct 2018 17:34:32 +0200 Sabrina Dubroca wrote: > Stephen converted macsec's sci to use 0xhex, but 0xhex handles > unsigned int's, not 64 bits ints. Thus, the output of the "ip macsec > show" command is mangled, with half of the SCI replaced with 0s: > > # ip macsec show > 11: macsec0: [...] > cipher suite: GCM-AES-128, using ICV length 16 > TXSC: 01560001 on SA 0 > > # ip -d link show macsec0 > 11: macsec0@ens3: [...] > link/ether 52:54:00:12:01:56 brd ff:ff:ff:ff:ff:ff promiscuity 0 > macsec sci 5254001201560001 [...] > > where TXSC and sci should match. > > Fixes: c0b904de6211 ("macsec: support JSON") > Signed-off-by: Sabrina Dubroca Thanks for finding this. We should add JSON (and macsec) to tests.
Re: [iproute PATCH] bridge: fdb: Fix for missing keywords in non-JSON output
On Tue, 9 Oct 2018 14:44:08 +0200 Phil Sutter wrote: > While migrating to JSON print library, some keywords were dropped from > standard output by accident. Add them back to unbreak output parsers. > > Fixes: c7c1a1ef51aea ("bridge: colorize output and use JSON print library") > Signed-off-by: Phil Sutter Good catch. Applied.
Re: BBR and TCP internal pacing causing interrupt storm with pfifo_fast
On 10/15/2018 07:50 AM, Eric Dumazet wrote: > On Mon, Oct 15, 2018 at 3:26 AM Gasper Zejn wrote: >> >> >> I've tried to isolate the issue as best I could. There seems to be an >> issue if the TCP socket has keepalive set and send queue is not empty >> and the route goes away. >> >> https://github.com/zejn/bbr_pfifo_interrupts_issue >> >> Hope this helps, >> Gasper > > This is awesome Gasper, I will take a look thanks. > > Note that we are about to send a patch series (targeting net-next) to > polish the EDT patch series that was merged last month for linux-4.20. > TCP internal pacing is going to be much better performance-wise. > Yeah, I believe that : Commit c092dd5f4a7f4e4dbbcc8cf2e50b516bf07e432f ("tcp: switch tcp_internal_pacing() to tcp_wstamp_ns") has incidentally fixed the issue. That is because it calls tcp_internal_pacing() from tcp_update_skb_after_send() which is called only if the packet was correctly sent by IP layer. Before this patch, tcp_internal_pacing() was called from __tcp_transmit_skb() before we attempted to send the clone and the clone could be dropped in IP layer (lack of route for example) right away. So in case the packet was not sent because of a route problem, the high resolution timer would kick soon after and TCP xmit path would be entered again, triggering this loop problem. I am going to send the 2nd round of EDT patches, so that you can try David Miller net-next tree with all the patches we believe are needed for 4.20. Once proven to work, we might have to backport the series to 4.18 and 4.19 Thanks !
Re: [Bug 201423] New: eth0: hw csum failure
On Mon, 15 Oct 2018 08:41:47 -0700 Eric Dumazet wrote: > On Mon, Oct 15, 2018 at 8:15 AM Stephen Hemminger > wrote: > > > > > > > > Begin forwarded message: > > > > Date: Sun, 14 Oct 2018 10:42:48 + > > From: bugzilla-dae...@bugzilla.kernel.org > > To: step...@networkplumber.org > > Subject: [Bug 201423] New: eth0: hw csum failure > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=201423 > > > > Bug ID: 201423 > >Summary: eth0: hw csum failure > >Product: Networking > >Version: 2.5 > > Kernel Version: 4.19.0-rc7 > > Hardware: Intel > > OS: Linux > > Tree: Mainline > > Status: NEW > > Severity: normal > > Priority: P1 > > Component: Other > > Assignee: step...@networkplumber.org > > Reporter: ross...@inwind.it > > Regression: No > > > > I have a P6T DELUXE V2 motherboard and using the sky2 driver for the > > ethernet > > ports. I get the following error message: > > > > [ 433.727397] eth0: hw csum failure > > [ 433.727406] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.19.0-rc7 #19 > > [ 433.727406] Hardware name: System manufacturer System Product Name/P6T > > DELUXE V2, BIOS 120212/22/2010 > > [ 433.727407] Call Trace: > > [ 433.727409] > > [ 433.727415] dump_stack+0x46/0x5b > > [ 433.727419] __skb_checksum_complete+0xb0/0xc0 > > [ 433.727423] tcp_v4_rcv+0x528/0xb60 > > [ 433.727426] ? ipt_do_table+0x2d0/0x400 > > [ 433.727429] ip_local_deliver_finish+0x5a/0x110 > > [ 433.727430] ip_local_deliver+0xe1/0xf0 > > [ 433.727431] ? ip_sublist_rcv_finish+0x60/0x60 > > [ 433.727432] ip_rcv+0xca/0xe0 > > [ 433.727434] ? ip_rcv_finish_core.isra.0+0x300/0x300 > > [ 433.727436] __netif_receive_skb_one_core+0x4b/0x70 > > [ 433.727438] netif_receive_skb_internal+0x4e/0x130 > > [ 433.727439] napi_gro_receive+0x6a/0x80 > > [ 433.727442] sky2_poll+0x707/0xd20 > > [ 433.727446] ? rcu_check_callbacks+0x1b4/0x900 > > [ 433.727447] net_rx_action+0x237/0x380 > > [ 433.727449] __do_softirq+0xdc/0x1e0 > > [ 433.727452] irq_exit+0xa9/0xb0 > > [ 433.727453] do_IRQ+0x45/0xc0 > > [ 433.727455] common_interrupt+0xf/0xf > > [ 433.727456] > > [ 433.727459] RIP: 0010:cpuidle_enter_state+0x124/0x200 > > [ 433.727461] Code: 53 60 89 c3 e8 dd 90 ad ff 65 8b 3d 96 58 a7 7e e8 d1 > > 8f > > ad ff 31 ff 49 89 c4 e8 27 99 ad ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> > > 89 e1 > > 4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48 > > [ 433.727462] RSP: :c90a3e98 EFLAGS: 0282 ORIG_RAX: > > ffde > > [ 433.727463] RAX: 880237b1f280 RBX: 0004 RCX: > > 001f > > [ 433.727464] RDX: 20c49ba5e353f7cf RSI: 2fe419c1 RDI: > > > > [ 433.727465] RBP: 880237b263a0 R08: 0714 R09: > > 00650512105d > > [ 433.727465] R10: R11: 0342 R12: > > 0064fc2a8b1c > > [ 433.727466] R13: 0064fc25b35f R14: 0004 R15: > > 8204af20 > > [ 433.727468] ? cpuidle_enter_state+0x119/0x200 > > [ 433.727471] do_idle+0x1bf/0x200 > > [ 433.727473] cpu_startup_entry+0x6a/0x70 > > [ 433.727475] start_secondary+0x17f/0x1c0 > > [ 433.727476] secondary_startup_64+0xa4/0xb0 > > [ 441.662954] eth0: hw csum failure > > [ 441.662959] CPU: 4 PID: 4347 Comm: radeon_cs:0 Not tainted 4.19.0-rc7 #19 > > [ 441.662960] Hardware name: System manufacturer System Product Name/P6T > > DELUXE V2, BIOS 120212/22/2010 > > [ 441.662960] Call Trace: > > [ 441.662963] > > [ 441.662968] dump_stack+0x46/0x5b > > [ 441.662972] __skb_checksum_complete+0xb0/0xc0 > > [ 441.662975] tcp_v4_rcv+0x528/0xb60 > > [ 441.662979] ? ipt_do_table+0x2d0/0x400 > > [ 441.662981] ip_local_deliver_finish+0x5a/0x110 > > [ 441.662983] ip_local_deliver+0xe1/0xf0 > > [ 441.662985] ? ip_sublist_rcv_finish+0x60/0x60 > > [ 441.662986] ip_rcv+0xca/0xe0 > > [ 441.662988] ? ip_rcv_finish_core.isra.0+0x300/0x300 > > [ 441.662990] __netif_receive_skb_one_core+0x4b/0x70 > > [ 441.662993] netif_receive_skb_internal+0x4e/0x130 > > [ 441.662994] napi_gro_receive+0x6a/0x80 > > [ 441.662998] sky2_poll+0x707/0xd20 > > [ 441.663000] net_rx_action+0x237/0x380 > > [ 441.663002] __do_softirq+0xdc/0x1e0 > > [ 441.663005] irq_exit+0xa9/0xb0 > > [ 441.663007] do_IRQ+0x45/0xc0 > > [ 441.663009] common_interrupt+0xf/0xf > > [ 441.663010] > > [ 441.663012] RIP: 0010:merge+0x22/0xb0 > > [ 441.663014] Code: c3 31 c0 c3 90 90 90 90 41 56 41 55 41 54 55 48 89 d5 > > 53 > > 48 89 cb 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0 <48> > > 85 c9 > > 74 70 48 85 d2 74 6b 49 89 fd 49 89 f6 49 89 e4 eb 14 48 > > [ 441.663015] RSP: 0018:c990b988 EFLAGS: 0246 ORIG_RAX: > > ffde > > [ 441.663017] RAX: RBX: 88021ab2d408 RCX: > > 88021ab2d408 > > [ 441.663018]
Re: Fw: [Bug 201423] New: eth0: hw csum failure
Hi Eric. On Mon, 15 Oct 2018 at 16:42, Eric Dumazet wrote: > > On Mon, Oct 15, 2018 at 8:15 AM Stephen Hemminger > wrote: > > > > > > > > Begin forwarded message: > > > > Date: Sun, 14 Oct 2018 10:42:48 + > > From: bugzilla-dae...@bugzilla.kernel.org > > To: step...@networkplumber.org > > Subject: [Bug 201423] New: eth0: hw csum failure > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=201423 > > > > Bug ID: 201423 > >Summary: eth0: hw csum failure > >Product: Networking > >Version: 2.5 > > Kernel Version: 4.19.0-rc7 > > Hardware: Intel > > OS: Linux > > Tree: Mainline > > Status: NEW > > Severity: normal > > Priority: P1 > > Component: Other > > Assignee: step...@networkplumber.org > > Reporter: ross...@inwind.it > > Regression: No > > > > I have a P6T DELUXE V2 motherboard and using the sky2 driver for the > > ethernet > > ports. I get the following error message: > > > > [ 433.727397] eth0: hw csum failure > > [ 433.727406] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.19.0-rc7 #19 > > [ 433.727406] Hardware name: System manufacturer System Product Name/P6T > > DELUXE V2, BIOS 120212/22/2010 > > [ 433.727407] Call Trace: > > [ 433.727409] > > [ 433.727415] dump_stack+0x46/0x5b > > [ 433.727419] __skb_checksum_complete+0xb0/0xc0 > > [ 433.727423] tcp_v4_rcv+0x528/0xb60 > > [ 433.727426] ? ipt_do_table+0x2d0/0x400 > > [ 433.727429] ip_local_deliver_finish+0x5a/0x110 > > [ 433.727430] ip_local_deliver+0xe1/0xf0 > > [ 433.727431] ? ip_sublist_rcv_finish+0x60/0x60 > > [ 433.727432] ip_rcv+0xca/0xe0 > > [ 433.727434] ? ip_rcv_finish_core.isra.0+0x300/0x300 > > [ 433.727436] __netif_receive_skb_one_core+0x4b/0x70 > > [ 433.727438] netif_receive_skb_internal+0x4e/0x130 > > [ 433.727439] napi_gro_receive+0x6a/0x80 > > [ 433.727442] sky2_poll+0x707/0xd20 > > [ 433.727446] ? rcu_check_callbacks+0x1b4/0x900 > > [ 433.727447] net_rx_action+0x237/0x380 > > [ 433.727449] __do_softirq+0xdc/0x1e0 > > [ 433.727452] irq_exit+0xa9/0xb0 > > [ 433.727453] do_IRQ+0x45/0xc0 > > [ 433.727455] common_interrupt+0xf/0xf > > [ 433.727456] > > [ 433.727459] RIP: 0010:cpuidle_enter_state+0x124/0x200 > > [ 433.727461] Code: 53 60 89 c3 e8 dd 90 ad ff 65 8b 3d 96 58 a7 7e e8 d1 > > 8f > > ad ff 31 ff 49 89 c4 e8 27 99 ad ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> > > 89 e1 > > 4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48 > > [ 433.727462] RSP: :c90a3e98 EFLAGS: 0282 ORIG_RAX: > > ffde > > [ 433.727463] RAX: 880237b1f280 RBX: 0004 RCX: > > 001f > > [ 433.727464] RDX: 20c49ba5e353f7cf RSI: 2fe419c1 RDI: > > > > [ 433.727465] RBP: 880237b263a0 R08: 0714 R09: > > 00650512105d > > [ 433.727465] R10: R11: 0342 R12: > > 0064fc2a8b1c > > [ 433.727466] R13: 0064fc25b35f R14: 0004 R15: > > 8204af20 > > [ 433.727468] ? cpuidle_enter_state+0x119/0x200 > > [ 433.727471] do_idle+0x1bf/0x200 > > [ 433.727473] cpu_startup_entry+0x6a/0x70 > > [ 433.727475] start_secondary+0x17f/0x1c0 > > [ 433.727476] secondary_startup_64+0xa4/0xb0 > > [ 441.662954] eth0: hw csum failure > > [ 441.662959] CPU: 4 PID: 4347 Comm: radeon_cs:0 Not tainted 4.19.0-rc7 #19 > > [ 441.662960] Hardware name: System manufacturer System Product Name/P6T > > DELUXE V2, BIOS 120212/22/2010 > > [ 441.662960] Call Trace: > > [ 441.662963] > > [ 441.662968] dump_stack+0x46/0x5b > > [ 441.662972] __skb_checksum_complete+0xb0/0xc0 > > [ 441.662975] tcp_v4_rcv+0x528/0xb60 > > [ 441.662979] ? ipt_do_table+0x2d0/0x400 > > [ 441.662981] ip_local_deliver_finish+0x5a/0x110 > > [ 441.662983] ip_local_deliver+0xe1/0xf0 > > [ 441.662985] ? ip_sublist_rcv_finish+0x60/0x60 > > [ 441.662986] ip_rcv+0xca/0xe0 > > [ 441.662988] ? ip_rcv_finish_core.isra.0+0x300/0x300 > > [ 441.662990] __netif_receive_skb_one_core+0x4b/0x70 > > [ 441.662993] netif_receive_skb_internal+0x4e/0x130 > > [ 441.662994] napi_gro_receive+0x6a/0x80 > > [ 441.662998] sky2_poll+0x707/0xd20 > > [ 441.663000] net_rx_action+0x237/0x380 > > [ 441.663002] __do_softirq+0xdc/0x1e0 > > [ 441.663005] irq_exit+0xa9/0xb0 > > [ 441.663007] do_IRQ+0x45/0xc0 > > [ 441.663009] common_interrupt+0xf/0xf > > [ 441.663010] > > [ 441.663012] RIP: 0010:merge+0x22/0xb0 > > [ 441.663014] Code: c3 31 c0 c3 90 90 90 90 41 56 41 55 41 54 55 48 89 d5 > > 53 > > 48 89 cb 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0 <48> > > 85 c9 > > 74 70 48 85 d2 74 6b 49 89 fd 49 89 f6 49 89 e4 eb 14 48 > > [ 441.663015] RSP: 0018:c990b988 EFLAGS: 0246 ORIG_RAX: > > ffde > > [ 441.663017] RAX: RBX: 88021ab2d408 RCX: > > 88021ab2d408 > > [
Re: [bpf-next PATCH v2 2/2] bpf: bpftool, add flag to allow non-compat map definitions
On Mon, 15 Oct 2018 08:17:53 -0700, John Fastabend wrote: > Multiple map definition structures exist and user may have non-zero > fields in their definition that are not recognized by bpftool and > libbpf. The normal behavior is to then fail loading the map. Although > this is a good default behavior users may still want to load the map > for debugging or other reasons. This patch adds a --mapcompat flag > that can be used to override the default behavior and allow loading > the map even when it has additional non-zero fields. > > For now the only user is 'bpftool prog' we can switch over other > subcommands as needed. The library exposes an API that consumes > a flags field now but I kept the original API around also in case > users of the API don't want to expose this. The flags field is an > int in case we need more control over how the API call handles > errors/features/etc in the future. > > Signed-off-by: John Fastabend No strong opinion on the functionality, but may I be a grump and again request adding the new option to completions and the man page? :)
Re: [bpf-next PATCH v2 1/2] bpf: bpftool, add support for attaching programs to maps
On Mon, 15 Oct 2018 08:17:48 -0700, John Fastabend wrote: > Sock map/hash introduce support for attaching programs to maps. To > date I have been doing this with custom tooling but this is less than > ideal as we shift to using bpftool as the single CLI for our BPF uses. > This patch adds new sub commands 'attach' and 'detach' to the 'prog' > command to attach programs to maps and then detach them. > > Signed-off-by: John Fastabend Reviewed-by: Jakub Kicinski
Re: Bug in MACSec - stops passing traffic after approx 5TB
And confirmed, starting with a high packet number results in a very short testbed run, 296 packets and then nothing, just as you surmised. Sorry for raising the alarm falsely. Looks like I need to roll my own build of wpa_supplicant as the ubuntu builds don't include the macsec driver, haven't tested Gentoo's ebuilds yet to see if they do. Josh Coombs On Sun, Oct 14, 2018 at 4:52 PM Josh Coombs wrote: > > On Sun, Oct 14, 2018 at 4:24 PM Sabrina Dubroca wrote: > > > > 2018-10-14, 10:59:31 -0400, Josh Coombs wrote: > > > I initially mistook this for a traffic control issue, but after > > > stripping the test beds down to just the MACSec component, I can still > > > replicate the issue. After approximately 5TB of transfer / 4 billion > > > packets over a MACSec link it stops passing traffic. > > > > I think you're just hitting packet number exhaustion. After 2^32 > > packets, the packet number would wrap to 0 and start being reused, > > which breaks the crypto used by macsec. Before this point, you have to > > add a new SA, and tell the macsec device to switch to it. > > I had not considered that, I naively thought as long as I didn't > specify a replay window, it'd roll the PN over on it's own and life > would be good. I'll test that theory tomorrow, should be easy to > prove out. > > > That's why you should be using wpa_supplicant. It will monitor the > > growth of the packet number, and handle the rekey for you. > > Thank you for the heads up, I'll read up on this as well. > > Josh C
Re: Fw: [Bug 201423] New: eth0: hw csum failure
On Mon, Oct 15, 2018 at 8:15 AM Stephen Hemminger wrote: > > > > Begin forwarded message: > > Date: Sun, 14 Oct 2018 10:42:48 + > From: bugzilla-dae...@bugzilla.kernel.org > To: step...@networkplumber.org > Subject: [Bug 201423] New: eth0: hw csum failure > > > https://bugzilla.kernel.org/show_bug.cgi?id=201423 > > Bug ID: 201423 >Summary: eth0: hw csum failure >Product: Networking >Version: 2.5 > Kernel Version: 4.19.0-rc7 > Hardware: Intel > OS: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Other > Assignee: step...@networkplumber.org > Reporter: ross...@inwind.it > Regression: No > > I have a P6T DELUXE V2 motherboard and using the sky2 driver for the ethernet > ports. I get the following error message: > > [ 433.727397] eth0: hw csum failure > [ 433.727406] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.19.0-rc7 #19 > [ 433.727406] Hardware name: System manufacturer System Product Name/P6T > DELUXE V2, BIOS 120212/22/2010 > [ 433.727407] Call Trace: > [ 433.727409] > [ 433.727415] dump_stack+0x46/0x5b > [ 433.727419] __skb_checksum_complete+0xb0/0xc0 > [ 433.727423] tcp_v4_rcv+0x528/0xb60 > [ 433.727426] ? ipt_do_table+0x2d0/0x400 > [ 433.727429] ip_local_deliver_finish+0x5a/0x110 > [ 433.727430] ip_local_deliver+0xe1/0xf0 > [ 433.727431] ? ip_sublist_rcv_finish+0x60/0x60 > [ 433.727432] ip_rcv+0xca/0xe0 > [ 433.727434] ? ip_rcv_finish_core.isra.0+0x300/0x300 > [ 433.727436] __netif_receive_skb_one_core+0x4b/0x70 > [ 433.727438] netif_receive_skb_internal+0x4e/0x130 > [ 433.727439] napi_gro_receive+0x6a/0x80 > [ 433.727442] sky2_poll+0x707/0xd20 > [ 433.727446] ? rcu_check_callbacks+0x1b4/0x900 > [ 433.727447] net_rx_action+0x237/0x380 > [ 433.727449] __do_softirq+0xdc/0x1e0 > [ 433.727452] irq_exit+0xa9/0xb0 > [ 433.727453] do_IRQ+0x45/0xc0 > [ 433.727455] common_interrupt+0xf/0xf > [ 433.727456] > [ 433.727459] RIP: 0010:cpuidle_enter_state+0x124/0x200 > [ 433.727461] Code: 53 60 89 c3 e8 dd 90 ad ff 65 8b 3d 96 58 a7 7e e8 d1 8f > ad ff 31 ff 49 89 c4 e8 27 99 ad ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 89 > e1 > 4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48 > [ 433.727462] RSP: :c90a3e98 EFLAGS: 0282 ORIG_RAX: > ffde > [ 433.727463] RAX: 880237b1f280 RBX: 0004 RCX: > 001f > [ 433.727464] RDX: 20c49ba5e353f7cf RSI: 2fe419c1 RDI: > > [ 433.727465] RBP: 880237b263a0 R08: 0714 R09: > 00650512105d > [ 433.727465] R10: R11: 0342 R12: > 0064fc2a8b1c > [ 433.727466] R13: 0064fc25b35f R14: 0004 R15: > 8204af20 > [ 433.727468] ? cpuidle_enter_state+0x119/0x200 > [ 433.727471] do_idle+0x1bf/0x200 > [ 433.727473] cpu_startup_entry+0x6a/0x70 > [ 433.727475] start_secondary+0x17f/0x1c0 > [ 433.727476] secondary_startup_64+0xa4/0xb0 > [ 441.662954] eth0: hw csum failure > [ 441.662959] CPU: 4 PID: 4347 Comm: radeon_cs:0 Not tainted 4.19.0-rc7 #19 > [ 441.662960] Hardware name: System manufacturer System Product Name/P6T > DELUXE V2, BIOS 120212/22/2010 > [ 441.662960] Call Trace: > [ 441.662963] > [ 441.662968] dump_stack+0x46/0x5b > [ 441.662972] __skb_checksum_complete+0xb0/0xc0 > [ 441.662975] tcp_v4_rcv+0x528/0xb60 > [ 441.662979] ? ipt_do_table+0x2d0/0x400 > [ 441.662981] ip_local_deliver_finish+0x5a/0x110 > [ 441.662983] ip_local_deliver+0xe1/0xf0 > [ 441.662985] ? ip_sublist_rcv_finish+0x60/0x60 > [ 441.662986] ip_rcv+0xca/0xe0 > [ 441.662988] ? ip_rcv_finish_core.isra.0+0x300/0x300 > [ 441.662990] __netif_receive_skb_one_core+0x4b/0x70 > [ 441.662993] netif_receive_skb_internal+0x4e/0x130 > [ 441.662994] napi_gro_receive+0x6a/0x80 > [ 441.662998] sky2_poll+0x707/0xd20 > [ 441.663000] net_rx_action+0x237/0x380 > [ 441.663002] __do_softirq+0xdc/0x1e0 > [ 441.663005] irq_exit+0xa9/0xb0 > [ 441.663007] do_IRQ+0x45/0xc0 > [ 441.663009] common_interrupt+0xf/0xf > [ 441.663010] > [ 441.663012] RIP: 0010:merge+0x22/0xb0 > [ 441.663014] Code: c3 31 c0 c3 90 90 90 90 41 56 41 55 41 54 55 48 89 d5 53 > 48 89 cb 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0 <48> 85 > c9 > 74 70 48 85 d2 74 6b 49 89 fd 49 89 f6 49 89 e4 eb 14 48 > [ 441.663015] RSP: 0018:c990b988 EFLAGS: 0246 ORIG_RAX: > ffde > [ 441.663017] RAX: RBX: 88021ab2d408 RCX: > 88021ab2d408 > [ 441.663018] RDX: 88021ab2d388 RSI: a021c440 RDI: > > [ 441.663019] RBP: 88021ab2d388 R08: 5ecf R09: > 8500 > [ 441.663020] R10: ea000877ec00 R11: 880236803500 R12: > a021c440 > [ 441.663021] R13: 88021ab2d448 R14: 0004 R15: >
[bpf-next PATCH v2 2/2] bpf: bpftool, add flag to allow non-compat map definitions
Multiple map definition structures exist and user may have non-zero fields in their definition that are not recognized by bpftool and libbpf. The normal behavior is to then fail loading the map. Although this is a good default behavior users may still want to load the map for debugging or other reasons. This patch adds a --mapcompat flag that can be used to override the default behavior and allow loading the map even when it has additional non-zero fields. For now the only user is 'bpftool prog' we can switch over other subcommands as needed. The library exposes an API that consumes a flags field now but I kept the original API around also in case users of the API don't want to expose this. The flags field is an int in case we need more control over how the API call handles errors/features/etc in the future. Signed-off-by: John Fastabend --- tools/bpf/bpftool/main.c |7 ++- tools/bpf/bpftool/main.h |3 ++- tools/bpf/bpftool/prog.c |2 +- tools/lib/bpf/bpf.h |3 +++ tools/lib/bpf/libbpf.c | 27 ++- tools/lib/bpf/libbpf.h |2 ++ 6 files changed, 32 insertions(+), 12 deletions(-) diff --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c index 79dc3f1..828dde3 100644 --- a/tools/bpf/bpftool/main.c +++ b/tools/bpf/bpftool/main.c @@ -55,6 +55,7 @@ bool pretty_output; bool json_output; bool show_pinned; +int bpf_flags; struct pinned_obj_table prog_table; struct pinned_obj_table map_table; @@ -341,6 +342,7 @@ int main(int argc, char **argv) { "pretty", no_argument,NULL, 'p' }, { "version",no_argument,NULL, 'V' }, { "bpffs", no_argument,NULL, 'f' }, + { "mapcompat", no_argument,NULL, 'm' }, { 0 } }; int opt, ret; @@ -355,7 +357,7 @@ int main(int argc, char **argv) hash_init(map_table.table); opterr = 0; - while ((opt = getopt_long(argc, argv, "Vhpjf", + while ((opt = getopt_long(argc, argv, "Vhpjfm", options, NULL)) >= 0) { switch (opt) { case 'V': @@ -379,6 +381,9 @@ int main(int argc, char **argv) case 'f': show_pinned = true; break; + case 'm': + bpf_flags = MAPS_RELAX_COMPAT; + break; default: p_err("unrecognized option '%s'", argv[optind - 1]); if (json_output) diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h index 40492cd..91fd697 100644 --- a/tools/bpf/bpftool/main.h +++ b/tools/bpf/bpftool/main.h @@ -74,7 +74,7 @@ #define HELP_SPEC_PROGRAM \ "PROG := { id PROG_ID | pinned FILE | tag PROG_TAG }" #define HELP_SPEC_OPTIONS \ - "OPTIONS := { {-j|--json} [{-p|--pretty}] | {-f|--bpffs} }" + "OPTIONS := { {-j|--json} [{-p|--pretty}] | {-f|--bpffs} | {-m|--mapcompat}" #define HELP_SPEC_MAP \ "MAP := { id MAP_ID | pinned FILE }" @@ -89,6 +89,7 @@ enum bpf_obj_type { extern json_writer_t *json_wtr; extern bool json_output; extern bool show_pinned; +extern int bpf_flags; extern struct pinned_obj_table prog_table; extern struct pinned_obj_table map_table; diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c index 99ab42c..3350289 100644 --- a/tools/bpf/bpftool/prog.c +++ b/tools/bpf/bpftool/prog.c @@ -908,7 +908,7 @@ static int do_load(int argc, char **argv) } } - obj = bpf_object__open_xattr(); + obj = __bpf_object__open_xattr(, bpf_flags); if (IS_ERR_OR_NULL(obj)) { p_err("failed to open object file"); goto err_free_reuse_maps; diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index 87520a8..69a4d40 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -69,6 +69,9 @@ struct bpf_load_program_attr { __u32 prog_ifindex; }; +/* Flags to direct loading requirements */ +#define MAPS_RELAX_COMPAT 0x01 + /* Recommend log buffer size */ #define BPF_LOG_BUF_SIZE (256 * 1024) int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr, diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 176cf55..bd71efc 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -562,8 +562,9 @@ static int compare_bpf_map(const void *_a, const void *_b) } static int -bpf_object__init_maps(struct bpf_object *obj) +bpf_object__init_maps(struct bpf_object *obj, int flags) { + bool strict = !(flags & MAPS_RELAX_COMPAT); int i, map_idx, map_def_sz, nr_maps = 0; Elf_Scn *scn; Elf_Data *data; @@ -685,7 +686,8 @@ static int compare_bpf_map(const void *_a, const void *_b)
[bpf-next PATCH v2 1/2] bpf: bpftool, add support for attaching programs to maps
Sock map/hash introduce support for attaching programs to maps. To date I have been doing this with custom tooling but this is less than ideal as we shift to using bpftool as the single CLI for our BPF uses. This patch adds new sub commands 'attach' and 'detach' to the 'prog' command to attach programs to maps and then detach them. Signed-off-by: John Fastabend --- tools/bpf/bpftool/Documentation/bpftool-prog.rst | 11 ++ tools/bpf/bpftool/Documentation/bpftool.rst |2 tools/bpf/bpftool/bash-completion/bpftool| 19 tools/bpf/bpftool/prog.c | 99 ++ 4 files changed, 128 insertions(+), 3 deletions(-) diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.rst b/tools/bpf/bpftool/Documentation/bpftool-prog.rst index 64156a1..12c8030 100644 --- a/tools/bpf/bpftool/Documentation/bpftool-prog.rst +++ b/tools/bpf/bpftool/Documentation/bpftool-prog.rst @@ -25,6 +25,8 @@ MAP COMMANDS | **bpftool** **prog dump jited** *PROG* [{**file** *FILE* | **opcodes**}] | **bpftool** **prog pin** *PROG* *FILE* | **bpftool** **prog load** *OBJ* *FILE* [**type** *TYPE*] [**map** {**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*] +| **bpftool** **prog attach** *PROG* *ATTACH_TYPE* *MAP* +| **bpftool** **prog detach** *PROG* *ATTACH_TYPE* *MAP* | **bpftool** **prog help** | | *MAP* := { **id** *MAP_ID* | **pinned** *FILE* } @@ -37,6 +39,7 @@ MAP COMMANDS | **cgroup/bind4** | **cgroup/bind6** | **cgroup/post_bind4** | **cgroup/post_bind6** | | **cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** | **cgroup/sendmsg6** | } +| *ATTACH_TYPE* := { **msg_verdict** | **skb_verdict** | **skb_parse** } DESCRIPTION @@ -90,6 +93,14 @@ DESCRIPTION Note: *FILE* must be located in *bpffs* mount. +**bpftool prog attach** *PROG* *ATTACH_TYPE* *MAP* + Attach bpf program *PROG* (with type specified by *ATTACH_TYPE*) + to the map *MAP*. + +**bpftool prog detach** *PROG* *ATTACH_TYPE* *MAP* + Detach bpf program *PROG* (with type specified by *ATTACH_TYPE*) + from the map *MAP*. + **bpftool prog help** Print short help message. diff --git a/tools/bpf/bpftool/Documentation/bpftool.rst b/tools/bpf/bpftool/Documentation/bpftool.rst index 8dda77d..25c0872 100644 --- a/tools/bpf/bpftool/Documentation/bpftool.rst +++ b/tools/bpf/bpftool/Documentation/bpftool.rst @@ -26,7 +26,7 @@ SYNOPSIS | **pin** | **event_pipe** | **help** } *PROG-COMMANDS* := { **show** | **list** | **dump jited** | **dump xlated** | **pin** - | **load** | **help** } + | **load** | **attach** | **detach** | **help** } *CGROUP-COMMANDS* := { **show** | **list** | **attach** | **detach** | **help** } diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool index df1060b..0826519 100644 --- a/tools/bpf/bpftool/bash-completion/bpftool +++ b/tools/bpf/bpftool/bash-completion/bpftool @@ -292,6 +292,23 @@ _bpftool() fi return 0 ;; +attach|detach) +if [[ ${#words[@]} == 7 ]]; then +COMPREPLY=( $( compgen -W "id pinned" -- "$cur" ) ) +return 0 +fi + +if [[ ${#words[@]} == 6 ]]; then +COMPREPLY=( $( compgen -W "msg_verdict skb_verdict skb_parse" -- "$cur" ) ) +return 0 +fi + +if [[ $prev == "$command" ]]; then +COMPREPLY=( $( compgen -W "id pinned" -- "$cur" ) ) +return 0 +fi +return 0 +;; load) local obj @@ -347,7 +364,7 @@ _bpftool() ;; *) [[ $prev == $object ]] && \ -COMPREPLY=( $( compgen -W 'dump help pin load \ +COMPREPLY=( $( compgen -W 'dump help pin attach detach load \ show list' -- "$cur" ) ) ;; esac diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c index b1cd3bc..99ab42c 100644 --- a/tools/bpf/bpftool/prog.c +++ b/tools/bpf/bpftool/prog.c @@ -77,6 +77,26 @@ [BPF_PROG_TYPE_FLOW_DISSECTOR] = "flow_dissector", }; +static const char * const attach_type_strings[] = { + [BPF_SK_SKB_STREAM_PARSER] = "stream_parser", + [BPF_SK_SKB_STREAM_VERDICT] = "stream_verdict", + [BPF_SK_MSG_VERDICT] = "msg_verdict", + [__MAX_BPF_ATTACH_TYPE] = NULL, +}; + +enum bpf_attach_type parse_attach_type(const char *str) +{ + enum bpf_attach_type type; + + for